|
You can index binary data that you have stored in database columns. For example, if you have stored PDF files in a column called binary_content in your database, you can configure LucidWorks Enterprise to recognize and extract the PDF data correctly. Indexing binary data in a database requires some modification to the data source configuration file in LucidWorks Enterprise (LWE) which does not exist until you create a JDBC data source. For detailed information about working with JDBC data sources, see Create a New JDBC Data Source. After you have created the data source, you can find the configuration file in LWE_HOME/conf/solr/cores/collection/conf/dataconfig_id.xml. If you are familiar with Solr, you will recognize this file as a Data Import Handler configuration file. |
This functionality is available in LucidWorks Enterprise but not LucidWorks Cloud.
|
Follow these steps to modify the configuration file:
- Add a name attribute for the database containing your binary data to the dataSource entry.
- Set the convertType attribute for the dataSource to false. This prevents LWE from treating binary data as strings.
- Add a FieldStreamDataSource to stream the binary data to the Tika entity processor.
- Specify the dataSource name in the root entity.
- Add an entity for your FieldStreamDataSource using the TikaEntityProcessor to take the binary data from the FieldStreamDataSource, parse it, and specify a field for storing the processed data.
- Reload the Solr core to apply your configuration changes.
| After you have modified the data source configuration file you should not modify the data source from the LWE Admin UI: LWE will automatically overwrite the convertType attribute, and indexing for the modified data source will fail. |
Example
In this example there is a MySQL database called test containing a table called documents that contains PDF data in a column called binary_content. When we first create the data source, the data source config file looks like this:
<dataConfig> <dataSource autoCommit="true" batchSize="-1" convertType="true" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root"/> <document name="items"> <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer"> <field column="data_source" template="9"/> <field column="data_source_type" template="Jdbc"/> <field column="data_source_name" template="MySQL"/> </entity> </document> </dataConfig>
To modify this data configuration file, we follow these steps:
- Add the name attribute to the dataSource and set convertType to false:
<dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root" name="test"/>
Specify another dataSource called fieldReader to handle the binary data:<dataSource name="fieldReader" type="FieldStreamDataSource" />
- Specify the data source for the root entity:
<entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer" dataSource="test">
- Add an entity for the fieldReader data source specifying the TikaEntityProcessor and a dataField for storing the processed binary data:
<entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text"> <field column="text" name="body" /> </entity>
- Reload the Solr core to apply your configuration changes.
For this example, the final configuration file looks like this:
<dataConfig> <dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin" url="jdbc:mysql://localhost/test" user="root" name="test"/> <dataSource name="fieldReader" type="FieldStreamDataSource" /> <document name="items"> <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents" transformer="TemplateTransformer" dataSource="test"> <field column="data_source" template="9"/> <field column="data_source_type" template="Jdbc"/> <field column="data_source_name" template="MySQL"/> <entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text"> <field column="text" name="body" /> </entity> </entity> </document> </dataConfig>
