Support Forums

LW Enterprise
LW Cloud

LucidWorks Platform v2.1

Older Versions

LucidWorks 2.0
LucidWorks 1.8

This is the documentation for the LucidWorks Platform v2.1, the latest release.
Log in to comment on pages.

Skip to end of metadata
Go to start of metadata

You can index binary data that you have stored in database columns. For example, if you have stored PDF files in a column called binary_content in your database, you can configure LucidWorks Enterprise to recognize and extract the PDF data correctly.

Indexing binary data in a database requires some modification to the data source configuration file in LucidWorks Enterprise (LWE) which does not exist until you create a JDBC data source. For detailed information about working with JDBC data sources, see Create a New JDBC Data Source. After you have created the data source, you can find the configuration file in LWE_HOME/conf/solr/cores/collection/conf/dataconfig_id.xml. If you are familiar with Solr, you will recognize this file as a Data Import Handler configuration file.

This functionality is available in LucidWorks Enterprise but not LucidWorks Cloud.

Follow these steps to modify the configuration file:

  1. Add a name attribute for the database containing your binary data to the dataSource entry.
  2. Set the convertType attribute for the dataSource to false. This prevents LWE from treating binary data as strings.
  3. Add a FieldStreamDataSource to stream the binary data to the Tika entity processor.
  4. Specify the dataSource name in the root entity.
  5. Add an entity for your FieldStreamDataSource using the TikaEntityProcessor to take the binary data from the FieldStreamDataSource, parse it, and specify a field for storing the processed data.
  6. Reload the Solr core to apply your configuration changes.
After you have modified the data source configuration file you should not modify the data source from the LWE Admin UI: LWE will automatically overwrite the convertType attribute, and indexing for the modified data source will fail.

Example

In this example there is a MySQL database called test containing a table called documents that contains PDF data in a column called binary_content. When we first create the data source, the data source config file looks like this:

<dataConfig>
  <dataSource autoCommit="true" batchSize="-1" convertType="true" driver="com.mysql.jdbc.Driver" password="admin"
   url="jdbc:mysql://localhost/test" user="root"/>
  <document name="items">
    <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents"
     transformer="TemplateTransformer">
      <field column="data_source" template="9"/>
      <field column="data_source_type" template="Jdbc"/>
      <field column="data_source_name" template="MySQL"/>
    </entity>
  </document>
</dataConfig>

To modify this data configuration file, we follow these steps:

  1. Add the name attribute to the dataSource and set convertType to false:
    <dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin"
     url="jdbc:mysql://localhost/test" user="root" name="test"/>
    


    Specify another dataSource called fieldReader to handle the binary data:

    <dataSource name="fieldReader" type="FieldStreamDataSource" />
    


  2. Specify the data source for the root entity:
    <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents"
     transformer="TemplateTransformer" dataSource="test">
    


  3. Add an entity for the fieldReader data source specifying the TikaEntityProcessor and a dataField for storing the processed binary data:
    <entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text">
      <field column="text" name="body" />
    </entity>
    


  4. Reload the Solr core to apply your configuration changes.

For this example, the final configuration file looks like this:

<dataConfig>
  <dataSource autoCommit="true" batchSize="-1" convertType="false" driver="com.mysql.jdbc.Driver" password="admin"
   url="jdbc:mysql://localhost/test" user="root" name="test"/>
  <dataSource name="fieldReader" type="FieldStreamDataSource" />
  <document name="items">
    <entity name="root" preImportDeleteQuery="data_source:9" query="SELECT * FROM documents"
     transformer="TemplateTransformer"
     dataSource="test">
      <field column="data_source" template="9"/>
      <field column="data_source_type" template="Jdbc"/>
      <field column="data_source_name" template="MySQL"/>
      <entity dataSource="fieldReader" processor="TikaEntityProcessor" dataField="root.binary_content" format="text">
        <field column="text" name="body" />
      </entity>
    </entity>
  </document>
</dataConfig>
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.