Support Resources

LucidWorks Support Portal
LucidWorks Big Data Forum

LucidWorks Big Data

PDF Versions

This is the documentation for LucidWorks Big Data v1.2.

Skip to end of metadata
Go to start of metadata

LucidWorks Search Data sources are one of the ways in which content can be loaded into the Big Data system. Documents indexed in this way do not go through the extract-transform-load workflow, as described in Workflows, which means the incoming documents cannot be analyzed for clustering or statistically interesting phrases. Documents can also be sent directly to the system, using the Big Data Document Indexing API. More background information on this topic is available in the section on Loading Data.

Because LucidWorks Big Data includes LucidWorks Search "under the hood", all of the data source types available with LucidWorks Search are available to the Big Data system. Below we give a couple of examples of using the data sources in the Big Data context, but full details on each type is available in the LucidWorks Search documentation.

Background Information

Data sources describe a target repository of documents and access method. This description is then used to create a crawl job to be executed by a specific crawler implementation (called crawler controllers).

A data source is defined by selecting a crawler controller, then specifying a valid type for that crawler. Some crawlers work with several types, other crawlers only support one type. It is important to match the correct crawler with the type of data source to be indexed.

Each crawler and specified type has different supported attributes. A few attributes are common across all crawlers and types, and some types share attributes with other types using the same crawler. Review the supported attributes carefully when creating data sources with this API.

At present, LucidWorks Search includes the following built-in crawler controllers that support the following kinds of data sources:

Crawler Controller

Symbolic Name

Data Source Types Supported

Aperture-based crawlers

lucid.aperture

  • Local file systems
  • Websites

DataImportHandler-based JDBC crawler

lucid.jdbc

  • JDBC databases

SolrXML crawler

lucid.solrxml

  • Solr XML files

Google Connector Manager-based crawler

lucid.gcm

  • Microsoft SharePoint servers (Microsoft Office SharePoint Server 2007, Microsoft Windows SharePoint Services 3.0, SharePoint 2010)

Remote file system and pseudo-file system crawler

lucid.fs

  • SMB / CIFS (Windows Shares) filesystems
  • Hadoop Distributed File Systems
  • Amazon S3 buckets
  • HDFS over Amazon S3
  • FTP servers

MapR crawler

lucid.mapr

  • MapR Hadoop filesystem

MapR high volume crawler

lucid.map.reduce.maprfs

MongoDB crawler

lucid.mongodb

  • MongoDB instances

Azure Blob crawler

lucid.azure_blob

  • Azure Blob storage containers

Azure Table crawler

lucid.azure_table

  • Azure Table instances

External data

lucid.external

  • Externally generated data pushed to LucidWorks via Solr

Twitter stream

lucid.twitter.stream

  • Twitter Streams using Twitter's stream API

High-Volume HDFS

lucid.map.reduce.hdfs

  • High Volume crawling of Hadoop File Systems
Icon

We'll only cover the Twitter and High-Volume HDFS connectors in this guide. Please see the LucidWorks Search documentation for full details of each type of data source supported.

API Entry Points

/sda/v1/client/collections/collection/datasources: list or create data sources in a particular collection

/sda/v1/client/collections/collection/datasources/id: start, remove, or get details for a particular data source

Get a List of Data Sources

GET /sda/v1/client/collections/collection/datasources

Input

Path Parameters

Key

Description

collection

The collection name.

Query Parameters

None.

Output

Output Content

A JSON map of attributes to values. The exact set of attributes for a particular data source depends on the type. There is, however, a set of attributes common to all data source types. Specific attributes are discussed in sections for those types later in this section.

Common Attributes

These attributes are used for all data source types (except where specifically noted).

General Attributes

Key

Type

Description

id

32-bit integer

The numeric ID for this data source.

type

string

The type of this data source. Valid types are:

  • file for a filesystem (remote or local, but must be paired with the correct crawler, as below)
  • web for HTTP or HTTPS web sites
  • jdbc for a JDBC database
  • solrxml for files in Solr XML format
  • sharepoint for a SharePoint repository
  • smb for a Windows file share (CIFS)
  • hdfs for a Hadoop filesystem
  • s3 for a native S3 filesystem
  • s3h for a Hadoop-over-S3 filesystem
  • mapr_fs for a MapR filesystem
  • mapr_hv for high volume crawling of a MapR filesystem
  • mongodb for a MongoDB
  • azure_table for an Azure Table instance
  • azure_blob for an Azure Blob storage container
  • external for an externally-managed data source
  • twitter_stream for a Twitter stream
  • high_volume_hdfs for high-volume crawling of a Hadoop filesystem

crawler

string

Crawler implementation that handles this type of data source. The crawler must be able to support the specified type. Valid crawlers are:

  • lucid.aperture for web and file types
  • lucid.fs for file, smb, hdfs, s3h, s3, and ftp types
  • lucid.gcm for sharepoint type
  • lucid.jdbc for jdbc type
  • lucid.solrxml for solrxml type
  • lucid.mapr for mapr_fs type
  • lucid.map.reduce.maprfs for mapr_hv type
  • lucid.mongodb for mongodb type
  • lucid.azureblob for azure_blob type
  • lucid.azuretable for azure_table type
  • lucid.external for external type
  • lucid.twitter.stream for twitter_stream type
  • lucid.map.reduce.hdfs for high_volume_hdfs type

collection

string

The name of the document collection that documents will be indexed into.

name

string

A human-readable name for this data source. Names may consist of any combination of letters, digits, spaces and other characters. Names are case-insensitive, and do not need to be unique: several data sources can share the same name.

category

string

The category of this data source: Web, FileSystem, Jdbc, SolrXml, Sharepoint, External, or Other. For informational purposes only.

mapping

JSON map

Attributes that define how incoming fields in the content will be handled. They can be mapped to other fields in the content, fields can be explicitly inserted into the content, or defaults for missing content can be supplied. If mapping is something you'd like to work with, please see the LucidWorks Search documentation section on Data Sources.

output_type

string

A fully qualified class name of the format of the output from the crawl. For LucidWorks Big Data, this must always be com.lucid.sda.hbase.lws.HBaseUpdateController. The default is intended for direct indexing by Solr, but LucidWorks Big Data first stores all documents in HBase and synchronizes with Solr. For this reason, this should always be specified as com.lucid.sda.hbase.lws.HBaseUpdateController. If this is not specified, and the default is used, documents will not be available for Analysis.

output_args

string

A Zookeeper connect string that the HBase library can understand, such as localhost:5181. If you have ZooKeeper running on another server or port, modify the example string accordingly.

Field Mapping
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API.

Icon

The default mapping settings are mostly fine, with the exception of the original_content parameter, which should be set to "true" (the default is "false"). This allows saving the raw bytes of documents to HBase, which will improve the ability for vectors to be created and for machine learning tasks to use the vectors.

Each of the attributes will be shown under the main attribute mapping, which contains a JSON map with several keys & values. For more information, please see the LucidWorks Search documentation section on common attributes for data sources.

Optional Commit Rules
The following attributes are optional and relate to when new documents will be added to the index:

Key

Type

Description

commit_within

integer

Number of milliseconds that defines the maximum interval between commits while indexing documents. The default is 900,000 milliseconds (15 minutes).

commit_on_finish

boolean

When true (the default), then commit will be invoked at the end of crawl.

Batch Processing
Batch processing allows crawling a repository of content, but not indexing the content until a later time (perhaps after some additional processing). A few attributes control batch processing and are also optional. They are not covered in this section, but are covered in the section of LucidWorks Search documentation on Processing Documents in Batches.

Twitter Stream Attributes

The Twitter Stream data source type uses Twitter's streaming API to index tweets on a continuous basis.

This data source uses the lucid.twitter.stream crawler. Unlike other crawlers, which generally have some kind of defined end point (even if that end point is after hundreds of thousands or millions of documents), this crawler opens a stream and will not stop until Twitter stops. The Data Source Jobs API in LucidWorks Search will allow you to stop the stream if necessary.

This data source is in early stages of development, and does not yet process deletes. Deletes will be shown in Data Source History statistics, but these are deleted tweets marked as such from the streaming API - the actual tweets may or may not be in the index (and if they were, the data source does not yet process them).

In order to successfully configure a Twitter stream, you must first register your application with Twitter and accept their terms of service. The registration process will provide you with the required OAuth tokens you need to access the stream. To get the tokens, follow these steps:

  1. Make sure you have a Twitter account, and go to http://dev.twitter.com/ and sign in.
  2. Choose "Create an App" link and fill out the required details. The callback field can be skipped. Hit "Create Application" to register your application.
  3. The next page will contain the Consumer Key and Consumer Secret, which you will need to configure the data source in LucidWorks.
  4. At the bottom of the same page, choose "Create My Access Token".
  5. The next page will contain the Access Token and Token Secret, which you will also need to configure the data source in LucidWorks.

While you need a Twitter account to register an application, you do not use your Twitter credentials to configure this data source. Take the Consumer Key, Consumer Secret, Access Token, and Token Secret information and store it where you can access it while configuring the data source.

When creating a data source of type twitter_stream, the value lucid.twitter.stream must be supplied for the crawler attribute, described in the section on common attributes above. The common attributes are available for configuration in addition to those listed below.

Key

Type

Required

Default

Description

access_token

string

Yes

Null

The access token is provided after registering with Twitter and requesting an access token (see above).

consumer_key

string

Yes

Null

The consumer key is provided after registering with Twitter (see above).

consumer_secret

string

Yes

Null

The consumer secret is provided after registering with Twitter (see above). It should be treated as a password for your registered application.

filter_follow

list

No

Null

A set of specific Twitter user IDs to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location). Note that this is not the user handle or screen name, but a numeric ID assigned by Twitter. To find the ID, you could do an API request like: {{

https://api.twitter.com/1/users/show.xml?screen_name=usaa

}}, replacing "usaa" with the user handle as necessary. The ID is found in the "id" field of the XML output.

filter_locations

list

No

Null

A set of bounding boxes (latitude/longitude, up, down, right, left, etc.) to filter the stream for geographic location. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location).

filter_track

list

No

Null

A set of keywords to filter the stream. If combined with another filter, they act as OR statements on the stream (i.e., the tweet must match the user ID or the keyword or the location).

max_docs

long

No

-1

While testing the feed, it may be desirable to limit initial streams to a specific number of tweets. The default for this is "-1", which doesn't close the connection before it is manually closed.

sleep

integer

Yes

10000

Twitter will occasionally throttle streaming, in which case you can configure the data source to wait the requisite amount of time before trying again. The default is 10,000 milliseconds, which should be sufficient for most scenarios.

token_secret

string

Yes

Null

The token key is provided after registering with Twitter and requesting an access token (see above). It should be treated as a password for your API access.

Example twitter_stream data source

High-Volume HDFS Attributes

The High Volume HDFS (HV-HDFS) data source uses a MapReduce-enabled crawler designed to leverage the scaling qualities of Apache Hadoop while indexing content.

To achieve this, HV-HDFS consists of a series of MapReduce enabled Jobs to convert raw content into documents that can be indexed which in turn relies on the Behemoth project (we specifically leverage the LWE fork of this project) for MapReduce ready document conversion via Apache Tika and writing of documents to LucidWorks.

The HV-HDFS data source is currently marked as "Early Access" and is thus subject to changes in how it works in future releases.

Icon

Before using the HV-HDFS Data Source type, please review the section on Using the High Volume HDFS Crawler in LucidWorks Search documentation.

When creating a data source of type high_volume_hdfs, the value lucid.map.reduce.hdfs must be supplied for the crawler attribute, described in the section on common attributesabove. The common attributes are available for configuration in addition to those listed below.

Key

Type

Required

Default

Description

MIME_type

string

No

Null

Allows limiting the crawl to content of a specific MIME type. If a value is entered, Behemoth will skip Tika's MIME detection and process the content with the appropriate parser. The MIME type should be entered in full, such as "application/pdf" for PDF documents.

hadoop_conf

string

Yes

Null

The location of the Hadoop configuration directory that contains the Hadoop core-site.xml, mapred-site.xml and other Hadoop configuration files. This path must reside on the same machine as the LucidWorks server. Hadoop does not need to be running on the same server as LucidWorks, but the configuration directory must be available from the LucidWorks server.

path

string

Yes

Null

The input path where your data resides. It is not required that this be in HDFS to begin with, since the first step of the process converts content to one or more SequenceFiles in HDFS. For example, hdfs://bob:54310/path/to/content and file:///path/to/local/content would both be valid inputs.

recurse

boolean

No

True

If true, the default, the crawler will crawl all subdirectories of the input path. Set to false if the crawler should stay within the top directory specified with the path attribute.

tika_content
_handler

string

No

com.digitalpebble.
behemoth.tika.
TikaProcessor

In most cases, the default, com.digitalpebble.behemoth.tika.TikaProcessor, does not need to be changed. If you have need to change it, enter the fully qualified class name of a Behemoth Tika Processor that is capable of extracting content from documents. The class must be available on the classpath in the Job jar used by LucidWorks.

work_path

string

Yes

/tmp

A path to use for intermediate storage. Note the connector does not clean up temporary content so it can be used in debugging, if necessary. Once the job is complete, content stored in this location temporarily can be safely deleted. For example, hdfs://bob:54310/tmp/hv_hdfs would be a valid path.

zookeeper_host

string

Yes

Null

The host and port where ZooKeeper is running and coordinates SolrCloud activity, entered as hostname:port.

Example high_volume_hdfs data source

Examples

Input
Get all data sources for the "documentation" collection.

Output
The output below omits the mapping sub-attributes, which define how incoming content is handled. This example data source was created automatically by requesting documents to be added to the index with the Document Indexing API.

Create a Data Source

POST /sda/v1/client/collections/collection/datasources

Input

Path Parameters

Key

Description

collection

The collection name

Query Parameters

None

Input content

JSON block with all attributes. The ID field, if present, will be ignored. See attributes in section on getting a list of data sources.

Output

Output Content

JSON representation of new data source. Attributes returned are listed in the section on getting a list of data sources.

Examples

Input

Create a new data source to consume Twitter in the "documentation" collection. Note that the access_token, consumer_key, consumer_secret and token_secret are only examples and should be replaced with your own values.

Output

Get Data Source Details

This API provides the settings information for a specific data source.

GET /sda/v1/client/collections/collection/datasources/id

Icon

Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources.

Input

Path Parameters

Key

Description

collection

the collection name

id

The data source ID

Query Parameters

None.

Input content

None.

Output

Output Content

Key

Type

Description

collection

string

The collection the data source belongs to.

id

integer

The ID of the data source.

properties

JSON map

This includes all the attributes for the data source.

status

string

The state of the data source. In most cases, this will be simply EXISTS.

Examples

Get all of the parameters for data source 49, created in the previous step.

Input

Output
The mapping attributes have been omitted from this example but will be returned in a successful response.

Update Data Source Details

This API allows updating the settings information for a specific data source.

PUT /sda/v1/client/collections/collection/datasources/id

Icon

Note that the only way to find the id of a data source is to either store it on creation, or use the API call referenced above to get a list of data sources.

Input

Path Parameters

Key

Description

collection

the collection name

id

The data source ID

Query Parameters

None.

Input content

JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Possible attributes are listed in the section above on getting a list of data sources.

Output

Output Content

The output is essentially a status report. If successful, the output will contain a line "status":"SUCCEEDED". It may also report FAILED if the document was invalid for some reason. The other parts of the response are not essential at this point, only the status. Children will also be returned, which in the case of adding documents, a SUCCEEDED indicates that the documents were successfully added to the Solr/LucidWorks index.

Examples

Input
Update the "max_docs" attribute for data source 49, created in the previous step.

Output

Start a Data Source

POST /sda/v1/client/collections/collection/datasources/id

Input

Path Parameters

Key

Description

collection

The collection name

id

The data source ID

Query Parameters

None

Input content

JSON block with either all attributes or just those that need updating. Data source type, crawler type, and ID cannot be updated. Other attributes are listed in the section on getting a list of data sources.

Output

Output Content

The output is essentially a status report. If successful, the output will contain a line "status":"RUNNING". The other parts of the response are not essential at this point, only the status.

To check the ongoing status, you can use the LucidWorks Search Data Source Status API.

Examples

Input
Start data source 49.

Output

Delete a Data Source

DELETE /sda/v1/client/collections/collection/datasources/id

Input

Path Parameters

Key

Description

collection

the collection name

id

The data source ID

Query Parameters

None.

Input content

None.

Output

Output Content

None.

Examples

Input
Delete data source 48.

Output

None. Check the listing of data sources to confirm deletion.