Page tree
Skip to end of metadata
Go to start of metadata

Data sources describe a target repository of documents and how to access documents in the repository. The definitions include the location of the repository, any authentication credentials that are required, and how to handle links to other documents or available paths on a filesystem tree. Instructions for how to handle the output of the job and how to handle errors can also be provided. This description is then used to create a crawl job to be executed by a crawler.

The available parameters are defined by the data source type and available types are defined by the crawler, also referred to as a crawler controller.

A data source is defined by selecting a crawler controller, then specifying a valid type for that crawler. Some crawlers work with several types, while others only support a single type. It is important to match the correct crawler with the type of data source to be indexed. If the type specified is not allowed by the crawler, the API will return an error.

All data sources share some attributes (parameters), but each one also has attributes specific to that type. Review the supported attributes carefully when creating data sources with the API.

For an introduction to how crawling works in LucidWorks Search, please see the section Overview of Crawling.

At present, the LucidWorks Search includes the following built-in crawler controllers that support the following kinds of data sources:

Crawler Controller

Symbolic Name

Data Source Types Supported

Aperture-based crawlers

lucid.aperture

DataImportHandler-based JDBC crawler

lucid.jdbc

SolrXML crawler

lucid.solrxml

Google Connector Manager-based crawler

lucid.gcm

Remote file system and pseudo-file system crawler

lucid.fs

Hadoop crawler

lucid.hadoop.*

MongoDB crawler

lucid.mongodb

Twitter search

lucid.twitter.search

Twitter stream

lucid.twitter.stream

Azure Blob crawler

lucid.azure_blob

Azure Table crawler

lucid.azure_table

Push crawler

lucid.push

API Endpoints

/api/collections/collection/datasources : list or create data sources in a particular collection

/api/collections/collection/datasources/id : update, remove, or get details for a particular data source

Get a List of Data Sources

GET /api/collections/ collection /datasources

Input

Path Parameters

Key

Description

collection

The collection name.

Query Parameters

None.

Output

Output Content


A JSON map of attributes to values. The exact set of attributes for a particular data source depends on the type. There is, however, a set of attributes common to all data source types. Specific attributes are discussed in sections for those types, linked below.

Common Attributes

These attributes are used for all data source types (except where specifically noted).

 Expand the table of attributes common to most data sources...

General Attributes

The general attributes define the data source name, type, crawler to be used, and collection, among other details.

 Expand the table of general attributes ...

Key

Type

Required

Default

Description

id

32-bit integer

No

Auto-assigned

The numeric ID for this data source.

type

string

Yes

Null

The type of this data source. Valid types are:

  • file for a filesystem (remote or local, but must be paired with the correct crawler, as below)
  • web for HTTP or HTTPS web sites
  • jdbc for a JDBC database
  • solrxml for files in Solr XML format
  • sharepoint for a SharePoint repository
  • smb for a Windows file share (CIFS)
  • hdfs for a Hadoop filesystem
  • s3 for a native S3 filesystem
  • s3h for a Hadoop-over-S3 filesystem
  • azure_blob for an Azure Blob
  • azure_table for an Azure Table
  • mongodb for a MongoDB instance
  • push for an externally-managed data source
  • twitter_stream for a Twitter stream
  • hadoop for high-volume crawling of a Hadoop filesystem. Note that this type is used with several crawlers, which are customized for each distribution of Hadoop that LucidWorks Search supports.

crawler

string

Yes

Null

Crawler implementation that handles this type of data source. The crawler must be able to support the specified type. Supported types for each crawler is indicated in italics in the list below. Valid crawlers are:

  • lucid.aperture for web and file types
  • lucid.fs for file, smb, hdfs, s3h, s3, and ftp types
  • lucid.gcm for sharepoint type
  • lucid.jdbc for jdbc type
  • lucid.solrxml for solrxml type
  • lucid.azureblob for azure_blob type
  • lucid.azuretable for azure_table type
  • lucid.mongodb for mongodb type
  • lucid.push for push type
  • lucid.twitter.stream for twitter_stream type
  • lucid.twitter.search for twitter_search type
  • lucid.hadoop.apache1 for hadoop type with Apache Hadoop v1.x
  • lucid.hadoop.apache2 for hadoop type with Apache Hadoop v2.x
  • lucid.hadoop.cloudera for hadoop type with Cloudera CDH
  • lucid.hadoop.horton for hadoop type with Hortonworks HDP
  • lucid.hadoop.mapr for hadoop type with MapR Hadoop
  • lucid.hadoop.pivotal for hadoop type with Pivotal Hadoop

collection

string

Yes

Null

The name of the document collection that documents will be indexed into.

name

string

Yes

Null

A human-readable name for this data source. Names may consist of any combination of letters, digits, spaces and other characters. Names are case-insensitive, and do not need to be unique: several data sources can share the same name.

category

string

No

Null

The category of this data source: Web, FileSystem, Jdbc, SolrXml, SharePoint, External, or Other. For informational purposes only.

Crawler Output
For most search applications, the default crawler output may be sufficient. With the default implementation, the output_type is set to "solr", and the output_args are the location of Solr, which is interpreted from master.conf as the setting of the LWE-Core component, and some performance settings that can be modified to improve performance as needed. However, if using LucidWorks Big Data, or integrating with another system that will consume the crawler output, you may want to modify these settings accordingly.

 Expand the table of Crawler Output options...

Key

Type

Required

Default

Description

output_type

string

No

"solr"

Advanced. Defines the way crawl output is handled. The following types are supported:

  • solr: The output will be sent to Solr for indexing.
  • NULL: The crawl output will be discarded. This is always entered as all upper-case.
  • com.lucid.crawl.impl.FileUpdateController: The output will be sent to a file.
  • com.lucid.crawl.script.ScriptPreprocessorUpdateController: This will allow a script to be run on the content before it is sent to Solr for indexing. Only Javascript is supported at this time. The script name is provided in output_args.
  • com.lucid.sda.hbase.lws.HBaseUpdateController: The output will be sent to an HBase implementation and is used in conjunction with LucidWorks Big Data only.

    Alternatively, it could be another a fully-qualified class name of a custom implementation of UpdateController, which could be created with a custom connector.

output_args

string

No

See description

Advanced. Defines where crawler output should be sent. Valid values are dependent on the output_type selected.

  • output_type is "solr": A few parameters are possible. If not defined, output_args will default to the Solr instance as defined in master.conf and the collection that uses the data source (for example, if LucidWorks has been installed in the default location and creating the data source for collection1, the path would be http://127.0.0.1:8888/solr/collection1). Two additional parameters are possible:
    • "threads": Defines the number of concurrent threads to use for sending updates. This does not define threads to use while crawling a data source, but threads to use while updating Solr via SolrJ. The default is 2.
    • "buffer": Defines the number of documents collected before sending to SolrJ in bulk, which can be used to reduce the number of calls to Solr. The default value is 1, which means no additional buffering. In general, setting this value to higher than one has little impact on performance when the number of threads is greater than 1; the performance benefits are usually seen when threads=1, but at the cost of increased Java heap consumption.
    • When using "threads" and "buffer" in configuration, express them as key=value pairs, separated by commas, with no whitespace between them. For example, "output_args":"buffer=2,threads=10" is a valid input (and will use the default Solr location). If one or more of the values is missing, the default is used.
  • output_type is "com.lucid.crawl.impl.FileUpdateController": The output_args must be a URI string for a file path. It must point to either a directory (which must exist) or a filename that will be created during the crawl (which must not exist prior to the crawl). The path will be interpreted as entered, so absolute paths should be used whenever possible. Relative paths will be interpreted relative to the working directory of the Connectors component, which is $LWS_HOME.
  • output_type is "com.lucid.crawl.script.ScriptPreprocessorUpdateController": The output_args are a script name, in the form of "script=name", without any file extension. The system will automatically look for the script name with a .js file extension, which means only Javascripts are allowed at this time. The location of the script must be $LWS_HOME/conf/data source name
  • output_type is "com.lucid.sda.hbase.lws.HBaseUpdateController": The output_args must be the host:port of the ZooKeeper instance, again used with LucidWorks Big Data only. A valid string must be interpreted by HBase, and will be similar to localhost:2181.

    If using a custom implementation of UpdateController, this attribute can be however you defined its use in that class.

Field Mapping
The output also includes the field mapping for the data source, which is modifiable as part of the regular data source update API. The mappings for a data source can also be updated with the Field Mapping API. Note that not all data sources support field mapping.

 Expand the table of Field Mapping options...

The data source attribute mapping contains a JSON map with the following keys and values:

Key

Type

Required

Default

Description

mapping

JSON string-string

No

See list of attributes in this table

A map where keys are case-insensitive names of the original metadata key names, and values are case-sensitive names of fields that make sense in the current schema. These target field names are verified against the current schema and if they are not valid these mappings are removed. Please note that null target names are allowed, which means that source fields with such mappings will be discarded.

datasource_field

string

No

"data_source"

A prefix for index fields that are needed for LucidWorks faceting and data source management. In general, this will be adjusted to match the schema.xml value. However, in cases where no indexing will be performed (i.e., batch processing is being performed), the schema.xml is not available for checking so it may be useful to edit this parameter manually to fit the expected schema value. If performing a normal crawl (i.e., the crawler finds the documents, parses them, and passes them along for indexing), this field should be left as the default.

default_field

string

No

"null"

The field name to use if source name doesn't match any mapping. If null, then dynamicField will be used, and if that is null too then the original name will be returned.

dynamic_field

string

No

"attr"

If not null then source names without specific mappings will be mapped to dynamicField_sourceName, after some cleanup of the source name (non-letter characters are replaced with underscore).

literals

JSON string-string

No

Null

An optional map that can specify static pairs of keys and values to be added to output documents.

lucidworks_fields

boolean

No

"true"

If true, the default, then the field mapping process will automatically add LucidWorks-specific fields (such as data_source and data_source_type) to the documents. There may be some cases where the data source information is already added to the documents, such as with Solr XML documents, where this setting should be false. However, without this information, LucidWorks will not be able to properly identify documents from a specific data source and would not be able to show accurate document counts, display the documents in facets, or delete documents if necessary.

mappings

JSON string-string

No

See description.

The mappings section contains a list of source fields and the target fields they will be mapped to. Several mappings are defined by default. See the list in the section on Field Mapping in the Overview of Crawling.

When the mapping is created or updated, LucidWorks checks the mappings against the schema.xml for the collection and verifies that the target fields exist in the schema.

During indexing, the field mapping process performs the following steps:

  1. The mappings are checked for the existence of the source field name. If it exists, it will be mapped to the target field.
  2. If the source field name does not exist in the mappings, the schema.xml for the collection is checked. If the source field name exists in the schema, it will be indexed to that field.
  3. If a dynamic_field has been defined, a dynamic field will be created according to the dynamic field rule.
  4. If a default_field has been defined, the source field will be mapped to the defined default field.
  5. If none of these steps has produced a match, the field will be discarded.

multi_val

JSON string-boolean

No

"acl": true,
"author": true,
"body": false,
"dateCreated": false,
"description": false,
"fileSize": false,
"mimeType": false,
"title": false

A map of target field names that is automatically initialized from the schema based on the target field's multiValued attribute. In general, this will be adjusted to match the schema.xml value. However, in cases where no indexing will be performed (i.e., batch processing is being performed), the schema.xml is not available for checking so it may be useful to edit this parameter manually to fit the expected schema value. If performing a normal crawl (i.e., the crawler finds the documents, parses them, and passes them along for indexing), this field should be left as the default.

Field mapping normalization is a step applied after all target names for field values have been resolved, including substitution with dynamic or default field names. This step checks that values are compatible with the index schema. The following checks are performed:

  • For the "mimeType" field, : if it is defined as multiValued=false then only the longest (probably most specific) value is retained, and all other values are discarded.
  • If field type is set to DATE in the field mapping, first the values are checked for validity and invalid values are discarded. If multiValued=false in the target schema, then only the first remaining value will be retained, and all other values are discarded.
  • If field type is STRING, and multiValued=false in the target schema, then all values are concatenated using a single space character, so that the resulting field has only single concatenated value.
  • For all other field types, if multiValued=false and multiple values are encountered, only the first value is retained and all other values are discarded.

original_content

boolean

No

"false"

If true, adds the ability to store the original raw bytes of any document. By default it is false. If this is enabled, a field called "original_content" will be added to each document, and will contain the raw bytes of the original document. The field is subject to normal field mapping rules, which means that if this field is not defined in the schema.xml file, it will be added dynamically as attr_original_content according to the default rules of field mapping. If the "attr_" dynamic rule has been removed, this field may be deleted during field mapping if it is not defined in schema.xml (which it is not by default, so possibly should be added, depending on your configuration).

The data source types that use the lucid.fs, lucid.aperture, and lucid.gcm crawlers (so, data source types Web, File, SMB, HDFS, S3, S3H, FTP, and SharePoint) are the only ones that support this attribute. It is not possible to store original binary content for the Solr XML, JDBC, Push, Twitter Search or Twitter Stream data source types.

types

JSON string-string

No

"date": "DATE",
"datecreated": "DATE",
"filesize": "LONG",
"lastmodified": "DATE"

A map pre-initialized from the current schema. Additional validation can be performed on fields with declared non-string types. Currently supported types are DATE, INT, LONG, DOUBLE, FLOAT and STRING. If not specified fields are assumed to have the type STRING.

The map is pre-initialized from the types definition in schema.xml in the following ways:

  • Any class with DateField becomes DATE* Any class that ends with *DoubleField becomes DOUBLE
  • Any class that ends with *FloatField becomes FLOAT
  • Any class that ends with *IntField or *ShortField becomes INT
  • Any class that ends with *LongField becomes LONG
  • Anything else not listed above becomes STRING

unique_key

string

No

"id"

Defines the document field to use as the unique key in the Solr schema. For example, if the schema uses "id" as the unique key field name, and the unique_key attribute is set to "url", then field mapping will map "url" to "id". By default, this will be adjusted to match the schema.xml value. However, in cases where no indexing will be performed (i.e., batch processing is being performed), the schema.xml is not available for checking so it may be useful to edit this parameter manually to fit the expected schema value. If performing a normal crawl (i.e., the crawler finds the documents, parses them, and passes them along for indexing), this field should be left as the default. With push data sources, this parameter would map a field from the incoming documents to be the unique key for all documents.

verify_schema

boolean

No

"true"

If true, the default, then field mapping will be validated against the current schema at the moment the crawl job is started. This may result in dropping some fields or changing their multiplicity so they conform to the current schema. The modified mapping is not propagated back to the data source definition (i.e., it is not saved permanently). In this way, the schema can be modified without having to modify the data source mapping definition; however, it may also be more difficult to learn what the final field mapping was. If this value is false, then the field mapping rules are not verified and are applied as is, which may result in exceptions if documents are added that don't match the current schema (e.g., incoming documents have multiple values in a field when the schema expects a single value).

Optional Commit Rules

 Expand the table of commit options...

The following attributes are optional and relate to when new documents will be added to the index:

Required

Default

Key

Type

Description

commit_within

integer

No

900000

Number of milliseconds that defines the maximum interval between commits while indexing documents. The default is 900,000 milliseconds (15 minutes).

commit_on_finish

boolean

No

True

When true (the default), then commit will be invoked at the end of crawl.

Batch Processing
The following attributes control batch processing and are also optional.

 Expand for the Batch Processing options...

See also Processing Documents in Batches as some crawlers only support a subset of batch processing options. Note that the MapR High Volume Data Sources and High-Volume HDFS Data Sources do not support any kind of batch processing.

Key

Type

Required

Default

Description

parsing

boolean

No

True

When true (the default),the crawlers will parse rich formats immediately. When false, other processing is skipped and raw input documents are stored in a batch.

indexing

boolean

No

True

When true (the default), then parsed documents will be sent immediately for indexing. When false, parsed documents will be stored in a batch.

caching

boolean

No

False

When true, both raw and parsed documents will always be stored in a batch, in addition to any other requested processing. If false (the default), then batch is not created and documents are not preserved unless as a result of setting other options above.

Attributes for Specific Data Source Types

Each data source type has attributes specifically for that type. To find the attributes for a specific data source type, see the API documentation for that type. The types are:

Examples

Input

Output

JSON output of all configured data sources.

Create a Data Source

POST /api/collections/ collection /datasources

Input

Path Parameters

Key

Description

collection

The collection name

Query Parameters

None

Input content

JSON block with all attributes. The ID field should not be included because it will be automatically generated by the system. See attributes in section on getting a list of data sources.

Output

Output Content

JSON representation of new data source. Attributes returned are listed in the section on getting a list of data sources.

Examples

Create a data source that includes the content of the LucidWorks web site. To keep the size down, only crawl two levels, and do not index the blog tag links or any search links. Also, do not wander off the site and index any external links.

Input

Output

Get Data Source Details

GET /api/collections/ collection /datasources/ id

Icon

This call requires knowing the ID of the data source. There is no way to query for the ID by using the name, so the only way to find the id of a data source is use the API call to get a list of data sources.

Input

Path Parameters

Key

Description

collection

the collection name

Query Parameters

Key

Type

Description

id

string

The data source ID

Input content

None

Output

Output Content

Attributes returned are listed in the section on getting a list of data sources.

Examples

Get all of the parameters for data source 6, created in the previous step.

Input

Output

Update a Data Source

PUT /api/collections/ collection /datasources/ id

Input

Path Parameters

Key

Description

collection

The collection name

id

The data source ID

Query Parameters

None

Input content

JSON block with either all attributes or just those that need updating. The attributes type (data source type), crawler (crawler type), and id (ID) cannot be updated.

Output

Output Content

None

Examples

Change the web data source so that it crawls three levels instead of just two:

Input

Output

None. Check data source properties to confirm changes.

Delete a Data Source

Icon

The Data Source DELETE command will delete documents associated with the data source as of v2.5 (in prior versions it did not). To keep the documents, add keep_docs=true to the delete request, after the id. For example:

DELETE /api/collections/ collection /datasources/ id

Input

Path Parameters

Key

Description

collection

the collection name

id

The data source ID

Query Parameters

None

Input content

None

Output

Output Content

None

Examples

Input

Output

None. Check the listing of data sources to confirm deletion.