Page tree
Skip to end of metadata
Go to start of metadata

The Indexing Settings screen allows you to configure a few general options to fine-tune your index.

De-duplication can be configured from this screen, with a few options for how to identify and process duplicate documents.

If you will be updating data in your system frequently, or expect it to be available very shortly after adding it to the system, you may want to tweak the default commit settings, which govern how quickly data being added to the index is made searchable to users.

This screen also allows you to schedule index-related activities such as processing the Click Scoring logs for boost data, creating and updating the auto-complete index, and optimizing the index. See the section Activities below for details.

Indexing Options

Several options are available at the top of the page. Most of these options are related to automatic commit parameters, but you can also define the default field type and de-duplication settings.

These settings can also be modified with the Settings API, and the related API attribute names are provided below.

De-duplication

In LucidWorks Search, duplicates can be identified by calculating a hash that identifies very similar documents.

While this setting enables de-duplication generally, specific fields should be selected as being used for de-duplication, which can be done on the Field Configuration screen or with the Fields API. If no fields are selected as being the basis for determining duplicate documents, then all fields of a document are used as the basis for judging duplicate documents.

You can choose from three possible methods of handling duplicates:

  • Off does not identify duplicate documents within the index.
  • Tag identifies duplicates with a unique tag stored in the signatureField, but does not remove duplicate documents from the index . This approach is recommended, although it does require using field collapsing or another method to remove duplicates from the search results for users.
  • Overwrite overwrites duplicate documents with incoming documents. This should only be used if you are sure that the duplicate detection is working the way you expect.

Note that de-duplication does not work properly in SolrCloud mode.

If using the LucidWorks Search REST API, this can be set with the de_duplication parameter of the Settings API.

Default Field Type

If fields are found in documents that do not correspond with defined LucidWorks fields, this setting assigns a default field type in order to parse text found in that field. |

If using the LucidWorks Search REST API, this can be set with the unknown_type_handling parameter of the Settings API.

Commit Settings

Commits in LucidWorks Search (and Solr) control how often documents being added to the index are made searchable to users. If you are not familiar with commits, please refer also to the Commits section of the Apache Solr Reference Guide.

In general, you can set commits to happen when a set number of documents have been queued, or at set intervals of time, or a combination of the two.

Parameter

Settings API Attribute Name

Description

Auto-commit max docs

update_handler_autocommit_max_docs

This setting defines the number of documents to queue before pushing them to the index, also known as the the maxDocs parameter for autocommit definitions in the solrconfig.xml file for the collection. It works in conjunction with the "Auto-commit max time" parameter in that if either limit is reached, the pending updates will be pushed to the index.

Auto-commit max time (ms)

update_handler_autocommit_max_time

This setting defines the number of milliseconds to wait before pushing documents to the index, also known as the maxTime parameter for autocommit definitions in the solrconfig.xml file for the collection. It works in conjunction with the "Auto-commit max docs" parameter in that if either limit is reached, the pending updates will be pushed to the index.

Auto-soft-commit max docs

update_handler_autosoftcommit_max_docs

This setting defines the number of documents to queue before performing a "soft commit", used with Solr's NearRealTime searching, and pushing the documents to the index. This setting is also known as maxDocs parameter for autosoftcommit definitions in the solrconfig.xml file for the collection. It works in conjunction with the "Auto-soft-commit max time" parameter in that if either limit is reached, the documents will be pushed to the index.

Auto-soft-commit max time (ms)

update_handler_autosoftcommit_max_time

This setting defines the number of milliseconds to wait before performing a "soft commit", used with Solr's NearRealTime searching, and pushing the documents to the index. This setting is also known as maxTime parameter for autosoftcommit definitions in the solrconfig.xml file for the collection. It works in conjunction with the "Auto-soft-commit max docs" parameter in that if either limit is reached, the documents will be pushed to the index.

If you have made changes to any of these settings, click Update to save your changes.

Activities

The second half of the page allows configuration and monitoring of some essential system processes, called Activities. Next to each process name, the status of the process is shown, along with when it was last run and how long the run took. If a schedule has been set for the activity, that will be shown, and the Edit button will allow setting a new schedule or editing an existing one.

You can schedule these activities:

Activity

Description

Optimize index

Optimizes the internal Apache Lucene data structures for better performance in searching. Optimizing a large index can take a long time, so it should be done judiciously based on when indexing completes.

Process click logs

Processes the click.log to create a file for Click Scoring to use in relevancy ranking calculations.

Generate Auto-Complete index

Creates the index required for implementing automatic suggestions for user queries as they type.

Activity schedules can be automatically disabled by LucidWorks Search if they fail on a consistent basis, but not all failures will trigger a deactivation of the schedule. If there is a missing parameter or other fatal error in the configuration of the task that will always cause an error, the schedule will be deactivated. Similarly, if there is another circumstance that causes the activity to fail when it launches (such as, the Solr handler is missing), and the scheduled task fails three times, the schedule will be deactivated. If a task consistently runs for a time and fails at some point, that will not trigger deactivation of the schedule.

Icon

Times shown in the drop-down menu will be saved in GMT, but will display in the Index-Settings page in the time zone configured for each user (so, if different users have different time zones configured via the User Management screen, the schedule for the activity will be shown in their local time.

Activity Schedules can also be set using the Activities API.

  • No labels