Any term in a query, except for quoted phrases, may contain one or more wildcard characters. Wildcard characters indicate that the term is actually a pattern that may match any number of terms. There are two forms of wildcard character: asterisk ("*") which will match zero or more arbitrary characters and question mark ("?") which will match exactly one arbitrary character.
A term consisting only of one or more asterisks will match all terms of the field in which it is used. For example, title:*.
The most common use of asterisk is as the last character of a query term to match all terms that begin with the rest of the query term as a prefix. For example, paint*.
One traditional use of asterisk is to force plurals to match. This use is usually unnecessary because LucidWorks uses a stemming filter to automatically match both singular and plural forms. However, this technique may still be useful if the administrator chooses to disable the stemming filter or for fields that may not have a stemming filter. For example, Sneaker* will match both "sneaker" and "sneakers".
A question mark can be used where there might be variations for a single character. For example:
|?at||"cat", Bat", "fat", "kat", and so on|
|c?t||"cat", "cot", "cut"|
|ca?||"cab", "can", "cat", and so on|
Any combination of asterisks and question mark wildcards can be used in a single term, but care is needed to avoid unexpected results.
Note that wildcards are not supported within quoted phrases. They will be treated as if they were white space. Wildcards can be used for non-text fields.
If you need to use a non-wildcard asterisk or question mark in a non-text field, be sure to escape each of them with a backslash. For example,
will match the literal term "ABC*DEF?GHI".
If you need to use a trailing question mark wildcard at the end of a query that starts with a question word (who, what, when, where, why or how), be sure to add a space and some extraneous syntax such as a +, otherwise the natural language query heuristic will discard that trailing question mark. For example:
|What is aspirin?||The question mark is ignored|
|myField: XX/YY/Z?||The question mark is treated as a wildcard|
|Where is part AB004x?||The question mark is ignored|
|Where is part AB004x? +||The question mark is treated as a wildcard and the extraneous "+" will be ignored|
|myField: XX/YY/Z? +||The question mark is treated as a wildcard and the extraneous "+" will be ignored|
Wildcards can be placed at the start of terms, such as *ation, which is known as a leading wildcard or sometimes as a suffix query. The syntaxes are the same as described above, but there may be local performance considerations that need to be evaluated.
Lucene and Solr technically support leading wildcards, but this feature is usually disabled by default in the traditional query parsers due to concerns about query performance since it tends to select a large percentage of indexed terms. The Lucid query parser does support leading wildcards by default, but this feature may be disabled by setting the leadWild configuration setting in solrconfig.xml to 'false'. To address performance concerns, Lucene 2.9+ and Solr 1.4+ now support a 'reversed wildcards' (or 'reversed tokens') strategy to work around this performance bottleneck.
This optimization is disabled by default. To enable this optimization you must manually add the ReversedWildcardFilterFactory filter to the end of the index analyzer tokenizer chain for the field types in the schema.xml file for the fields that require this optimization.
This affects all fields for the selected field types, so if you have multiple fields of a selected type and do not want this feature for all of them, you must create a new field type to use for the selected field.
The Lucid query parser will detect when leading wildcards are used and invoke the reversal filter, if present in the index analyzer, to reverse the wildcard term so that it will generate the proper query term that will match the reversed terms that are stored in the index for this field.
The rules for what constitutes a leading wildcard are not contained within the Lucid query parser itself. Rather, the query parser invokes the filter factory (if present) to inquire whether a given wildcard term satisfies the rules. There are a variety of optional parameters for the filter factory, described below, to control the rules. The default rules are that a query term will be considered to have a leading wildcard and to be a candidate for reversal only if there is either an asterisk in the first or second position or a question mark in the first position and neither of the last two positions are a wildcard. If a wildcard query term does not meet these conditions, the wildcard query will be performed with the usual, un-reversed wildcard term.
Use of the wildcard reversal filter will double the number of terms stored in the index for all fields of the selected field type since the filter stores the original term and the reversed form of the term at the same position.
There is no change to the query analyzer for the optimized field or field type. The reversal filter factory must only be specified for the index analyzer.
As an example, the index analyzer for field type text_en should appear as follows after you have manually edited schema.xml to add the wildcard reversal filter at the end of the index analyzer for this field type:
You must place the wildcard reversal filter at the end of the index analyzer for the field type since it is reversing the final form of the terms as they would normally be stored in the index.
Although this feature improves the performance of leading wildcards, it will not improve the performance of search terms that have both leading and trailing wildcards, since such a term will still have a leading wildcard even after being reversed. In such a case, which depends on the rule settings, the filter factory will inform the Lucid query parser that such a wildcard term is not a candidate for reversal. In that case, the Lucid query parser would generate a wildcard query using the un-reversed wildcard term.
The filter factory has several optional parameters to precisely control what forms of wildcard are considered leading and candidates for reversal at query time:
- maxPosAsterisk="n" -- maximum position (1-based) of the asterisk wildcard ('*') that triggers the reversal of a query term. Asterisks that occur at higher positions will not cause the reversal of the query term. The default is 2, meaning that asterisks in positions 1 and 2 will cause a reversal (assuming that the other conditions are met.)
- maxPosQuestion="n" -- maximum position (1-based) of the question mark wildcard ('?') that triggers the reversal of a query term. The default is 1. Set this to 0 and set maxPosAsterisk to 1 to reverse only pure suffix queries (i.e., those with a single leading asterisk.)
- maxFractionAsterisk="n.m" -- additional parameter that triggers the reversal if the position of at least one asterisk ('*') is at less than this fraction (0.0 to 1.0) of the query term length. The default is 0.0 (disabled.)
- minTrailing="n" -- minimum number of trailing characters in query term after the last wildcard character. For best performance this should be set to a value larger than one. The default is two.
These optional parameters only affect query processing, but must be associated with the index analyzer even though they do not affect indexing itself.