Skip to content

Commit bc5a9ca

Browse files
committed
Rename edit_distance/min_similarity to fuzziness
A lot of different API's currently use different names for the same logical parameter. Since lucene moved away from the notion of a `similarity` and now uses an `fuzziness` we should generalize this and encapsulate the generation, parsing and creation of these settings across all queries. This commit adds a new `Fuzziness` class that handles the renaming and generalization in a backwards compatible manner. This commit also added a ParseField class to better support deprecated Query DSL parameters The ParseField class allows specifying parameger that have been deprecated. Those parameters can be more easily tracked and removed in future version. This also allows to run queries in `strict` mode per index to throw exceptions if a query is executed with deprected keys. Closes elastic#4082
1 parent f7db7eb commit bc5a9ca

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+917
-196
lines changed

docs/reference/api-conventions.asciidoc

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,21 @@ fields within a document indexed treated as boolean fields.
122122
All REST APIs support providing numbered parameters as `string` on top
123123
of supporting the native JSON number types.
124124

125+
[[time-units]]
126+
[float]
127+
=== Time units
128+
129+
Whenever durations need to be specified, eg for a `timeout` parameter, the duration
130+
can be specified as a whole number representing time in milliseconds, or as a time value like `2d` for 2 days. The supported units are:
131+
132+
[horizontal]
133+
`y`:: Year
134+
`M`:: Month
135+
`w`:: Week
136+
`h`:: Hour
137+
`m`:: Minute
138+
`s`:: Second
139+
125140
[[distance-units]]
126141
[float]
127142
=== Distance Units
@@ -144,6 +159,63 @@ Centimeter:: `cm` or `centimeters`
144159
Millimeter:: `mm` or `millimeters`
145160

146161

162+
[[fuzziness]]
163+
[float]
164+
=== Fuzziness
165+
166+
Some queries and APIs support parameters to allow inexact _fuzzy_ matching,
167+
using the `fuzziness` parameter. The `fuzziness` parameter is context
168+
sensitive which means that it depends on the type of the field being queried:
169+
170+
[float]
171+
==== Numeric, date and IPv4 fields
172+
173+
When querying numeric, date and IPv4 fields, `fuzziness` is interpreted as a
174+
`+/- margin. It behaves like a <<query-dsl-range-query>> where:
175+
176+
-fuzziness <= field value <= +fuzziness
177+
178+
The `fuzziness` parameter should be set to a numeric value, eg `2` or `2.0`. A
179+
`date` field interprets a long as milliseconds, but also accepts a string
180+
containing a time value -- `"1h"` -- as explained in <<time-units>>. An `ip`
181+
field accepts a long or another IPv4 address (which will be converted into a
182+
long).
183+
184+
[float]
185+
==== String fields
186+
187+
When querying `string` fields, `fuzziness` is interpreted as a
188+
http://en.wikipedia.org/wiki/Levenshtein_distance[Levenshtein Edit Distance]
189+
-- the number of one character changes that need to be made to one string to
190+
make it the same as another string.
191+
192+
The `fuzziness` parameter can be specified as:
193+
194+
`0`, `1`, `2`::
195+
196+
the maximum allowed Levenshtein Edit Distance (or number of edits)
197+
198+
`AUTO`::
199+
+
200+
--
201+
generates an edit distance based on the length of the term. For lengths:
202+
203+
`0..1`:: must match exactly
204+
`1..4`:: one edit allowed
205+
`>4`:: two edits allowed
206+
207+
`AUTO` should generally be the preferred value for `fuzziness`.
208+
--
209+
210+
`0.0..1.0`::
211+
212+
converted into an edit distance using the formula: `length(term) * (1.0 -
213+
fuzziness)`, eg a `fuzziness` of `0.6` with a term of length 10 would result
214+
in an edit distance of `4`. Note: in all APIs except for the
215+
<<query-dsl-flt-query>>, the maximum allowed edit distance is `2`.
216+
217+
218+
147219
[float]
148220
=== Result Casing
149221

docs/reference/query-dsl/queries/flt-field-query.asciidoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,8 @@ The `fuzzy_like_this_field` top level parameters include:
3333
|`max_query_terms` |The maximum number of query terms that will be
3434
included in any generated query. Defaults to `25`.
3535

36-
|`min_similarity` |The minimum similarity of the term variants. Defaults
37-
to `0.5`.
36+
|`fuzziness` |The fuzziness of the term variants. Defaults
37+
to `0.5`. See <<fuzziness>>.
3838

3939
|`prefix_length` |Length of required common prefix on variant terms.
4040
Defaults to `0`.

docs/reference/query-dsl/queries/flt-query.asciidoc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ Defaults to the `_all` field.
3232
|`max_query_terms` |The maximum number of query terms that will be
3333
included in any generated query. Defaults to `25`.
3434

35-
|`min_similarity` |The minimum similarity of the term variants. Defaults
36-
to `0.5`.
35+
|`fuzziness` |The minimum similarity of the term variants. Defaults
36+
to `0.5`. See <<fuzziness>>.
3737

3838
|`prefix_length` |Length of required common prefix on variant terms.
3939
Defaults to `0`.
Lines changed: 54 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
[[query-dsl-fuzzy-query]]
22
=== Fuzzy Query
33

4-
A fuzzy query that uses similarity based on Levenshtein (edit
5-
distance) algorithm. This maps to Lucene's `FuzzyQuery`.
4+
The fuzzy query uses similarity based on Levenshtein edit distance for
5+
`string` fields, and a `+/-` margin on numeric and date fields.
66

7-
Warning: this query is not very scalable with its default prefix length
8-
of 0 - in this case, *every* term will be enumerated and cause an edit
9-
score calculation or `max_expansions` is not set.
7+
==== String fields
8+
9+
The `fuzzy` query generates all possible matching terms that are within the
10+
maximum edit distance specified in `fuzziness` and then checks the term
11+
dictionary to find out which of those generated terms actually exist in the
12+
index.
1013

1114
Here is a simple example:
1215

@@ -17,63 +20,83 @@ Here is a simple example:
1720
}
1821
--------------------------------------------------
1922

20-
More complex settings can be set (the values here are the default
21-
values):
23+
Or with more advanced settings:
2224

2325
[source,js]
2426
--------------------------------------------------
25-
{
26-
"fuzzy" : {
27-
"user" : {
28-
"value" : "ki",
29-
"boost" : 1.0,
30-
"min_similarity" : 0.5,
31-
"prefix_length" : 0
32-
}
27+
{
28+
"fuzzy" : {
29+
"user" : {
30+
"value" : "ki",
31+
"boost" : 1.0,
32+
"fuzziness" : 2,
33+
"prefix_length" : 0,
34+
"max_expansions": 100
3335
}
3436
}
37+
}
3538
--------------------------------------------------
3639

37-
The `max_expansions` parameter (unbounded by default) controls the
38-
number of terms the fuzzy query will expand to.
40+
[float]
41+
===== Parameters
42+
43+
[horizontal]
44+
`fuzziness`::
45+
46+
The maximum edit distance. Defaults to `AUTO`. See <<fuzziness>>.
47+
48+
`prefix_length`::
49+
50+
The number of initial characters which will not be ``fuzzified''. This
51+
helps to reduce the number of terms which must be examined. Defaults
52+
to `0`.
53+
54+
`max_expansions`::
55+
56+
The maximum number of terms that the `fuzzy` query will expand to.
57+
Defaults to `0`.
58+
59+
60+
WARNING: this query can be very heavy if `prefix_length` and `max_expansions`
61+
are both set to their defaults of `0`. This could cause every term in the
62+
index to be examined!
63+
3964

4065
[float]
41-
==== Numeric / Date Fuzzy
66+
==== Numeric and date fields
67+
68+
Performs a <<query-dsl-range-query>> ``around'' the value using the
69+
`fuzziness` value as a `+/-` range, where:
70+
71+
-fuzziness <= field value <= +fuzziness
4272

43-
`fuzzy` query on a numeric field will result in a range query "around"
44-
the value using the `min_similarity` value. For example:
73+
For example:
4574

4675
[source,js]
4776
--------------------------------------------------
4877
{
4978
"fuzzy" : {
5079
"price" : {
5180
"value" : 12,
52-
"min_similarity" : 2
81+
"fuzziness" : 2
5382
}
5483
}
5584
}
5685
--------------------------------------------------
5786

58-
Will result in a range query between 10 and 14. Same applies to dates,
59-
with support for time format for the `min_similarity` field:
87+
Will result in a range query between 10 and 14. Date fields support
88+
<<time-units,time values>>, eg:
6089

6190
[source,js]
6291
--------------------------------------------------
6392
{
6493
"fuzzy" : {
6594
"created" : {
6695
"value" : "2010-02-05T12:05:07",
67-
"min_similarity" : "1d"
96+
"fuzziness" : "1d"
6897
}
6998
}
7099
}
71100
--------------------------------------------------
72101

73-
In the mapping, numeric and date types now allow to configure a
74-
`fuzzy_factor` mapping value (defaults to 1), which will be used to
75-
multiply the fuzzy value by it when used in a `query_string` type query.
76-
For example, for dates, a fuzzy factor of "1d" will result in
77-
multiplying whatever fuzzy value provided in the min_similarity by it.
78-
Note, this is explicitly supported since query_string query only allowed
79-
for similarity valued between 0.0 and 1.0.
102+
See <<fuzziness>> for more details about accepted values.

docs/reference/query-dsl/queries/match-query.asciidoc

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,10 @@ The `analyzer` can be set to control which analyzer will perform the
3434
analysis process on the text. It default to the field explicit mapping
3535
definition, or the default search analyzer.
3636

37-
`fuzziness` can be set to a value (depending on the relevant type, for
38-
string types it should be a value between `0.0` and `1.0`) to constructs
39-
fuzzy queries for each term analyzed. The `prefix_length` and
37+
`fuzziness` allows _fuzzy matching_ based on the type of field being queried.
38+
See <<fuzziness>> for allowed settings.
39+
40+
The `prefix_length` and
4041
`max_expansions` can be set in this case to control the fuzzy process.
4142
If the fuzzy option is set the query will use `constant_score_rewrite`
4243
as its <<query-dsl-multi-term-rewrite,rewrite
@@ -80,9 +81,9 @@ change that the `zero_terms_query` option can be used, which accepts
8081
.cutoff_frequency
8182
The match query supports a `cutoff_frequency` that allows
8283
specifying an absolute or relative document frequency where high
83-
frequent terms are moved into an optional subquery and are only scored
84-
if one of the low frequent (below the cutoff) terms in the case of an
85-
`or` operator or all of the low frequent terms in the case of an `and`
84+
frequent terms are moved into an optional subquery and are only scored
85+
if one of the low frequent (below the cutoff) terms in the case of an
86+
`or` operator or all of the low frequent terms in the case of an `and`
8687
operator match.
8788

8889
This query allows handling `stopwords` dynamically at runtime, is domain
@@ -101,8 +102,8 @@ Note: If the `cutoff_frequency` is used and the operator is `and`
101102
_stacked tokens_ (tokens that are on the same position like `synonym` filter emits)
102103
are not handled gracefully as they are in a pure `and` query. For instance the query
103104
`fast fox` is analyzed into 3 terms `[fast, quick, fox]` where `quick` is a synonym
104-
for `fast` on the same token positions the query might require `fast` and `quick` to
105-
match if the operator is `and`.
105+
for `fast` on the same token positions the query might require `fast` and `quick` to
106+
match if the operator is `and`.
106107

107108
Here is an example showing a query composed of stopwords exclusivly:
108109

docs/reference/query-dsl/queries/query-string-query.asciidoc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,8 @@ increments in result queries. Defaults to `true`.
4646
|`fuzzy_max_expansions` |Controls the number of terms fuzzy queries will
4747
expand to. Defaults to `50`
4848

49-
|`fuzzy_min_sim` |Set the minimum similarity for fuzzy queries. Defaults
50-
to `0.5`
49+
|`fuzziness` |Set the fuzziness for fuzzy queries. Defaults
50+
to `AUTO`. See <<fuzziness>> for allowed settings.
5151

5252
|`fuzzy_prefix_length` |Set the prefix length for fuzzy queries. Default
5353
is `0`.
@@ -70,7 +70,7 @@ in the resulting boolean query should match. It can be an absolute value
7070
both>>.
7171

7272
|`lenient` |If set to `true` will cause format based failures (like
73-
providing text to a numeric field) to be ignored.
73+
providing text to a numeric field) to be ignored.
7474
|=======================================================================
7575

7676
When a multi term query is being generated, one can control how it gets
@@ -128,7 +128,7 @@ search on all "city" fields:
128128

129129
Another option is to provide the wildcard fields search in the query
130130
string itself (properly escaping the `*` sign), for example:
131-
`city.\*:something`.
131+
`city.\*:something`.
132132

133133
When running the `query_string` query against multiple fields, the
134134
following additional parameters are allowed:

docs/reference/search/suggesters/completion-suggest.asciidoc

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ curl -X POST 'localhost:9200/music/_suggest?pretty' -d '{
199199
"completion" : {
200200
"field" : "suggest",
201201
"fuzzy" : {
202-
"edit_distance" : 2
202+
"fuzziness" : 2
203203
}
204204
}
205205
}
@@ -210,8 +210,9 @@ The fuzzy query can take specific fuzzy parameters.
210210
The following parameters are supported:
211211

212212
[horizontal]
213-
`edit_distance`::
214-
Maximum edit distance, defaults to `1`
213+
`fuzziness`::
214+
The fuzziness factor, defaults to `AUTO`.
215+
See <<fuzziness>> for allowed settings.
215216

216217
`transpositions`::
217218
Sets if transpositions should be counted

src/main/java/org/apache/lucene/queryparser/classic/MapperQueryParser.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
import org.elasticsearch.common.lucene.Lucene;
3131
import org.elasticsearch.common.lucene.search.Queries;
3232
import org.elasticsearch.common.lucene.search.XFilteredQuery;
33+
import org.elasticsearch.common.unit.Fuzziness;
3334
import org.elasticsearch.index.mapper.FieldMapper;
3435
import org.elasticsearch.index.mapper.MapperService;
3536
import org.elasticsearch.index.query.QueryParseContext;
@@ -435,7 +436,7 @@ private Query getFuzzyQuerySingle(String field, String termStr, String minSimila
435436
if (currentMapper != null) {
436437
try {
437438
//LUCENE 4 UPGRADE I disabled transpositions here by default - maybe this needs to be changed
438-
Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, minSimilarity, fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
439+
Query fuzzyQuery = currentMapper.fuzzyQuery(termStr, Fuzziness.build(minSimilarity), fuzzyPrefixLength, settings.fuzzyMaxExpansions(), false);
439440
return wrapSmartNameQuery(fuzzyQuery, fieldMappers, parseContext);
440441
} catch (RuntimeException e) {
441442
if (settings.lenient()) {

0 commit comments

Comments
 (0)