-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Use optimized text in match_only_text fields #129371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use optimized text in match_only_text fields #129371
Conversation
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
/** | ||
* Reader that decodes UTF-8 formatted bytes into chars. | ||
*/ | ||
public class UTF8DecodingReader extends Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically equivalent to:
final var reader = new InputStreamReader(new ByteArrayInputStream(utfBytes.bytes(), utfBytes.offset(), utfBytes.length()), StandardCharsets.UTF_8);
But according to the microbenchmarks, using the InputStreamReader
/ByteArrayInputStream
is really slow, and was actually slower than the original string-based implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the results from the microbenchmark:
Benchmark (nDocs) Mode Cnt Score Error Units
OptimizedTextBenchmark.indexDocuments (baseline) 1048576 avgt 5 581.242 ± 3.050 ms/op
OptimizedTextBenchmark.indexDocuments (UTF8DecodingReader) 1048576 avgt 5 544.477 ± 4.961 ms/op
OptimizedTextBenchmark.indexDocuments (InputStreamReader) 1048576 avgt 5 852.380 ± 6.238 ms/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Yes, this implementation makes a lot of sense then. Did you also run this PR against elastic/logs (enterprise) benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't yet, but I plan to today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a final class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I left a few questions.
/** | ||
* Reader that decodes UTF-8 formatted bytes into chars. | ||
*/ | ||
public class UTF8DecodingReader extends Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Yes, this implementation makes a lot of sense then. Did you also run this PR against elastic/logs (enterprise) benchmark?
...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java
Show resolved
Hide resolved
...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/SourceConfirmedTextQuery.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
...per-extras/src/main/java/org/elasticsearch/index/mapper/extras/MatchOnlyTextFieldMapper.java
Show resolved
Hide resolved
/** | ||
* Reader that decodes UTF-8 formatted bytes into chars. | ||
*/ | ||
public class UTF8DecodingReader extends Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a final class?
Looks like serverless check failed, because of:
Which killed the search node and many tests failed because of this. I don't think it is related to this change. |
Ok, I benchmarked this with the In light of that, I'm not sure if we want to continue forward with this change.
|
Given that even median indexing throughput varies, I think that this is within noise. I don't think this change makes things slower. I suspect we don't see a gain here, given that that are many other moving parts here. The micro benchmark suggested a minor improvement. I think not having to create java strings during indexing for match only fields is a beneficial. Additionally block loaders for match only text fields can work with bytes refs instead of strings, which I think is beneficial as well.
So I think we can move forward with the change and merge it in. |
Ok, I'll merge it and keep an eye on the nightlies. We can always revert it if we see something we don't like. |
💚 Backport successful
|
Follow-up to #126492 to use the json parsing optimizations for
match_only_text
fields.Relates to #129072.