-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Adding cardinality support for random_sampler agg #86838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding cardinality support for random_sampler agg #86838
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Hi @benwtrent, I've created a changelog YAML for you. |
…wtrent/elasticsearch into feature/aggs-add-cardinality-support
Pinging @elastic/clients-team (Team:Clients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code change LGTM
The doc changes sound like a justification and very technical to me. I suggest to describe the user impact and what it means regarding accuracy.
This is misleading in my opinion:
not reflect exactly the true number...
Cardinality returns approximate numbers, so it never pretended to return exact numbers, even a highprecision_threshold
"counts are expected to be close to accurate". I am not sure if we can say something about the effect on precision_threshold
.
A key exception to ... for automatic scaling
What about min
and max
? They don't carry a doc_count
as well and just return as is.
Having that said, I wonder if we even need a "special case" for cardinality. The impact of sampling described in general should be sufficient. For cardinality we put an approximation on an already approximate structure. In my opinion this should be even less surprising than doing an approximation on otherwise exact results.
The idea is that all other things that are "count" related are scaled. Cardinality is not. With cardinality, if you effectively had a unique value for every doc (thus matching doc_count), we don't scale. Which, visually, is confusing. It would match the sampled document count. If they removed using the random sampler, the count would dramatically increase. This kind of significant change does not really occur for other aggs. max/min/avg/percentiles/etc. are not scaled either. |
Thanks, I didn't see the "full unique case". It seems UI specific, because a user wouldn't run the cardinality agg on such a field. We could try to detect that, but it is hard to make the switch: What if we get |
Correct, this is why scaling cardinality just really isn't possible. We don't know how to do it in an unbiased way. Consequently, the only way to really use it is in relation to the number of docs actually sampled. Certain visualizations do this already (our data visualization in ML does cardinality under the old "sampler" agg). |
We intend to use cardinality for the data visualizer which would replace traditional sampler agg - as discussed, providing we document the inherent limited accuracy, then this would be useful. |
@elasticmachine update branch |
counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it | ||
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give | ||
you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values | ||
for the given field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested some minor changes for improving readability.
docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc
Outdated
Show resolved
Hide resolved
docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc
Outdated
Show resolved
Hide resolved
Co-authored-by: István Zoltán Szabó <[email protected]>
counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it | ||
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give | ||
you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values | ||
for the given field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs changes LGTM, thank you!
* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...
This adds support for the
cardinality
aggregation within a random_sampler.This usecase is helpful in determining the ratio of unique values compared to the count of total documents within the sampled set.