Skip to content

Adding cardinality support for random_sampler agg #86838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benwtrent
Copy link
Member

This adds support for the cardinality aggregation within a random_sampler.

This usecase is helpful in determining the ratio of unique values compared to the count of total documents within the sampled set.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 17, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label May 17, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code change LGTM

The doc changes sound like a justification and very technical to me. I suggest to describe the user impact and what it means regarding accuracy.

This is misleading in my opinion:

not reflect exactly the true number...

Cardinality returns approximate numbers, so it never pretended to return exact numbers, even a highprecision_threshold "counts are expected to be close to accurate". I am not sure if we can say something about the effect on precision_threshold.

A key exception to ... for automatic scaling

What about min and max? They don't carry a doc_count as well and just return as is.

Having that said, I wonder if we even need a "special case" for cardinality. The impact of sampling described in general should be sufficient. For cardinality we put an approximation on an already approximate structure. In my opinion this should be even less surprising than doing an approximation on otherwise exact results.

@benwtrent
Copy link
Member Author

Having that said, I wonder if we even need a "special case" for cardinality. The impact of sampling described in general should be sufficient. For cardinality we put an approximation on an already approximate structure. In my opinion this should be even less surprising than doing an approximation on otherwise exact results.

The idea is that all other things that are "count" related are scaled. Cardinality is not.

With cardinality, if you effectively had a unique value for every doc (thus matching doc_count), we don't scale. Which, visually, is confusing. It would match the sampled document count. If they removed using the random sampler, the count would dramatically increase. This kind of significant change does not really occur for other aggs.

max/min/avg/percentiles/etc. are not scaled either.

@hendrikmuhs
Copy link

Thanks, I didn't see the "full unique case". It seems UI specific, because a user wouldn't run the cardinality agg on such a field. We could try to detect that, but it is hard to make the switch: What if we get all_docs - 1?

@benwtrent
Copy link
Member Author

We could try to detect that, but it is hard to make the switch:

Correct, this is why scaling cardinality just really isn't possible. We don't know how to do it in an unbiased way. Consequently, the only way to really use it is in relation to the number of docs actually sampled.

Certain visualizations do this already (our data visualization in ML does cardinality under the old "sampler" agg).

@sophiec20
Copy link
Contributor

We intend to use cardinality for the data visualizer which would replace traditional sampler agg - as discussed, providing we document the inherent limited accuracy, then this would be useful.

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine removed the Team:Clients Meta label for clients team label Jul 20, 2022
counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give
you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values
for the given field.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcawl does this read clearly?

@tveasey does this cover your concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me.

Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some minor changes for improving readability.

Co-authored-by: István Zoltán Szabó <[email protected]>
counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it
to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give
you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values
for the given field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me.

@benwtrent benwtrent requested a review from szabosteve July 20, 2022 17:58
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs changes LGTM, thank you!

@benwtrent benwtrent merged commit 94f2544 into elastic:master Jul 21, 2022
@benwtrent benwtrent deleted the feature/aggs-add-cardinality-support branch July 21, 2022 11:19
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 22, 2022
* upstream/master: (40 commits)
  Fix CI job naming
  [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623)
  Add "Vector Search" area to changelog schema
  [DOCS] Update API key API (elastic#88499)
  Enable the pipeline on the feature branch (elastic#88672)
  Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626)
  [DOCS] Fix transform painless example syntax (elastic#88364)
  [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685)
  Fix double rounding errors for disk usage (elastic#88683)
  Replace health request with a state observer. (elastic#88641)
  [ML] Fail model deployment if all allocations cannot be provided (elastic#88656)
  Upgrade to OpenJDK 18.0.2+9 (elastic#88675)
  [ML] make bucket_correlation aggregation generally available (elastic#88655)
  Adding cardinality support for random_sampler agg (elastic#86838)
  Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643)
  Reinstate test cluster throttling behavior (elastic#88664)
  Mute testReadBlobWithPrematureConnectionClose
  Simplify plugin descriptor tests (elastic#88659)
  Add CI job for testing more job parallelism
  [ML] make deployment infer requests fully cancellable (elastic#88649)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants