Adding cardinality support for random_sampler agg #86838

benwtrent · 2022-05-17T12:14:38Z

This adds support for the cardinality aggregation within a random_sampler.

This usecase is helpful in determining the ratio of unique values compared to the count of total documents within the sampled set.

elasticmachine · 2022-05-17T12:14:42Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticsearchmachine · 2022-05-17T12:15:04Z

Hi @benwtrent, I've created a changelog YAML for you.

…wtrent/elasticsearch into feature/aggs-add-cardinality-support

elasticmachine · 2022-05-17T14:23:35Z

Pinging @elastic/clients-team (Team:Clients)

hendrikmuhs

code change LGTM

The doc changes sound like a justification and very technical to me. I suggest to describe the user impact and what it means regarding accuracy.

This is misleading in my opinion:

not reflect exactly the true number...

Cardinality returns approximate numbers, so it never pretended to return exact numbers, even a highprecision_threshold "counts are expected to be close to accurate". I am not sure if we can say something about the effect on precision_threshold.

A key exception to ... for automatic scaling

What about min and max? They don't carry a doc_count as well and just return as is.

Having that said, I wonder if we even need a "special case" for cardinality. The impact of sampling described in general should be sufficient. For cardinality we put an approximation on an already approximate structure. In my opinion this should be even less surprising than doing an approximation on otherwise exact results.

benwtrent · 2022-05-18T11:16:22Z

Having that said, I wonder if we even need a "special case" for cardinality. The impact of sampling described in general should be sufficient. For cardinality we put an approximation on an already approximate structure. In my opinion this should be even less surprising than doing an approximation on otherwise exact results.

The idea is that all other things that are "count" related are scaled. Cardinality is not.

With cardinality, if you effectively had a unique value for every doc (thus matching doc_count), we don't scale. Which, visually, is confusing. It would match the sampled document count. If they removed using the random sampler, the count would dramatically increase. This kind of significant change does not really occur for other aggs.

max/min/avg/percentiles/etc. are not scaled either.

hendrikmuhs · 2022-05-18T13:19:41Z

Thanks, I didn't see the "full unique case". It seems UI specific, because a user wouldn't run the cardinality agg on such a field. We could try to detect that, but it is hard to make the switch: What if we get all_docs - 1?

benwtrent · 2022-05-18T14:56:43Z

We could try to detect that, but it is hard to make the switch:

Correct, this is why scaling cardinality just really isn't possible. We don't know how to do it in an unbiased way. Consequently, the only way to really use it is in relation to the number of docs actually sampled.

Certain visualizations do this already (our data visualization in ML does cardinality under the old "sampler" agg).

sophiec20 · 2022-07-20T13:00:25Z

We intend to use cardinality for the data visualizer which would replace traditional sampler agg - as discussed, providing we document the inherent limited accuracy, then this would be useful.

benwtrent · 2022-07-20T13:23:44Z

@elasticmachine update branch

benwtrent · 2022-07-20T13:25:04Z

docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc

+counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it
+to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give
+you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values
+for the given field.


@lcawl does this read clearly?

@tveasey does this cover your concerns?

This seems fine to me.

szabosteve

I suggested some minor changes for improving readability.

docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc

Co-authored-by: István Zoltán Szabó <[email protected]>

tveasey · 2022-07-20T17:41:30Z

docs/reference/aggregations/bucket/random-sampler-aggregation.asciidoc

+counts do not lend themselves for automatic scaling. So, when interpreting the cardinality count, be sure to compare it
+to the number of sampled docs provided in the top level `doc_count` within the random_sampler aggregation. This can give
+you an idea of unique values as a percentage of total values, but may not reflect exactly the true number of unique values
+for the given field.


This seems fine to me.

szabosteve

Docs changes LGTM, thank you!

* upstream/master: (40 commits) Fix CI job naming [ML] disallow autoscaling downscaling in two trained model assignment scenarios (elastic#88623) Add "Vector Search" area to changelog schema [DOCS] Update API key API (elastic#88499) Enable the pipeline on the feature branch (elastic#88672) Adding the ability to register a PeerFinderListener to Coordinator (elastic#88626) [DOCS] Fix transform painless example syntax (elastic#88364) [ML] Muting InternalCategorizationAggregationTests testReduceRandom (elastic#88685) Fix double rounding errors for disk usage (elastic#88683) Replace health request with a state observer. (elastic#88641) [ML] Fail model deployment if all allocations cannot be provided (elastic#88656) Upgrade to OpenJDK 18.0.2+9 (elastic#88675) [ML] make bucket_correlation aggregation generally available (elastic#88655) Adding cardinality support for random_sampler agg (elastic#86838) Use custom task instead of generic AckedClusterStateUpdateTask (elastic#88643) Reinstate test cluster throttling behavior (elastic#88664) Mute testReadBlobWithPrematureConnectionClose Simplify plugin descriptor tests (elastic#88659) Add CI job for testing more job parallelism [ML] make deployment infer requests fully cancellable (elastic#88649) ...

Adding cardinality support for random_sampler agg

45db5ff

benwtrent added >enhancement :Analytics/Aggregations Aggregations v8.3.0 labels May 17, 2022

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 17, 2022

benwtrent and others added 4 commits May 17, 2022 08:15

Update docs/changelog/86838.yaml

0ea6c01

fixing test

b2ca179

Merge branch 'feature/aggs-add-cardinality-support' of github.com:ben…

f759e16

…wtrent/elasticsearch into feature/aggs-add-cardinality-support

fixing bad change

d49764f

sethmlarson added the Team:Clients Meta label for clients team label May 17, 2022

hendrikmuhs reviewed May 18, 2022

View reviewed changes

craigtaverner added v8.4.0 and removed v8.3.0 labels May 25, 2022

Merge branch 'master' into feature/aggs-add-cardinality-support

3152565

elasticsearchmachine removed the Team:Clients Meta label for clients team label Jul 20, 2022

benwtrent commented Jul 20, 2022

View reviewed changes

szabosteve reviewed Jul 20, 2022

View reviewed changes

Apply suggestions from code review

726b18a

Co-authored-by: István Zoltán Szabó <[email protected]>

tveasey approved these changes Jul 20, 2022

View reviewed changes

benwtrent requested a review from szabosteve July 20, 2022 17:58

szabosteve approved these changes Jul 21, 2022

View reviewed changes

benwtrent merged commit 94f2544 into elastic:master Jul 21, 2022

benwtrent deleted the feature/aggs-add-cardinality-support branch July 21, 2022 11:19

Adding cardinality support for random_sampler agg #86838

Adding cardinality support for random_sampler agg #86838

Uh oh!

Conversation

benwtrent commented May 17, 2022

Uh oh!

elasticmachine commented May 17, 2022

Uh oh!

elasticsearchmachine commented May 17, 2022

Uh oh!

elasticmachine commented May 17, 2022

Uh oh!

hendrikmuhs left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent commented May 18, 2022

Uh oh!

hendrikmuhs commented May 18, 2022

Uh oh!

benwtrent commented May 18, 2022

Uh oh!

sophiec20 commented Jul 20, 2022

Uh oh!

benwtrent commented Jul 20, 2022

Uh oh!

benwtrent Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

tveasey Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

szabosteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tveasey Jul 20, 2022

Choose a reason for hiding this comment

Uh oh!

szabosteve left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!