Skip to content

Avoid dropping aggregate groupings in local plans #129370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jun 12, 2025

The local plan optimizer should not change the layout, as it has already been agreed upon. However, CombineProjections can violate this when some grouping elements refer to the same attribute. This occurs when ReplaceFieldWithConstantOrNull replaces missing fields with the same reference for a given data type.

I see several ways to address this issue:

  1. Skip CombineProjections in the local plan. This optimization is useful in the global plan, but likely unnecessary in the local plan, especially since we now need to retain the same number of aggregates and groupings.

  2. Update RuleUtils#aliasedNulls to generate an eval for each missing field separately, rather than one per data type.

This change adopts option 2 as it is simpler and more robust. However, either option is fine, and I am open to other suggestions.

Closes #128054

@dnhatn dnhatn added v8.18.3 v8.19.0 v9.0.3 auto-backport Automatically create backport pull requests when merged :Analytics/ES|QL AKA ESQL >bug labels Jun 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn marked this pull request as ready for review June 12, 2025 22:48
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a first look at it, but I need more time to understand it.
It seems to be doing what the comment in RuleUtils.aliasedNulls is suggesting, which is good.

When I tested this I really had to use "max_concurrent_shards_per_node":1 (as you have it in your IT test). Why is this bit needed, if you can shed some light?

@@ -1680,6 +1680,36 @@ public void testQueryOnEmptyDataIndex() {
}
}

public void testGroupingStatsOnMissingFields() {
assertAcked(client().admin().indices().prepareCreate("test_missing_fields").setMapping("data", "type=long"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you meant missing_field_index, right?

var field = as(project.child(), FieldExtractExec.class);
var fields = field.attributesToExtract();
assertThat(Expressions.names(fields), contains("_meta_field", "gender", "hire_date", "job", "job.raw", "languages", "long_noidx"));
var eval = as(field.child(), EvalExec.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also, please, change the "Expects" part of this method at the top? It's a thing with the optimizer tests that, at least for me, is very useful when looking at a test.

@bpintea
Copy link
Contributor

bpintea commented Jun 17, 2025

Nice find of the cause!

This change adopts option 2 as it is simpler and more robust.

I think skipping CombineProjections in LocalLogicalPlanOptimizer should be (1) even simpler (basically, similar to this line) and likely needed anyways, as indeed, the rule (1) shouldn't normally apply locally, but (2) could also change the layout (as it happened).

So IMO we should at a minimum apply 1. Then, the NULL aliasing should make a difference with many locally missing fields, so even it has some fragility, I'd opt to keep it (at least until we substitute with something else).

@alex-spies
Copy link
Contributor

alex-spies commented Jun 18, 2025

Hey, nice find @dnhatn !

Update RuleUtils#aliasedNulls to generate an eval for each missing field separately, rather than one per data type.

I believe this is a good quick fix.

I think skipping CombineProjections in LocalLogicalPlanOptimizer should be (1) even simpler (basically, similar to this line) and likely needed anyways, as indeed, the rule (1) shouldn't normally apply locally, but (2) could also change the layout (as it happened)

I wouldn't disable this locally altogether. There are very valid reasons to combine projections on the local node; for one, ProjectAwayColumns introduces a projection that's only present on local nodes; also, ReplaceFieldWithConstantOrNull introduces a projection only locally; combining those makes sense and simplifies our plans.

I think the crux is that on a local node Aggregate can actually mean "this will be the initial aggregate that corresponds to an intermediate/final aggregate downstream and thus mustn't have its output changed".

Thus, I think a more robust approach would be to mark such Aggregates and avoid the deduplication of the groupings here for any aggregates marked in such a way.

I know this requires a new transport version, but we could add a fixedOutput field to Aggregate and check it in CombineProjections.

@bpintea
Copy link
Contributor

bpintea commented Jun 18, 2025

I wouldn't disable this locally altogether. There are very valid reasons to combine projections on the local node; for one, ProjectAwayColumns introduces a projection that's only present on local nodes; also, ReplaceFieldWithConstantOrNull introduces a projection only locally; combining those makes sense and simplifies our plans.

I think ProjectAwayColumns should matter later, as a physical rule, but the projections that ReplaceFieldWithConstantOrNull produces should be combined, you're right!

Thus, I think a more robust approach would be to mark such Aggregates and avoid the deduplication of the groupings here for any aggregates marked in such a way.

CombineProjections would need to become aware of a special Aggregate flavour, then, plus have a rule to mark these instances (coming/going from/in a fragment). An alternative might be to just make CombineProjections location-aware (coordinator vs. local) and based on this discriminator just use a list, rather then a set, when building the output of the agg in #combineUpperGroupingsAndLowerProjections? That shouldn't require transport changes.

@alex-spies
Copy link
Contributor

I think ProjectAwayColumns should matter later, as a physical rule,

It adds a logical projection into the fragment here. Currently, this top projection gets normally combined with anything that rules like ReplaceFieldWithConstantOrNull add.

An alternative might be to just make CombineProjections location-aware (coordinator vs. local) and based on this discriminator just use a list, rather then a set, when building the output of the agg in #combineUpperGroupingsAndLowerProjections? That shouldn't require transport changes.

Oh, right - I thought this wasn't an option because there may be other aggs in the local plan, but of course there can be at most one aggregate 🤦 . So, a local flavor of CombineProjections would do the trick!

@bpintea
Copy link
Contributor

bpintea commented Jun 18, 2025

Laterally, I was also wondering if we couldn't do a verification of the outputs before and after local replanning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.18.4 v8.19.0 v9.0.4 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ES|QL] ClassCastException in PercentileDoubleGroupingAggregatorFunction
5 participants