-
Notifications
You must be signed in to change notification settings - Fork 25.3k
ESQL - KNN function uses LIMIT for setting top k #129353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ESQL - KNN function uses LIMIT for setting top k #129353
Conversation
… the query builder is retrieved
…n-limit' into non-issue/esql-knn-function-limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed these tests to:
- Remove the "k" option and replace it by LIMIT
- Remove STATS use cases, as I've removed the ability to use KNN in STATS for now
@@ -190,7 +190,7 @@ public BiConsumer<LogicalPlan, Failures> postAnalysisPlanVerification() { | |||
* @param plan root plan to check | |||
* @param failures failures found | |||
*/ | |||
private static void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) { | |||
private void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed visibility of these methods to allow overriding in KNN
@@ -184,11 +184,10 @@ public void execute(EsqlQueryRequest request, EsqlExecutionInfo executionInfo, P | |||
new EsqlCCSUtils.CssPartialErrorsActionListener(executionInfo, listener) { | |||
@Override | |||
public void onResponse(LogicalPlan analyzedPlan) { | |||
LogicalPlan optimizedPlan = optimizedPlan(analyzedPlan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed PreMapper to be done after the plan has been optimized. This way, KNN can update k before the QueryBuilder is created, and so it takes the k coming from the optimization
* KNN should not be used in aggregations, as it is a top-N query and not a filtering query | ||
*/ | ||
@Override | ||
protected void checkFullTextFunctionsInAggs(Aggregate agg, Failures failures) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the ability to use KNN in aggs. It does not make sense to me - KNN retrieves a top K in terms of similarity, not as a filtering function like other FTFs.
We can undo that decision later, for example forcing it to use a minimum similarity - but still think that it's too much of an edge case as of now, and complicates testing.
Having As a user, I would assume it takes the first one, then allows me to do whatever I want with the table of |
@benwtrent In case there's a LIMIT that comes after that reduces the resultset, then whatever the user does before would get lost when applying the lesser LIMIT afterwards. Given the following:
Then it doesn't matter what I calculated in the bottom 10 rows as it will be discarded by the last LIMIT 10. This comes from an optimization already done for pushing down and combining limits - I wanted to keep the same semantics for KNN, as the returned rows that don't pass a posterior LIMIT would be discarded as well. The only case I can think of is if we're changing the SORT order between limits - I don't see that taken into account in the push down limits optimization, I'll think about it more. |
@carlosdelest yes, sort order changing is exactly my concern |
KNN
is a different kind of function than other text functions in ESQL - it retrieves the top K results instead of returning all the results dictated byLIMIT
.In order not to return less results than the user expects, and to optimize the underlying
KNN
search, KNN function will use theLIMIT
(s) before it to set the k parameter, in case it is not set already as an option by the user.will mean that KNN will set
k
to 100.A
LIMIT
that comes later from KNN is taken into account:will use k = 100 as well.
The only commands that stop the
LIMIT
toKNN
propagation isSTATS
.If there are multiple LIMIT clauses, the lesser one will be applied:
will have k = 10.
If the user has already specified a k value in the KNN function via the k option, the k option will prevail vs using a limit:
will hava k = 10.
k needs to be specified either implicitly (via
LIMIT
) or explicitly (via thek
option). We have a defaultLIMIT
of 1000 that is applied implicitly, except for example in the case of STATS:the query above will fail as there is no default LIMIT applied and k option is not set.
Open questions: