Skip to content

ESQL - KNN function uses LIMIT for setting top k #129353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

carlosdelest
Copy link
Member

@carlosdelest carlosdelest commented Jun 12, 2025

KNN is a different kind of function than other text functions in ESQL - it retrieves the top K results instead of returning all the results dictated by LIMIT.

In order not to return less results than the user expects, and to optimize the underlying KNN search, KNN function will use the LIMIT(s) before it to set the k parameter, in case it is not set already as an option by the user.

FROM test
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 100

will mean that KNN will set k to 100.

A LIMIT that comes later from KNN is taken into account:

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| EVAL t = x + 1
| SORT _score
| LIMIT 100

will use k = 100 as well.

The only commands that stop the LIMIT to KNN propagation is STATS.

If there are multiple LIMIT clauses, the lesser one will be applied:

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 20
| EVAL t = x + 1
| SORT _score
| LIMIT 10

will have k = 10.

If the user has already specified a k value in the KNN function via the k option, the k option will prevail vs using a limit:

FROM test
| WHERE KNN(vector, [0, 1, 2],  {k: 10})
| LIMIT 100

will hava k = 10.

k needs to be specified either implicitly (via LIMIT) or explicitly (via the k option). We have a default LIMIT of 1000 that is applied implicitly, except for example in the case of STATS:

FROM test
| WHERE KNN(vector, [0, 1, 2])
| STATS c = count(*)

the query above will fail as there is no default LIMIT applied and k option is not set.

Open questions:

  • Should we just make K a mandatory param and don't depend on LIMIT?
  • Should KNN use K as an optional param instead of an option?
  • Should the default LIMIT of 1000 be considered an error for KNN, and force the user to either set a LIMIT or set K in KNN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these tests to:

  • Remove the "k" option and replace it by LIMIT
  • Remove STATS use cases, as I've removed the ability to use KNN in STATS for now

@@ -190,7 +190,7 @@ public BiConsumer<LogicalPlan, Failures> postAnalysisPlanVerification() {
* @param plan root plan to check
* @param failures failures found
*/
private static void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) {
private void checkFullTextQueryFunctions(LogicalPlan plan, Failures failures) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed visibility of these methods to allow overriding in KNN

@@ -184,11 +184,10 @@ public void execute(EsqlQueryRequest request, EsqlExecutionInfo executionInfo, P
new EsqlCCSUtils.CssPartialErrorsActionListener(executionInfo, listener) {
@Override
public void onResponse(LogicalPlan analyzedPlan) {
LogicalPlan optimizedPlan = optimizedPlan(analyzedPlan);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed PreMapper to be done after the plan has been optimized. This way, KNN can update k before the QueryBuilder is created, and so it takes the k coming from the optimization

* KNN should not be used in aggregations, as it is a top-N query and not a filtering query
*/
@Override
protected void checkFullTextFunctionsInAggs(Aggregate agg, Failures failures) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the ability to use KNN in aggs. It does not make sense to me - KNN retrieves a top K in terms of similarity, not as a filtering function like other FTFs.

We can undo that decision later, for example forcing it to use a minimum similarity - but still think that it's too much of an edge case as of now, and complicates testing.

@benwtrent
Copy link
Member

FROM test METADATA _score
| WHERE KNN(vector, [0, 1, 2])
| LIMIT 20
| EVAL t = x + 1
| SORT _score
| LIMIT 10

Having knn take the lesser one doesn't make sense to me. Why shouldn't it take the first one?

As a user, I would assume it takes the first one, then allows me to do whatever I want with the table of k:20, then further limit based on some subsequent actions.

@carlosdelest
Copy link
Member Author

carlosdelest commented Jun 13, 2025

Having knn take the lesser one doesn't make sense to me. Why shouldn't it take the first one?

As a user, I would assume it takes the first one, then allows me to do whatever I want with the table of k:20, then further limit based on some subsequent actions.

@benwtrent In case there's a LIMIT that comes after that reduces the resultset, then whatever the user does before would get lost when applying the lesser LIMIT afterwards.

Given the following:

| WHERE KNN(..)
| LIMIT 20
| EVAL k = x +1
| LIMIT 10

Then it doesn't matter what I calculated in the bottom 10 rows as it will be discarded by the last LIMIT 10.

This comes from an optimization already done for pushing down and combining limits - I wanted to keep the same semantics for KNN, as the returned rows that don't pass a posterior LIMIT would be discarded as well.

The only case I can think of is if we're changing the SORT order between limits - I don't see that taken into account in the push down limits optimization, I'll think about it more.

@benwtrent
Copy link
Member

@carlosdelest yes, sort order changing is exactly my concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants