Skip to content

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 112 commits into from

Conversation

valeriy42
Copy link
Contributor

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148

@valeriy42 valeriy42 requested a review from Copilot June 12, 2025 11:17
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Jun 12, 2025
@valeriy42 valeriy42 added >bug :ml Machine learning v8.19.0 v8.17.8 v8.18.3 and removed needs:triage Requires assignment of a team area label labels Jun 12, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a retry mechanism to ensure anomaly detection job state updates are committed even if the master node is temporarily unavailable. It refactors the existing setJobState methods to use a new UpdateStateRetryableAction wrapper and adds comprehensive tests to cover success, failure, retry, and no-retry scenarios.

  • Added UpdateStateRetryableAction inner class to handle retryable persistent task updates.
  • Refactored both setJobState overloads to invoke the retryable action.
  • Expanded unit tests in AutodetectProcessManagerTests to validate retry logic, success, failure, and non-retry conditions.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManagerTests.java Tests for the new retryable state-update behavior.
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java Implementation of UpdateStateRetryableAction and refactored setJobState.
Comments suppressed due to low confidence (1)

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java:1044

  • The catch block references e.getMessage() but the exception variable is named e1, causing a compilation error. Change it to e1.getMessage().
logger.warn("Error while delegating exception [" + e.getMessage() + "]", e1);

TimeValue.MAX_VALUE,
jobTask,
jobTaskState,
ActionListener.wrap(persistentTask -> {
Copy link
Preview

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure branch is not provided to ActionListener.wrap, so on final failures the user-provided handler will never be invoked. Supply a second lambda that calls handler.accept(e) for the onFailure case.

Copilot uses AI. Check for mistakes.

Comment on lines +1147 to +1149
// retry everything *except* when the task truly no longer exists
// TODO valeriy: is this the only exception we should not retry on?
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false;
Copy link
Preview

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider removing or resolving this TODO in production code. Clarify the full set of exceptions that should bypass retries or track this in an issue.

Suggested change
// retry everything *except* when the task truly no longer exists
// TODO valeriy: is this the only exception we should not retry on?
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false;
// Retry everything except when the task truly no longer exists or other critical exceptions occur.
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false
&& (ExceptionsHelper.unwrapCause(e) instanceof ElasticsearchStatusException) == false
&& (ExceptionsHelper.unwrapCause(e) instanceof InvalidAliasNameException) == false;

Copilot uses AI. Check for mistakes.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jun 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

schase-es and others added 17 commits June 12, 2025 13:27
…stic#129015)

The TransportConnectionListener interface has previously included the
Transport.Connection being closed and unregistered in its onNodeDisconnected
callback. This is not in use, and can be removed as it is also available in the
onConnectionClosed callback. It is being replaced with a Nullable exception that
caused the close. This is being used in pending work (ES-11448) to differentiate
network issues from node restarts.

Closes ES-12007
…tic#128979)

Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19.

Related: elastic#128919.
We miss appending null when ignored_source is not available. Our 
randomized tests already cover this case, but we do not check it when
loading fields.

I labelled this non-issue for an unreleased bug.

Closes elastic#128959
Relates elastic#119546
The Identity Provider's Service Provider cache had two issues:

1. It checked for identity based on sequence numbers, but didn't
   include the `seq_no_primary_term` parameter on searches, which
   means the sequence would always by `-2`
2. It didn't track whether the index was deleted, which means it
   could be caching values from an old version of the index

This commit fixes both of these issues.

In practice neither issue was a problem because there are no
deployments that use index-based service providers, however the 2nd
issue did cause some challenges for testing.
When parsing documents, we receive the document as UTF-8 encoded data which
we then parse and convert the fields to java-native UTF-16 encoded Strings. 
We then convert these strings back to UTF-8 for storage in lucene.

This patch skips the redundant conversion, instead passing lucene a
direct reference to the received UTF-8 bytes when possible.
This extends the change from elastic#128176 to validate the "custom
attributes" on a per Service Provider basis.

Each Service Provider (whether registered or wildcard based) has a
field "attributes.extensions" which is a list of attribute names that
may be provided by the caller of "/_idp/saml/init".

Service Providers that have not be configured with extension
attributes will reject any custom attributes in SAML init.

This necessitates a new field in the service provider index (but only
if the new `extensions` attribute is set).
The template has been updated, but there is no data migration because
the `saml-service-provider` index does not exist in any of the
environments into which we wish to deploy this change.
…stic#128930)

Modifies the methods to work with a project scope rather than a cluster
scope.
This is part of an iterative process to make ILM project-aware.
This removes all non-test usages of
```
Metadata.Builder.putCustom(String type, ProjectCustom custom)
```
And replaces it with appropriate calls to the equivalent method on
`ProjectMetadata.Builder`.

In most cases this _does not_ make the code project aware, but does
reduce the number of deprecated methods in use.
…8895)

Lucene's `org.apache.lucene.util.automaton.Operations#getSingleton` fails with an Automaton for a `REGEXP_EMPTY` `RegExp`. This adds a workaround for that, to check the type of automaton before calling into that failing method.

Closes elastic#128813
)

IAE gets reported as a 400 status code, but that's inappropriate when
inconsistent pages are always bugs, and should be reported with a 500.
Throw ISE instead.
This PR fixes an NPE in the semantic highlighter when the document being highlighted doesn't contain value for the semantic text field.

Closes elastic#128975
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the 
WildCardFieldMapper to avoid memory pressure when building the query.
…ion and lexical search support (elastic#128927)

* Add clarifications to semantic text documentation

* Regnerate match ESQL docs

* Fix whitespace

* PR feedback

* Update docs/reference/elasticsearch/mapping-reference/semantic-text.md

Co-authored-by: Liam Thompson <[email protected]>

---------

Co-authored-by: Liam Thompson <[email protected]>
nielsbauman and others added 22 commits June 12, 2025 13:27
And replace it with appropriate calls to the equivalent method on
`ProjectMetadata.Builder`.
In most cases this _does not_ make the code project aware, but does
reduce the number of deprecated methods in use.
Concerns both the getter and the setter.
Introduces a new search load metric to the stats infrastructure, measured and tracked on a per-shard basis. The metric represents the Exponentially Weighted Moving Rate (EWMR) of search operations, calculated using the "took" time from each completed search phase.
Now that ESQL has `allow_partial_results` we can reply with a `200` even
though some nodes failed to run ESQL. This could happen because the node
is restarting. Or because of a bug. Or a disconnect. All kinds of
things. This logs those partial failures so an operator can look at them
and get a sense of why they are happening.
This fixes a silly bug where we didn't override `OffHeapStats` for IVF.
* feat: enable date_detection for all apm data streams

* Update resources.yaml

* Create 128913.yml

---------

Co-authored-by: Carson Ip <[email protected]>
This PR introduces several fixes to various IT tests, related to the use and misuse of the version identifier for the start cluster:

    wherever we can, we replace of versions in test code with features
    where we can't, we make sure we use the actual stack version (the one provided by -Dtests.bwc.main.version and not the bogus "0.0.0" version string)
    when requesting the cluster version we make sure we do use the "unresolved" version identifier (the value of the tests.old_cluster_version system property e.g. 0.0.0 ) so we resolve the right distribution

These changes enabled the tests to be used in BC upgrade tests (and potentially in serverless upgrade tests too, where they would have also failed)

Relates to ES-12010

Precedes elastic#128614, elastic#128823 and elastic#128983
)

* [Build] Build maven aggregation zip as part of DRA build

* Update path for aggregation zip
The threadpool-based merge scheduler triggers indexing throttling if merges are still getting enqueued faster than they're executed, while they are also disk IO unthrottled. This PR fixes the case where indexing throttling was incorrectly NOT triggered when disk IO throttling was disabled via the index settings.
…129255)

* Register match_phrase as a function not a snapshot function

* Update usage
- provides better configuration cache support
- requires some rework due to changed defaults
* Adding support to exclude semantic_text subfields

* Update docs/changelog/127664.yaml

* Updating changelog file

* remove duplicate test from yaml file

* Adding support to exclude semantic_text subfields from mapper builders

* Adding support for generic field types

* refactoring to use builder and setting exclude value from semantic_text mapper

* update in semantic_text mapper and fetcher to incorporate the support functionality

* Fix code style issue

* adding node feature for yaml tests

* Adding more restrictive checks on yaml tests and few refactoring

* Returns metadata fields from metadata mappers

* returns all source fields for fieldcaps

* gather all fields and iterate to process for fieldcaps api

* revert back all changes from MappedFieldtype and subclasses

* revert back exclude logic from semantic_text mapper

* fix lint issues

* fix lint issues

* Adding runtime fields into fieldCaps

* Fix linting issue

* removing unused functions that used in previous implementation

* fix multifield tests failure

* getting alias fields for field caps

* adding support for query time runtime fields

* [CI] Auto commit changes from spotless

* Fix empty mapping fieldCaps call

* Address passthrough behavior for mappers

* Fix SearchAsYoutype mapper failures

* rename abstract method to have more meaningful name

* Rename mapper function to match its functionality

* Adding filtering for infernece subfields

* revert back previous implementation changes

* Adding yaml test for field caps not filtering multi-field

* Fixing yaml test

* Adding comment why .infernece filter is added

---------

Co-authored-by: elasticsearchmachine <[email protected]>
Co-authored-by: Elastic Machine <[email protected]>
update to use ES logger instead of infostream and fixing native access warnings
Co-authored-by: ywangd <[email protected]>
Co-authored-by: rjernst <[email protected]>
Relates: ES-11445
@valeriy42 valeriy42 force-pushed the bugfix/is-1543-opening-state branch from e29fe33 to 2596df9 Compare June 13, 2025 07:38
@valeriy42 valeriy42 requested review from a team as code owners June 13, 2025 07:38
@ldematte
Copy link
Contributor

I think you need to rebase your PR (or open a new one)

@valeriy42
Copy link
Contributor Author

Replaced by #129391

@valeriy42 valeriy42 closed this Jun 13, 2025
@valeriy42 valeriy42 deleted the bugfix/is-1543-opening-state branch June 13, 2025 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.17.8 v8.18.3 v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Prevent AD jobs from being stuck in the "opening" state