[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

valeriy42 · 2025-06-12T11:17:41Z

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148

Copilot

Pull Request Overview

This PR introduces a retry mechanism to ensure anomaly detection job state updates are committed even if the master node is temporarily unavailable. It refactors the existing setJobState methods to use a new UpdateStateRetryableAction wrapper and adds comprehensive tests to cover success, failure, retry, and no-retry scenarios.

Added UpdateStateRetryableAction inner class to handle retryable persistent task updates.
Refactored both setJobState overloads to invoke the retryable action.
Expanded unit tests in AutodetectProcessManagerTests to validate retry logic, success, failure, and non-retry conditions.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManagerTests.java	Tests for the new retryable state-update behavior.
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java	Implementation of `UpdateStateRetryableAction` and refactored `setJobState`.

Comments suppressed due to low confidence (1)

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java:1044

The catch block references e.getMessage() but the exception variable is named e1, causing a compilation error. Change it to e1.getMessage().

logger.warn("Error while delegating exception [" + e.getMessage() + "]", e1);

Copilot · 2025-06-12T11:19:44Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+            TimeValue.MAX_VALUE,
+            jobTask,
+            jobTaskState,
+            ActionListener.wrap(persistentTask -> {


The failure branch is not provided to ActionListener.wrap, so on final failures the user-provided handler will never be invoked. Supply a second lambda that calls handler.accept(e) for the onFailure case.

Copilot · 2025-06-12T11:19:44Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+            // retry everything *except* when the task truly no longer exists
+            // TODO valeriy: is this the only exception we should not retry on?
+            return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false;


[nitpick] Consider removing or resolving this TODO in production code. Clarify the full set of exceptions that should bypass retries or track this in an issue.

Suggested change

// retry everything *except* when the task truly no longer exists

// TODO valeriy: is this the only exception we should not retry on?

return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false;

// Retry everything except when the task truly no longer exists or other critical exceptions occur.

return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false

&& (ExceptionsHelper.unwrapCause(e) instanceof ElasticsearchStatusException) == false

&& (ExceptionsHelper.unwrapCause(e) instanceof InvalidAliasNameException) == false;

elasticsearchmachine · 2025-06-12T11:19:50Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-06-12T11:19:51Z

Hi @valeriy42, I've created a changelog YAML for you.

…stic#129015) The TransportConnectionListener interface has previously included the Transport.Connection being closed and unregistered in its onNodeDisconnected callback. This is not in use, and can be removed as it is also available in the onConnectionClosed callback. It is being replaced with a Nullable exception that caused the close. This is being used in pending work (ES-11448) to differentiate network issues from node restarts. Closes ES-12007

…tic#128979) Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19. Related: elastic#128919.

We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546

The Identity Provider's Service Provider cache had two issues: 1. It checked for identity based on sequence numbers, but didn't include the `seq_no_primary_term` parameter on searches, which means the sequence would always by `-2` 2. It didn't track whether the index was deleted, which means it could be caching values from an old version of the index This commit fixes both of these issues. In practice neither issue was a problem because there are no deployments that use index-based service providers, however the 2nd issue did cause some challenges for testing.

When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible.

This extends the change from elastic#128176 to validate the "custom attributes" on a per Service Provider basis. Each Service Provider (whether registered or wildcard based) has a field "attributes.extensions" which is a list of attribute names that may be provided by the caller of "/_idp/saml/init". Service Providers that have not be configured with extension attributes will reject any custom attributes in SAML init. This necessitates a new field in the service provider index (but only if the new `extensions` attribute is set). The template has been updated, but there is no data migration because the `saml-service-provider` index does not exist in any of the environments into which we wish to deploy this change.

…stic#128930) Modifies the methods to work with a project scope rather than a cluster scope. This is part of an iterative process to make ILM project-aware.

This removes all non-test usages of ``` Metadata.Builder.putCustom(String type, ProjectCustom custom) ``` And replaces it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use.

…tic#128948)

…8895) Lucene's `org.apache.lucene.util.automaton.Operations#getSingleton` fails with an Automaton for a `REGEXP_EMPTY` `RegExp`. This adds a workaround for that, to check the type of automaton before calling into that failing method. Closes elastic#128813

) IAE gets reported as a 400 status code, but that's inappropriate when inconsistent pages are always bugs, and should be reported with a 500. Throw ISE instead.

This PR fixes an NPE in the semantic highlighter when the document being highlighted doesn't contain value for the semantic text field. Closes elastic#128975

This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the WildCardFieldMapper to avoid memory pressure when building the query.

…128980)

…ion and lexical search support (elastic#128927) * Add clarifications to semantic text documentation * Regnerate match ESQL docs * Fix whitespace * PR feedback * Update docs/reference/elasticsearch/mapping-reference/semantic-text.md Co-authored-by: Liam Thompson <[email protected]> --------- Co-authored-by: Liam Thompson <[email protected]>

And replace it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use. Concerns both the getter and the setter.

…reaker elastic#129257

Introduces a new search load metric to the stats infrastructure, measured and tracked on a per-shard basis. The metric represents the Exponentially Weighted Moving Rate (EWMR) of search operations, calculated using the "took" time from each completed search phase.

^^

Now that ESQL has `allow_partial_results` we can reply with a `200` even though some nodes failed to run ESQL. This could happen because the node is restarting. Or because of a bug. Or a disconnect. All kinds of things. This logs those partial failures so an operator can look at them and get a sense of why they are happening.

This fixes a silly bug where we didn't override `OffHeapStats` for IVF.

* feat: enable date_detection for all apm data streams * Update resources.yaml * Create 128913.yml --------- Co-authored-by: Carson Ip <[email protected]>

This PR introduces several fixes to various IT tests, related to the use and misuse of the version identifier for the start cluster: wherever we can, we replace of versions in test code with features where we can't, we make sure we use the actual stack version (the one provided by -Dtests.bwc.main.version and not the bogus "0.0.0" version string) when requesting the cluster version we make sure we do use the "unresolved" version identifier (the value of the tests.old_cluster_version system property e.g. 0.0.0 ) so we resolve the right distribution These changes enabled the tests to be used in BC upgrade tests (and potentially in serverless upgrade tests too, where they would have also failed) Relates to ES-12010 Precedes elastic#128614, elastic#128823 and elastic#128983

) * [Build] Build maven aggregation zip as part of DRA build * Update path for aggregation zip

The threadpool-based merge scheduler triggers indexing throttling if merges are still getting enqueued faster than they're executed, while they are also disk IO unthrottled. This PR fixes the case where indexing throttling was incorrectly NOT triggered when disk IO throttling was disabled via the index settings.

…129255) * Register match_phrase as a function not a snapshot function * Update usage

- provides better configuration cache support - requires some rework due to changed defaults

* Adding support to exclude semantic_text subfields * Update docs/changelog/127664.yaml * Updating changelog file * remove duplicate test from yaml file * Adding support to exclude semantic_text subfields from mapper builders * Adding support for generic field types * refactoring to use builder and setting exclude value from semantic_text mapper * update in semantic_text mapper and fetcher to incorporate the support functionality * Fix code style issue * adding node feature for yaml tests * Adding more restrictive checks on yaml tests and few refactoring * Returns metadata fields from metadata mappers * returns all source fields for fieldcaps * gather all fields and iterate to process for fieldcaps api * revert back all changes from MappedFieldtype and subclasses * revert back exclude logic from semantic_text mapper * fix lint issues * fix lint issues * Adding runtime fields into fieldCaps * Fix linting issue * removing unused functions that used in previous implementation * fix multifield tests failure * getting alias fields for field caps * adding support for query time runtime fields * [CI] Auto commit changes from spotless * Fix empty mapping fieldCaps call * Address passthrough behavior for mappers * Fix SearchAsYoutype mapper failures * rename abstract method to have more meaningful name * Rename mapper function to match its functionality * Adding filtering for infernece subfields * revert back previous implementation changes * Adding yaml test for field caps not filtering multi-field * Fixing yaml test * Adding comment why .infernece filter is added --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>

This reverts commit 6370d60.

update to use ES logger instead of infostream and fixing native access warnings

Co-authored-by: ywangd <[email protected]> Co-authored-by: rjernst <[email protected]> Relates: ES-11445

…entations (elastic#128293)" (elastic#129206) This reverts commit de7c91c.

ldematte · 2025-06-13T07:42:24Z

I think you need to rebase your PR (or open a new one)

valeriy42 · 2025-06-13T07:46:41Z

Replaced by #129391

valeriy42 requested a review from Copilot June 12, 2025 11:17

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.1.0 labels Jun 12, 2025

valeriy42 added >bug :ml Machine learning v8.19.0 v8.17.8 v8.18.3 and removed needs:triage Requires assignment of a team area label labels Jun 12, 2025

Copilot AI reviewed Jun 12, 2025

View reviewed changes

elasticsearchmachine added the Team:ML Meta label for the ML team label Jun 12, 2025

schase-es and others added 17 commits June 12, 2025 13:27

ESQL: Forward port 8.19 RegexMatch serialization change version (elas…

c07678f

…tic#128979) Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19. Related: elastic#128919.

[ES|QL] Fix test releases for LookupJoinTypesIT. (elastic#128985)

53dfd00

Make utility methods in IndexLifecycleTransition project-aware (ela…

8612084

…stic#128930) Modifies the methods to work with a project scope rather than a cluster scope. This is part of an iterative process to make ILM project-aware.

[ES|QL] Enable the completion command as a tech preview feature (elas…

1dd1f61

…tic#128948)

[ES|QL] Fix test releases for telemetry. (elastic#128971)

8503c64

ESQL: Throw ISE instead of IAE for illegal block in page (elastic#128960

bebe561

) IAE gets reported as a 400 status code, but that's inappropriate when inconsistent pages are always bugs, and should be reported with a 500. Throw ISE instead.

Fix NPE in semantic highlighter (elastic#128989)

b48d827

This PR fixes an NPE in the semantic highlighter when the document being highlighted doesn't contain value for the semantic text field. Closes elastic#128975

Improve execution of terms queries over wildcard fields (elastic#128986)

e31f5ed

This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the WildCardFieldMapper to avoid memory pressure when building the query.

ESQL: Simplify agg construction when reducing on node-level (elastic#…

b40d842

…128980)

nielsbauman and others added 22 commits June 12, 2025 13:27

Mute org.elasticsearch.compute.data.sort.LongTopNSetTests testCrankyB…

78c6d1c

…reaker elastic#129257

[ESQL] Fix typo in search-functions.md (elastic#129260)

6920b1e

^^

Update Gradle wrapper to 8.14.2 (elastic#129179)

6504b27

Fix ivf nodestats impl for getOffHeapByteSize (elastic#129259)

3d79e52

This fixes a silly bug where we didn't override `OffHeapStats` for IVF.

feat: enable date_detection for all apm data streams (elastic#128913)

8a09fc6

* feat: enable date_detection for all apm data streams * Update resources.yaml * Create 128913.yml --------- Co-authored-by: Carson Ip <[email protected]>

[Build] Build maven aggregation zip as part of DRA build (elastic#129175

97e5192

) * [Build] Build maven aggregation zip as part of DRA build * Update path for aggregation zip

Register match_phrase as a function not a snapshot function (elastic#…

4535fbb

…129255) * Register match_phrase as a function not a snapshot function * Update usage

[Gradle] Spotless plugin update (elastic#115750)

c9d22cd

- provides better configuration cache support - requires some rework due to changed defaults

Revert "[Gradle] Spotless plugin update (elastic#115750)"

57c108a

This reverts commit 6370d60.

Switch IVF Writer to ES Logger (elastic#129224)

020e03a

update to use ES logger instead of infostream and fixing native access warnings

Add heap usage estimate to ClusterInfo (elastic#128723)

0f129a2

Co-authored-by: ywangd <[email protected]> Co-authored-by: rjernst <[email protected]> Relates: ES-11445

Revert "Use IndexOrDocValuesQuery in NumberFieldType#termQuery implem…

11cfedf

…entations (elastic#128293)" (elastic#129206) This reverts commit de7c91c.

initial implementation

4aa8e6b

Unit tests implemented

0797b17

Unit tests commented and fixed

5fb283a

Update docs/changelog/129329.yaml

2596df9

valeriy42 force-pushed the bugfix/is-1543-opening-state branch from e29fe33 to 2596df9 Compare June 13, 2025 07:38

valeriy42 requested review from a team as code owners June 13, 2025 07:38

valeriy42 closed this Jun 13, 2025

valeriy42 deleted the bugfix/is-1543-opening-state branch June 13, 2025 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

Uh oh!

valeriy42 commented Jun 12, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 12, 2025

Uh oh!

Copilot AI Jun 12, 2025

Uh oh!

elasticsearchmachine commented Jun 12, 2025

Uh oh!

elasticsearchmachine commented Jun 12, 2025

Uh oh!

ldematte commented Jun 13, 2025

Uh oh!

valeriy42 commented Jun 13, 2025

Uh oh!

Uh oh!

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

Uh oh!

Conversation

valeriy42 commented Jun 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 12, 2025

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jun 12, 2025

Uh oh!

elasticsearchmachine commented Jun 12, 2025

Uh oh!

ldematte commented Jun 13, 2025

Uh oh!

valeriy42 commented Jun 13, 2025

Uh oh!

Uh oh!