-
Notifications
You must be signed in to change notification settings - Fork 25.3k
[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a retry mechanism to ensure anomaly detection job state updates are committed even if the master node is temporarily unavailable. It refactors the existing setJobState
methods to use a new UpdateStateRetryableAction
wrapper and adds comprehensive tests to cover success, failure, retry, and no-retry scenarios.
- Added
UpdateStateRetryableAction
inner class to handle retryable persistent task updates. - Refactored both
setJobState
overloads to invoke the retryable action. - Expanded unit tests in
AutodetectProcessManagerTests
to validate retry logic, success, failure, and non-retry conditions.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
x-pack/plugin/ml/src/test/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManagerTests.java | Tests for the new retryable state-update behavior. |
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java | Implementation of UpdateStateRetryableAction and refactored setJobState . |
Comments suppressed due to low confidence (1)
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java:1044
- The catch block references
e.getMessage()
but the exception variable is namede1
, causing a compilation error. Change it toe1.getMessage()
.
logger.warn("Error while delegating exception [" + e.getMessage() + "]", e1);
TimeValue.MAX_VALUE, | ||
jobTask, | ||
jobTaskState, | ||
ActionListener.wrap(persistentTask -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure branch is not provided to ActionListener.wrap
, so on final failures the user-provided handler will never be invoked. Supply a second lambda that calls handler.accept(e)
for the onFailure case.
Copilot uses AI. Check for mistakes.
// retry everything *except* when the task truly no longer exists | ||
// TODO valeriy: is this the only exception we should not retry on? | ||
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider removing or resolving this TODO in production code. Clarify the full set of exceptions that should bypass retries or track this in an issue.
// retry everything *except* when the task truly no longer exists | |
// TODO valeriy: is this the only exception we should not retry on? | |
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false; | |
// Retry everything except when the task truly no longer exists or other critical exceptions occur. | |
return (ExceptionsHelper.unwrapCause(e) instanceof ResourceNotFoundException) == false | |
&& (ExceptionsHelper.unwrapCause(e) instanceof ElasticsearchStatusException) == false | |
&& (ExceptionsHelper.unwrapCause(e) instanceof InvalidAliasNameException) == false; |
Copilot uses AI. Check for mistakes.
Pinging @elastic/ml-core (Team:ML) |
Hi @valeriy42, I've created a changelog YAML for you. |
…stic#129015) The TransportConnectionListener interface has previously included the Transport.Connection being closed and unregistered in its onNodeDisconnected callback. This is not in use, and can be removed as it is also available in the onConnectionClosed callback. It is being replaced with a Nullable exception that caused the close. This is being used in pending work (ES-11448) to differentiate network issues from node restarts. Closes ES-12007
…tic#128979) Fwd port ESQL_REGEX_MATCH_WITH_CASE_INSENSITIVITY_8_19. Related: elastic#128919.
We miss appending null when ignored_source is not available. Our randomized tests already cover this case, but we do not check it when loading fields. I labelled this non-issue for an unreleased bug. Closes elastic#128959 Relates elastic#119546
The Identity Provider's Service Provider cache had two issues: 1. It checked for identity based on sequence numbers, but didn't include the `seq_no_primary_term` parameter on searches, which means the sequence would always by `-2` 2. It didn't track whether the index was deleted, which means it could be caching values from an old version of the index This commit fixes both of these issues. In practice neither issue was a problem because there are no deployments that use index-based service providers, however the 2nd issue did cause some challenges for testing.
When parsing documents, we receive the document as UTF-8 encoded data which we then parse and convert the fields to java-native UTF-16 encoded Strings. We then convert these strings back to UTF-8 for storage in lucene. This patch skips the redundant conversion, instead passing lucene a direct reference to the received UTF-8 bytes when possible.
This extends the change from elastic#128176 to validate the "custom attributes" on a per Service Provider basis. Each Service Provider (whether registered or wildcard based) has a field "attributes.extensions" which is a list of attribute names that may be provided by the caller of "/_idp/saml/init". Service Providers that have not be configured with extension attributes will reject any custom attributes in SAML init. This necessitates a new field in the service provider index (but only if the new `extensions` attribute is set). The template has been updated, but there is no data migration because the `saml-service-provider` index does not exist in any of the environments into which we wish to deploy this change.
…stic#128930) Modifies the methods to work with a project scope rather than a cluster scope. This is part of an iterative process to make ILM project-aware.
This removes all non-test usages of ``` Metadata.Builder.putCustom(String type, ProjectCustom custom) ``` And replaces it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use.
…8895) Lucene's `org.apache.lucene.util.automaton.Operations#getSingleton` fails with an Automaton for a `REGEXP_EMPTY` `RegExp`. This adds a workaround for that, to check the type of automaton before calling into that failing method. Closes elastic#128813
This PR fixes an NPE in the semantic highlighter when the document being highlighted doesn't contain value for the semantic text field. Closes elastic#128975
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the WildCardFieldMapper to avoid memory pressure when building the query.
…ion and lexical search support (elastic#128927) * Add clarifications to semantic text documentation * Regnerate match ESQL docs * Fix whitespace * PR feedback * Update docs/reference/elasticsearch/mapping-reference/semantic-text.md Co-authored-by: Liam Thompson <[email protected]> --------- Co-authored-by: Liam Thompson <[email protected]>
And replace it with appropriate calls to the equivalent method on `ProjectMetadata.Builder`. In most cases this _does not_ make the code project aware, but does reduce the number of deprecated methods in use. Concerns both the getter and the setter.
Introduces a new search load metric to the stats infrastructure, measured and tracked on a per-shard basis. The metric represents the Exponentially Weighted Moving Rate (EWMR) of search operations, calculated using the "took" time from each completed search phase.
Now that ESQL has `allow_partial_results` we can reply with a `200` even though some nodes failed to run ESQL. This could happen because the node is restarting. Or because of a bug. Or a disconnect. All kinds of things. This logs those partial failures so an operator can look at them and get a sense of why they are happening.
This fixes a silly bug where we didn't override `OffHeapStats` for IVF.
* feat: enable date_detection for all apm data streams * Update resources.yaml * Create 128913.yml --------- Co-authored-by: Carson Ip <[email protected]>
This PR introduces several fixes to various IT tests, related to the use and misuse of the version identifier for the start cluster: wherever we can, we replace of versions in test code with features where we can't, we make sure we use the actual stack version (the one provided by -Dtests.bwc.main.version and not the bogus "0.0.0" version string) when requesting the cluster version we make sure we do use the "unresolved" version identifier (the value of the tests.old_cluster_version system property e.g. 0.0.0 ) so we resolve the right distribution These changes enabled the tests to be used in BC upgrade tests (and potentially in serverless upgrade tests too, where they would have also failed) Relates to ES-12010 Precedes elastic#128614, elastic#128823 and elastic#128983
The threadpool-based merge scheduler triggers indexing throttling if merges are still getting enqueued faster than they're executed, while they are also disk IO unthrottled. This PR fixes the case where indexing throttling was incorrectly NOT triggered when disk IO throttling was disabled via the index settings.
…129255) * Register match_phrase as a function not a snapshot function * Update usage
- provides better configuration cache support - requires some rework due to changed defaults
* Adding support to exclude semantic_text subfields * Update docs/changelog/127664.yaml * Updating changelog file * remove duplicate test from yaml file * Adding support to exclude semantic_text subfields from mapper builders * Adding support for generic field types * refactoring to use builder and setting exclude value from semantic_text mapper * update in semantic_text mapper and fetcher to incorporate the support functionality * Fix code style issue * adding node feature for yaml tests * Adding more restrictive checks on yaml tests and few refactoring * Returns metadata fields from metadata mappers * returns all source fields for fieldcaps * gather all fields and iterate to process for fieldcaps api * revert back all changes from MappedFieldtype and subclasses * revert back exclude logic from semantic_text mapper * fix lint issues * fix lint issues * Adding runtime fields into fieldCaps * Fix linting issue * removing unused functions that used in previous implementation * fix multifield tests failure * getting alias fields for field caps * adding support for query time runtime fields * [CI] Auto commit changes from spotless * Fix empty mapping fieldCaps call * Address passthrough behavior for mappers * Fix SearchAsYoutype mapper failures * rename abstract method to have more meaningful name * Rename mapper function to match its functionality * Adding filtering for infernece subfields * revert back previous implementation changes * Adding yaml test for field caps not filtering multi-field * Fixing yaml test * Adding comment why .infernece filter is added --------- Co-authored-by: elasticsearchmachine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
This reverts commit 6370d60.
update to use ES logger instead of infostream and fixing native access warnings
Co-authored-by: ywangd <[email protected]> Co-authored-by: rjernst <[email protected]> Relates: ES-11445
…entations (elastic#128293)" (elastic#129206) This reverts commit de7c91c.
e29fe33
to
2596df9
Compare
I think you need to rebase your PR (or open a new one) |
Replaced by #129391 |
During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".
This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.
Fixes #126148