Skip to content

Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
aruraghuwanshi opened this issue Feb 14, 2025 · 7 comments

Comments

@aruraghuwanshi
Copy link
Contributor

Affected Version

28.0.1

Description

After the coordinator leader election is concluded, sometimes the old coordinator leader keeps emitting stale metrics for a specific datasource, despite the new coordinator leader emitting the realtime metrics. This creates duplication and erroneous reporting.

Some details that were noted:

  • Its always the high ingestion volume datasource that goes into this problem state with 2 coordinators emitting that metric (One stale and one actual). All other datasources' metrics are emitted only by the new coordinator-leader.
  • No difference in logs emitted by the old coordinator leader for the problematic datasource vs the good ones.

Reference Image attached (Blue: Old coordinator metric for one specific datasource in a stuck/stale state; Green: New coordinator emitting the same metric with the realtime values)

Image
@aruraghuwanshi aruraghuwanshi changed the title Old Coordinator Leader keeps emitting stale metrics Old Coordinator Leader keeps emitting stale metrics for specific datasource Feb 14, 2025
@aruraghuwanshi aruraghuwanshi changed the title Old Coordinator Leader keeps emitting stale metrics for specific datasource Old Coordinator Leader keeps emitting stale metrics for a specific datasource Feb 14, 2025
@aruraghuwanshi
Copy link
Contributor Author

Image

Metrics reporter was marked closed for that datasource during the leader transition, but the stale metrics are still being emitted by the old-coordinator leader.

@kfaraz
Copy link
Contributor

kfaraz commented Feb 17, 2025

@aruraghuwanshi , I am not sure if you are referring to a metric emitted by Druid itself or some metric emitted by Kafka (since the logs you shared above indicate something originating in Kafka code).
If it's the former, can you please share the stale metric names that are coming from the old coordinator leader?

@aruraghuwanshi
Copy link
Contributor Author

Attached two examples here. I'm referring to all the mentioned druid metrics in this Kafka section

Image Image

@aruraghuwanshi
Copy link
Contributor Author

Adding the metric names here for reference @kfaraz :
ingest/kafka/lag
ingest/kafka/maxLag
ingest/kafka/avgLag
ingest/kafka/partitionLag

@aruraghuwanshi
Copy link
Contributor Author

For more context, we've faced this issue a few times before and the only resolution seems to be to kill the old-coordinator leader pod. Once that pod restarts and stabilizes the stale metric disappears and the real-values of the kafka lag metric(s) are only emitted by the current coordiantor leader.

@kfaraz
Copy link
Contributor

kfaraz commented Mar 6, 2025

Could you please check the service/heartbeat metric with dimension leader for the two coordinators and check if both of them are considering themselves to be leader at the same time?

Side note: Are you running coordinator and overlord as a single service?

@aruraghuwanshi
Copy link
Contributor Author

aruraghuwanshi commented Mar 8, 2025

Hey @kfaraz , unfortunately we're not emitting the heartbeat metric but we did confirm that we only had one active-leader at the time of this Incident, as per the logs shown here.

| Are you running coordinator and overlord as a single service?
Yes, that is accurate

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants