Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

aruraghuwanshi · 2025-02-14T22:14:55Z

Affected Version

28.0.1

Description

After the coordinator leader election is concluded, sometimes the old coordinator leader keeps emitting stale metrics for a specific datasource, despite the new coordinator leader emitting the realtime metrics. This creates duplication and erroneous reporting.

Some details that were noted:

Its always the high ingestion volume datasource that goes into this problem state with 2 coordinators emitting that metric (One stale and one actual). All other datasources' metrics are emitted only by the new coordinator-leader.
No difference in logs emitted by the old coordinator leader for the problematic datasource vs the good ones.

Reference Image attached (Blue: Old coordinator metric for one specific datasource in a stuck/stale state; Green: New coordinator emitting the same metric with the realtime values)

aruraghuwanshi · 2025-02-15T00:29:21Z

Metrics reporter was marked closed for that datasource during the leader transition, but the stale metrics are still being emitted by the old-coordinator leader.

kfaraz · 2025-02-17T02:30:14Z

@aruraghuwanshi , I am not sure if you are referring to a metric emitted by Druid itself or some metric emitted by Kafka (since the logs you shared above indicate something originating in Kafka code).
If it's the former, can you please share the stale metric names that are coming from the old coordinator leader?

aruraghuwanshi · 2025-02-18T23:36:07Z

Attached two examples here. I'm referring to all the mentioned druid metrics in this Kafka section

aruraghuwanshi · 2025-02-18T23:37:03Z

Adding the metric names here for reference @kfaraz :
ingest/kafka/lag
ingest/kafka/maxLag
ingest/kafka/avgLag
ingest/kafka/partitionLag

aruraghuwanshi · 2025-02-18T23:41:04Z

For more context, we've faced this issue a few times before and the only resolution seems to be to kill the old-coordinator leader pod. Once that pod restarts and stabilizes the stale metric disappears and the real-values of the kafka lag metric(s) are only emitted by the current coordiantor leader.

kfaraz · 2025-03-06T04:14:20Z

Could you please check the service/heartbeat metric with dimension leader for the two coordinators and check if both of them are considering themselves to be leader at the same time?

Side note: Are you running coordinator and overlord as a single service?

aruraghuwanshi · 2025-03-08T01:27:34Z

Hey @kfaraz , unfortunately we're not emitting the heartbeat metric but we did confirm that we only had one active-leader at the time of this Incident, as per the logs shown here.

| Are you running coordinator and overlord as a single service?
Yes, that is accurate

aruraghuwanshi added the Uncategorized problem report label Feb 14, 2025

aruraghuwanshi changed the title ~~Old Coordinator Leader keeps emitting stale metrics~~ Old Coordinator Leader keeps emitting stale metrics for specific datasource Feb 14, 2025

aruraghuwanshi changed the title ~~Old Coordinator Leader keeps emitting stale metrics for specific datasource~~ Old Coordinator Leader keeps emitting stale metrics for a specific datasource Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

aruraghuwanshi commented Feb 14, 2025

aruraghuwanshi commented Feb 15, 2025

kfaraz commented Feb 17, 2025

aruraghuwanshi commented Feb 18, 2025

aruraghuwanshi commented Feb 18, 2025

aruraghuwanshi commented Feb 18, 2025

kfaraz commented Mar 6, 2025

aruraghuwanshi commented Mar 8, 2025 •

edited

Loading

Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730

Comments

aruraghuwanshi commented Feb 14, 2025

Affected Version

Description

aruraghuwanshi commented Feb 15, 2025

kfaraz commented Feb 17, 2025

aruraghuwanshi commented Feb 18, 2025

aruraghuwanshi commented Feb 18, 2025

aruraghuwanshi commented Feb 18, 2025

kfaraz commented Mar 6, 2025

aruraghuwanshi commented Mar 8, 2025 • edited Loading

aruraghuwanshi commented Mar 8, 2025 •

edited

Loading