-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Old Coordinator Leader keeps emitting stale metrics for a specific datasource #17730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@aruraghuwanshi , I am not sure if you are referring to a metric emitted by Druid itself or some metric emitted by Kafka (since the logs you shared above indicate something originating in Kafka code). |
Attached two examples here. I'm referring to all the mentioned druid metrics in this Kafka section ![]() ![]() |
Adding the metric names here for reference @kfaraz : |
For more context, we've faced this issue a few times before and the only resolution seems to be to kill the old-coordinator leader pod. Once that pod restarts and stabilizes the stale metric disappears and the real-values of the kafka lag metric(s) are only emitted by the current coordiantor leader. |
Could you please check the Side note: Are you running coordinator and overlord as a single service? |
Hey @kfaraz , unfortunately we're not emitting the heartbeat metric but we did confirm that we only had one active-leader at the time of this Incident, as per the logs shown here. | Are you running coordinator and overlord as a single service? |
Affected Version
28.0.1
Description
After the coordinator leader election is concluded, sometimes the old coordinator leader keeps emitting stale metrics for a specific datasource, despite the new coordinator leader emitting the realtime metrics. This creates duplication and erroneous reporting.
Some details that were noted:
Reference Image attached (Blue: Old coordinator metric for one specific datasource in a stuck/stale state; Green: New coordinator emitting the same metric with the realtime values)
The text was updated successfully, but these errors were encountered: