Skip to content

Coordinator fails to elect leader when zookeeper connection transitions from LOST state to RECONNECTING state #17786

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vimil-saju opened this issue Mar 9, 2025 · 0 comments

Comments

@vimil-saju
Copy link
Contributor

vimil-saju commented Mar 9, 2025

Affected Version

29.0.0

The Druid version where the problem was encountered.

Description

We have Druid 29.0.0 deployed on a Kubernetes cluster, along with Zookeeper, which is configured with Istio-proxy enabled. Recently, we disabled Istio-proxy on the Zookeeper pods and restarted Zookeeper. Following this change, we observed that the Druid coordinators lost leadership. Specifically, the LeaderLatch did not invoke reset() to create the ephemeral node when the Zookeeper connection state transitioned from LOST to RECONNECTING. This resulted in 503 errors for requests to the coordinators, as there was no leader available.

Upon further investigation, we discovered that this issue is present in the Curator library version 5.5, which Druid currently uses. The problem has been addressed and fixed in version 5.8 of the Curator library. More details can be found in the related

  • JIRA issue: CURATOR-724.
  • Github Issue: LeaderLatch isn't able to recover after zk recover/leaderPath missing CURATOR-724

I believe upgrading the Curator library to version 5.8.0 will resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant