Enhanced Cassandra health checks #1
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation:
The current Cassandra health check simply attempts to query one
node in the cluster, randomly chosen, and sets the status to
DOWN if that node cannot be queried.
This is admittedly misleading, since Cassandra clusters are usually
resilient to the loss of many nodes, and can still continue to
serve application queries even if many nodes are down.
Modification:
This commit replaces the current health check by the following
checks:
and globally, and how many were up or down.
certain token ranges are unavailable for the configured
keyspace, local datacenter and consistency level.
Note that these checks do not require querying the cluster; they
rely solely on metadata exposed by the driver and Gossip event,
and thus execute very fast.
The status is set to DOWN if:
Result:
The health report now accurately describes the cluster health
by setting it status to DOWN when queries are likely to
fail due to UnavailableException errors.