Skip to content

Conversation

@adutra
Copy link

@adutra adutra commented Apr 17, 2020

Motivation:

The current Cassandra health check simply attempts to query one
node in the cluster, randomly chosen, and sets the status to
DOWN if that node cannot be queried.

This is admittedly misleading, since Cassandra clusters are usually
resilient to the loss of many nodes, and can still continue to
serve application queries even if many nodes are down.

Modification:

This commit replaces the current health check by the following
checks:

  • Topology: reports how many nodes were found in each datacenter
    and globally, and how many were up or down.
  • Token range availability: inspects the ring and detects if
    certain token ranges are unavailable for the configured
    keyspace, local datacenter and consistency level.

Note that these checks do not require querying the cluster; they
rely solely on metadata exposed by the driver and Gossip event,
and thus execute very fast.

The status is set to DOWN if:

  1. The entire cluster is down; or
  2. There are unavailable token ranges.

Result:

The health report now accurately describes the cluster health
by setting it status to DOWN when queries are likely to
fail due to UnavailableException errors.

@adutra adutra force-pushed the java2731 branch 10 times, most recently from f464c0a to bb556e2 Compare April 18, 2020 16:48
@adutra
Copy link
Author

adutra commented Apr 18, 2020

The PR has been retrofitted to rely on JAVA-2742 (health diagnostics in the driver itself).

@adutra
Copy link
Author

adutra commented Apr 18, 2020

PR for JAVA-2742 can be found here: apache/cassandra-java-driver#1435.

@adutra adutra force-pushed the java2731 branch 2 times, most recently from 0455b95 to cfc5d04 Compare April 19, 2020 13:26
adutra added 2 commits April 19, 2020 15:29
Motivation:

The current Cassandra health check simply attempts to query one
node in the cluster, randomly chosen, and sets the status to
DOWN if that node cannot be queried.

This is admittedly misleading, since Cassandra clusters are usually
resilient to the loss of many nodes, and can still continue to
serve application queries even if many nodes are down.

Modification:

This commit replaces the current health check by the DataStax
driver's SessionDiagnostics feature, which perform the following
checks:

- Topology: reports how many nodes were found in each datacenter
and globally, and how many were up or down.
- Token range availability: inspects the ring and detects if
certain token ranges are unavailable for the configured
keyspace, local datacenter and consistency level.

Note that these checks do not require querying the cluster; they
rely solely on metadata exposed by the driver and on Gossip events,
and thus execute very fast.

The status is set to DOWN if:

1. The entire cluster is down; or
2. There are unavailable token ranges.

Result:

The health report now accurately describes the cluster health
by setting it status to DOWN when queries are likely to
fail due to UnavailableException errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants