Enhanced Cassandra health checks #1

adutra · 2020-04-17T09:24:48Z

Motivation:

The current Cassandra health check simply attempts to query one
node in the cluster, randomly chosen, and sets the status to
DOWN if that node cannot be queried.

This is admittedly misleading, since Cassandra clusters are usually
resilient to the loss of many nodes, and can still continue to
serve application queries even if many nodes are down.

Modification:

This commit replaces the current health check by the following
checks:

Topology: reports how many nodes were found in each datacenter
and globally, and how many were up or down.
Token range availability: inspects the ring and detects if
certain token ranges are unavailable for the configured
keyspace, local datacenter and consistency level.

Note that these checks do not require querying the cluster; they
rely solely on metadata exposed by the driver and Gossip event,
and thus execute very fast.

The status is set to DOWN if:

The entire cluster is down; or
There are unavailable token ranges.

Result:

The health report now accurately describes the cluster health
by setting it status to DOWN when queries are likely to
fail due to UnavailableException errors.

adutra · 2020-04-18T16:49:45Z

The PR has been retrofitted to rely on JAVA-2742 (health diagnostics in the driver itself).

adutra · 2020-04-18T16:55:01Z

PR for JAVA-2742 can be found here: apache/cassandra-java-driver#1435.

Motivation: The current Cassandra health check simply attempts to query one node in the cluster, randomly chosen, and sets the status to DOWN if that node cannot be queried. This is admittedly misleading, since Cassandra clusters are usually resilient to the loss of many nodes, and can still continue to serve application queries even if many nodes are down. Modification: This commit replaces the current health check by the DataStax driver's SessionDiagnostics feature, which perform the following checks: - Topology: reports how many nodes were found in each datacenter and globally, and how many were up or down. - Token range availability: inspects the ring and detects if certain token ranges are unavailable for the configured keyspace, local datacenter and consistency level. Note that these checks do not require querying the cluster; they rely solely on metadata exposed by the driver and on Gossip events, and thus execute very fast. The status is set to DOWN if: 1. The entire cluster is down; or 2. There are unavailable token ranges. Result: The health report now accurately describes the cluster health by setting it status to DOWN when queries are likely to fail due to UnavailableException errors.

adutra force-pushed the java2731 branch 10 times, most recently from f464c0a to bb556e2 Compare April 18, 2020 16:48

adutra force-pushed the java2731 branch 2 times, most recently from 0455b95 to cfc5d04 Compare April 19, 2020 13:26

adutra added 2 commits April 19, 2020 15:29

[temp] use driver 4.6.0-alpha3

50518e7

adutra force-pushed the java2731 branch from cfc5d04 to 39270db Compare April 19, 2020 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced Cassandra health checks #1

Enhanced Cassandra health checks #1

Uh oh!

adutra commented Apr 17, 2020

Uh oh!

adutra commented Apr 18, 2020

Uh oh!

adutra commented Apr 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enhanced Cassandra health checks #1

Are you sure you want to change the base?

Enhanced Cassandra health checks #1

Uh oh!

Conversation

adutra commented Apr 17, 2020

Uh oh!

adutra commented Apr 18, 2020

Uh oh!

adutra commented Apr 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants