Improve PostgreSQL replication lag detection #395

jsosic · 2020-05-08T23:44:43Z

In some cases master can show pg_last_xact_replay_timestamp() from past, which can cause the exporter to show ever-growing value for the lag. This fixes the issue.

I've got the idea for fix from:

Example (from master):

[local]:5432 postgres@postgres=# SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag;
+----------------+
|      lag       |
+----------------+
| 3567749.711759 |
+----------------+
(1 row)

Same master, fixed query:

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+-----+
| lag |
+-----+
|   0 |
+-----+
(1 row)

Master recovery status:

[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| f                 |
+-------------------+
(1 row)

Streaming slave:

[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| t                 |
+-------------------+
(1 row)

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+---------+
|   lag   |
+---------+
| 0.31566 |
+---------+
(1 row)

Paused slave:

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+--------------+
|     lag      |
+--------------+
| 28053.150006 |
+--------------+
(1 row)


[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| t                 |
+-------------------+
(1 row)

From the documentation:

True if recovery is still in progress.

In some cases master can show pg_last_xact_replay_timestamp() from past, which can cause the exporter to show ever-growing value for the lag. By checking if the instance is in recovery we can avoid reporting some huge number for master instance.

coveralls · 2020-05-08T23:54:31Z

Coverage remained the same at 64.7% when pulling dc87b6e on jsosic:fix-psql-replication-lag into e2df41f on wrouesnel:master.

* Add CHANGELOG from existing tags. Now released under Prometheus Community * [CHANGE] Update build to use standard Prometheus promu/Dockerfile * [ENHANCEMENT] Remove duplicate column in queries.yml #433 * [ENHANCEMENT] Add query for 'pg_replication_slots' #465 * [ENHANCEMENT] Allow a custom prefix for metric namespace #387 * [ENHANCEMENT] Improve PostgreSQL replication lag detection #395 * [ENHANCEMENT] Support connstring syntax when discovering databases #473 * [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric #421 * [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml #357 * [BUGFIX] Don't ignore errors in parseUserQueries #362 * [BUGFIX] Fix queries.yaml for AWS RDS #370 * [BUGFIX] Recover when connection cannot be established at startup #415 * [BUGFIX] Don't retry if an error occurs #426 * [BUGFIX] Do not panic on incorrect env #457 Signed-off-by: Ben Kochie <[email protected]>

* Add CHANGELOG from existing tags. First release under the Prometheus Community organisation. * [CHANGE] Update build to use standard Prometheus promu/Dockerfile * [ENHANCEMENT] Remove duplicate column in queries.yml #433 * [ENHANCEMENT] Add query for 'pg_replication_slots' #465 * [ENHANCEMENT] Allow a custom prefix for metric namespace #387 * [ENHANCEMENT] Improve PostgreSQL replication lag detection #395 * [ENHANCEMENT] Support connstring syntax when discovering databases #473 * [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric #421 * [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml #357 * [BUGFIX] Don't ignore errors in parseUserQueries #362 * [BUGFIX] Fix queries.yaml for AWS RDS #370 * [BUGFIX] Recover when connection cannot be established at startup #415 * [BUGFIX] Don't retry if an error occurs #426 * [BUGFIX] Do not panic on incorrect env #457 Signed-off-by: Ben Kochie <[email protected]>

* Add CHANGELOG from existing tags. First release under the Prometheus Community organisation. * [CHANGE] Update build to use standard Prometheus promu/Dockerfile * [ENHANCEMENT] Remove duplicate column in queries.yml prometheus-community#433 * [ENHANCEMENT] Add query for 'pg_replication_slots' prometheus-community#465 * [ENHANCEMENT] Allow a custom prefix for metric namespace prometheus-community#387 * [ENHANCEMENT] Improve PostgreSQL replication lag detection prometheus-community#395 * [ENHANCEMENT] Support connstring syntax when discovering databases prometheus-community#473 * [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric prometheus-community#421 * [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml prometheus-community#357 * [BUGFIX] Don't ignore errors in parseUserQueries prometheus-community#362 * [BUGFIX] Fix queries.yaml for AWS RDS prometheus-community#370 * [BUGFIX] Recover when connection cannot be established at startup prometheus-community#415 * [BUGFIX] Don't retry if an error occurs prometheus-community#426 * [BUGFIX] Do not panic on incorrect env prometheus-community#457 Signed-off-by: Ben Kochie <[email protected]>

In some cases master can show pg_last_xact_replay_timestamp() from past, which can cause the exporter to show ever-growing value for the lag. By checking if the instance is in recovery we can avoid reporting some huge number for master instance.

* Add CHANGELOG from existing tags. First release under the Prometheus Community organisation. * [CHANGE] Update build to use standard Prometheus promu/Dockerfile * [ENHANCEMENT] Remove duplicate column in queries.yml prometheus-community#433 * [ENHANCEMENT] Add query for 'pg_replication_slots' prometheus-community#465 * [ENHANCEMENT] Allow a custom prefix for metric namespace prometheus-community#387 * [ENHANCEMENT] Improve PostgreSQL replication lag detection prometheus-community#395 * [ENHANCEMENT] Support connstring syntax when discovering databases prometheus-community#473 * [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric prometheus-community#421 * [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml prometheus-community#357 * [BUGFIX] Don't ignore errors in parseUserQueries prometheus-community#362 * [BUGFIX] Fix queries.yaml for AWS RDS prometheus-community#370 * [BUGFIX] Recover when connection cannot be established at startup prometheus-community#415 * [BUGFIX] Don't retry if an error occurs prometheus-community#426 * [BUGFIX] Do not panic on incorrect env prometheus-community#457 Signed-off-by: Ben Kochie <[email protected]>

wrouesnel merged commit f188bde into prometheus-community:master Dec 24, 2020

SuperQ mentioned this pull request Feb 26, 2021

Release 0.9.0 #485

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve PostgreSQL replication lag detection #395

Improve PostgreSQL replication lag detection #395

Uh oh!

jsosic commented May 8, 2020 •

edited

Loading

Uh oh!

coveralls commented May 8, 2020

Uh oh!

Uh oh!

Improve PostgreSQL replication lag detection #395

Improve PostgreSQL replication lag detection #395

Uh oh!

Conversation

jsosic commented May 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented May 8, 2020

Uh oh!

Uh oh!

jsosic commented May 8, 2020 •

edited

Loading