Skip to content

Improve PostgreSQL replication lag detection #395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jsosic
Copy link
Contributor

@jsosic jsosic commented May 8, 2020

In some cases master can show pg_last_xact_replay_timestamp() from past, which can cause the exporter to show ever-growing value for the lag. This fixes the issue.

I've got the idea for fix from:

Example (from master):

[local]:5432 postgres@postgres=# SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) as lag;
+----------------+
|      lag       |
+----------------+
| 3567749.711759 |
+----------------+
(1 row)

Same master, fixed query:

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+-----+
| lag |
+-----+
|   0 |
+-----+
(1 row)

Master recovery status:

[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| f                 |
+-------------------+
(1 row)

Streaming slave:

[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| t                 |
+-------------------+
(1 row)

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+---------+
|   lag   |
+---------+
| 0.31566 |
+---------+
(1 row)

Paused slave:

[local]:5432 postgres@postgres=# SELECT CASE WHEN NOT pg_is_in_recovery() THEN 0 ELSE GREATEST (0, EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))) END AS lag;
+--------------+
|     lag      |
+--------------+
| 28053.150006 |
+--------------+
(1 row)


[local]:5432 postgres@postgres=# SELECT pg_is_in_recovery();
+-------------------+
| pg_is_in_recovery |
+-------------------+
| t                 |
+-------------------+
(1 row)

From the documentation:

True if recovery is still in progress.

In some cases master can show pg_last_xact_replay_timestamp() from past,
which can cause the exporter to show ever-growing value for the lag.

By checking if the instance is in recovery we can avoid reporting some
huge number for master instance.
@coveralls
Copy link

Coverage Status

Coverage remained the same at 64.7% when pulling dc87b6e on jsosic:fix-psql-replication-lag into e2df41f on wrouesnel:master.

@wrouesnel wrouesnel merged commit f188bde into prometheus-community:master Dec 24, 2020
SuperQ added a commit that referenced this pull request Feb 26, 2021
* Add CHANGELOG from existing tags.

Now released under Prometheus Community

* [CHANGE] Update build to use standard Prometheus promu/Dockerfile
* [ENHANCEMENT] Remove duplicate column in queries.yml #433
* [ENHANCEMENT] Add query for 'pg_replication_slots' #465
* [ENHANCEMENT] Allow a custom prefix for metric namespace #387
* [ENHANCEMENT] Improve PostgreSQL replication lag detection #395
* [ENHANCEMENT] Support connstring syntax when discovering databases #473
* [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric #421
* [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml #357
* [BUGFIX] Don't ignore errors in parseUserQueries #362
* [BUGFIX] Fix queries.yaml for AWS RDS #370
* [BUGFIX] Recover when connection cannot be established at startup #415
* [BUGFIX] Don't retry if an error occurs #426
* [BUGFIX] Do not panic on incorrect env #457

Signed-off-by: Ben Kochie <[email protected]>
@SuperQ SuperQ mentioned this pull request Feb 26, 2021
SuperQ added a commit that referenced this pull request Mar 1, 2021
* Add CHANGELOG from existing tags.

First release under the Prometheus Community organisation.

* [CHANGE] Update build to use standard Prometheus promu/Dockerfile
* [ENHANCEMENT] Remove duplicate column in queries.yml #433
* [ENHANCEMENT] Add query for 'pg_replication_slots' #465
* [ENHANCEMENT] Allow a custom prefix for metric namespace #387
* [ENHANCEMENT] Improve PostgreSQL replication lag detection #395
* [ENHANCEMENT] Support connstring syntax when discovering databases #473
* [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric #421
* [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml #357
* [BUGFIX] Don't ignore errors in parseUserQueries #362
* [BUGFIX] Fix queries.yaml for AWS RDS #370
* [BUGFIX] Recover when connection cannot be established at startup #415
* [BUGFIX] Don't retry if an error occurs #426
* [BUGFIX] Do not panic on incorrect env #457

Signed-off-by: Ben Kochie <[email protected]>
angaz pushed a commit to angaz/postgres_exporter that referenced this pull request Mar 3, 2022
* Add CHANGELOG from existing tags.

First release under the Prometheus Community organisation.

* [CHANGE] Update build to use standard Prometheus promu/Dockerfile
* [ENHANCEMENT] Remove duplicate column in queries.yml prometheus-community#433
* [ENHANCEMENT] Add query for 'pg_replication_slots' prometheus-community#465
* [ENHANCEMENT] Allow a custom prefix for metric namespace prometheus-community#387
* [ENHANCEMENT] Improve PostgreSQL replication lag detection prometheus-community#395
* [ENHANCEMENT] Support connstring syntax when discovering databases prometheus-community#473
* [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric prometheus-community#421
* [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml prometheus-community#357
* [BUGFIX] Don't ignore errors in parseUserQueries prometheus-community#362
* [BUGFIX] Fix queries.yaml for AWS RDS prometheus-community#370
* [BUGFIX] Recover when connection cannot be established at startup prometheus-community#415
* [BUGFIX] Don't retry if an error occurs prometheus-community#426
* [BUGFIX] Do not panic on incorrect env prometheus-community#457

Signed-off-by: Ben Kochie <[email protected]>
ritbl pushed a commit to heniek/postgres_exporter that referenced this pull request Mar 19, 2023
In some cases master can show pg_last_xact_replay_timestamp() from past,
which can cause the exporter to show ever-growing value for the lag.

By checking if the instance is in recovery we can avoid reporting some
huge number for master instance.
ritbl pushed a commit to heniek/postgres_exporter that referenced this pull request Mar 19, 2023
* Add CHANGELOG from existing tags.

First release under the Prometheus Community organisation.

* [CHANGE] Update build to use standard Prometheus promu/Dockerfile
* [ENHANCEMENT] Remove duplicate column in queries.yml prometheus-community#433
* [ENHANCEMENT] Add query for 'pg_replication_slots' prometheus-community#465
* [ENHANCEMENT] Allow a custom prefix for metric namespace prometheus-community#387
* [ENHANCEMENT] Improve PostgreSQL replication lag detection prometheus-community#395
* [ENHANCEMENT] Support connstring syntax when discovering databases prometheus-community#473
* [ENHANCEMENT] Detect SIReadLock locks in the pg_locks metric prometheus-community#421
* [BUGFIX] Fix pg_database_size_bytes metric in queries.yaml prometheus-community#357
* [BUGFIX] Don't ignore errors in parseUserQueries prometheus-community#362
* [BUGFIX] Fix queries.yaml for AWS RDS prometheus-community#370
* [BUGFIX] Recover when connection cannot be established at startup prometheus-community#415
* [BUGFIX] Don't retry if an error occurs prometheus-community#426
* [BUGFIX] Do not panic on incorrect env prometheus-community#457

Signed-off-by: Ben Kochie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants