Skip to content

Commit c843528

Browse files
authored
Add playbook: update_pgcluster (vitabaks#281)
1 parent 446f6de commit c843528

14 files changed

+873
-1
lines changed

README.md

+23-1
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ In addition to deploying new clusters, this playbook also support the deployment
4444
- [Create cluster with WAL-G:](#create-cluster-with-wal-g)
4545
- [Point-In-Time-Recovery:](#point-in-time-recovery)
4646
- [Maintenance](#maintenance)
47+
- [Update the PostgreSQL HA Cluster](#update-the-postgresql-ha-cluster)
48+
- [Using Git for cluster configuration management](#using-git-for-cluster-configuration-management-iacgitops)
4749
- [Disaster Recovery](#disaster-recovery)
4850
- [etcd](#etcd)
4951
- [PostgreSQL (databases)](#postgresql-databases)
@@ -460,7 +462,27 @@ I recommend that you study the following materials for further maintenance of th
460462
- [Patroni documentation](https://patroni.readthedocs.io/en/latest/)
461463
- [etcd operations guide](https://etcd.io/docs/v3.3.12/op-guide/)
462464

463-
## Using Git for cluster configuration management (IaC/GitOps)
465+
#### Update the PostgreSQL HA Cluster
466+
467+
`update_pgcluster.yml` playbook is designed to update the PostgreSQL HA Cluster, to a new minor version (for example 15.1->15.2, and etc).
468+
469+
Usage:
470+
471+
- Update PostgreSQL:
472+
473+
`ansible-playbook update_pgcluster.yml`
474+
475+
- Update Patroni:
476+
477+
`ansible-playbook update_pgcluster.yml -e target=patroni`
478+
479+
- Update all system:
480+
481+
`ansible-playbook update_pgcluster.yml -e target=system`
482+
483+
More details [here](roles/update)
484+
485+
#### Using Git for cluster configuration management (IaC/GitOps)
464486

465487
Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through manual processes. \
466488
GitOps automates infrastructure updates using a Git workflow with continuous integration (CI) and continuous delivery (CI/CD). When new code is merged, the CI/CD pipeline enacts the change in the environment. Any configuration drift, such as manual changes or errors, is overwritten by GitOps automation so the environment converges on the desired state defined in Git.

roles/update/README.md

+129
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
## Update the PostgreSQL HA Cluster
2+
3+
This role is designed to update the PostgreSQL HA cluster to a new minor version (for example, 15.1->15.2, and etc).
4+
5+
By default, only PostgreSQL packages defined in the postgresql_packages variable are updated (vars/Debian.yml or vars/RedHat.yml). In addition, you can update Patroni or the entire system.
6+
7+
#### Usage
8+
9+
Update PostgreSQL:
10+
11+
`ansible-playbook update_pgcluster.yml`
12+
13+
Update Patroni:
14+
15+
`ansible-playbook update_pgcluster.yml -e target=patroni`
16+
17+
Update all system packages:
18+
19+
`ansible-playbook update_pgcluster.yml -e target=system`
20+
21+
22+
#### Variables
23+
24+
- `target`
25+
- Defines the target for the update.
26+
- Available values: 'postgres', 'patroni', 'system'
27+
- Default value: postgres
28+
- `max_replication_lag_bytes`
29+
- Determines the size of the replication lag above which the update will not be performed.
30+
- If the lag is high, you will be prompted to try again later.
31+
- Default value: 10485760 (10 MiB)
32+
- `max_transaction_sec`
33+
- Determines the maximum transaction time, in the presence of which the update will not be performed.
34+
- If long-running transactions are present, you will be prompted to try again later.
35+
- Default value: 15 (seconds)
36+
- `update_extensions`
37+
- If 'true', an attempt will be made to automatically update all extensions for all databases.
38+
- Specify 'false', to avoid updating extensions.
39+
- Default value: true
40+
---
41+
42+
## Plan:
43+
44+
Note: About the expected downtime of the database during the update:
45+
46+
When using load balancing for read-only traffic (the "Type A" and "Type C" schemes), zero downtime is expected (for read traffic), provided there is more than one replica in the cluster. For write traffic (to the Primary), the expected downtime is ~5-10 seconds.
47+
48+
#### 1. PRE-UPDATE: Perform Pre-Checks
49+
- Test PostgreSQL DB Access
50+
- Make sure that physical replication is active
51+
- Stop, if there are no active replicas
52+
- Make sure there is no high replication lag
53+
- Note: no more than `max_replication_lag_bytes`
54+
- Stop, if replication lag is high
55+
- Make sure there are no long-running transactions
56+
- no more than `max_transaction_sec`
57+
- Stop, if long-running transactions detected
58+
#### 2. UPDATE: Secondary (one by one)
59+
- Stop read-only traffic
60+
- Enable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
61+
- Reload patroni service
62+
- Make sure replica endpoint is unavailable
63+
- Wait for active transactions to complete
64+
- Stop Services
65+
- Execute CHECKPOINT before stopping PostgreSQL
66+
- Stop Patroni service on the Cluster Replica
67+
- Update PostgreSQL
68+
- if `target` variable is not defined or `target=postgres`
69+
- Install the latest version of PostgreSQL packages
70+
- Update Patroni
71+
- if `target=patroni` (or `system`)
72+
- Install the latest version of Patroni package
73+
- Update all system packages (includes PostgreSQL and Patroni)
74+
- if `target=system`
75+
- Update all system packages
76+
- Start Services
77+
- Start Patroni service
78+
- Wait for Patroni port to become open on the host
79+
- Check that the Patroni is healthy
80+
- Check PostgreSQL is started and accepting connections
81+
- Start read-only traffic
82+
- Disable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
83+
- Reload patroni service
84+
- Make sure replica endpoint is available
85+
- Perform the same steps for the next replica server.
86+
#### 3. UPDATE: Primary
87+
- Switchover Patroni leader role
88+
- Perform switchover of the leader for the Patroni cluster
89+
- Make sure that the Patroni is healthy and is a replica
90+
- Notes:
91+
- At this stage, the leader becomes a replica
92+
- the database downtime is ~5 seconds (write traffic)
93+
- Stop read-only traffic
94+
- Enable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
95+
- Reload patroni service
96+
- Make sure replica endpoint is unavailable
97+
- Wait for active transactions to complete
98+
- Stop Services
99+
- Execute CHECKPOINT before stopping PostgreSQL
100+
- Stop Patroni service on the old Cluster Leader
101+
- Update PostgreSQL
102+
- if `target` variable is not defined or `target=postgres`
103+
- Install the latest version of PostgreSQL packages
104+
- Update Patroni
105+
- if `target=patroni` (or `system`)
106+
- Install the latest version of Patroni package
107+
- Update all system packages (includes PostgreSQL and Patroni)
108+
- if `target=system`
109+
- Update all system packages
110+
- Start Services
111+
- Start Patroni service
112+
- Wait for Patroni port to become open on the host
113+
- Check that the Patroni is healthy
114+
- Check PostgreSQL is started and accepting connections
115+
- Start read-only traffic
116+
- Disable `noloadbalance`, `nosync`, `nofailover` parameters in the patroni.yml
117+
- Reload patroni service
118+
- Make sure replica endpoint is available
119+
#### 4. POST-UPDATE: Update extensions
120+
- Update extensions
121+
- Get the current Patroni Cluster Leader Node
122+
- Get a list of databases
123+
- Update extensions in each database
124+
- Get a list of old PostgreSQL extensions
125+
- Update old PostgreSQL extensions (if an update is required)
126+
- Check the Patroni cluster state
127+
- Check the current PostgreSQL version
128+
- List the Patroni cluster members
129+
- Update completed.

roles/update/tasks/extensions.yml

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
- name: 'Get the current Patroni Cluster Leader Node'
3+
uri:
4+
url: http://{{ inventory_hostname }}:{{ patroni_restapi_port }}/leader
5+
status_code: 200
6+
register: patroni_leader_result
7+
changed_when: false
8+
failed_when: false
9+
10+
- name: Get a list of databases
11+
command: psql -tAXc "select datname from pg_catalog.pg_database where not datistemplate"
12+
register: databases_list
13+
changed_when: false
14+
when:
15+
- patroni_leader_result.status == 200
16+
17+
- name: Update extensions in each database
18+
include_tasks: update_extensions.yml
19+
loop: "{{ databases_list.stdout_lines }}"
20+
loop_control:
21+
loop_var: pg_target_dbname
22+
when: databases_list.stdout_lines is defined
23+
...

roles/update/tasks/patroni.yml

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
---
2+
# patroni_installation_method: "pip"
3+
- block:
4+
- name: Install the latest version of Patroni
5+
pip:
6+
name: patroni
7+
state: latest
8+
executable: pip3
9+
extra_args: "--trusted-host=pypi.python.org --trusted-host=pypi.org --trusted-host=files.pythonhosted.org"
10+
umask: "0022"
11+
environment:
12+
PATH: "{{ ansible_env.PATH }}:/usr/local/bin:/usr/bin"
13+
when: installation_method == "repo" and patroni_installation_method == "pip"
14+
environment: "{{ proxy_env | default({}) }}"
15+
vars:
16+
ansible_python_interpreter: /usr/bin/python3
17+
18+
# patroni_installation_method: "rpm/deb"
19+
- block:
20+
# Debian
21+
- name: Install the latest version of Patroni packages
22+
package:
23+
name: "{{ patroni_packages| default('patroni')}}"
24+
state: latest
25+
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length < 1
26+
27+
# RedHat
28+
- name: Install the latest version of Patroni packages
29+
package:
30+
name: "{{ patroni_packages| default('patroni')}}"
31+
state: latest
32+
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length < 1
33+
34+
# when patroni_deb_package_repo or patroni_rpm_package_repo URL is defined
35+
# Debian
36+
- name: Download Patroni deb package
37+
get_url:
38+
url: "{{ item }}"
39+
dest: /tmp/
40+
timeout: 60
41+
validate_certs: false
42+
loop: "{{ patroni_deb_package_repo | list }}"
43+
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length > 0
44+
45+
- name: Install Patroni from deb package
46+
apt:
47+
force_apt_get: true
48+
deb: "/tmp/{{ item }}"
49+
state: present
50+
loop: "{{ patroni_deb_package_repo | map('basename') | list }}"
51+
when: ansible_os_family == "Debian" and patroni_deb_package_repo | length > 0
52+
53+
# RedHat
54+
- name: Download Patroni rpm package
55+
get_url:
56+
url: "{{ item }}"
57+
dest: /tmp/
58+
timeout: 60
59+
validate_certs: false
60+
loop: "{{ patroni_rpm_package_repo | list }}"
61+
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length > 0
62+
63+
- name: Install Patroni from rpm package
64+
package:
65+
name: "/tmp/{{ item }}"
66+
state: present
67+
loop: "{{ patroni_rpm_package_repo | map('basename') | list }}"
68+
when: ansible_os_family == "RedHat" and patroni_rpm_package_repo | length > 0
69+
environment: "{{ proxy_env | default({}) }}"
70+
when:
71+
- installation_method == "repo"
72+
- (patroni_installation_method == "rpm" or patroni_installation_method == "deb")
73+
...

roles/update/tasks/postgres.yml

+25
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
---
2+
- name: Clean yum cache
3+
command: yum clean all
4+
when:
5+
- ansible_os_family == "RedHat"
6+
- ansible_distribution_major_version == '7'
7+
8+
- name: Clean dnf cache
9+
command: dnf clean all
10+
when:
11+
- ansible_os_family == "RedHat"
12+
- ansible_distribution_major_version is version('8', '>=')
13+
14+
- name: Update apt cache
15+
apt:
16+
update_cache: true
17+
cache_valid_time: 3600
18+
when: ansible_os_family == "Debian"
19+
20+
- name: Install the latest version of PostgreSQL packages
21+
package:
22+
name: "{{ item }}"
23+
state: latest
24+
loop: "{{ postgresql_packages }}"
25+
...

roles/update/tasks/pre_checks.yml

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
---
2+
- name: '[Pre-Check] (ALL) Test PostgreSQL DB Access'
3+
command: psql -tAXc 'select 1'
4+
changed_when: false
5+
6+
- name: '[Pre-Check] Make sure that physical replication is active'
7+
command: >-
8+
psql -tAXc "select count(*) from pg_stat_replication
9+
where application_name != 'pg_basebackup'"
10+
register: pg_replication_state
11+
changed_when: false
12+
when:
13+
- inventory_hostname in groups['primary']
14+
15+
# Stop, if there are no active replicas
16+
- name: "Pre-Check error. Print physical replication state"
17+
fail:
18+
msg: "There are no active replica servers (pg_stat_replication returned 0 entries)."
19+
when:
20+
- inventory_hostname in groups['primary']
21+
- pg_replication_state.stdout | int == 0
22+
23+
- name: '[Pre-Check] Make sure there is no high replication lag (more than {{ max_replication_lag_bytes | human_readable }})'
24+
command: >-
25+
psql -tAXc "select pg_wal_lsn_diff(pg_current_wal_lsn(),
26+
replay_lsn) pg_lag_bytes from pg_stat_replication
27+
order by pg_lag_bytes desc limit 1"
28+
register: pg_lag_bytes
29+
changed_when: false
30+
failed_when: false
31+
until: pg_lag_bytes.stdout|int < max_replication_lag_bytes|int
32+
retries: 30
33+
delay: 5
34+
when:
35+
- inventory_hostname in groups['primary']
36+
37+
# Stop, if replication lag is high
38+
- block:
39+
- name: "Print replication lag"
40+
debug:
41+
msg: "Current replication lag:
42+
{{ pg_lag_bytes.stdout | int | human_readable }}"
43+
44+
- name: "Pre-Check error. Please try again later"
45+
fail:
46+
msg: High replication lag on the Patroni Cluster, please try again later.
47+
when:
48+
- pg_lag_bytes.stdout is defined
49+
- pg_lag_bytes.stdout|int >= max_replication_lag_bytes|int
50+
51+
- name: '[Pre-Check] Make sure there are no long-running transactions (more than {{ max_transaction_sec }} seconds)'
52+
command: >-
53+
psql -tAXc "select pid, usename, client_addr, clock_timestamp() - xact_start as xact_age,
54+
state, wait_event_type ||':'|| wait_event as wait_events,
55+
left(regexp_replace(query, E'[ \\t\\n\\r]+', ' ', 'g'),100) as query
56+
from pg_stat_activity
57+
where clock_timestamp() - xact_start > '{{ max_transaction_sec }} seconds'::interval
58+
and backend_type = 'client backend' and pid <> pg_backend_pid()
59+
order by xact_age desc limit 10"
60+
register: pg_long_transactions
61+
changed_when: false
62+
failed_when: false
63+
until: pg_long_transactions.stdout | length < 1
64+
retries: 30
65+
delay: 2
66+
when:
67+
- inventory_hostname in groups['primary']
68+
69+
# Stop, if long-running transactions detected
70+
- block:
71+
- name: "Print long-running (>{{ max_transaction_sec }}s) transactions"
72+
debug:
73+
msg: "{{ pg_long_transactions.stdout_lines }}"
74+
75+
- name: "Pre-Check error. Please try again later"
76+
fail:
77+
msg: long-running transactions detected (more than {{ max_transaction_sec }} seconds), please try again later.
78+
when:
79+
- pg_long_transactions.stdout is defined
80+
- pg_long_transactions.stdout | length > 0
81+
...

0 commit comments

Comments
 (0)