Skip to content

Commit 9345ce3

Browse files
Merge branch 'master' into dev/bugfix
2 parents 1301a59 + 280ef60 commit 9345ce3

File tree

3 files changed

+19
-19
lines changed

3 files changed

+19
-19
lines changed

doc/polardb/arch.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Architecture and Main Functions
22

3-
PolarDB for PostgreSQL implements a share-nothing architecture using PostgreSQL as a main component. Without specific mention, we use PolarDB to represent PolarDB for PostgreSQL. PolarDB is fully compatible to PostgreSQL and supports PostgreSQL's most SQL functions. As a distributed database system, PolarDB achieves same data consistency and ACID as a single-node database system. PolarDB implements Paxos based replication, which offers high availability, data redundancy, and consistency across nodes, during node failure and cluster reconfiguration. Fine-grained sharding and application-transparent shard relocation allow PolarDB to efficiently utilize cloud-wise resource to adapt to varying computation and storage requirements. PolarDB's distributed SQL engine achieves fast processing of complex queries by combining the comprehensiveness of PostgreSQL optimizer and the efficiency of parallel execution among or inside nodes.
3+
PolarDB for PostgreSQL implements a share-nothing architecture using PostgreSQL as the main component. Without specific mention, we use PolarDB to represent PolarDB for PostgreSQL. PolarDB is fully compatible to PostgreSQL and supports PostgreSQL's most SQL functions. As a distributed database system, PolarDB achieves the same data consistency and ACID as a standalone database system. PolarDB implements Paxos-based replication, which offers high availability, data redundancy, and consistency across nodes, during node failure and cluster reconfiguration. Fine-grained sharding and application-transparent shard relocation allow PolarDB to efficiently utilize the cloud-wise resources to adapt to varying computation and storage requirements. PolarDB's distributed SQL engine achieves fast processing of complex queries by combining the comprehensiveness of the PostgreSQL optimizer and the efficiency of parallel execution among or inside nodes.
44

55
Overall, PolarDB provides scalable SQL computation and fully-ACID relational data storage on commodity hardware or standard cloud resources, such as ECS and block storage service.
66

@@ -9,7 +9,7 @@ Overall, PolarDB provides scalable SQL computation and fully-ACID relational dat
99
<img src="sharding_plug_in.png" alt="Sharding and Plug-in" width="250"/>
1010

1111
PolarDB cluster is formed of three main components: database node (DN), cluster manager (CM), and
12-
transaction & time service (TM). They are different processes and can be deployed in multiple servers or ECS. DNs are main database engines, which receive SQL requests from clients or load balancer and process them. Each DN can act as a coordinator to distribute queries and coordinate transactions across multiple DNs. Each DN is responsible for part of data stored in a database. Any operations, including read and write, to those data, are processed by their corresponding DN. The data in one DN is further partitioned into shards. Those shards can be relocated to other DNs when PolarDB scales out or re-balances workload. DNs have their replicas storing same data through Paxos based replication. DNs and their replicas form a replication group. In such group, a primary DN handles all write requests and propogate their results to replica DNs, or called follower DNs. Our Paxos replication also supports logger DNs, which only stores log records rather than data.
12+
transaction & time service (TM). They are different processes and can be deployed in multiple servers or ECS. DNs are the main database engines, which receive SQL requests from clients or load balancer and process them. Each DN can act as a coordinator to distribute queries and coordinate transactions across many DNs. Each DN handles part of the data stored in a database. Any operations, including read and write, to those data, are processed by their corresponding DN. The data in one DN is further partitioned into shards. Those shards can be relocated to other DNs when PolarDB scales out or re-balances workload. DNs have their replicas storing the same data through Paxos-based replication. DNs and their replicas form a replication group. In such a group, a primary DN handles all write requests and propagates their results to replica DNs (or follower DNs). Our Paxos replication also supports logger DNs, which only stores log records rather than data.
1313

1414
TM is a centralized service to support globally consistent and in-order status or counters. Ascending timestamp and global transaction ID are two examples.
1515

doc/polardb/cts.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,41 +2,41 @@
22
## HLC, CTS and (Distributed) Snapshot Isolation
33

44
PolarDB for PG (or PolarDB for short) implements a multi-core scaling transaction processing system
5-
and supports distributed transactions (upcoming) by using commit timestamp based MVCC.
6-
The traditional PostgreSQL uses xid-based snapshot to provide transactional isolation
5+
and supports distributed transactions (upcoming) by using commit timestamp-based MVCC.
6+
The traditional PostgreSQL uses a xid-based snapshot to provide transactional isolation
77
on top of MVCC which can introduce scaling bottleneck on many-core machines.
88
Specifically, each transaction snapshots the currently running transactions in the Proc-Array at its start
99
by holding the Proc-Array lock. Meanwhile, the Proc-Array lock is acquired to clear a transaction at its end.
1010
The frequent use of the global Proc-Array lock in the critical path could lead to severe contention.
1111
For read-committed (RC) isolation, each snapshot is generated at the statement level which can further exacerbate the performance.
1212

13-
To tackle this problem, PolarDB adopts the commit timestamp based MVCC for better
13+
To tackle this problem, PolarDB adopts the commit timestamp-based MVCC for better
1414
scalability. Specifically, each transaction assigns a start timestamp and commit timestamp
1515
for each transaction when it starts and commits. The timestamps are monotonically increasing
1616
to maintain transaction order for isolation.
1717

1818
PolarDB implements a CTS (Commit Timestamp Store) which is carefully designed for scalability to store commit timestamps
1919
and transaction status.
20-
As being in critical path, PolarDB employs well-known concurrent data structures and lock-free algorithms for CTS to avoid scaling bottlenecks on multi-core machines.
20+
As being in the critical path, PolarDB employs well-known concurrent data structures and lock-free algorithms for CTS to avoid scaling bottlenecks on multi-core machines.
2121
Specifically, PolarDB avoids acquiring exclusive lock as far as possible when accessing the CTS buffers in memory.
22-
Exclusive locking is needed only when a LRU replacement occurs.
22+
Exclusive locking is needed only when an LRU replacement occurs.
2323
To parallelize LRU buffer replacement and flush, CTS adopts multi-lru structures for multi-core architectures.
2424

2525
As xid-based snapshots are removed for better performance, PolarDB implements CTS-based vacuum, hot-chain pruning and logical replication, etc.
2626
More interestingly, by using CTS PolarDB simplifies the snapshot building process of logical replication.
27-
The original logical replication in PostgreSQL consists of four steps which is complex and is hard to understand and to reason its correctness. In contrast, the CTS based snapshot building undergoes only two steps and is significantly easier to understand
27+
The original logical replication in PostgreSQL consists of four steps which is complex and is hard to understand and to reason its correctness. In contrast, the CTS-based snapshot building undergoes only two steps and is significantly easier to understand
2828
than the original one.
2929

30-
PolarDB designs a hybrid logical clock (HLC) to generate start and commit timestamps for transactions to maintain snapshot isolation (supporting RR and RC isolations). The adoption of HLC is to support decentralized distributed transaction management in our upcoming distributed shared-nothing PolarDB-PG. The HLC consists of a logical part (strictly increasing counter) and physical part. The logical part is used to track transaction order to ensure snapshot isolation while the physical part is used to generate freshness snapshots across different machines. The physical clock on each node can be synchronized by using NTP (Network Time Protocol) or PTP (Precision Time Protocol).
30+
PolarDB designs a hybrid logical clock (HLC) to generate start and commit timestamps for transactions to maintain snapshot isolation (supporting RR and RC isolations). The adoption of HLC is to support decentralized distributed transaction management in our upcoming distributed shared-nothing PolarDB-PG. The HLC consists of a logical part (strictly increasing counter) and a physical part. The logical part is used to track transaction order to ensure snapshot isolation while the physical part is used to generate freshness snapshots across different machines. The physical clock on each node can be synchronized by using NTP (Network Time Protocol) or PTP (Precision Time Protocol).
3131
The PTP within a local area network can keep the maximum clock skew between any two machines as small as several microseconds.
32-
The adoption of advanced PTP in a single data center can enable PolarDB-PG to provide strong external consistency across different nodes like Google Spanner. However, our upcoming open-sourced distributed version assumes machines being synchronized by NTP and only aims to guarantee snapshot isolation and internal consistency across nodes. A 64 bit HLC timestamp consists of 16 lowest bit logical counter, 48 higher bit physical time and 2 reserved bits.
32+
The adoption of advanced PTP in a single data center can enable PolarDB-PG to provide strong external consistency across different nodes like Google Spanner. However, our upcoming open-sourced distributed version assumes machines being synchronized by NTP and only aims to guarantee snapshot isolation and internal consistency across nodes. A 64-bit HLC timestamp consists of 16 lowest bit logical counter, 48 higher bit physical time and 2 reserved bits.
3333

3434
To maintain distributed snapshot isolation, PolarDB adopts HLC to generate snapshot start timestamp for each transaction.
3535
Any node which accepts one transaction acts as its coordinator node to assign its start and commit timestamp.
36-
To commit a distributed transaction, its coordinator uses two phase commit protocol (2PC), collects prepared HLC timestamps from all the participating nodes during the prepare phase and determines its commit timestamp by choosing the maximum timestamp from all the prepared timestamps. Finally, the commit timestamp is passed to all the participating nodes to commit the transaction.
36+
To commit a distributed transaction, its coordinator uses two phase commit protocol (2PC), collects prepared HLC timestamps from all the participating nodes during the prepare phase, and determines its commit timestamp by choosing the maximum timestamp from all the prepared timestamps. Finally, the commit timestamp is passed to all the participating nodes to commit the transaction.
3737
The hybrid logical clock on each node is updated using the arriving start and commit timestamps when a transaction accesses it, e.g., transaction begin and commit.
3838
PolarDB uses 2PC prepared wait mechanism to resolve causality ordering between transactions like Google Percolator, i.e.,
39-
a MVCC scan must wait for one prepared transaction to complete when validating the visibility of its writes.
39+
an MVCC scan must wait for one prepared transaction to complete when validating the visibility of its writes.
4040
The prepared status is maintained in CTS for fast access and is replaced with a commit timestamp when the prepared transaction commits. The HLC based distributed transaction will appear soon in our distributed shared-nothing version of PolarDB-PG.
4141
The main design goal of PolarDB is to provide scaling OLTP performance within each many-core machine and across many machines.
4242

doc/polardb/ha_paxos.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
1-
PolarDB for PostgreSQL ensures zero data loss after any node failure by using X-Paxos based replication. X-Paxos consensus protocol was implemented by Alibaba and deployed in product systems supporting Alibaba major applications and platforms. X-Paxos achieves strong consistency cross replicas and high availability at replica failure. It allows cross-AZ and cross-DC network latency while maintaining high throughput and correctness.
1+
PolarDB for PostgreSQL ensures zero data loss after any node failure by using X-Paxos-based replication. X-Paxos consensus protocol was implemented by Alibaba and deployed in product systems supporting Alibaba's major applications and platforms. X-Paxos achieves strong consistency across replicas and high availability at replica failure. It allows cross-AZ and cross-DC network latency while maintaining high throughput and correctness.
22

3-
PolarDB for PostgreSQL combined X-Paxos with PostgreSQL’s streaming replication function. X-Paxos is responsible for maintaining consistent replicas’ status, such their received and applied redo log positions, replicas’ role, and leader election, etc. PostgreSQL’s streaming replication offers functions of WAL’s transferring and receiving and persistency.
3+
PolarDB for PostgreSQL combined X-Paxos with PostgreSQL’s streaming replication function. X-Paxos is responsible for maintaining consistent replicas’ status, such as their received and applied redo log positions, replicas’ roles, and leader elections, etc. PostgreSQL’s streaming replication offers functions of WAL’s transferring and receiving and persistency.
44

5-
On leader node, a consensus service is deployed to negotiate the log synchronization positions with follower nodes. Those positions are used to determine the starting WAL records to be sent to followers. Transactions can commit only after majority nodes have received WAL records and they are persistent to disks, called reaching consensus. At the same time, any data related persistence operations (write IO) needs to wait for the corresponding WAL records reach consensus among replicas. For follower nodes, their recovery process applies a WAL record only when it reaches consensus. This can prevent any uncommitted data to become visible to users through follower nodes.
5+
On leader node, a consensus service is deployed to negotiate the log synchronization positions with follower nodes. Those positions are used to determine the starting WAL records to be sent to followers. Transactions can commit only after the majority of nodes have received WAL records and they are persistent to disks, called reaching consensus. At the same time, any data-related persistence operations (write IO) need to wait for the corresponding WAL records reach consensus among replicas. For follower nodes, their recovery process applies a WAL record only when it reaches consensus. This can prevent any uncommitted data to become visible to users through follower nodes.
66

7-
The consensus service maintains a consensus log. Its records store WAL entries’ locations. A leader node generates a consensus log entry based on the current local WAL position, and sends it to follower nodes. Followers receive the consensus log entry. They write it in local disks after ensuring that the WAL record corresponding to the consensus log entry is successfully persisted. A consensus log entry becomes persistent in majority nodes then the log entry reaches consensus. The leader node uses the WAL location recorded in the consensus log entry, which already reaches consensus, to determine the WAL location reaching consensus.
7+
The consensus service maintains a consensus log. Its records store WAL entries’ locations. A leader node generates a consensus log entry based on the current local WAL position and sends it to follower nodes. Followers receive the consensus log entry. They write it in local disks after ensuring that the WAL record corresponding to the consensus log entry is successfully persisted. A consensus log entry becomes persistent in majority nodes then the log entry reaches consensus. The leader node uses the WAL location recorded in the consensus log entry, which already reaches consensus, to determine the WAL location reaching consensus.
88

9-
Follower nodes use X-Paxos state machine to manage streaming replication and log applying. WAL is pulled from leader node to follower nodes. When leadership changes, follower nodes use local received WAL position and send WAL requests to new leader node for any WAL after the position. When a follower is elected as a leader, it automatically exits from recovery and promotes itself as primary PostgreSQL instance, offering read and write service to clients
9+
Follower nodes use X-Paxos state machine to manage streaming replication and log applying. WAL is pulled from leader node to follower nodes. When leadership changes, follower nodes use local received WAL position and send WAL requests to the new leader node for any WAL after the position. When a follower is elected as a leader, it automatically exits from recovery and promotes itself as the primary PostgreSQL instance, offering read and write service to clients.
1010

11-
PolarDB for PostgreSQL supports multiple X-Paxos roles, such as leader, follower, and logger. The logger node differs from follower nodes in whether storing data and whether applying WAL to recovery data. Logger nodes only retain real-time WAL logs. It conducts streaming replication as followers but the received WAL logs are not played back. This means a logger node does not stores data or run recovery. Using logger nodes allows X-Paxos to achieve similar consensus across nodes with less storage cost paid.
11+
PolarDB for PostgreSQL supports many X-Paxos roles, such as leader, follower, and logger. The logger node differs from follower nodes in whether storing data and whether applying WAL to recovery data. Logger nodes only retain real-time WAL logs. It conducts streaming replication as followers but the received WAL logs are not played back. This means a logger node does not store data or run recovery. Using logger nodes allows X-Paxos to achieve similar consensus across nodes with less storage cost paid.
1212

1313
___
1414

15-
Copyright © Alibaba Group, Inc.
15+
Copyright © Alibaba Group, Inc.

0 commit comments

Comments
 (0)