Merge branch 'master' into dev/bugfix

yuwei-michael · web-flow · commit 9345ce307e9a · 2021-06-09T10:35:41.000+08:00
diff --git a/doc/polardb/arch.md b/doc/polardb/arch.md
@@ -1,6 +1,6 @@
 ## Architecture and Main Functions
 
-PolarDB for PostgreSQL implements a share-nothing architecture using PostgreSQL as a main component. Without specific mention, we use PolarDB to represent PolarDB for PostgreSQL. PolarDB is fully compatible to PostgreSQL and supports PostgreSQL's most SQL functions. As a distributed database system, PolarDB achieves same data consistency and ACID as a single-node database system. PolarDB implements Paxos based replication, which offers high availability, data redundancy, and consistency across nodes, during node failure and cluster reconfiguration. Fine-grained sharding and application-transparent shard relocation allow PolarDB to efficiently utilize cloud-wise resource to adapt to varying computation and storage requirements. PolarDB's distributed SQL engine achieves fast processing of complex queries by combining the comprehensiveness of PostgreSQL optimizer and the efficiency of parallel execution among or inside nodes. 
+PolarDB for PostgreSQL implements a share-nothing architecture using PostgreSQL as the main component. Without specific mention, we use PolarDB to represent PolarDB for PostgreSQL. PolarDB is fully compatible to PostgreSQL and supports PostgreSQL's most SQL functions. As a distributed database system, PolarDB achieves the same data consistency and ACID as a standalone database system. PolarDB implements Paxos-based replication, which offers high availability, data redundancy, and consistency across nodes, during node failure and cluster reconfiguration. Fine-grained sharding and application-transparent shard relocation allow PolarDB to efficiently utilize the cloud-wise resources to adapt to varying computation and storage requirements. PolarDB's distributed SQL engine achieves fast processing of complex queries by combining the comprehensiveness of the PostgreSQL optimizer and the efficiency of parallel execution among or inside nodes. 
 
 Overall, PolarDB provides scalable SQL computation and fully-ACID relational data storage on commodity hardware or standard cloud resources, such as ECS and block storage service.  
 
@@ -9,7 +9,7 @@ Overall, PolarDB provides scalable SQL computation and fully-ACID relational dat
 <img src="sharding_plug_in.png" alt="Sharding and Plug-in" width="250"/>
 
 PolarDB cluster is formed of three main components: database node (DN), cluster manager (CM), and 
-transaction & time service (TM). They are different processes and can be deployed in multiple servers or ECS. DNs are main database engines, which receive SQL requests from clients or load balancer and process them. Each DN can act as a coordinator to distribute queries and coordinate transactions across multiple DNs. Each DN is responsible for part of data stored in a database. Any operations, including read and write, to those data, are processed by their corresponding DN. The data in one DN is further partitioned into shards. Those shards can be relocated to other DNs when PolarDB scales out or re-balances workload. DNs have their replicas storing same data through Paxos based replication. DNs and their replicas form a replication group. In such group, a primary DN handles all write requests and propogate their results to replica DNs, or called follower DNs. Our Paxos replication also supports logger DNs, which only stores log records rather than data. 
+transaction & time service (TM). They are different processes and can be deployed in multiple servers or ECS. DNs are the main database engines, which receive SQL requests from clients or load balancer and process them. Each DN can act as a coordinator to distribute queries and coordinate transactions across many DNs. Each DN handles part of the data stored in a database. Any operations, including read and write, to those data, are processed by their corresponding DN. The data in one DN is further partitioned into shards. Those shards can be relocated to other DNs when PolarDB scales out or re-balances workload. DNs have their replicas storing the same data through Paxos-based replication. DNs and their replicas form a replication group. In such a group, a primary DN handles all write requests and propagates their results to replica DNs (or follower DNs). Our Paxos replication also supports logger DNs, which only stores log records rather than data. 
 
 TM is a centralized service to support globally consistent and in-order status or counters. Ascending timestamp and global transaction ID are two examples. 
 
diff --git a/doc/polardb/cts.md b/doc/polardb/cts.md
@@ -2,41 +2,41 @@
 ## HLC, CTS and (Distributed) Snapshot Isolation
 
 PolarDB for PG (or PolarDB for short) implements a multi-core scaling transaction processing system
-and supports distributed transactions (upcoming) by using commit timestamp based MVCC.
-The traditional PostgreSQL uses xid-based snapshot to provide transactional isolation
+and supports distributed transactions (upcoming) by using commit timestamp-based MVCC.
+The traditional PostgreSQL uses a xid-based snapshot to provide transactional isolation
 on top of MVCC which can introduce scaling bottleneck on many-core machines.
 Specifically, each transaction snapshots the currently running transactions in the Proc-Array at its start 
 by holding the Proc-Array lock. Meanwhile, the Proc-Array lock is acquired to clear a transaction at its end. 
 The frequent use of the global Proc-Array lock in the critical path could lead to severe contention. 
 For read-committed (RC) isolation, each snapshot is generated at the statement level which can further exacerbate the performance. 
 
-To tackle this problem, PolarDB adopts the commit timestamp based MVCC for better
+To tackle this problem, PolarDB adopts the commit timestamp-based MVCC for better
 scalability. Specifically, each transaction assigns a start timestamp and commit timestamp
 for each transaction when it starts and commits. The timestamps are monotonically increasing
 to maintain transaction order for isolation. 
 
 PolarDB implements a CTS (Commit Timestamp Store) which is carefully designed for scalability to store commit timestamps
 and transaction status.
-As being in critical path, PolarDB employs well-known concurrent data structures and lock-free algorithms for CTS to avoid scaling bottlenecks on multi-core machines.
+As being in the critical path, PolarDB employs well-known concurrent data structures and lock-free algorithms for CTS to avoid scaling bottlenecks on multi-core machines.
 Specifically, PolarDB avoids acquiring exclusive lock as far as possible when accessing the CTS buffers in memory.
-Exclusive locking is needed only when a LRU replacement occurs. 
+Exclusive locking is needed only when an LRU replacement occurs. 
 To parallelize LRU buffer replacement and flush, CTS adopts multi-lru structures for multi-core architectures.
 
 As xid-based snapshots are removed for better performance, PolarDB implements CTS-based vacuum, hot-chain pruning and logical replication, etc. 
 More interestingly, by using CTS PolarDB simplifies the snapshot building process of logical replication.
-The original logical replication in PostgreSQL consists of four steps which is complex and is hard to understand and to reason its correctness. In contrast, the CTS based snapshot building undergoes only two steps and is significantly easier to understand
+The original logical replication in PostgreSQL consists of four steps which is complex and is hard to understand and to reason its correctness. In contrast, the CTS-based snapshot building undergoes only two steps and is significantly easier to understand
 than the original one.
 
-PolarDB designs a hybrid logical clock (HLC) to generate start and commit timestamps for transactions to maintain snapshot isolation (supporting RR and RC isolations). The adoption of HLC is to support decentralized distributed transaction management in our upcoming distributed shared-nothing PolarDB-PG. The HLC consists of a logical part (strictly increasing counter) and physical part. The logical part is used to track transaction order to ensure snapshot isolation while the physical part is used to generate freshness snapshots across different machines. The physical clock on each node can be synchronized by using NTP (Network Time Protocol) or PTP (Precision Time Protocol).
+PolarDB designs a hybrid logical clock (HLC) to generate start and commit timestamps for transactions to maintain snapshot isolation (supporting RR and RC isolations). The adoption of HLC is to support decentralized distributed transaction management in our upcoming distributed shared-nothing PolarDB-PG. The HLC consists of a logical part (strictly increasing counter) and a physical part. The logical part is used to track transaction order to ensure snapshot isolation while the physical part is used to generate freshness snapshots across different machines. The physical clock on each node can be synchronized by using NTP (Network Time Protocol) or PTP (Precision Time Protocol).
 The PTP within a local area network can keep the maximum clock skew between any two machines as small as several microseconds.
-The adoption of advanced PTP in a single data center can enable PolarDB-PG to provide strong external consistency across different nodes like Google Spanner. However, our upcoming open-sourced distributed version assumes machines being synchronized by NTP and only aims to guarantee snapshot isolation and internal consistency across nodes. A 64 bit HLC timestamp consists of 16 lowest bit logical counter, 48 higher bit physical time and 2 reserved bits. 
+The adoption of advanced PTP in a single data center can enable PolarDB-PG to provide strong external consistency across different nodes like Google Spanner. However, our upcoming open-sourced distributed version assumes machines being synchronized by NTP and only aims to guarantee snapshot isolation and internal consistency across nodes. A 64-bit HLC timestamp consists of 16 lowest bit logical counter, 48 higher bit physical time and 2 reserved bits. 
 
 To maintain distributed snapshot isolation, PolarDB adopts HLC to generate snapshot start timestamp for each transaction.
 Any node which accepts one transaction acts as its coordinator node to assign its start and commit timestamp. 
-To commit a distributed transaction, its coordinator uses two phase commit protocol (2PC), collects prepared HLC timestamps from all the participating nodes during the prepare phase and determines its commit timestamp by choosing the maximum timestamp from all the prepared timestamps. Finally, the commit timestamp is passed to all the participating nodes to commit the transaction.
+To commit a distributed transaction, its coordinator uses two phase commit protocol (2PC), collects prepared HLC timestamps from all the participating nodes during the prepare phase, and determines its commit timestamp by choosing the maximum timestamp from all the prepared timestamps. Finally, the commit timestamp is passed to all the participating nodes to commit the transaction.
 The hybrid logical clock on each node is updated using the arriving start and commit timestamps when a transaction accesses it, e.g., transaction begin and commit. 
 PolarDB uses 2PC prepared wait mechanism to resolve causality ordering between transactions like Google Percolator, i.e., 
-a MVCC scan must wait for one prepared transaction to complete when validating the visibility of its writes.
+an MVCC scan must wait for one prepared transaction to complete when validating the visibility of its writes.
 The prepared status is maintained in CTS for fast access and is replaced with a commit timestamp when the prepared transaction commits. The HLC based distributed transaction will appear soon in our distributed shared-nothing version of PolarDB-PG.
 The main design goal of PolarDB is to provide scaling OLTP performance within each many-core machine and across many machines.
 
diff --git a/doc/polardb/ha_paxos.md b/doc/polardb/ha_paxos.md
@@ -1,15 +1,15 @@
-PolarDB for PostgreSQL ensures zero data loss after any node failure by using X-Paxos based replication. X-Paxos consensus protocol was implemented by Alibaba and deployed in product systems supporting Alibaba major applications and platforms.  X-Paxos achieves strong consistency cross replicas and high availability at replica failure. It allows cross-AZ and cross-DC network latency while maintaining high throughput and correctness. 
+PolarDB for PostgreSQL ensures zero data loss after any node failure by using X-Paxos-based replication. X-Paxos consensus protocol was implemented by Alibaba and deployed in product systems supporting Alibaba's major applications and platforms.  X-Paxos achieves strong consistency across replicas and high availability at replica failure. It allows cross-AZ and cross-DC network latency while maintaining high throughput and correctness. 
 
-PolarDB for PostgreSQL combined X-Paxos with PostgreSQL’s streaming replication function. X-Paxos is responsible for maintaining consistent replicas’ status, such their received and applied redo log positions, replicas’ role, and leader election, etc. PostgreSQL’s streaming replication offers functions of WAL’s transferring and receiving and persistency. 
+PolarDB for PostgreSQL combined X-Paxos with PostgreSQL’s streaming replication function. X-Paxos is responsible for maintaining consistent replicas’ status, such as their received and applied redo log positions, replicas’ roles, and leader elections, etc. PostgreSQL’s streaming replication offers functions of WAL’s transferring and receiving and persistency. 
 
-On leader node, a consensus service is deployed to negotiate the log synchronization positions with follower nodes.  Those positions are used to determine the starting WAL records to be sent to followers. Transactions can commit only after majority nodes have received WAL records and they are persistent to disks, called reaching consensus. At the same time, any data related persistence operations (write IO) needs to wait for the corresponding WAL records reach consensus among replicas. For follower nodes, their recovery process applies a WAL record only when it reaches consensus. This can prevent any uncommitted data to become visible to users through follower nodes. 
+On leader node, a consensus service is deployed to negotiate the log synchronization positions with follower nodes.  Those positions are used to determine the starting WAL records to be sent to followers. Transactions can commit only after the majority of nodes have received WAL records and they are persistent to disks, called reaching consensus. At the same time, any data-related persistence operations (write IO) need to wait for the corresponding WAL records reach consensus among replicas. For follower nodes, their recovery process applies a WAL record only when it reaches consensus. This can prevent any uncommitted data to become visible to users through follower nodes. 
 
-The consensus service maintains a consensus log. Its records store WAL entries’ locations. A leader node generates a consensus log entry based on the current local WAL position, and sends it to follower nodes. Followers receive the consensus log entry. They write it in local disks after ensuring that the WAL record corresponding to the consensus log entry is successfully persisted. A consensus log entry becomes persistent in majority nodes then the log entry reaches consensus. The leader node uses the WAL location recorded in the consensus log entry, which already reaches consensus, to determine the WAL location reaching consensus. 
+The consensus service maintains a consensus log. Its records store WAL entries’ locations. A leader node generates a consensus log entry based on the current local WAL position and sends it to follower nodes. Followers receive the consensus log entry. They write it in local disks after ensuring that the WAL record corresponding to the consensus log entry is successfully persisted. A consensus log entry becomes persistent in majority nodes then the log entry reaches consensus. The leader node uses the WAL location recorded in the consensus log entry, which already reaches consensus, to determine the WAL location reaching consensus. 
 
-Follower nodes use X-Paxos state machine to manage streaming replication and log applying. WAL is pulled from leader node to follower nodes. When leadership changes, follower nodes use local received WAL position and send WAL requests to new leader node for any WAL after the position. When a follower is elected as a leader, it automatically exits from recovery and promotes itself as primary PostgreSQL instance, offering read and write service to clients
+Follower nodes use X-Paxos state machine to manage streaming replication and log applying. WAL is pulled from leader node to follower nodes. When leadership changes, follower nodes use local received WAL position and send WAL requests to the new leader node for any WAL after the position. When a follower is elected as a leader, it automatically exits from recovery and promotes itself as the primary PostgreSQL instance, offering read and write service to clients.
 
-PolarDB for PostgreSQL supports multiple X-Paxos roles, such as leader, follower, and logger. The logger node differs from follower nodes in whether storing data and whether applying WAL to recovery data. Logger nodes only retain real-time WAL logs. It conducts streaming replication as followers but the received WAL logs are not played back. This means a logger node does not stores data or run recovery.  Using logger nodes allows X-Paxos to achieve similar consensus across nodes with less storage cost paid. 
+PolarDB for PostgreSQL supports many X-Paxos roles, such as leader, follower, and logger. The logger node differs from follower nodes in whether storing data and whether applying WAL to recovery data. Logger nodes only retain real-time WAL logs. It conducts streaming replication as followers but the received WAL logs are not played back. This means a logger node does not store data or run recovery.  Using logger nodes allows X-Paxos to achieve similar consensus across nodes with less storage cost paid. 
 
 ___
 
-Copyright © Alibaba Group, Inc.
+Copyright © Alibaba Group, Inc.