Comparing changes

The view exposes the contents of the shared buffer lookup table for debugging, testing and investigation. TODO: It is better to place this view in pg_buffercache. But it's added as a system view since BufHashTable is not exposed outside buf_table.c. To move it to pg_buffercache, we should move the function pg_get_buffer_lookup_table() to pg_buffercache which invokes BufTableGetContent() by passing it the tuple store and tuple descriptor. BufTableGetContent fills the tuple store. The partitions are locked by pg_get_buffer_lookup_table(). Author: Ashutosh Bapat <[email protected]>

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS, which does not include ConfigReloadPending. Thus we need to check for it explicitly.

Currently an assing hook can perform some preprocessing of a new value, but it cannot change the behavior, which dictates that the new value will be applied immediately after the hook. Certain GUC options (like shared_buffers, coming in subsequent patches) may need coordinating work between backends to change, meaning we cannot apply it right away. Add a new flag "pending" for an assign hook to allow the hook indicate exactly that. If the pending flag is set after the hook, the new value will not be applied and it's handling becomes the hook's implementation responsibility. Note, that this also requires changes in the way how GUCs are getting reported, but the patch does not cover that yet.

Currently WaitForProcSignalBarrier allows to make sure the message sent via EmitProcSignalBarrier was processed by all ProcSignal mechanism participants. Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, which will be updated when a process has received the message, but not processed it yet. This makes it possible to support a new mode of waiting, when ProcSignal participants want to synchronize message processing. To do that, a participant can wait via WaitForProcSignalBarrierReceived when processing a message, effectively making sure that all processes are going to start processing ProcSignalBarrier simultaneously.

Currently all the work with shared memory is done via a single anonymous memory mapping, which limits ways how the shared memory could be organized. Introduce possibility to allocate multiple shared memory mappings, where a single mapping is associated with a specified shared memory segment. There is only fixed amount of available segments, currently only one main shared memory segment is allocated. A new shared memory API is introduces, extended with a segment as a new parameter. As a path of least resistance, the original API is kept in place, utilizing the main shared memory segment.

Currently the shared memory layout is designed to pack everything tight together, leaving no space between mappings for resizing. Here is how it looks like for one mapping in /proc/$PID/maps, /dev/zero represents the anonymous shared memory we talk about: 00400000-00490000 /path/bin/postgres ... 012d9000-0133e000 [heap] 7f443a800000-7f470a800000 /dev/zero (deleted) 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 ... Make the layout more dynamic via splitting every shared memory segment into two parts: * An anonymous file, which actually contains shared memory content. Such an anonymous file is created via memfd_create, it lives in memory, behaves like a regular file and semantically equivalent to an anonymous memory allocated via mmap with MAP_ANONYMOUS. * A reservation mapping, which size is much larger than required shared segment size. This mapping is created with flags PROT_NONE (which makes sure the reserved space is not used), and MAP_NORESERVE (to not count the reserved space against memory limits). The anonymous file is mapped into this reservation mapping. The resulting layout looks like this: 00400000-00490000 /path/bin/postgres ... 3f526000-3f590000 rw-p [heap] 7fbd827fe000-7fbd8bdde000 rw-s /memfd:main (deleted) -- anon file 7fbd8bdde000-7fbe82800000 ---s /memfd:main (deleted) -- reservation 7fbe82800000-7fbe90670000 r--p /usr/lib/locale/locale-archive 7fbe90800000-7fbe90941000 r-xp /usr/lib64/libstdc++.so.6.0.34 To resize a shared memory segment in this layout it's possible to use ftruncate on the anonymous file, adjusting access permissions on the reserved space as needed. This approach also do not impact the actual memory usage as reported by the kernel. Here is the output of /proc/$PID/status for the master version with shared_buffers = 128 MB: // Peak virtual memory size, which is described as total pages // mapped in mm_struct. It corresponds to the mapped reserved space // and is the only number that grows with it. VmPeak: 2043192 kB // Size of memory portions. It contains RssAnon + RssFile + RssShmem VmRSS: 22908 kB // Size of resident anonymous memory RssAnon: 768 kB // Size of resident file mappings RssFile: 10364 kB // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and // shared anonymous mappings) RssShmem: 11776 kB Here is the same for the patch when reserving 20GB of space: VmPeak: 21255824 kB VmRSS: 25020 kB RssAnon: 768 kB RssFile: 10812 kB RssShmem: 13440 kB Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup was created with the memory limit 256 MB, then PostgreSQL was launched withing this cgroup with shared_buffers = 128 MB: $ cd /sys/fs/cgroup $ mkdir postgres $ cd postres $ echo 268435456 > memory.max $ echo $MASTER_PID_SHELL > cgroup.procs # postgres from the master branch has being successfully launched # from that shell $ cat memory.current 17465344 (~16.6 MB) # stop postgres $ echo $PATCH_PID_SHELL > cgroup.procs # postgres from the patch has being successfully launched from that shell $ cat memory.current 20770816 (~19.8 MB) To control the amount of space reserved a new GUC max_available_memory is introduced. Ideally it should be based on the maximum available memory, hense the name. There are also few unrelated advantages of using anon files: * We've got a file descriptor, which could be used for regular file operations (modification, truncation, you name it). * The file could be given a name, which improves readability when it comes to process maps. * By default, Linux will not add file-backed shared mappings into a core dump, making it more convenient to work with them in PostgreSQL: no more huge dumps to process. The downside is that memfd_create is Linux specific.

Add more shmem segments to split shared buffers into following chunks: * BUFFERS_SHMEM_SEGMENT: contains buffer blocks * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status Size of the corresponding shared data directly depends on NBuffers, meaning that if we would like to change NBuffers, they have to be resized correspondingly. Placing each of them in a separate shmem segment allows to achieve that. There are some asumptions made about each of shmem segments upper size limit. The buffer blocks have the largest, while the rest claim less extra room for resize. Ideally those limits have to be deduced from the maximum allowed shared memory.

shm_total_page_count is used unitialized. If this variable has a random value to start with, the final sum would be wrong. Also include pg_shmem.h where shared memory segment macros are used. Author: Ashutosh Bapat

This function calls many functions which return the amount of shared memory required for different shared memory data structures. Up until now, the returned total of these sizes was used to create a single shared memory segment. But starting the previous patch, we create multiple shared memory segments each of which contain one shared memory structure related to shared buffers and one main memory segment containing rest of the structures. Since CalculateShmemSize() is called for every shared memory segment, and its return value is added to the memory required for all the shared memory segments, we end up allocating more memory than required. Instead, CalculateShmemSize() is called only once. Each of its callees are expected to a. return the size required from the main segment b. add sizes to the AnonymousMappings corresponding to the other memory segments. For individual modules to add memory to their respective AnonymousMappings, we need to know the different mappings upfront. Hence ANON_MAPPINGS replaces next_free_segment. TODOs: 1. This change however requires that the AnonymousMappings array and macros defining identifiers of each of the segments be platform-independent. This patch doesn't achieve that goal for all the platforms for example windows. We need to fix that. 2. If postgres is invoked with -C shared_memory_size, it reports 0. That's because it report the GUC values before share memory sizes are set in AnonymousMappings. Fix that too. 3. Eliminate this assymetry in CalculateShmemSize(). See TODO in prologue of CalculateShmemSize(). 4. This is one way to avoid requesting more memory in each segment. But there may be other ways to design CalculateShmemSize(). Need to think and implement it better. Author: Ashutosh Bapat

Modifies pg_shmem_allocations to report shared memory segment as well. Adds pg_shmem_segments to report shared memory segment information. TODO: This commit should be merged with the earlier commit introducing multiple shared memory segments. Author: Ashutosh Bapat

Add assing hook for shared_buffers to resize shared memory using space, introduced in the previous commits without requiring PostgreSQL restart. Essentially the implementation is based on two mechanisms: a ProcSignalBarrier is used to make sure all processes are starting the resize procedure simultaneously, and a global Barrier is used to coordinate after that and make sure all finished processes are waiting for others that are in progress. The resize process looks like this: * The GUC assign hook sets a flag to let the Postmaster know that resize was requested. * Postmaster verifies the flag in the event loop, and starts the resize by emitting a ProcSignal barrier. * All processes, that participate in ProcSignal mechanism, begin to process ProcSignal barrier. First a process waits until all processes have confirmed they received the message and can start simultaneously. * Every process recalculates shared memory size based on the new NBuffers, adjusts its size using ftruncate and adjust reservation permissions with mprotect. One elected process signals the postmaster to do the same. * When finished, every process waits on a global ShmemControl barrier, untill all others are finished as well. This way we ensure three stages with clear boundaries: before the resize, when all processes use old NBuffers; during the resize, when processes have mix of old and new NBuffers, and wait until it's done; after the resize, when all processes use new NBuffers. * After all processes are using new value, one of them will initialize new shared structures (buffer blocks, descriptors, etc) as needed and broadcast new value of NBuffers via ShmemControl in shared memory. Other backends are waiting for this operation to finish as well. Then the barrier is lifted and everything goes as usual. Since resizing takes time, we need to take into account that during that time: - New backends can be spawned. They will check status of the barrier early during the bootstrap, and wait until everything is over to work with the new NBuffers value. - Old backends can exit before attempting to resize. Synchronization used between backends relies on ProcSignalBarrier and waits for all participants received the message at the beginning to gather all existing backends. - Some backends might be blocked and not responsing either before or after receiving the message. In the first case such backend still have ProcSignalSlot and should be waited for, in the second case shared barrier will make sure we still waiting for those backends. In any case there is an unbounded wait. - Backends might join barrier in disjoint groups with some time in between. That means that relying only on the shared dynamic barrier is not enough -- it will only synchronize resize procedure withing those groups. That's why we wait first for all participants of ProcSignal mechanism who received the message. Here is how it looks like after raising shared_buffers from 128 MB to 512 MB and calling pg_reload_conf(): -- 128 MB 7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted) 7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted) 7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted) 7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted) 7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted) 7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted) 7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted) 7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted) 7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted) ^ buffers content, ~247 MB 7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted) ^ reserved space, ~2210 MB 7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted) 7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted) -- 512 MB 7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted) 7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted) 7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted) 7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted) 7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted) 7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted) 7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted) 7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted) 7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted) ^ buffers content, ~632 MB 7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted) ^ reserved space, ~1824 MB 7f8877066000-7f887e950000 rw-s /memfd:main (deleted) 7f887e950000-7f8890a00000 ---s /memfd:main (deleted) The implementation supports only increasing of shared_buffers. For decreasing the value a similar procedure is needed. But the buffer blocks with data have to be drained first, so that the actual data set fits into the new smaller space. From experiment it turns out that shared mappings have to be extended separately for each process that uses them. Another rough edge is that a backend blocked on ReadCommand will not apply shared_buffers change until it receives something. Authors: Dmitrii Dolgov, Ashutosh Bapat

The assign_hook for shared_buffers (assign_shared_buffers()) is called twice during server startup. First time it sets the default value of shared_buffers, followed by a second time when it sets the value specified in the configuration file or on the command line. At those times the shared buffer pool is yet to be initialized. Hence there is no need to keep the GUC change pending or going through the entire process of resizing memory maps, reinitializing the shared memory and process synchronization. Instead the given value should be assigned directly to NBuffers, which will be used when creating the shared memory and also when initializing the buffer pool the first time. Any changes to shared_buffer after that will need remapping the shared memory segment and synchronize buffer pool reinitialization across the backends. If BufferBlocks is not initilized assign_shared_buffers() sets the given value to NBuffers directly. Otherwise it marks the change as pending and sets the flag pending_pm_shmem_resize so that Postmaster can start the buffer pool reinitialization. TODO: 1. The change depends upon the C convention that the global pointer variables being initialized to NULL. May be initialize BufferBlocks to NULL explicitly. 2. We might think of a better way to check whether buffer pool has been initialized or not. See comment in assign_shared_buffers(). Author: Ashutosh Bapat

… structures Update totalsize and end address in segment and mapping: Once a shared memory segment has been resized, the total size and end address of the same needs to be updated in the corresponding AnonymousMapping and Segment structure. Update allocated_size for resized shared memory structure: Reallocating the shared memory structure after resizing needs a bit more work. But at least update the allocated_size as well along with the size of shared memory structure. Author: Ashutosh Bapat

Buffer eviction =============== When shrinking the shared buffers pool, each buffer in the area being shrunk needs to be flushed if it's dirty so as not to loose the changes to that buffer after shrinking. Also, each such buffer needs to be removed from the buffer mapping table so that backends do not access it after shrinking. Buffer eviction requires a separate barrier phase for two reasons: 1. No other backend should map a new page to any of buffers being evicted when eviction is in progress. So they wait while eviction is in progress. 2. Since a pinned buffer has the pin recorded in the backend local memory as well as the buffer descriptor (which is in shared memory), eviction should not coincide with remapping the shared memory of a backend. Otherwise we might loose consistency of local and shared pinning records. Hence it needs to be carried out in ProcessBarrierShmemResize() and not in AnonymousShmemResize() as indicated by now removed comment. If a buffer being evicted is pinned, we raise a FATAL error but this should improve. There are multiple options 1. to wait for the pinned buffer to get unpinned, 2. the backend is killed or it itself cancels the query or 3. rollback the operation. Note that option 1 and 2 would require the pinning related local and shared records to be accessed. But we need infrastructure to do either of this right now. Removing the evicted buffers from buffer ring ============================================= If the buffer pool has been shrunk, the buffers in the buffer ring may not be valid anymore. Modify GetBufferFromRing to check if the buffer is still valid before using it. This makes GetBufferFromRing() a bit more expensive because of additional boolean condition and masks any bug that introduces an invalid buffer into the ring. The alternative fix is more complex as explained below. The strategy object is created in CurrentMemoryContext and is not available in any global structure thus accessible when processing buffer resizing barriers. We may modify GetAccessStrategy() to register strategy in a global linked list and then arrange to deregister it once it's no more in use. Looking at the places which use GetAccessStrategy(), fixing all those may be some work. Author: Ashutosh Bapat Reviewed-by: Tomas Vondra

... and BgBufferSync and ClockSweepTick adjustments Reinitializing strategry control area ===================================== The commit introduces a separate function StrategyReInitialize() instead of reusing StrategyInitialize() since some of the things that the second one does are not required in the first one. Here's list of what StrategyReInitialize() does and how does it differ from StrategyInitialize(). 1. StrategyControl pointer needn't be fetched again since it should not change. But added an Assert to make sure the pointer is valid. 2. &StrategyControl->buffer_strategy_lock need not be initialized again. 3. nextVictimBuffer, completePasses and numBufferAllocs are viewed in the context of NBuffers. Now that NBuffers itself has changed, those three do not make sense. Reset them as if the server has restarted again. Ability to delay resizing operation =================================== This commit introduces a flag delay_shmem_resize, which postgresql backends and workers can use to signal the coordinator to delay resizing operation. Background writer sets this flag when its scanning buffers. Background writer operation =========================== Background writer is blocked when the actual resizing is in progress. It stops a scan in progress when it sees that the resizing has begun or is about to begin. Once the buffer resizing is finished, before resuming the regular operation, bgwriter resets the information saved so far. This information is viewed in the context of NBuffers and hence does not make sense after resizing which chanegs NBuffers. Buffer lookup table =================== Right now there is no way to free shared memory. Even if we shrink the buffer lookup table when shrinking the buffer pool the unused hash table entries can not be freed. When we expand the buffer pool, more entries can be allocated but we can not resize the hash table directory without rehashing all the entries. Just allocating more entries will lead to more contention. Hence we setup the buffer lookup table considering the maximum possible size of the buffer pool which is MaxAvailableMemory only once at the beginning. Shared buffer lookup table and StrategyControl are not resized even if the buffer pool is resized hence they are allocated in the main shared memory segment TODO: ==== 1. The way BgBufferSync is written today, it packs four functionalities: setting up the buffer sync state, performing the buffer sync, resetting the buffer sync state when bgwriter_lru_maxpages <= 0 and setting it up again after bgwriter_lru_maxpages > 0. That makes the code hard to read. It will be good to divide this function into 3/4 different functions each performing one functionality. Then pack all the state (the local variables from that function converted to static global) into a structure, which is passed to these functions. Once that happens BgBufferSyncReset() will call one of the functions to reset the state when buffer pool is resized. 2. The condition (pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) == NBuffers) checked in BgBufferSync() to check whether buffer resizing is "about to begin" is wrong. NBuffers it not changed, until AnonymousShmemResize() is called and it wont' be called unless BgBufferSync() finishes if it has already begun. Need a better condition to check whether buffer resizing is about to begin. Author: Ashutosh Bapat Reviewed-by: Tomas Vondra

The commit adds two tests: 1. TAP test to stress test buffer pool resizing under concurrent load. 2. SQL test to test sanity of shared memory allocations and mappings after buffer pool resizing operation. Author: Palak Chaturvedi <[email protected]> Author: Ashutosh Bapat <[email protected]>

This branch was automatically generated by a robot using patches from an email thread registered at: https://commitfest.postgresql.org/patch/5319 The branch will be overwritten each time a new patch version is posted to the thread, and also periodically to check for bitrot caused by changes on the master branch. Patch(es): https://www.postgresql.org/message-id/CAExHW5vB8sAmDtkEN5dcYYeBok3D8eAzMFCOH1k+krxht1yFjA@mail.gmail.com Author(s): Dmitry Dolgov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Oct 3, 2025

This comparison is taking too long to generate.

Uh oh!