Improving hot-page detection and promotion
There are two aspects to the memory-promotion problem: detecting which pages should be moved, and actually migrating them. On the detection side, there are a number of sources for data that can be used to detect hot (frequently accessed) pages; Rao is mostly focused on approaches that involve scanning page tables to see which pages have been accessed recently. He thinks that the current scanning implementation needs an overhaul; it is tied to the NUMA-balancing code, and its operation tends to create latency spikes for applications. Since both the page-table-entry (PTE) scanning and page migration are done in process context, they interrupt user-space execution in potentially disruptive ways.
Additionally, he said, there is an increasing number of ways to detect memory access — new data-temperature sources are becoming available. That suggests a need for a way to gather this data together and act on it in one place. His proposal is a new kernel thread, kmmscand, that would take over this task. It would maintain a list of process address spaces, then perform "A-bit" scanning (clearing the "accessed" bit in the PTEs, then scanning later to see which PTEs have had the bit set, indicating an access in the meantime) on each. The results of the scan are then used to create a list of pages to promote to faster memory.
In Rao's implementation, the migration task is separated into its own thread that runs independently of the scanning thread. The performance impact of the scanning can be regulated by adjusting the rate at which PTEs are scanned or by exempting some processes from scanning altogether. There is still room for improvement, though, he said. In particular, there is always a need for better hot-page detection, and there is a need for throttling and batching of the migration work as well.
A participant asked how often address spaces are scanned; Rao answered that, initially, a scan is performed every two seconds. Over time, the rate is adjusted according to conditions; it can end up anywhere between 0.5 and five seconds.
Another participant said that effective scanning requires flushing the translation lookaside buffer (TLB); otherwise, it will short out the address-translation process that sets the A bit in the first place. TLB flushes, of course, can hurt performance. Rao answered that the current implementation promotes pages the first time an access is observed, so missing data on subsequent accesses is not a problem. SeongJae Park said that DAMON, which also performs this sort of scanning, is not currently doing TLB flushes. That drew another question about whether this scanning should just be integrated into DAMON rather than being implemented as a separate thread; Rao answered that integrating the systems is something he will be looking into. Davidlohr Bueso pointed out that enterprise distributions do not normally enable DAMON, but Park answered that Red Hat and Debian are indeed enabling it.
Gregory Price said that the scanning thread will have to hold locks while passing over the page tables, and asked whether that has been observed to cause latencies for the processes being scanned. Rao said that the overall result of his implementation is a significant reduction in latencies experienced by applications, but did not address the locking issue specifically. Jonathan Cameron asked how the scanning of huge pages was handled; Rao said that the thread is just scanning PTEs and not doing anything special for huge pages.
Another participant worried about the cost of scanning especially large address spaces and asked if the overhead had been measured. Rao answered that there is still a need to understand the full cost of A-bit scanning. If it turns out to be a problem, there are optimizations, such as scanning at the PMD level first before dropping down to the PTE level in hot areas, that can be considered. There have been some experiments done with 64GB address spaces, he said, that have shown an improvement over current kernels.
A remote participant pointed out that the multi-generational LRU also optimizes its scanning by looking at higher-level page tables first; among other things, that helps it to avoid scanning unmapped memory. Cameron pointed out that scanning the higher-level tables will only be useful if the TLB is regularly flushed. But Yuanchu Xie said that the multi-generational LRU scanning seems to be optimized well, and is integrated with the reclaim system as well. Rao agreed that this avenue needs more exploration.
At the end of the session, Rao raised one last problem: while A-bit scanning can identify hot pages, it provides no information about which NUMA nodes have accessed any given page. As a result, there is no way to know which node a hot page should be promoted to. There needs to be a way to maintain home-node information for pages he said. Alternatively, the system could also scan pages in the fastest tiers to see where the pages in a given address space are clustered now, then promote other pages to the same nodes. The group seemed to think that this could be an interesting heuristic to explore.
Rao has posted
the slides from this session.
Index entries for this article | |
---|---|
Kernel | Memory management/Tiered-memory systems |
Conference | Storage, Filesystem, Memory-Management and BPF Summit/2025 |
Posted Apr 9, 2025 14:45 UTC (Wed)
by sjpark (subscriber, #87716)
[Link] (3 responses)
I mistakenly mentioned RedHat instead of CentOS and Fedora. I am using a tool from Oracle[1] to get the list of kernels that enabled DAMON.
[1] https://oracle.github.io/kconfigs/?config=UTS_RELEASE&...
Posted Apr 9, 2025 15:08 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link] (2 responses)
Is the source for generating this public?
Posted Apr 9, 2025 16:01 UTC (Wed)
by sjpark (subscriber, #87716)
[Link] (1 responses)
Posted Apr 9, 2025 18:11 UTC (Wed)
by rahulsundaram (subscriber, #21946)
[Link]
Posted Apr 11, 2025 17:23 UTC (Fri)
by Yuanchu (subscriber, #153443)
[Link]
DAMON enabled Linux kernels
DAMON enabled Linux kernels
DAMON enabled Linux kernels
DAMON enabled Linux kernels
TLB flushes