1.. SPDX-License-Identifier: GPL-2.0 2 3============= 4Multi-Gen LRU 5============= 6The multi-gen LRU is an alternative LRU implementation that optimizes 7page reclaim and improves performance under memory pressure. Page 8reclaim decides the kernel's caching policy and ability to overcommit 9memory. It directly impacts the kswapd CPU usage and RAM efficiency. 10 11Design overview 12=============== 13Objectives 14---------- 15The design objectives are: 16 17* Good representation of access recency 18* Try to profit from spatial locality 19* Fast paths to make obvious choices 20* Simple self-correcting heuristics 21 22The representation of access recency is at the core of all LRU 23implementations. In the multi-gen LRU, each generation represents a 24group of pages with similar access recency. Generations establish a 25(time-based) common frame of reference and therefore help make better 26choices, e.g., between different memcgs on a computer or different 27computers in a data center (for job scheduling). 28 29Exploiting spatial locality improves efficiency when gathering the 30accessed bit. A rmap walk targets a single page and does not try to 31profit from discovering a young PTE. A page table walk can sweep all 32the young PTEs in an address space, but the address space can be too 33sparse to make a profit. The key is to optimize both methods and use 34them in combination. 35 36Fast paths reduce code complexity and runtime overhead. Unmapped pages 37do not require TLB flushes; clean pages do not require writeback. 38These facts are only helpful when other conditions, e.g., access 39recency, are similar. With generations as a common frame of reference, 40additional factors stand out. But obvious choices might not be good 41choices; thus self-correction is necessary. 42 43The benefits of simple self-correcting heuristics are self-evident. 44Again, with generations as a common frame of reference, this becomes 45attainable. Specifically, pages in the same generation can be 46categorized based on additional factors, and a feedback loop can 47statistically compare the refault percentages across those categories 48and infer which of them are better choices. 49 50Assumptions 51----------- 52The protection of hot pages and the selection of cold pages are based 53on page access channels and patterns. There are two access channels: 54 55* Accesses through page tables 56* Accesses through file descriptors 57 58The protection of the former channel is by design stronger because: 59 601. The uncertainty in determining the access patterns of the former 61 channel is higher due to the approximation of the accessed bit. 622. The cost of evicting the former channel is higher due to the TLB 63 flushes required and the likelihood of encountering the dirty bit. 643. The penalty of underprotecting the former channel is higher because 65 applications usually do not prepare themselves for major page 66 faults like they do for blocked I/O. E.g., GUI applications 67 commonly use dedicated I/O threads to avoid blocking rendering 68 threads. 69 70There are also two access patterns: 71 72* Accesses exhibiting temporal locality 73* Accesses not exhibiting temporal locality 74 75For the reasons listed above, the former channel is assumed to follow 76the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is 77present, and the latter channel is assumed to follow the latter 78pattern unless outlying refaults have been observed. 79 80Workflow overview 81================= 82Evictable pages are divided into multiple generations for each 83``lruvec``. The youngest generation number is stored in 84``lrugen->max_seq`` for both anon and file types as they are aged on 85an equal footing. The oldest generation numbers are stored in 86``lrugen->min_seq[]`` separately for anon and file types as clean file 87pages can be evicted regardless of swap constraints. These three 88variables are monotonically increasing. 89 90Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` 91bits in order to fit into the gen counter in ``folio->flags``. Each 92truncated generation number is an index to ``lrugen->folios[]``. The 93sliding window technique is used to track at least ``MIN_NR_GENS`` and 94at most ``MAX_NR_GENS`` generations. The gen counter stores a value 95within ``[1, MAX_NR_GENS]`` while a page is on one of 96``lrugen->folios[]``; otherwise it stores zero. 97 98Each generation is divided into multiple tiers. A page accessed ``N`` 99times through file descriptors is in tier ``order_base_2(N)``. Unlike 100generations, tiers do not have dedicated ``lrugen->folios[]``. In 101contrast to moving across generations, which requires the LRU lock, 102moving across tiers only involves atomic operations on 103``folio->flags`` and therefore has a negligible cost. A feedback loop 104modeled after the PID controller monitors refaults over all the tiers 105from anon and file types and decides which tiers from which types to 106evict or protect. 107 108There are two conceptually independent procedures: the aging and the 109eviction. They form a closed-loop system, i.e., the page reclaim. 110 111Aging 112----- 113The aging produces young generations. Given an ``lruvec``, it 114increments ``max_seq`` when ``max_seq-min_seq+1`` approaches 115``MIN_NR_GENS``. The aging promotes hot pages to the youngest 116generation when it finds them accessed through page tables; the 117demotion of cold pages happens consequently when it increments 118``max_seq``. The aging uses page table walks and rmap walks to find 119young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` 120and calls ``walk_page_range()`` with each ``mm_struct`` on this list 121to scan PTEs, and after each iteration, it increments ``max_seq``. For 122the latter, when the eviction walks the rmap and finds a young PTE, 123the aging scans the adjacent PTEs. For both, on finding a young PTE, 124the aging clears the accessed bit and updates the gen counter of the 125page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. 126 127Eviction 128-------- 129The eviction consumes old generations. Given an ``lruvec``, it 130increments ``min_seq`` when ``lrugen->folios[]`` indexed by 131``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to 132evict from, it first compares ``min_seq[]`` to select the older type. 133If both types are equally old, it selects the one whose first tier has 134a lower refault percentage. The first tier contains single-use 135unmapped clean pages, which are the best bet. The eviction sorts a 136page according to its gen counter if the aging has found this page 137accessed through page tables and updated its gen counter. It also 138moves a page to the next generation, i.e., ``min_seq+1``, if this page 139was accessed multiple times through file descriptors and the feedback 140loop has detected outlying refaults from the tier this page is in. To 141this end, the feedback loop uses the first tier as the baseline, for 142the reason stated earlier. 143 144Working set protection 145---------------------- 146Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is 147set, an ``lruvec`` is protected from the eviction when its oldest 148generation was born within ``lru_gen_min_ttl`` milliseconds. In other 149words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds 150from getting evicted. The OOM killer is triggered if this working set 151cannot be kept in memory. 152 153This time-based approach has the following advantages: 154 1551. It is easier to configure because it is agnostic to applications 156 and memory sizes. 1572. It is more reliable because it is directly wired to the OOM killer. 158 159Rmap/PT walk feedback 160--------------------- 161Searching the rmap for PTEs mapping each page on an LRU list (to test 162and clear the accessed bit) can be expensive because pages from 163different VMAs (PA space) are not cache friendly to the rmap (VA 164space). For workloads mostly using mapped pages, searching the rmap 165can incur the highest CPU cost in the reclaim path. 166 167``lru_gen_look_around()`` exploits spatial locality to reduce the 168trips into the rmap. It scans the adjacent PTEs of a young PTE and 169promotes hot pages. If the scan was done cacheline efficiently, it 170adds the PMD entry pointing to the PTE table to the Bloom filter. This 171forms a feedback loop between the eviction and the aging. 172 173Bloom Filters 174------------- 175Bloom filters are a space and memory efficient data structure for set 176membership test, i.e., test if an element is not in the set or may be 177in the set. 178 179In the eviction path, specifically, in ``lru_gen_look_around()``, if a 180PMD has a sufficient number of hot pages, its address is placed in the 181filter. In the aging path, set membership means that the PTE range 182will be scanned for young pages. 183 184Note that Bloom filters are probabilistic on set membership. If a test 185is false positive, the cost is an additional scan of a range of PTEs, 186which may yield hot pages anyway. Parameters of the filter itself can 187control the false positive rate in the limit. 188 189Memcg LRU 190--------- 191An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs, 192since each node and memcg combination has an LRU of folios (see 193``mem_cgroup_lruvec()``). Its goal is to improve the scalability of 194global reclaim, which is critical to system-wide memory overcommit in 195data centers. Note that memcg LRU only applies to global reclaim. 196 197The basic structure of an memcg LRU can be understood by an analogy to 198the active/inactive LRU (of folios): 199 2001. It has the young and the old (generations), i.e., the counterparts 201 to the active and the inactive; 2022. The increment of ``max_seq`` triggers promotion, i.e., the 203 counterpart to activation; 2043. Other events trigger similar operations, e.g., offlining an memcg 205 triggers demotion, i.e., the counterpart to deactivation. 206 207In terms of global reclaim, it has two distinct features: 208 2091. Sharding, which allows each thread to start at a random memcg (in 210 the old generation) and improves parallelism; 2112. Eventual fairness, which allows direct reclaim to bail out at will 212 and reduces latency without affecting fairness over some time. 213 214In terms of traversing memcgs during global reclaim, it improves the 215best-case complexity from O(n) to O(1) and does not affect the 216worst-case complexity O(n). Therefore, on average, it has a sublinear 217complexity. 218 219Summary 220------- 221The multi-gen LRU (of folios) can be disassembled into the following 222parts: 223 224* Generations 225* Rmap walks 226* Page table walks 227* Bloom filters 228* PID controller 229 230The aging and the eviction form a producer-consumer model; 231specifically, the latter drives the former by the sliding window over 232generations. Within the aging, rmap walks drive page table walks by 233inserting hot densely populated page tables to the Bloom filters. 234Within the eviction, the PID controller uses refaults as the feedback 235to select types to evict and tiers to protect. 236