xref: /openbmc/linux/Documentation/mm/multigen_lru.rst (revision b4646da0573fae9dfa2b8f1f10936cb6eedd7230)
1.. SPDX-License-Identifier: GPL-2.0
2
3=============
4Multi-Gen LRU
5=============
6The multi-gen LRU is an alternative LRU implementation that optimizes
7page reclaim and improves performance under memory pressure. Page
8reclaim decides the kernel's caching policy and ability to overcommit
9memory. It directly impacts the kswapd CPU usage and RAM efficiency.
10
11Design overview
12===============
13Objectives
14----------
15The design objectives are:
16
17* Good representation of access recency
18* Try to profit from spatial locality
19* Fast paths to make obvious choices
20* Simple self-correcting heuristics
21
22The representation of access recency is at the core of all LRU
23implementations. In the multi-gen LRU, each generation represents a
24group of pages with similar access recency. Generations establish a
25(time-based) common frame of reference and therefore help make better
26choices, e.g., between different memcgs on a computer or different
27computers in a data center (for job scheduling).
28
29Exploiting spatial locality improves efficiency when gathering the
30accessed bit. A rmap walk targets a single page and does not try to
31profit from discovering a young PTE. A page table walk can sweep all
32the young PTEs in an address space, but the address space can be too
33sparse to make a profit. The key is to optimize both methods and use
34them in combination.
35
36Fast paths reduce code complexity and runtime overhead. Unmapped pages
37do not require TLB flushes; clean pages do not require writeback.
38These facts are only helpful when other conditions, e.g., access
39recency, are similar. With generations as a common frame of reference,
40additional factors stand out. But obvious choices might not be good
41choices; thus self-correction is necessary.
42
43The benefits of simple self-correcting heuristics are self-evident.
44Again, with generations as a common frame of reference, this becomes
45attainable. Specifically, pages in the same generation can be
46categorized based on additional factors, and a feedback loop can
47statistically compare the refault percentages across those categories
48and infer which of them are better choices.
49
50Assumptions
51-----------
52The protection of hot pages and the selection of cold pages are based
53on page access channels and patterns. There are two access channels:
54
55* Accesses through page tables
56* Accesses through file descriptors
57
58The protection of the former channel is by design stronger because:
59
601. The uncertainty in determining the access patterns of the former
61   channel is higher due to the approximation of the accessed bit.
622. The cost of evicting the former channel is higher due to the TLB
63   flushes required and the likelihood of encountering the dirty bit.
643. The penalty of underprotecting the former channel is higher because
65   applications usually do not prepare themselves for major page
66   faults like they do for blocked I/O. E.g., GUI applications
67   commonly use dedicated I/O threads to avoid blocking rendering
68   threads.
69
70There are also two access patterns:
71
72* Accesses exhibiting temporal locality
73* Accesses not exhibiting temporal locality
74
75For the reasons listed above, the former channel is assumed to follow
76the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
77present, and the latter channel is assumed to follow the latter
78pattern unless outlying refaults have been observed.
79
80Workflow overview
81=================
82Evictable pages are divided into multiple generations for each
83``lruvec``. The youngest generation number is stored in
84``lrugen->max_seq`` for both anon and file types as they are aged on
85an equal footing. The oldest generation numbers are stored in
86``lrugen->min_seq[]`` separately for anon and file types as clean file
87pages can be evicted regardless of swap constraints. These three
88variables are monotonically increasing.
89
90Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
91bits in order to fit into the gen counter in ``folio->flags``. Each
92truncated generation number is an index to ``lrugen->folios[]``. The
93sliding window technique is used to track at least ``MIN_NR_GENS`` and
94at most ``MAX_NR_GENS`` generations. The gen counter stores a value
95within ``[1, MAX_NR_GENS]`` while a page is on one of
96``lrugen->folios[]``; otherwise it stores zero.
97
98Each generation is divided into multiple tiers. A page accessed ``N``
99times through file descriptors is in tier ``order_base_2(N)``. Unlike
100generations, tiers do not have dedicated ``lrugen->folios[]``. In
101contrast to moving across generations, which requires the LRU lock,
102moving across tiers only involves atomic operations on
103``folio->flags`` and therefore has a negligible cost. A feedback loop
104modeled after the PID controller monitors refaults over all the tiers
105from anon and file types and decides which tiers from which types to
106evict or protect. The desired effect is to balance refault percentages
107between anon and file types proportional to the swappiness level.
108
109There are two conceptually independent procedures: the aging and the
110eviction. They form a closed-loop system, i.e., the page reclaim.
111
112Aging
113-----
114The aging produces young generations. Given an ``lruvec``, it
115increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
116``MIN_NR_GENS``. The aging promotes hot pages to the youngest
117generation when it finds them accessed through page tables; the
118demotion of cold pages happens consequently when it increments
119``max_seq``. The aging uses page table walks and rmap walks to find
120young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
121and calls ``walk_page_range()`` with each ``mm_struct`` on this list
122to scan PTEs, and after each iteration, it increments ``max_seq``. For
123the latter, when the eviction walks the rmap and finds a young PTE,
124the aging scans the adjacent PTEs. For both, on finding a young PTE,
125the aging clears the accessed bit and updates the gen counter of the
126page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
127
128Eviction
129--------
130The eviction consumes old generations. Given an ``lruvec``, it
131increments ``min_seq`` when ``lrugen->folios[]`` indexed by
132``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
133evict from, it first compares ``min_seq[]`` to select the older type.
134If both types are equally old, it selects the one whose first tier has
135a lower refault percentage. The first tier contains single-use
136unmapped clean pages, which are the best bet. The eviction sorts a
137page according to its gen counter if the aging has found this page
138accessed through page tables and updated its gen counter. It also
139moves a page to the next generation, i.e., ``min_seq+1``, if this page
140was accessed multiple times through file descriptors and the feedback
141loop has detected outlying refaults from the tier this page is in. To
142this end, the feedback loop uses the first tier as the baseline, for
143the reason stated earlier.
144
145Working set protection
146----------------------
147Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
148set, an ``lruvec`` is protected from the eviction when its oldest
149generation was born within ``lru_gen_min_ttl`` milliseconds. In other
150words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
151from getting evicted. The OOM killer is triggered if this working set
152cannot be kept in memory.
153
154This time-based approach has the following advantages:
155
1561. It is easier to configure because it is agnostic to applications
157   and memory sizes.
1582. It is more reliable because it is directly wired to the OOM killer.
159
160``mm_struct`` list
161------------------
162An ``mm_struct`` list is maintained for each memcg, and an
163``mm_struct`` follows its owner task to the new memcg when this task
164is migrated.
165
166A page table walker iterates ``lruvec_memcg()->mm_list`` and calls
167``walk_page_range()`` with each ``mm_struct`` on this list to scan
168PTEs. When multiple page table walkers iterate the same list, each of
169them gets a unique ``mm_struct``, and therefore they can run in
170parallel.
171
172Page table walkers ignore any misplaced pages, e.g., if an
173``mm_struct`` was migrated, pages left in the previous memcg will be
174ignored when the current memcg is under reclaim. Similarly, page table
175walkers will ignore pages from nodes other than the one under reclaim.
176
177This infrastructure also tracks the usage of ``mm_struct`` between
178context switches so that page table walkers can skip processes that
179have been sleeping since the last iteration.
180
181Rmap/PT walk feedback
182---------------------
183Searching the rmap for PTEs mapping each page on an LRU list (to test
184and clear the accessed bit) can be expensive because pages from
185different VMAs (PA space) are not cache friendly to the rmap (VA
186space). For workloads mostly using mapped pages, searching the rmap
187can incur the highest CPU cost in the reclaim path.
188
189``lru_gen_look_around()`` exploits spatial locality to reduce the
190trips into the rmap. It scans the adjacent PTEs of a young PTE and
191promotes hot pages. If the scan was done cacheline efficiently, it
192adds the PMD entry pointing to the PTE table to the Bloom filter. This
193forms a feedback loop between the eviction and the aging.
194
195Bloom filters
196-------------
197Bloom filters are a space and memory efficient data structure for set
198membership test, i.e., test if an element is not in the set or may be
199in the set.
200
201In the eviction path, specifically, in ``lru_gen_look_around()``, if a
202PMD has a sufficient number of hot pages, its address is placed in the
203filter. In the aging path, set membership means that the PTE range
204will be scanned for young pages.
205
206Note that Bloom filters are probabilistic on set membership. If a test
207is false positive, the cost is an additional scan of a range of PTEs,
208which may yield hot pages anyway. Parameters of the filter itself can
209control the false positive rate in the limit.
210
211PID controller
212--------------
213A feedback loop modeled after the Proportional-Integral-Derivative
214(PID) controller monitors refaults over anon and file types and
215decides which type to evict when both types are available from the
216same generation.
217
218The PID controller uses generations rather than the wall clock as the
219time domain because a CPU can scan pages at different rates under
220varying memory pressure. It calculates a moving average for each new
221generation to avoid being permanently locked in a suboptimal state.
222
223Memcg LRU
224---------
225An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
226since each node and memcg combination has an LRU of folios (see
227``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
228global reclaim, which is critical to system-wide memory overcommit in
229data centers. Note that memcg LRU only applies to global reclaim.
230
231The basic structure of an memcg LRU can be understood by an analogy to
232the active/inactive LRU (of folios):
233
2341. It has the young and the old (generations), i.e., the counterparts
235   to the active and the inactive;
2362. The increment of ``max_seq`` triggers promotion, i.e., the
237   counterpart to activation;
2383. Other events trigger similar operations, e.g., offlining an memcg
239   triggers demotion, i.e., the counterpart to deactivation.
240
241In terms of global reclaim, it has two distinct features:
242
2431. Sharding, which allows each thread to start at a random memcg (in
244   the old generation) and improves parallelism;
2452. Eventual fairness, which allows direct reclaim to bail out at will
246   and reduces latency without affecting fairness over some time.
247
248In terms of traversing memcgs during global reclaim, it improves the
249best-case complexity from O(n) to O(1) and does not affect the
250worst-case complexity O(n). Therefore, on average, it has a sublinear
251complexity.
252
253Summary
254-------
255The multi-gen LRU (of folios) can be disassembled into the following
256parts:
257
258* Generations
259* Rmap walks
260* Page table walks via ``mm_struct`` list
261* Bloom filters for rmap/PT walk feedback
262* PID controller for refault feedback
263
264The aging and the eviction form a producer-consumer model;
265specifically, the latter drives the former by the sliding window over
266generations. Within the aging, rmap walks drive page table walks by
267inserting hot densely populated page tables to the Bloom filters.
268Within the eviction, the PID controller uses refaults as the feedback
269to select types to evict and tiers to protect.
270