1.. SPDX-License-Identifier: GPL-2.0
2
3=============
4Multi-Gen LRU
5=============
6The multi-gen LRU is an alternative LRU implementation that optimizes
7page reclaim and improves performance under memory pressure. Page
8reclaim decides the kernel's caching policy and ability to overcommit
9memory. It directly impacts the kswapd CPU usage and RAM efficiency.
10
11Quick start
12===========
13Build the kernel with the following configurations.
14
15* ``CONFIG_LRU_GEN=y``
16* ``CONFIG_LRU_GEN_ENABLED=y``
17
18All set!
19
20Runtime options
21===============
22``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
23following subsections.
24
25Kill switch
26-----------
27``enabled`` accepts different values to enable or disable the
28following components. Its default value depends on
29``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
30unless some of them have unforeseen side effects. Writing to
31``enabled`` has no effect when a component is not supported by the
32hardware, and valid values will be accepted even when the main switch
33is off.
34
35====== ===============================================================
36Values Components
37====== ===============================================================
380x0001 The main switch for the multi-gen LRU.
390x0002 Clearing the accessed bit in leaf page table entries in large
40       batches, when MMU sets it (e.g., on x86). This behavior can
41       theoretically worsen lock contention (mmap_lock). If it is
42       disabled, the multi-gen LRU will suffer a minor performance
43       degradation for workloads that contiguously map hot pages,
44       whose accessed bits can be otherwise cleared by fewer larger
45       batches.
460x0004 Clearing the accessed bit in non-leaf page table entries as
47       well, when MMU sets it (e.g., on x86). This behavior was not
48       verified on x86 varieties other than Intel and AMD. If it is
49       disabled, the multi-gen LRU will suffer a negligible
50       performance degradation.
51[yYnN] Apply to all the components above.
52====== ===============================================================
53
54E.g.,
55::
56
57    echo y >/sys/kernel/mm/lru_gen/enabled
58    cat /sys/kernel/mm/lru_gen/enabled
59    0x0007
60    echo 5 >/sys/kernel/mm/lru_gen/enabled
61    cat /sys/kernel/mm/lru_gen/enabled
62    0x0005
63
64Thrashing prevention
65--------------------
66Personal computers are more sensitive to thrashing because it can
67cause janks (lags when rendering UI) and negatively impact user
68experience. The multi-gen LRU offers thrashing prevention to the
69majority of laptop and desktop users who do not have ``oomd``.
70
71Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
72``N`` milliseconds from getting evicted. The OOM killer is triggered
73if this working set cannot be kept in memory. In other words, this
74option works as an adjustable pressure relief valve, and when open, it
75terminates applications that are hopefully not being used.
76
77Based on the average human detectable lag (~100ms), ``N=1000`` usually
78eliminates intolerable janks due to thrashing. Larger values like
79``N=3000`` make janks less noticeable at the risk of premature OOM
80kills.
81
82The default value ``0`` means disabled.
83
84Experimental features
85=====================
86``/sys/kernel/debug/lru_gen`` accepts commands described in the
87following subsections. Multiple command lines are supported, so does
88concatenation with delimiters ``,`` and ``;``.
89
90``/sys/kernel/debug/lru_gen_full`` provides additional stats for
91debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
92evicted generations in this file.
93
94Working set estimation
95----------------------
96Working set estimation measures how much memory an application needs
97in a given time interval, and it is usually done with little impact on
98the performance of the application. E.g., data centers want to
99optimize job scheduling (bin packing) to improve memory utilizations.
100When a new job comes in, the job scheduler needs to find out whether
101each server it manages can allocate a certain amount of memory for
102this new job before it can pick a candidate. To do so, the job
103scheduler needs to estimate the working sets of the existing jobs.
104
105When it is read, ``lru_gen`` returns a histogram of numbers of pages
106accessed over different time intervals for each memcg and node.
107``MAX_NR_GENS`` decides the number of bins for each histogram. The
108histograms are noncumulative.
109::
110
111    memcg  memcg_id  memcg_path
112       node  node_id
113           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
114           ...
115           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
116
117Each bin contains an estimated number of pages that have been accessed
118within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
119and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
120the former is the largest and that of the latter is the smallest.
121
122Users can write the following command to ``lru_gen`` to create a new
123generation ``max_gen_nr+1``:
124
125    ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
126
127``can_swap`` defaults to the swap setting and, if it is set to ``1``,
128it forces the scan of anon pages when swap is off, and vice versa.
129``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
130employs heuristics to reduce the overhead, which is likely to reduce
131the coverage as well.
132
133A typical use case is that a job scheduler runs this command at a
134certain time interval to create new generations, and it ranks the
135servers it manages based on the sizes of their cold pages defined by
136this time interval.
137
138Proactive reclaim
139-----------------
140Proactive reclaim induces page reclaim when there is no memory
141pressure. It usually targets cold pages only. E.g., when a new job
142comes in, the job scheduler wants to proactively reclaim cold pages on
143the server it selected, to improve the chance of successfully landing
144this new job.
145
146Users can write the following command to ``lru_gen`` to evict
147generations less than or equal to ``min_gen_nr``.
148
149    ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
150
151``min_gen_nr`` should be less than ``max_gen_nr-1``, since
152``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
153the active list) and therefore cannot be evicted. ``swappiness``
154overrides the default value in ``/proc/sys/vm/swappiness``.
155``nr_to_reclaim`` limits the number of pages to evict.
156
157A typical use case is that a job scheduler runs this command before it
158tries to land a new job on a server. If it fails to materialize enough
159cold pages because of the overestimation, it retries on the next
160server according to the ranking result obtained from the working set
161estimation step. This less forceful approach limits the impacts on the
162existing jobs.
163