xref: /openbmc/linux/Documentation/admin-guide/mm/multigen_lru.rst (revision 4f2c0a4acffbec01079c28f839422e64ddeff004)
1*07017acbSYu Zhao.. SPDX-License-Identifier: GPL-2.0
2*07017acbSYu Zhao
3*07017acbSYu Zhao=============
4*07017acbSYu ZhaoMulti-Gen LRU
5*07017acbSYu Zhao=============
6*07017acbSYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes
7*07017acbSYu Zhaopage reclaim and improves performance under memory pressure. Page
8*07017acbSYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit
9*07017acbSYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency.
10*07017acbSYu Zhao
11*07017acbSYu ZhaoQuick start
12*07017acbSYu Zhao===========
13*07017acbSYu ZhaoBuild the kernel with the following configurations.
14*07017acbSYu Zhao
15*07017acbSYu Zhao* ``CONFIG_LRU_GEN=y``
16*07017acbSYu Zhao* ``CONFIG_LRU_GEN_ENABLED=y``
17*07017acbSYu Zhao
18*07017acbSYu ZhaoAll set!
19*07017acbSYu Zhao
20*07017acbSYu ZhaoRuntime options
21*07017acbSYu Zhao===============
22*07017acbSYu Zhao``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
23*07017acbSYu Zhaofollowing subsections.
24*07017acbSYu Zhao
25*07017acbSYu ZhaoKill switch
26*07017acbSYu Zhao-----------
27*07017acbSYu Zhao``enabled`` accepts different values to enable or disable the
28*07017acbSYu Zhaofollowing components. Its default value depends on
29*07017acbSYu Zhao``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
30*07017acbSYu Zhaounless some of them have unforeseen side effects. Writing to
31*07017acbSYu Zhao``enabled`` has no effect when a component is not supported by the
32*07017acbSYu Zhaohardware, and valid values will be accepted even when the main switch
33*07017acbSYu Zhaois off.
34*07017acbSYu Zhao
35*07017acbSYu Zhao====== ===============================================================
36*07017acbSYu ZhaoValues Components
37*07017acbSYu Zhao====== ===============================================================
38*07017acbSYu Zhao0x0001 The main switch for the multi-gen LRU.
39*07017acbSYu Zhao0x0002 Clearing the accessed bit in leaf page table entries in large
40*07017acbSYu Zhao       batches, when MMU sets it (e.g., on x86). This behavior can
41*07017acbSYu Zhao       theoretically worsen lock contention (mmap_lock). If it is
42*07017acbSYu Zhao       disabled, the multi-gen LRU will suffer a minor performance
43*07017acbSYu Zhao       degradation for workloads that contiguously map hot pages,
44*07017acbSYu Zhao       whose accessed bits can be otherwise cleared by fewer larger
45*07017acbSYu Zhao       batches.
46*07017acbSYu Zhao0x0004 Clearing the accessed bit in non-leaf page table entries as
47*07017acbSYu Zhao       well, when MMU sets it (e.g., on x86). This behavior was not
48*07017acbSYu Zhao       verified on x86 varieties other than Intel and AMD. If it is
49*07017acbSYu Zhao       disabled, the multi-gen LRU will suffer a negligible
50*07017acbSYu Zhao       performance degradation.
51*07017acbSYu Zhao[yYnN] Apply to all the components above.
52*07017acbSYu Zhao====== ===============================================================
53*07017acbSYu Zhao
54*07017acbSYu ZhaoE.g.,
55*07017acbSYu Zhao::
56*07017acbSYu Zhao
57*07017acbSYu Zhao    echo y >/sys/kernel/mm/lru_gen/enabled
58*07017acbSYu Zhao    cat /sys/kernel/mm/lru_gen/enabled
59*07017acbSYu Zhao    0x0007
60*07017acbSYu Zhao    echo 5 >/sys/kernel/mm/lru_gen/enabled
61*07017acbSYu Zhao    cat /sys/kernel/mm/lru_gen/enabled
62*07017acbSYu Zhao    0x0005
63*07017acbSYu Zhao
64*07017acbSYu ZhaoThrashing prevention
65*07017acbSYu Zhao--------------------
66*07017acbSYu ZhaoPersonal computers are more sensitive to thrashing because it can
67*07017acbSYu Zhaocause janks (lags when rendering UI) and negatively impact user
68*07017acbSYu Zhaoexperience. The multi-gen LRU offers thrashing prevention to the
69*07017acbSYu Zhaomajority of laptop and desktop users who do not have ``oomd``.
70*07017acbSYu Zhao
71*07017acbSYu ZhaoUsers can write ``N`` to ``min_ttl_ms`` to prevent the working set of
72*07017acbSYu Zhao``N`` milliseconds from getting evicted. The OOM killer is triggered
73*07017acbSYu Zhaoif this working set cannot be kept in memory. In other words, this
74*07017acbSYu Zhaooption works as an adjustable pressure relief valve, and when open, it
75*07017acbSYu Zhaoterminates applications that are hopefully not being used.
76*07017acbSYu Zhao
77*07017acbSYu ZhaoBased on the average human detectable lag (~100ms), ``N=1000`` usually
78*07017acbSYu Zhaoeliminates intolerable janks due to thrashing. Larger values like
79*07017acbSYu Zhao``N=3000`` make janks less noticeable at the risk of premature OOM
80*07017acbSYu Zhaokills.
81*07017acbSYu Zhao
82*07017acbSYu ZhaoThe default value ``0`` means disabled.
83*07017acbSYu Zhao
84*07017acbSYu ZhaoExperimental features
85*07017acbSYu Zhao=====================
86*07017acbSYu Zhao``/sys/kernel/debug/lru_gen`` accepts commands described in the
87*07017acbSYu Zhaofollowing subsections. Multiple command lines are supported, so does
88*07017acbSYu Zhaoconcatenation with delimiters ``,`` and ``;``.
89*07017acbSYu Zhao
90*07017acbSYu Zhao``/sys/kernel/debug/lru_gen_full`` provides additional stats for
91*07017acbSYu Zhaodebugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
92*07017acbSYu Zhaoevicted generations in this file.
93*07017acbSYu Zhao
94*07017acbSYu ZhaoWorking set estimation
95*07017acbSYu Zhao----------------------
96*07017acbSYu ZhaoWorking set estimation measures how much memory an application needs
97*07017acbSYu Zhaoin a given time interval, and it is usually done with little impact on
98*07017acbSYu Zhaothe performance of the application. E.g., data centers want to
99*07017acbSYu Zhaooptimize job scheduling (bin packing) to improve memory utilizations.
100*07017acbSYu ZhaoWhen a new job comes in, the job scheduler needs to find out whether
101*07017acbSYu Zhaoeach server it manages can allocate a certain amount of memory for
102*07017acbSYu Zhaothis new job before it can pick a candidate. To do so, the job
103*07017acbSYu Zhaoscheduler needs to estimate the working sets of the existing jobs.
104*07017acbSYu Zhao
105*07017acbSYu ZhaoWhen it is read, ``lru_gen`` returns a histogram of numbers of pages
106*07017acbSYu Zhaoaccessed over different time intervals for each memcg and node.
107*07017acbSYu Zhao``MAX_NR_GENS`` decides the number of bins for each histogram. The
108*07017acbSYu Zhaohistograms are noncumulative.
109*07017acbSYu Zhao::
110*07017acbSYu Zhao
111*07017acbSYu Zhao    memcg  memcg_id  memcg_path
112*07017acbSYu Zhao       node  node_id
113*07017acbSYu Zhao           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
114*07017acbSYu Zhao           ...
115*07017acbSYu Zhao           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
116*07017acbSYu Zhao
117*07017acbSYu ZhaoEach bin contains an estimated number of pages that have been accessed
118*07017acbSYu Zhaowithin ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
119*07017acbSYu Zhaoand ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
120*07017acbSYu Zhaothe former is the largest and that of the latter is the smallest.
121*07017acbSYu Zhao
122*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to create a new
123*07017acbSYu Zhaogeneration ``max_gen_nr+1``:
124*07017acbSYu Zhao
125*07017acbSYu Zhao    ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
126*07017acbSYu Zhao
127*07017acbSYu Zhao``can_swap`` defaults to the swap setting and, if it is set to ``1``,
128*07017acbSYu Zhaoit forces the scan of anon pages when swap is off, and vice versa.
129*07017acbSYu Zhao``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
130*07017acbSYu Zhaoemploys heuristics to reduce the overhead, which is likely to reduce
131*07017acbSYu Zhaothe coverage as well.
132*07017acbSYu Zhao
133*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command at a
134*07017acbSYu Zhaocertain time interval to create new generations, and it ranks the
135*07017acbSYu Zhaoservers it manages based on the sizes of their cold pages defined by
136*07017acbSYu Zhaothis time interval.
137*07017acbSYu Zhao
138*07017acbSYu ZhaoProactive reclaim
139*07017acbSYu Zhao-----------------
140*07017acbSYu ZhaoProactive reclaim induces page reclaim when there is no memory
141*07017acbSYu Zhaopressure. It usually targets cold pages only. E.g., when a new job
142*07017acbSYu Zhaocomes in, the job scheduler wants to proactively reclaim cold pages on
143*07017acbSYu Zhaothe server it selected, to improve the chance of successfully landing
144*07017acbSYu Zhaothis new job.
145*07017acbSYu Zhao
146*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to evict
147*07017acbSYu Zhaogenerations less than or equal to ``min_gen_nr``.
148*07017acbSYu Zhao
149*07017acbSYu Zhao    ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
150*07017acbSYu Zhao
151*07017acbSYu Zhao``min_gen_nr`` should be less than ``max_gen_nr-1``, since
152*07017acbSYu Zhao``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
153*07017acbSYu Zhaothe active list) and therefore cannot be evicted. ``swappiness``
154*07017acbSYu Zhaooverrides the default value in ``/proc/sys/vm/swappiness``.
155*07017acbSYu Zhao``nr_to_reclaim`` limits the number of pages to evict.
156*07017acbSYu Zhao
157*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command before it
158*07017acbSYu Zhaotries to land a new job on a server. If it fails to materialize enough
159*07017acbSYu Zhaocold pages because of the overestimation, it retries on the next
160*07017acbSYu Zhaoserver according to the ranking result obtained from the working set
161*07017acbSYu Zhaoestimation step. This less forceful approach limits the impacts on the
162*07017acbSYu Zhaoexisting jobs.
163