1*07017acbSYu Zhao.. SPDX-License-Identifier: GPL-2.0 2*07017acbSYu Zhao 3*07017acbSYu Zhao============= 4*07017acbSYu ZhaoMulti-Gen LRU 5*07017acbSYu Zhao============= 6*07017acbSYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes 7*07017acbSYu Zhaopage reclaim and improves performance under memory pressure. Page 8*07017acbSYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit 9*07017acbSYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency. 10*07017acbSYu Zhao 11*07017acbSYu ZhaoQuick start 12*07017acbSYu Zhao=========== 13*07017acbSYu ZhaoBuild the kernel with the following configurations. 14*07017acbSYu Zhao 15*07017acbSYu Zhao* ``CONFIG_LRU_GEN=y`` 16*07017acbSYu Zhao* ``CONFIG_LRU_GEN_ENABLED=y`` 17*07017acbSYu Zhao 18*07017acbSYu ZhaoAll set! 19*07017acbSYu Zhao 20*07017acbSYu ZhaoRuntime options 21*07017acbSYu Zhao=============== 22*07017acbSYu Zhao``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the 23*07017acbSYu Zhaofollowing subsections. 24*07017acbSYu Zhao 25*07017acbSYu ZhaoKill switch 26*07017acbSYu Zhao----------- 27*07017acbSYu Zhao``enabled`` accepts different values to enable or disable the 28*07017acbSYu Zhaofollowing components. Its default value depends on 29*07017acbSYu Zhao``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled 30*07017acbSYu Zhaounless some of them have unforeseen side effects. Writing to 31*07017acbSYu Zhao``enabled`` has no effect when a component is not supported by the 32*07017acbSYu Zhaohardware, and valid values will be accepted even when the main switch 33*07017acbSYu Zhaois off. 34*07017acbSYu Zhao 35*07017acbSYu Zhao====== =============================================================== 36*07017acbSYu ZhaoValues Components 37*07017acbSYu Zhao====== =============================================================== 38*07017acbSYu Zhao0x0001 The main switch for the multi-gen LRU. 39*07017acbSYu Zhao0x0002 Clearing the accessed bit in leaf page table entries in large 40*07017acbSYu Zhao batches, when MMU sets it (e.g., on x86). This behavior can 41*07017acbSYu Zhao theoretically worsen lock contention (mmap_lock). If it is 42*07017acbSYu Zhao disabled, the multi-gen LRU will suffer a minor performance 43*07017acbSYu Zhao degradation for workloads that contiguously map hot pages, 44*07017acbSYu Zhao whose accessed bits can be otherwise cleared by fewer larger 45*07017acbSYu Zhao batches. 46*07017acbSYu Zhao0x0004 Clearing the accessed bit in non-leaf page table entries as 47*07017acbSYu Zhao well, when MMU sets it (e.g., on x86). This behavior was not 48*07017acbSYu Zhao verified on x86 varieties other than Intel and AMD. If it is 49*07017acbSYu Zhao disabled, the multi-gen LRU will suffer a negligible 50*07017acbSYu Zhao performance degradation. 51*07017acbSYu Zhao[yYnN] Apply to all the components above. 52*07017acbSYu Zhao====== =============================================================== 53*07017acbSYu Zhao 54*07017acbSYu ZhaoE.g., 55*07017acbSYu Zhao:: 56*07017acbSYu Zhao 57*07017acbSYu Zhao echo y >/sys/kernel/mm/lru_gen/enabled 58*07017acbSYu Zhao cat /sys/kernel/mm/lru_gen/enabled 59*07017acbSYu Zhao 0x0007 60*07017acbSYu Zhao echo 5 >/sys/kernel/mm/lru_gen/enabled 61*07017acbSYu Zhao cat /sys/kernel/mm/lru_gen/enabled 62*07017acbSYu Zhao 0x0005 63*07017acbSYu Zhao 64*07017acbSYu ZhaoThrashing prevention 65*07017acbSYu Zhao-------------------- 66*07017acbSYu ZhaoPersonal computers are more sensitive to thrashing because it can 67*07017acbSYu Zhaocause janks (lags when rendering UI) and negatively impact user 68*07017acbSYu Zhaoexperience. The multi-gen LRU offers thrashing prevention to the 69*07017acbSYu Zhaomajority of laptop and desktop users who do not have ``oomd``. 70*07017acbSYu Zhao 71*07017acbSYu ZhaoUsers can write ``N`` to ``min_ttl_ms`` to prevent the working set of 72*07017acbSYu Zhao``N`` milliseconds from getting evicted. The OOM killer is triggered 73*07017acbSYu Zhaoif this working set cannot be kept in memory. In other words, this 74*07017acbSYu Zhaooption works as an adjustable pressure relief valve, and when open, it 75*07017acbSYu Zhaoterminates applications that are hopefully not being used. 76*07017acbSYu Zhao 77*07017acbSYu ZhaoBased on the average human detectable lag (~100ms), ``N=1000`` usually 78*07017acbSYu Zhaoeliminates intolerable janks due to thrashing. Larger values like 79*07017acbSYu Zhao``N=3000`` make janks less noticeable at the risk of premature OOM 80*07017acbSYu Zhaokills. 81*07017acbSYu Zhao 82*07017acbSYu ZhaoThe default value ``0`` means disabled. 83*07017acbSYu Zhao 84*07017acbSYu ZhaoExperimental features 85*07017acbSYu Zhao===================== 86*07017acbSYu Zhao``/sys/kernel/debug/lru_gen`` accepts commands described in the 87*07017acbSYu Zhaofollowing subsections. Multiple command lines are supported, so does 88*07017acbSYu Zhaoconcatenation with delimiters ``,`` and ``;``. 89*07017acbSYu Zhao 90*07017acbSYu Zhao``/sys/kernel/debug/lru_gen_full`` provides additional stats for 91*07017acbSYu Zhaodebugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from 92*07017acbSYu Zhaoevicted generations in this file. 93*07017acbSYu Zhao 94*07017acbSYu ZhaoWorking set estimation 95*07017acbSYu Zhao---------------------- 96*07017acbSYu ZhaoWorking set estimation measures how much memory an application needs 97*07017acbSYu Zhaoin a given time interval, and it is usually done with little impact on 98*07017acbSYu Zhaothe performance of the application. E.g., data centers want to 99*07017acbSYu Zhaooptimize job scheduling (bin packing) to improve memory utilizations. 100*07017acbSYu ZhaoWhen a new job comes in, the job scheduler needs to find out whether 101*07017acbSYu Zhaoeach server it manages can allocate a certain amount of memory for 102*07017acbSYu Zhaothis new job before it can pick a candidate. To do so, the job 103*07017acbSYu Zhaoscheduler needs to estimate the working sets of the existing jobs. 104*07017acbSYu Zhao 105*07017acbSYu ZhaoWhen it is read, ``lru_gen`` returns a histogram of numbers of pages 106*07017acbSYu Zhaoaccessed over different time intervals for each memcg and node. 107*07017acbSYu Zhao``MAX_NR_GENS`` decides the number of bins for each histogram. The 108*07017acbSYu Zhaohistograms are noncumulative. 109*07017acbSYu Zhao:: 110*07017acbSYu Zhao 111*07017acbSYu Zhao memcg memcg_id memcg_path 112*07017acbSYu Zhao node node_id 113*07017acbSYu Zhao min_gen_nr age_in_ms nr_anon_pages nr_file_pages 114*07017acbSYu Zhao ... 115*07017acbSYu Zhao max_gen_nr age_in_ms nr_anon_pages nr_file_pages 116*07017acbSYu Zhao 117*07017acbSYu ZhaoEach bin contains an estimated number of pages that have been accessed 118*07017acbSYu Zhaowithin ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages 119*07017acbSYu Zhaoand ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of 120*07017acbSYu Zhaothe former is the largest and that of the latter is the smallest. 121*07017acbSYu Zhao 122*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to create a new 123*07017acbSYu Zhaogeneration ``max_gen_nr+1``: 124*07017acbSYu Zhao 125*07017acbSYu Zhao ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` 126*07017acbSYu Zhao 127*07017acbSYu Zhao``can_swap`` defaults to the swap setting and, if it is set to ``1``, 128*07017acbSYu Zhaoit forces the scan of anon pages when swap is off, and vice versa. 129*07017acbSYu Zhao``force_scan`` defaults to ``1`` and, if it is set to ``0``, it 130*07017acbSYu Zhaoemploys heuristics to reduce the overhead, which is likely to reduce 131*07017acbSYu Zhaothe coverage as well. 132*07017acbSYu Zhao 133*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command at a 134*07017acbSYu Zhaocertain time interval to create new generations, and it ranks the 135*07017acbSYu Zhaoservers it manages based on the sizes of their cold pages defined by 136*07017acbSYu Zhaothis time interval. 137*07017acbSYu Zhao 138*07017acbSYu ZhaoProactive reclaim 139*07017acbSYu Zhao----------------- 140*07017acbSYu ZhaoProactive reclaim induces page reclaim when there is no memory 141*07017acbSYu Zhaopressure. It usually targets cold pages only. E.g., when a new job 142*07017acbSYu Zhaocomes in, the job scheduler wants to proactively reclaim cold pages on 143*07017acbSYu Zhaothe server it selected, to improve the chance of successfully landing 144*07017acbSYu Zhaothis new job. 145*07017acbSYu Zhao 146*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to evict 147*07017acbSYu Zhaogenerations less than or equal to ``min_gen_nr``. 148*07017acbSYu Zhao 149*07017acbSYu Zhao ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` 150*07017acbSYu Zhao 151*07017acbSYu Zhao``min_gen_nr`` should be less than ``max_gen_nr-1``, since 152*07017acbSYu Zhao``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to 153*07017acbSYu Zhaothe active list) and therefore cannot be evicted. ``swappiness`` 154*07017acbSYu Zhaooverrides the default value in ``/proc/sys/vm/swappiness``. 155*07017acbSYu Zhao``nr_to_reclaim`` limits the number of pages to evict. 156*07017acbSYu Zhao 157*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command before it 158*07017acbSYu Zhaotries to land a new job on a server. If it fails to materialize enough 159*07017acbSYu Zhaocold pages because of the overestimation, it retries on the next 160*07017acbSYu Zhaoserver according to the ranking result obtained from the working set 161*07017acbSYu Zhaoestimation step. This less forceful approach limits the impacts on the 162*07017acbSYu Zhaoexisting jobs. 163