admin-guide/mm/multigen_lru.rst

*07017acbSYu Zhao.. SPDX-License-Identifier: GPL-2.0
*07017acbSYu Zhao
*07017acbSYu Zhao=============
*07017acbSYu ZhaoMulti-Gen LRU
*07017acbSYu Zhao=============
*07017acbSYu ZhaoThe multi-gen LRU is an alternative LRU implementation that optimizes
*07017acbSYu Zhaopage reclaim and improves performance under memory pressure. Page
*07017acbSYu Zhaoreclaim decides the kernel's caching policy and ability to overcommit
*07017acbSYu Zhaomemory. It directly impacts the kswapd CPU usage and RAM efficiency.
*07017acbSYu Zhao
*07017acbSYu ZhaoQuick start
*07017acbSYu Zhao===========
*07017acbSYu ZhaoBuild the kernel with the following configurations.
*07017acbSYu Zhao
*07017acbSYu Zhao* ``CONFIG_LRU_GEN=y``
*07017acbSYu Zhao* ``CONFIG_LRU_GEN_ENABLED=y``
*07017acbSYu Zhao
*07017acbSYu ZhaoAll set!
*07017acbSYu Zhao
*07017acbSYu ZhaoRuntime options
*07017acbSYu Zhao===============
*07017acbSYu Zhao``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
*07017acbSYu Zhaofollowing subsections.
*07017acbSYu Zhao
*07017acbSYu ZhaoKill switch
*07017acbSYu Zhao-----------
*07017acbSYu Zhao``enabled`` accepts different values to enable or disable the
*07017acbSYu Zhaofollowing components. Its default value depends on
*07017acbSYu Zhao``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
*07017acbSYu Zhaounless some of them have unforeseen side effects. Writing to
*07017acbSYu Zhao``enabled`` has no effect when a component is not supported by the
*07017acbSYu Zhaohardware, and valid values will be accepted even when the main switch
*07017acbSYu Zhaois off.
*07017acbSYu Zhao
*07017acbSYu Zhao====== ===============================================================
*07017acbSYu ZhaoValues Components
*07017acbSYu Zhao====== ===============================================================
*07017acbSYu Zhao0x0001 The main switch for the multi-gen LRU.
*07017acbSYu Zhao0x0002 Clearing the accessed bit in leaf page table entries in large
*07017acbSYu Zhao       batches, when MMU sets it (e.g., on x86). This behavior can
*07017acbSYu Zhao       theoretically worsen lock contention (mmap_lock). If it is
*07017acbSYu Zhao       disabled, the multi-gen LRU will suffer a minor performance
*07017acbSYu Zhao       degradation for workloads that contiguously map hot pages,
*07017acbSYu Zhao       whose accessed bits can be otherwise cleared by fewer larger
*07017acbSYu Zhao       batches.
*07017acbSYu Zhao0x0004 Clearing the accessed bit in non-leaf page table entries as
*07017acbSYu Zhao       well, when MMU sets it (e.g., on x86). This behavior was not
*07017acbSYu Zhao       verified on x86 varieties other than Intel and AMD. If it is
*07017acbSYu Zhao       disabled, the multi-gen LRU will suffer a negligible
*07017acbSYu Zhao       performance degradation.
*07017acbSYu Zhao[yYnN] Apply to all the components above.
*07017acbSYu Zhao====== ===============================================================
*07017acbSYu Zhao
*07017acbSYu ZhaoE.g.,
*07017acbSYu Zhao::
*07017acbSYu Zhao
*07017acbSYu Zhao    echo y >/sys/kernel/mm/lru_gen/enabled
*07017acbSYu Zhao    cat /sys/kernel/mm/lru_gen/enabled
*07017acbSYu Zhao    0x0007
*07017acbSYu Zhao    echo 5 >/sys/kernel/mm/lru_gen/enabled
*07017acbSYu Zhao    cat /sys/kernel/mm/lru_gen/enabled
*07017acbSYu Zhao    0x0005
*07017acbSYu Zhao
*07017acbSYu ZhaoThrashing prevention
*07017acbSYu Zhao--------------------
*07017acbSYu ZhaoPersonal computers are more sensitive to thrashing because it can
*07017acbSYu Zhaocause janks (lags when rendering UI) and negatively impact user
*07017acbSYu Zhaoexperience. The multi-gen LRU offers thrashing prevention to the
*07017acbSYu Zhaomajority of laptop and desktop users who do not have ``oomd``.
*07017acbSYu Zhao
*07017acbSYu ZhaoUsers can write ``N`` to ``min_ttl_ms`` to prevent the working set of
*07017acbSYu Zhao``N`` milliseconds from getting evicted. The OOM killer is triggered
*07017acbSYu Zhaoif this working set cannot be kept in memory. In other words, this
*07017acbSYu Zhaooption works as an adjustable pressure relief valve, and when open, it
*07017acbSYu Zhaoterminates applications that are hopefully not being used.
*07017acbSYu Zhao
*07017acbSYu ZhaoBased on the average human detectable lag (~100ms), ``N=1000`` usually
*07017acbSYu Zhaoeliminates intolerable janks due to thrashing. Larger values like
*07017acbSYu Zhao``N=3000`` make janks less noticeable at the risk of premature OOM
*07017acbSYu Zhaokills.
*07017acbSYu Zhao
*07017acbSYu ZhaoThe default value ``0`` means disabled.
*07017acbSYu Zhao
*07017acbSYu ZhaoExperimental features
*07017acbSYu Zhao=====================
*07017acbSYu Zhao``/sys/kernel/debug/lru_gen`` accepts commands described in the
*07017acbSYu Zhaofollowing subsections. Multiple command lines are supported, so does
*07017acbSYu Zhaoconcatenation with delimiters ``,`` and ``;``.
*07017acbSYu Zhao
*07017acbSYu Zhao``/sys/kernel/debug/lru_gen_full`` provides additional stats for
*07017acbSYu Zhaodebugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
*07017acbSYu Zhaoevicted generations in this file.
*07017acbSYu Zhao
*07017acbSYu ZhaoWorking set estimation
*07017acbSYu Zhao----------------------
*07017acbSYu ZhaoWorking set estimation measures how much memory an application needs
*07017acbSYu Zhaoin a given time interval, and it is usually done with little impact on
*07017acbSYu Zhaothe performance of the application. E.g., data centers want to
*07017acbSYu Zhaooptimize job scheduling (bin packing) to improve memory utilizations.
*07017acbSYu ZhaoWhen a new job comes in, the job scheduler needs to find out whether
*07017acbSYu Zhaoeach server it manages can allocate a certain amount of memory for
*07017acbSYu Zhaothis new job before it can pick a candidate. To do so, the job
*07017acbSYu Zhaoscheduler needs to estimate the working sets of the existing jobs.
*07017acbSYu Zhao
*07017acbSYu ZhaoWhen it is read, ``lru_gen`` returns a histogram of numbers of pages
*07017acbSYu Zhaoaccessed over different time intervals for each memcg and node.
*07017acbSYu Zhao``MAX_NR_GENS`` decides the number of bins for each histogram. The
*07017acbSYu Zhaohistograms are noncumulative.
*07017acbSYu Zhao::
*07017acbSYu Zhao
*07017acbSYu Zhao    memcg  memcg_id  memcg_path
*07017acbSYu Zhao       node  node_id
*07017acbSYu Zhao           min_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
*07017acbSYu Zhao           ...
*07017acbSYu Zhao           max_gen_nr  age_in_ms  nr_anon_pages  nr_file_pages
*07017acbSYu Zhao
*07017acbSYu ZhaoEach bin contains an estimated number of pages that have been accessed
*07017acbSYu Zhaowithin ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
*07017acbSYu Zhaoand ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
*07017acbSYu Zhaothe former is the largest and that of the latter is the smallest.
*07017acbSYu Zhao
*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to create a new
*07017acbSYu Zhaogeneration ``max_gen_nr+1``:
*07017acbSYu Zhao
*07017acbSYu Zhao    ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``
*07017acbSYu Zhao
*07017acbSYu Zhao``can_swap`` defaults to the swap setting and, if it is set to ``1``,
*07017acbSYu Zhaoit forces the scan of anon pages when swap is off, and vice versa.
*07017acbSYu Zhao``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
*07017acbSYu Zhaoemploys heuristics to reduce the overhead, which is likely to reduce
*07017acbSYu Zhaothe coverage as well.
*07017acbSYu Zhao
*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command at a
*07017acbSYu Zhaocertain time interval to create new generations, and it ranks the
*07017acbSYu Zhaoservers it manages based on the sizes of their cold pages defined by
*07017acbSYu Zhaothis time interval.
*07017acbSYu Zhao
*07017acbSYu ZhaoProactive reclaim
*07017acbSYu Zhao-----------------
*07017acbSYu ZhaoProactive reclaim induces page reclaim when there is no memory
*07017acbSYu Zhaopressure. It usually targets cold pages only. E.g., when a new job
*07017acbSYu Zhaocomes in, the job scheduler wants to proactively reclaim cold pages on
*07017acbSYu Zhaothe server it selected, to improve the chance of successfully landing
*07017acbSYu Zhaothis new job.
*07017acbSYu Zhao
*07017acbSYu ZhaoUsers can write the following command to ``lru_gen`` to evict
*07017acbSYu Zhaogenerations less than or equal to ``min_gen_nr``.
*07017acbSYu Zhao
*07017acbSYu Zhao    ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``
*07017acbSYu Zhao
*07017acbSYu Zhao``min_gen_nr`` should be less than ``max_gen_nr-1``, since
*07017acbSYu Zhao``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
*07017acbSYu Zhaothe active list) and therefore cannot be evicted. ``swappiness``
*07017acbSYu Zhaooverrides the default value in ``/proc/sys/vm/swappiness``.
*07017acbSYu Zhao``nr_to_reclaim`` limits the number of pages to evict.
*07017acbSYu Zhao
*07017acbSYu ZhaoA typical use case is that a job scheduler runs this command before it
*07017acbSYu Zhaotries to land a new job on a server. If it fails to materialize enough
*07017acbSYu Zhaocold pages because of the overestimation, it retries on the next
*07017acbSYu Zhaoserver according to the ranking result obtained from the working set
*07017acbSYu Zhaoestimation step. This less forceful approach limits the impacts on the
*07017acbSYu Zhaoexisting jobs.