xref: /openbmc/linux/Documentation/admin-guide/sysctl/vm.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
157043247SMauro Carvalho Chehab===============================
257043247SMauro Carvalho ChehabDocumentation for /proc/sys/vm/
357043247SMauro Carvalho Chehab===============================
457043247SMauro Carvalho Chehab
557043247SMauro Carvalho Chehabkernel version 2.6.29
657043247SMauro Carvalho Chehab
757043247SMauro Carvalho ChehabCopyright (c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>
857043247SMauro Carvalho Chehab
957043247SMauro Carvalho ChehabCopyright (c) 2008         Peter W. Morreale <pmorreale@novell.com>
1057043247SMauro Carvalho Chehab
1157043247SMauro Carvalho ChehabFor general info and legal blurb, please look in index.rst.
1257043247SMauro Carvalho Chehab
1357043247SMauro Carvalho Chehab------------------------------------------------------------------------------
1457043247SMauro Carvalho Chehab
1557043247SMauro Carvalho ChehabThis file contains the documentation for the sysctl files in
1657043247SMauro Carvalho Chehab/proc/sys/vm and is valid for Linux kernel version 2.6.29.
1757043247SMauro Carvalho Chehab
1857043247SMauro Carvalho ChehabThe files in this directory can be used to tune the operation
1957043247SMauro Carvalho Chehabof the virtual memory (VM) subsystem of the Linux kernel and
2057043247SMauro Carvalho Chehabthe writeout of dirty data to disk.
2157043247SMauro Carvalho Chehab
2257043247SMauro Carvalho ChehabDefault values and initialization routines for most of these
2357043247SMauro Carvalho Chehabfiles can be found in mm/swap.c.
2457043247SMauro Carvalho Chehab
2557043247SMauro Carvalho ChehabCurrently, these files are in /proc/sys/vm:
2657043247SMauro Carvalho Chehab
2757043247SMauro Carvalho Chehab- admin_reserve_kbytes
2857043247SMauro Carvalho Chehab- compact_memory
2962af6964SFam Zheng- compaction_proactiveness
3057043247SMauro Carvalho Chehab- compact_unevictable_allowed
3157043247SMauro Carvalho Chehab- dirty_background_bytes
3257043247SMauro Carvalho Chehab- dirty_background_ratio
3357043247SMauro Carvalho Chehab- dirty_bytes
3457043247SMauro Carvalho Chehab- dirty_expire_centisecs
3557043247SMauro Carvalho Chehab- dirty_ratio
3657043247SMauro Carvalho Chehab- dirtytime_expire_seconds
3757043247SMauro Carvalho Chehab- dirty_writeback_centisecs
3857043247SMauro Carvalho Chehab- drop_caches
3957043247SMauro Carvalho Chehab- extfrag_threshold
4062af6964SFam Zheng- highmem_is_dirtyable
4157043247SMauro Carvalho Chehab- hugetlb_shm_group
4257043247SMauro Carvalho Chehab- laptop_mode
4357043247SMauro Carvalho Chehab- legacy_va_layout
4457043247SMauro Carvalho Chehab- lowmem_reserve_ratio
4557043247SMauro Carvalho Chehab- max_map_count
4657043247SMauro Carvalho Chehab- memory_failure_early_kill
4757043247SMauro Carvalho Chehab- memory_failure_recovery
4857043247SMauro Carvalho Chehab- min_free_kbytes
4957043247SMauro Carvalho Chehab- min_slab_ratio
5057043247SMauro Carvalho Chehab- min_unmapped_ratio
5157043247SMauro Carvalho Chehab- mmap_min_addr
5257043247SMauro Carvalho Chehab- mmap_rnd_bits
5357043247SMauro Carvalho Chehab- mmap_rnd_compat_bits
5457043247SMauro Carvalho Chehab- nr_hugepages
5557043247SMauro Carvalho Chehab- nr_hugepages_mempolicy
5657043247SMauro Carvalho Chehab- nr_overcommit_hugepages
5757043247SMauro Carvalho Chehab- nr_trim_pages         (only if CONFIG_MMU=n)
5857043247SMauro Carvalho Chehab- numa_zonelist_order
5957043247SMauro Carvalho Chehab- oom_dump_tasks
6057043247SMauro Carvalho Chehab- oom_kill_allocating_task
6157043247SMauro Carvalho Chehab- overcommit_kbytes
6257043247SMauro Carvalho Chehab- overcommit_memory
6357043247SMauro Carvalho Chehab- overcommit_ratio
6457043247SMauro Carvalho Chehab- page-cluster
658d98e42fSJoel Savitz- page_lock_unfairness
6657043247SMauro Carvalho Chehab- panic_on_oom
6774f44822SMel Gorman- percpu_pagelist_high_fraction
6857043247SMauro Carvalho Chehab- stat_interval
6957043247SMauro Carvalho Chehab- stat_refresh
7057043247SMauro Carvalho Chehab- numa_stat
7157043247SMauro Carvalho Chehab- swappiness
7257043247SMauro Carvalho Chehab- unprivileged_userfaultfd
7357043247SMauro Carvalho Chehab- user_reserve_kbytes
7457043247SMauro Carvalho Chehab- vfs_cache_pressure
7557043247SMauro Carvalho Chehab- watermark_boost_factor
7657043247SMauro Carvalho Chehab- watermark_scale_factor
7757043247SMauro Carvalho Chehab- zone_reclaim_mode
7857043247SMauro Carvalho Chehab
7957043247SMauro Carvalho Chehab
8057043247SMauro Carvalho Chehabadmin_reserve_kbytes
8157043247SMauro Carvalho Chehab====================
8257043247SMauro Carvalho Chehab
8357043247SMauro Carvalho ChehabThe amount of free memory in the system that should be reserved for users
8457043247SMauro Carvalho Chehabwith the capability cap_sys_admin.
8557043247SMauro Carvalho Chehab
8657043247SMauro Carvalho Chehabadmin_reserve_kbytes defaults to min(3% of free pages, 8MB)
8757043247SMauro Carvalho Chehab
8857043247SMauro Carvalho ChehabThat should provide enough for the admin to log in and kill a process,
8957043247SMauro Carvalho Chehabif necessary, under the default overcommit 'guess' mode.
9057043247SMauro Carvalho Chehab
9157043247SMauro Carvalho ChehabSystems running under overcommit 'never' should increase this to account
9257043247SMauro Carvalho Chehabfor the full Virtual Memory Size of programs used to recover. Otherwise,
9357043247SMauro Carvalho Chehabroot may not be able to log in to recover the system.
9457043247SMauro Carvalho Chehab
9557043247SMauro Carvalho ChehabHow do you calculate a minimum useful reserve?
9657043247SMauro Carvalho Chehab
9757043247SMauro Carvalho Chehabsshd or login + bash (or some other shell) + top (or ps, kill, etc.)
9857043247SMauro Carvalho Chehab
9957043247SMauro Carvalho ChehabFor overcommit 'guess', we can sum resident set sizes (RSS).
10057043247SMauro Carvalho ChehabOn x86_64 this is about 8MB.
10157043247SMauro Carvalho Chehab
10257043247SMauro Carvalho ChehabFor overcommit 'never', we can take the max of their virtual sizes (VSZ)
10357043247SMauro Carvalho Chehaband add the sum of their RSS.
10457043247SMauro Carvalho ChehabOn x86_64 this is about 128MB.
10557043247SMauro Carvalho Chehab
10657043247SMauro Carvalho ChehabChanging this takes effect whenever an application requests memory.
10757043247SMauro Carvalho Chehab
10857043247SMauro Carvalho Chehab
10957043247SMauro Carvalho Chehabcompact_memory
11057043247SMauro Carvalho Chehab==============
11157043247SMauro Carvalho Chehab
11257043247SMauro Carvalho ChehabAvailable only when CONFIG_COMPACTION is set. When 1 is written to the file,
11357043247SMauro Carvalho Chehaball zones are compacted such that free memory is available in contiguous
11457043247SMauro Carvalho Chehabblocks where possible. This can be important for example in the allocation of
11557043247SMauro Carvalho Chehabhuge pages although processes will also directly compact memory as required.
11657043247SMauro Carvalho Chehab
117facdaa91SNitin Guptacompaction_proactiveness
118facdaa91SNitin Gupta========================
119facdaa91SNitin Gupta
120facdaa91SNitin GuptaThis tunable takes a value in the range [0, 100] with a default value of
121facdaa91SNitin Gupta20. This tunable determines how aggressively compaction is done in the
12265d759c8SCharan Teja Reddybackground. Write of a non zero value to this tunable will immediately
12365d759c8SCharan Teja Reddytrigger the proactive compaction. Setting it to 0 disables proactive compaction.
124facdaa91SNitin Gupta
125facdaa91SNitin GuptaNote that compaction has a non-trivial system-wide impact as pages
126facdaa91SNitin Guptabelonging to different processes are moved around, which could also lead
127facdaa91SNitin Guptato latency spikes in unsuspecting applications. The kernel employs
128facdaa91SNitin Guptavarious heuristics to avoid wasting CPU cycles if it detects that
129facdaa91SNitin Guptaproactive compaction is not being effective.
130facdaa91SNitin Gupta
131facdaa91SNitin GuptaBe careful when setting it to extreme values like 100, as that may
132facdaa91SNitin Guptacause excessive background compaction activity.
13357043247SMauro Carvalho Chehab
13457043247SMauro Carvalho Chehabcompact_unevictable_allowed
13557043247SMauro Carvalho Chehab===========================
13657043247SMauro Carvalho Chehab
13757043247SMauro Carvalho ChehabAvailable only when CONFIG_COMPACTION is set. When set to 1, compaction is
13857043247SMauro Carvalho Chehaballowed to examine the unevictable lru (mlocked pages) for pages to compact.
13957043247SMauro Carvalho ChehabThis should be used on systems where stalls for minor page faults are an
14057043247SMauro Carvalho Chehabacceptable trade for large contiguous free memory.  Set to 0 to prevent
14157043247SMauro Carvalho Chehabcompaction from moving pages that are unevictable.  Default value is 1.
1426923aa0dSSebastian Andrzej SiewiorOn CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
143751d5b27SAndrew Klychkovto compaction, which would block the task from becoming active until the fault
1446923aa0dSSebastian Andrzej Siewioris resolved.
14557043247SMauro Carvalho Chehab
14657043247SMauro Carvalho Chehab
14757043247SMauro Carvalho Chehabdirty_background_bytes
14857043247SMauro Carvalho Chehab======================
14957043247SMauro Carvalho Chehab
15057043247SMauro Carvalho ChehabContains the amount of dirty memory at which the background kernel
15157043247SMauro Carvalho Chehabflusher threads will start writeback.
15257043247SMauro Carvalho Chehab
15357043247SMauro Carvalho ChehabNote:
15457043247SMauro Carvalho Chehab  dirty_background_bytes is the counterpart of dirty_background_ratio. Only
15557043247SMauro Carvalho Chehab  one of them may be specified at a time. When one sysctl is written it is
15657043247SMauro Carvalho Chehab  immediately taken into account to evaluate the dirty memory limits and the
15757043247SMauro Carvalho Chehab  other appears as 0 when read.
15857043247SMauro Carvalho Chehab
15957043247SMauro Carvalho Chehab
16057043247SMauro Carvalho Chehabdirty_background_ratio
16157043247SMauro Carvalho Chehab======================
16257043247SMauro Carvalho Chehab
16357043247SMauro Carvalho ChehabContains, as a percentage of total available memory that contains free pages
16457043247SMauro Carvalho Chehaband reclaimable pages, the number of pages at which the background kernel
16557043247SMauro Carvalho Chehabflusher threads will start writing out dirty data.
16657043247SMauro Carvalho Chehab
16757043247SMauro Carvalho ChehabThe total available memory is not equal to total system memory.
16857043247SMauro Carvalho Chehab
16957043247SMauro Carvalho Chehab
17057043247SMauro Carvalho Chehabdirty_bytes
17157043247SMauro Carvalho Chehab===========
17257043247SMauro Carvalho Chehab
17357043247SMauro Carvalho ChehabContains the amount of dirty memory at which a process generating disk writes
17457043247SMauro Carvalho Chehabwill itself start writeback.
17557043247SMauro Carvalho Chehab
17657043247SMauro Carvalho ChehabNote: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
17757043247SMauro Carvalho Chehabspecified at a time. When one sysctl is written it is immediately taken into
17857043247SMauro Carvalho Chehabaccount to evaluate the dirty memory limits and the other appears as 0 when
17957043247SMauro Carvalho Chehabread.
18057043247SMauro Carvalho Chehab
18157043247SMauro Carvalho ChehabNote: the minimum value allowed for dirty_bytes is two pages (in bytes); any
18257043247SMauro Carvalho Chehabvalue lower than this limit will be ignored and the old configuration will be
18357043247SMauro Carvalho Chehabretained.
18457043247SMauro Carvalho Chehab
18557043247SMauro Carvalho Chehab
18657043247SMauro Carvalho Chehabdirty_expire_centisecs
18757043247SMauro Carvalho Chehab======================
18857043247SMauro Carvalho Chehab
18957043247SMauro Carvalho ChehabThis tunable is used to define when dirty data is old enough to be eligible
19057043247SMauro Carvalho Chehabfor writeout by the kernel flusher threads.  It is expressed in 100'ths
19157043247SMauro Carvalho Chehabof a second.  Data which has been dirty in-memory for longer than this
19257043247SMauro Carvalho Chehabinterval will be written out next time a flusher thread wakes up.
19357043247SMauro Carvalho Chehab
19457043247SMauro Carvalho Chehab
19557043247SMauro Carvalho Chehabdirty_ratio
19657043247SMauro Carvalho Chehab===========
19757043247SMauro Carvalho Chehab
19857043247SMauro Carvalho ChehabContains, as a percentage of total available memory that contains free pages
19957043247SMauro Carvalho Chehaband reclaimable pages, the number of pages at which a process which is
20057043247SMauro Carvalho Chehabgenerating disk writes will itself start writing out dirty data.
20157043247SMauro Carvalho Chehab
20257043247SMauro Carvalho ChehabThe total available memory is not equal to total system memory.
20357043247SMauro Carvalho Chehab
20457043247SMauro Carvalho Chehab
20557043247SMauro Carvalho Chehabdirtytime_expire_seconds
20657043247SMauro Carvalho Chehab========================
20757043247SMauro Carvalho Chehab
20857043247SMauro Carvalho ChehabWhen a lazytime inode is constantly having its pages dirtied, the inode with
20957043247SMauro Carvalho Chehaban updated timestamp will never get chance to be written out.  And, if the
21057043247SMauro Carvalho Chehabonly thing that has happened on the file system is a dirtytime inode caused
21157043247SMauro Carvalho Chehabby an atime update, a worker will be scheduled to make sure that inode
21257043247SMauro Carvalho Chehabeventually gets pushed out to disk.  This tunable is used to define when dirty
21357043247SMauro Carvalho Chehabinode is old enough to be eligible for writeback by the kernel flusher threads.
21457043247SMauro Carvalho ChehabAnd, it is also used as the interval to wakeup dirtytime_writeback thread.
21557043247SMauro Carvalho Chehab
21657043247SMauro Carvalho Chehab
21757043247SMauro Carvalho Chehabdirty_writeback_centisecs
21857043247SMauro Carvalho Chehab=========================
21957043247SMauro Carvalho Chehab
22057043247SMauro Carvalho ChehabThe kernel flusher threads will periodically wake up and write `old` data
22157043247SMauro Carvalho Chehabout to disk.  This tunable expresses the interval between those wakeups, in
22257043247SMauro Carvalho Chehab100'ths of a second.
22357043247SMauro Carvalho Chehab
22457043247SMauro Carvalho ChehabSetting this to zero disables periodic writeback altogether.
22557043247SMauro Carvalho Chehab
22657043247SMauro Carvalho Chehab
22757043247SMauro Carvalho Chehabdrop_caches
22857043247SMauro Carvalho Chehab===========
22957043247SMauro Carvalho Chehab
23057043247SMauro Carvalho ChehabWriting to this will cause the kernel to drop clean caches, as well as
23157043247SMauro Carvalho Chehabreclaimable slab objects like dentries and inodes.  Once dropped, their
23257043247SMauro Carvalho Chehabmemory becomes free.
23357043247SMauro Carvalho Chehab
23457043247SMauro Carvalho ChehabTo free pagecache::
23557043247SMauro Carvalho Chehab
23657043247SMauro Carvalho Chehab	echo 1 > /proc/sys/vm/drop_caches
23757043247SMauro Carvalho Chehab
23857043247SMauro Carvalho ChehabTo free reclaimable slab objects (includes dentries and inodes)::
23957043247SMauro Carvalho Chehab
24057043247SMauro Carvalho Chehab	echo 2 > /proc/sys/vm/drop_caches
24157043247SMauro Carvalho Chehab
24257043247SMauro Carvalho ChehabTo free slab objects and pagecache::
24357043247SMauro Carvalho Chehab
24457043247SMauro Carvalho Chehab	echo 3 > /proc/sys/vm/drop_caches
24557043247SMauro Carvalho Chehab
24657043247SMauro Carvalho ChehabThis is a non-destructive operation and will not free any dirty objects.
24757043247SMauro Carvalho ChehabTo increase the number of objects freed by this operation, the user may run
24857043247SMauro Carvalho Chehab`sync` prior to writing to /proc/sys/vm/drop_caches.  This will minimize the
24957043247SMauro Carvalho Chehabnumber of dirty objects on the system and create more candidates to be
25057043247SMauro Carvalho Chehabdropped.
25157043247SMauro Carvalho Chehab
25257043247SMauro Carvalho ChehabThis file is not a means to control the growth of the various kernel caches
25357043247SMauro Carvalho Chehab(inodes, dentries, pagecache, etc...)  These objects are automatically
25457043247SMauro Carvalho Chehabreclaimed by the kernel when memory is needed elsewhere on the system.
25557043247SMauro Carvalho Chehab
25657043247SMauro Carvalho ChehabUse of this file can cause performance problems.  Since it discards cached
25757043247SMauro Carvalho Chehabobjects, it may cost a significant amount of I/O and CPU to recreate the
25857043247SMauro Carvalho Chehabdropped objects, especially if they were under heavy use.  Because of this,
25957043247SMauro Carvalho Chehabuse outside of a testing or debugging environment is not recommended.
26057043247SMauro Carvalho Chehab
26157043247SMauro Carvalho ChehabYou may see informational messages in your kernel log when this file is
26257043247SMauro Carvalho Chehabused::
26357043247SMauro Carvalho Chehab
26457043247SMauro Carvalho Chehab	cat (1234): drop_caches: 3
26557043247SMauro Carvalho Chehab
26657043247SMauro Carvalho ChehabThese are informational only.  They do not mean that anything is wrong
26757043247SMauro Carvalho Chehabwith your system.  To disable them, echo 4 (bit 2) into drop_caches.
26857043247SMauro Carvalho Chehab
26957043247SMauro Carvalho Chehab
27057043247SMauro Carvalho Chehabextfrag_threshold
27157043247SMauro Carvalho Chehab=================
27257043247SMauro Carvalho Chehab
27357043247SMauro Carvalho ChehabThis parameter affects whether the kernel will compact memory or direct
27457043247SMauro Carvalho Chehabreclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
27557043247SMauro Carvalho Chehabdebugfs shows what the fragmentation index for each order is in each zone in
27657043247SMauro Carvalho Chehabthe system. Values tending towards 0 imply allocations would fail due to lack
27757043247SMauro Carvalho Chehabof memory, values towards 1000 imply failures are due to fragmentation and -1
27857043247SMauro Carvalho Chehabimplies that the allocation will succeed as long as watermarks are met.
27957043247SMauro Carvalho Chehab
28057043247SMauro Carvalho ChehabThe kernel will not compact memory in a zone if the
28157043247SMauro Carvalho Chehabfragmentation index is <= extfrag_threshold. The default value is 500.
28257043247SMauro Carvalho Chehab
28357043247SMauro Carvalho Chehab
28457043247SMauro Carvalho Chehabhighmem_is_dirtyable
28557043247SMauro Carvalho Chehab====================
28657043247SMauro Carvalho Chehab
28757043247SMauro Carvalho ChehabAvailable only for systems with CONFIG_HIGHMEM enabled (32b systems).
28857043247SMauro Carvalho Chehab
28957043247SMauro Carvalho ChehabThis parameter controls whether the high memory is considered for dirty
29057043247SMauro Carvalho Chehabwriters throttling.  This is not the case by default which means that
29157043247SMauro Carvalho Chehabonly the amount of memory directly visible/usable by the kernel can
29257043247SMauro Carvalho Chehabbe dirtied. As a result, on systems with a large amount of memory and
29357043247SMauro Carvalho Chehablowmem basically depleted writers might be throttled too early and
29457043247SMauro Carvalho Chehabstreaming writes can get very slow.
29557043247SMauro Carvalho Chehab
29657043247SMauro Carvalho ChehabChanging the value to non zero would allow more memory to be dirtied
29757043247SMauro Carvalho Chehaband thus allow writers to write more data which can be flushed to the
29857043247SMauro Carvalho Chehabstorage more effectively. Note this also comes with a risk of pre-mature
29957043247SMauro Carvalho ChehabOOM killer because some writers (e.g. direct block device writes) can
30057043247SMauro Carvalho Chehabonly use the low memory and they can fill it up with dirty data without
30157043247SMauro Carvalho Chehabany throttling.
30257043247SMauro Carvalho Chehab
30357043247SMauro Carvalho Chehab
30457043247SMauro Carvalho Chehabhugetlb_shm_group
30557043247SMauro Carvalho Chehab=================
30657043247SMauro Carvalho Chehab
30757043247SMauro Carvalho Chehabhugetlb_shm_group contains group id that is allowed to create SysV
30857043247SMauro Carvalho Chehabshared memory segment using hugetlb page.
30957043247SMauro Carvalho Chehab
31057043247SMauro Carvalho Chehab
31157043247SMauro Carvalho Chehablaptop_mode
31257043247SMauro Carvalho Chehab===========
31357043247SMauro Carvalho Chehab
31457043247SMauro Carvalho Chehablaptop_mode is a knob that controls "laptop mode". All the things that are
3159e1cbedeSMauro Carvalho Chehabcontrolled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst.
31657043247SMauro Carvalho Chehab
31757043247SMauro Carvalho Chehab
31857043247SMauro Carvalho Chehablegacy_va_layout
31957043247SMauro Carvalho Chehab================
32057043247SMauro Carvalho Chehab
32157043247SMauro Carvalho ChehabIf non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
32257043247SMauro Carvalho Chehabwill use the legacy (2.4) layout for all processes.
32357043247SMauro Carvalho Chehab
32457043247SMauro Carvalho Chehab
32557043247SMauro Carvalho Chehablowmem_reserve_ratio
32657043247SMauro Carvalho Chehab====================
32757043247SMauro Carvalho Chehab
32857043247SMauro Carvalho ChehabFor some specialised workloads on highmem machines it is dangerous for
32957043247SMauro Carvalho Chehabthe kernel to allow process memory to be allocated from the "lowmem"
33057043247SMauro Carvalho Chehabzone.  This is because that memory could then be pinned via the mlock()
33157043247SMauro Carvalho Chehabsystem call, or by unavailability of swapspace.
33257043247SMauro Carvalho Chehab
33357043247SMauro Carvalho ChehabAnd on large highmem machines this lack of reclaimable lowmem memory
33457043247SMauro Carvalho Chehabcan be fatal.
33557043247SMauro Carvalho Chehab
33657043247SMauro Carvalho ChehabSo the Linux page allocator has a mechanism which prevents allocations
33757043247SMauro Carvalho Chehabwhich *could* use highmem from using too much lowmem.  This means that
33857043247SMauro Carvalho Chehaba certain amount of lowmem is defended from the possibility of being
33957043247SMauro Carvalho Chehabcaptured into pinned user memory.
34057043247SMauro Carvalho Chehab
34157043247SMauro Carvalho Chehab(The same argument applies to the old 16 megabyte ISA DMA region.  This
34257043247SMauro Carvalho Chehabmechanism will also defend that region from allocations which could use
34357043247SMauro Carvalho Chehabhighmem or lowmem).
34457043247SMauro Carvalho Chehab
34557043247SMauro Carvalho ChehabThe `lowmem_reserve_ratio` tunable determines how aggressive the kernel is
34657043247SMauro Carvalho Chehabin defending these lower zones.
34757043247SMauro Carvalho Chehab
34857043247SMauro Carvalho ChehabIf you have a machine which uses highmem or ISA DMA and your
34957043247SMauro Carvalho Chehabapplications are using mlock(), or if you are running with no swap then
35057043247SMauro Carvalho Chehabyou probably should change the lowmem_reserve_ratio setting.
35157043247SMauro Carvalho Chehab
35257043247SMauro Carvalho ChehabThe lowmem_reserve_ratio is an array. You can see them by reading this file::
35357043247SMauro Carvalho Chehab
35457043247SMauro Carvalho Chehab	% cat /proc/sys/vm/lowmem_reserve_ratio
35557043247SMauro Carvalho Chehab	256     256     32
35657043247SMauro Carvalho Chehab
35757043247SMauro Carvalho ChehabBut, these values are not used directly. The kernel calculates # of protection
35857043247SMauro Carvalho Chehabpages for each zones from them. These are shown as array of protection pages
359*dbeb56feSRandy Dunlapin /proc/zoneinfo like the following. (This is an example of x86-64 box).
36057043247SMauro Carvalho ChehabEach zone has an array of protection pages like this::
36157043247SMauro Carvalho Chehab
36257043247SMauro Carvalho Chehab  Node 0, zone      DMA
36357043247SMauro Carvalho Chehab    pages free     1355
36457043247SMauro Carvalho Chehab          min      3
36557043247SMauro Carvalho Chehab          low      3
36657043247SMauro Carvalho Chehab          high     4
36757043247SMauro Carvalho Chehab	:
36857043247SMauro Carvalho Chehab	:
36957043247SMauro Carvalho Chehab      numa_other   0
37057043247SMauro Carvalho Chehab          protection: (0, 2004, 2004, 2004)
37157043247SMauro Carvalho Chehab	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
37257043247SMauro Carvalho Chehab    pagesets
37357043247SMauro Carvalho Chehab      cpu: 0 pcp: 0
37457043247SMauro Carvalho Chehab          :
37557043247SMauro Carvalho Chehab
37657043247SMauro Carvalho ChehabThese protections are added to score to judge whether this zone should be used
37757043247SMauro Carvalho Chehabfor page allocation or should be reclaimed.
37857043247SMauro Carvalho Chehab
37957043247SMauro Carvalho ChehabIn this example, if normal pages (index=2) are required to this DMA zone and
38057043247SMauro Carvalho Chehabwatermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
38157043247SMauro Carvalho Chehabnot be used because pages_free(1355) is smaller than watermark + protection[2]
38257043247SMauro Carvalho Chehab(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
38357043247SMauro Carvalho Chehabnormal page requirement. If requirement is DMA zone(index=0), protection[0]
38457043247SMauro Carvalho Chehab(=0) is used.
38557043247SMauro Carvalho Chehab
38657043247SMauro Carvalho Chehabzone[i]'s protection[j] is calculated by following expression::
38757043247SMauro Carvalho Chehab
38857043247SMauro Carvalho Chehab  (i < j):
38957043247SMauro Carvalho Chehab    zone[i]->protection[j]
39057043247SMauro Carvalho Chehab    = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
39157043247SMauro Carvalho Chehab      / lowmem_reserve_ratio[i];
39257043247SMauro Carvalho Chehab  (i = j):
39357043247SMauro Carvalho Chehab     (should not be protected. = 0;
39457043247SMauro Carvalho Chehab  (i > j):
39557043247SMauro Carvalho Chehab     (not necessary, but looks 0)
39657043247SMauro Carvalho Chehab
39757043247SMauro Carvalho ChehabThe default values of lowmem_reserve_ratio[i] are
39857043247SMauro Carvalho Chehab
39957043247SMauro Carvalho Chehab    === ====================================
40057043247SMauro Carvalho Chehab    256 (if zone[i] means DMA or DMA32 zone)
40157043247SMauro Carvalho Chehab    32  (others)
40257043247SMauro Carvalho Chehab    === ====================================
40357043247SMauro Carvalho Chehab
40457043247SMauro Carvalho ChehabAs above expression, they are reciprocal number of ratio.
40557043247SMauro Carvalho Chehab256 means 1/256. # of protection pages becomes about "0.39%" of total managed
40657043247SMauro Carvalho Chehabpages of higher zones on the node.
40757043247SMauro Carvalho Chehab
40857043247SMauro Carvalho ChehabIf you would like to protect more pages, smaller values are effective.
40957043247SMauro Carvalho ChehabThe minimum value is 1 (1/1 -> 100%). The value less than 1 completely
41057043247SMauro Carvalho Chehabdisables protection of the pages.
41157043247SMauro Carvalho Chehab
41257043247SMauro Carvalho Chehab
41357043247SMauro Carvalho Chehabmax_map_count:
41457043247SMauro Carvalho Chehab==============
41557043247SMauro Carvalho Chehab
41657043247SMauro Carvalho ChehabThis file contains the maximum number of memory map areas a process
41757043247SMauro Carvalho Chehabmay have. Memory map areas are used as a side-effect of calling
41857043247SMauro Carvalho Chehabmalloc, directly by mmap, mprotect, and madvise, and also when loading
41957043247SMauro Carvalho Chehabshared libraries.
42057043247SMauro Carvalho Chehab
42157043247SMauro Carvalho ChehabWhile most applications need less than a thousand maps, certain
42257043247SMauro Carvalho Chehabprograms, particularly malloc debuggers, may consume lots of them,
42357043247SMauro Carvalho Chehabe.g., up to one or two maps per allocation.
42457043247SMauro Carvalho Chehab
425c635b0ceSFengfei XiThe default value is 65530.
42657043247SMauro Carvalho Chehab
42757043247SMauro Carvalho Chehab
42857043247SMauro Carvalho Chehabmemory_failure_early_kill:
42957043247SMauro Carvalho Chehab==========================
43057043247SMauro Carvalho Chehab
43157043247SMauro Carvalho ChehabControl how to kill processes when uncorrected memory error (typically
43257043247SMauro Carvalho Chehaba 2bit error in a memory module) is detected in the background by hardware
43357043247SMauro Carvalho Chehabthat cannot be handled by the kernel. In some cases (like the page
43457043247SMauro Carvalho Chehabstill having a valid copy on disk) the kernel will handle the failure
43557043247SMauro Carvalho Chehabtransparently without affecting any applications. But if there is
436*dbeb56feSRandy Dunlapno other up-to-date copy of the data it will kill to prevent any data
43757043247SMauro Carvalho Chehabcorruptions from propagating.
43857043247SMauro Carvalho Chehab
43957043247SMauro Carvalho Chehab1: Kill all processes that have the corrupted and not reloadable page mapped
44057043247SMauro Carvalho Chehabas soon as the corruption is detected.  Note this is not supported
44157043247SMauro Carvalho Chehabfor a few types of pages, like kernel internally allocated data or
44257043247SMauro Carvalho Chehabthe swap cache, but works for the majority of user pages.
44357043247SMauro Carvalho Chehab
44457043247SMauro Carvalho Chehab0: Only unmap the corrupted page from all processes and only kill a process
44557043247SMauro Carvalho Chehabwho tries to access it.
44657043247SMauro Carvalho Chehab
44757043247SMauro Carvalho ChehabThe kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
44857043247SMauro Carvalho Chehabhandle this if they want to.
44957043247SMauro Carvalho Chehab
45057043247SMauro Carvalho ChehabThis is only active on architectures/platforms with advanced machine
45157043247SMauro Carvalho Chehabcheck handling and depends on the hardware capabilities.
45257043247SMauro Carvalho Chehab
45357043247SMauro Carvalho ChehabApplications can override this setting individually with the PR_MCE_KILL prctl
45457043247SMauro Carvalho Chehab
45557043247SMauro Carvalho Chehab
45657043247SMauro Carvalho Chehabmemory_failure_recovery
45757043247SMauro Carvalho Chehab=======================
45857043247SMauro Carvalho Chehab
45957043247SMauro Carvalho ChehabEnable memory failure recovery (when supported by the platform)
46057043247SMauro Carvalho Chehab
46157043247SMauro Carvalho Chehab1: Attempt recovery.
46257043247SMauro Carvalho Chehab
46357043247SMauro Carvalho Chehab0: Always panic on a memory failure.
46457043247SMauro Carvalho Chehab
46557043247SMauro Carvalho Chehab
46657043247SMauro Carvalho Chehabmin_free_kbytes
46757043247SMauro Carvalho Chehab===============
46857043247SMauro Carvalho Chehab
46957043247SMauro Carvalho ChehabThis is used to force the Linux VM to keep a minimum number
47057043247SMauro Carvalho Chehabof kilobytes free.  The VM uses this number to compute a
47157043247SMauro Carvalho Chehabwatermark[WMARK_MIN] value for each lowmem zone in the system.
47257043247SMauro Carvalho ChehabEach lowmem zone gets a number of reserved free pages based
47357043247SMauro Carvalho Chehabproportionally on its size.
47457043247SMauro Carvalho Chehab
47557043247SMauro Carvalho ChehabSome minimal amount of memory is needed to satisfy PF_MEMALLOC
47657043247SMauro Carvalho Chehaballocations; if you set this to lower than 1024KB, your system will
47757043247SMauro Carvalho Chehabbecome subtly broken, and prone to deadlock under high loads.
47857043247SMauro Carvalho Chehab
47957043247SMauro Carvalho ChehabSetting this too high will OOM your machine instantly.
48057043247SMauro Carvalho Chehab
48157043247SMauro Carvalho Chehab
48257043247SMauro Carvalho Chehabmin_slab_ratio
48357043247SMauro Carvalho Chehab==============
48457043247SMauro Carvalho Chehab
48557043247SMauro Carvalho ChehabThis is available only on NUMA kernels.
48657043247SMauro Carvalho Chehab
48757043247SMauro Carvalho ChehabA percentage of the total pages in each zone.  On Zone reclaim
48857043247SMauro Carvalho Chehab(fallback from the local zone occurs) slabs will be reclaimed if more
48957043247SMauro Carvalho Chehabthan this percentage of pages in a zone are reclaimable slab pages.
49057043247SMauro Carvalho ChehabThis insures that the slab growth stays under control even in NUMA
49157043247SMauro Carvalho Chehabsystems that rarely perform global reclaim.
49257043247SMauro Carvalho Chehab
49357043247SMauro Carvalho ChehabThe default is 5 percent.
49457043247SMauro Carvalho Chehab
49557043247SMauro Carvalho ChehabNote that slab reclaim is triggered in a per zone / node fashion.
49657043247SMauro Carvalho ChehabThe process of reclaiming slab memory is currently not node specific
49757043247SMauro Carvalho Chehaband may not be fast.
49857043247SMauro Carvalho Chehab
49957043247SMauro Carvalho Chehab
50057043247SMauro Carvalho Chehabmin_unmapped_ratio
50157043247SMauro Carvalho Chehab==================
50257043247SMauro Carvalho Chehab
50357043247SMauro Carvalho ChehabThis is available only on NUMA kernels.
50457043247SMauro Carvalho Chehab
50557043247SMauro Carvalho ChehabThis is a percentage of the total pages in each zone. Zone reclaim will
50657043247SMauro Carvalho Chehabonly occur if more than this percentage of pages are in a state that
50757043247SMauro Carvalho Chehabzone_reclaim_mode allows to be reclaimed.
50857043247SMauro Carvalho Chehab
50957043247SMauro Carvalho ChehabIf zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
51057043247SMauro Carvalho Chehabagainst all file-backed unmapped pages including swapcache pages and tmpfs
51157043247SMauro Carvalho Chehabfiles. Otherwise, only unmapped pages backed by normal files but not tmpfs
51257043247SMauro Carvalho Chehabfiles and similar are considered.
51357043247SMauro Carvalho Chehab
51457043247SMauro Carvalho ChehabThe default is 1 percent.
51557043247SMauro Carvalho Chehab
51657043247SMauro Carvalho Chehab
51757043247SMauro Carvalho Chehabmmap_min_addr
51857043247SMauro Carvalho Chehab=============
51957043247SMauro Carvalho Chehab
52057043247SMauro Carvalho ChehabThis file indicates the amount of address space  which a user process will
52157043247SMauro Carvalho Chehabbe restricted from mmapping.  Since kernel null dereference bugs could
52257043247SMauro Carvalho Chehabaccidentally operate based on the information in the first couple of pages
52357043247SMauro Carvalho Chehabof memory userspace processes should not be allowed to write to them.  By
52457043247SMauro Carvalho Chehabdefault this value is set to 0 and no protections will be enforced by the
52557043247SMauro Carvalho Chehabsecurity module.  Setting this value to something like 64k will allow the
52657043247SMauro Carvalho Chehabvast majority of applications to work correctly and provide defense in depth
52757043247SMauro Carvalho Chehabagainst future potential kernel bugs.
52857043247SMauro Carvalho Chehab
52957043247SMauro Carvalho Chehab
53057043247SMauro Carvalho Chehabmmap_rnd_bits
53157043247SMauro Carvalho Chehab=============
53257043247SMauro Carvalho Chehab
53357043247SMauro Carvalho ChehabThis value can be used to select the number of bits to use to
53457043247SMauro Carvalho Chehabdetermine the random offset to the base address of vma regions
53557043247SMauro Carvalho Chehabresulting from mmap allocations on architectures which support
53657043247SMauro Carvalho Chehabtuning address space randomization.  This value will be bounded
53757043247SMauro Carvalho Chehabby the architecture's minimum and maximum supported values.
53857043247SMauro Carvalho Chehab
53957043247SMauro Carvalho ChehabThis value can be changed after boot using the
54057043247SMauro Carvalho Chehab/proc/sys/vm/mmap_rnd_bits tunable
54157043247SMauro Carvalho Chehab
54257043247SMauro Carvalho Chehab
54357043247SMauro Carvalho Chehabmmap_rnd_compat_bits
54457043247SMauro Carvalho Chehab====================
54557043247SMauro Carvalho Chehab
54657043247SMauro Carvalho ChehabThis value can be used to select the number of bits to use to
54757043247SMauro Carvalho Chehabdetermine the random offset to the base address of vma regions
54857043247SMauro Carvalho Chehabresulting from mmap allocations for applications run in
54957043247SMauro Carvalho Chehabcompatibility mode on architectures which support tuning address
55057043247SMauro Carvalho Chehabspace randomization.  This value will be bounded by the
55157043247SMauro Carvalho Chehabarchitecture's minimum and maximum supported values.
55257043247SMauro Carvalho Chehab
55357043247SMauro Carvalho ChehabThis value can be changed after boot using the
55457043247SMauro Carvalho Chehab/proc/sys/vm/mmap_rnd_compat_bits tunable
55557043247SMauro Carvalho Chehab
55657043247SMauro Carvalho Chehab
55757043247SMauro Carvalho Chehabnr_hugepages
55857043247SMauro Carvalho Chehab============
55957043247SMauro Carvalho Chehab
56057043247SMauro Carvalho ChehabChange the minimum size of the hugepage pool.
56157043247SMauro Carvalho Chehab
56257043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst
56357043247SMauro Carvalho Chehab
56457043247SMauro Carvalho Chehab
56578f39084SMuchun Songhugetlb_optimize_vmemmap
56678f39084SMuchun Song========================
56778f39084SMuchun Song
56866361095SMuchun SongThis knob is not available when the size of 'struct page' (a structure defined
56966361095SMuchun Songin include/linux/mm_types.h) is not power of two (an unusual system config could
57078f39084SMuchun Songresult in this).
57178f39084SMuchun Song
572dff03381SMuchun SongEnable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO).
57378f39084SMuchun Song
57478f39084SMuchun SongOnce enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
57578f39084SMuchun Songbuddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages
57678f39084SMuchun Songper 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be
57778f39084SMuchun Songoptimized.  When those optimized HugeTLB pages are freed from the HugeTLB pool
57878f39084SMuchun Songto the buddy allocator, the vmemmap pages representing that range needs to be
57978f39084SMuchun Songremapped again and the vmemmap pages discarded earlier need to be rellocated
58078f39084SMuchun Songagain.  If your use case is that HugeTLB pages are allocated 'on the fly' (e.g.
58178f39084SMuchun Songnever explicitly allocating HugeTLB pages with 'nr_hugepages' but only set
58278f39084SMuchun Song'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on
58378f39084SMuchun Songthe fly') instead of being pulled from the HugeTLB pool, you should weigh the
58478f39084SMuchun Songbenefits of memory savings against the more overhead (~2x slower than before)
58578f39084SMuchun Songof allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy
58678f39084SMuchun Songallocator.  Another behavior to note is that if the system is under heavy memory
58778f39084SMuchun Songpressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB
58878f39084SMuchun Songpool to the buddy allocator since the allocation of vmemmap pages could be
58978f39084SMuchun Songfailed, you have to retry later if your system encounter this situation.
59078f39084SMuchun Song
59178f39084SMuchun SongOnce disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from
59278f39084SMuchun Songbuddy allocator will not be optimized meaning the extra overhead at allocation
59378f39084SMuchun Songtime from buddy allocator disappears, whereas already optimized HugeTLB pages
59478f39084SMuchun Songwill not be affected.  If you want to make sure there are no optimized HugeTLB
59578f39084SMuchun Songpages, you can set "nr_hugepages" to 0 first and then disable this.  Note that
59678f39084SMuchun Songwriting 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus
59778f39084SMuchun Songpages.  So, those surplus pages are still optimized until they are no longer
59878f39084SMuchun Songin use.  You would need to wait for those surplus pages to be released before
59978f39084SMuchun Songthere are no optimized pages in the system.
60078f39084SMuchun Song
60178f39084SMuchun Song
60257043247SMauro Carvalho Chehabnr_hugepages_mempolicy
60357043247SMauro Carvalho Chehab======================
60457043247SMauro Carvalho Chehab
60557043247SMauro Carvalho ChehabChange the size of the hugepage pool at run-time on a specific
60657043247SMauro Carvalho Chehabset of NUMA nodes.
60757043247SMauro Carvalho Chehab
60857043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst
60957043247SMauro Carvalho Chehab
61057043247SMauro Carvalho Chehab
61157043247SMauro Carvalho Chehabnr_overcommit_hugepages
61257043247SMauro Carvalho Chehab=======================
61357043247SMauro Carvalho Chehab
61457043247SMauro Carvalho ChehabChange the maximum size of the hugepage pool. The maximum is
61557043247SMauro Carvalho Chehabnr_hugepages + nr_overcommit_hugepages.
61657043247SMauro Carvalho Chehab
61757043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst
61857043247SMauro Carvalho Chehab
61957043247SMauro Carvalho Chehab
62057043247SMauro Carvalho Chehabnr_trim_pages
62157043247SMauro Carvalho Chehab=============
62257043247SMauro Carvalho Chehab
62357043247SMauro Carvalho ChehabThis is available only on NOMMU kernels.
62457043247SMauro Carvalho Chehab
62557043247SMauro Carvalho ChehabThis value adjusts the excess page trimming behaviour of power-of-2 aligned
62657043247SMauro Carvalho ChehabNOMMU mmap allocations.
62757043247SMauro Carvalho Chehab
62857043247SMauro Carvalho ChehabA value of 0 disables trimming of allocations entirely, while a value of 1
62957043247SMauro Carvalho Chehabtrims excess pages aggressively. Any value >= 1 acts as the watermark where
63057043247SMauro Carvalho Chehabtrimming of allocations is initiated.
63157043247SMauro Carvalho Chehab
63257043247SMauro Carvalho ChehabThe default value is 1.
63357043247SMauro Carvalho Chehab
634800c02f5SMauro Carvalho ChehabSee Documentation/admin-guide/mm/nommu-mmap.rst for more information.
63557043247SMauro Carvalho Chehab
63657043247SMauro Carvalho Chehab
63757043247SMauro Carvalho Chehabnuma_zonelist_order
63857043247SMauro Carvalho Chehab===================
63957043247SMauro Carvalho Chehab
64057043247SMauro Carvalho ChehabThis sysctl is only for NUMA and it is deprecated. Anything but
64157043247SMauro Carvalho ChehabNode order will fail!
64257043247SMauro Carvalho Chehab
64357043247SMauro Carvalho Chehab'where the memory is allocated from' is controlled by zonelists.
64457043247SMauro Carvalho Chehab
64557043247SMauro Carvalho Chehab(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
64657043247SMauro Carvalho Chehabyou may be able to read ZONE_DMA as ZONE_DMA32...)
64757043247SMauro Carvalho Chehab
64857043247SMauro Carvalho ChehabIn non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
64957043247SMauro Carvalho ChehabZONE_NORMAL -> ZONE_DMA
65057043247SMauro Carvalho ChehabThis means that a memory allocation request for GFP_KERNEL will
65157043247SMauro Carvalho Chehabget memory from ZONE_DMA only when ZONE_NORMAL is not available.
65257043247SMauro Carvalho Chehab
65357043247SMauro Carvalho ChehabIn NUMA case, you can think of following 2 types of order.
65457043247SMauro Carvalho ChehabAssume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL::
65557043247SMauro Carvalho Chehab
65657043247SMauro Carvalho Chehab  (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
65757043247SMauro Carvalho Chehab  (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
65857043247SMauro Carvalho Chehab
65957043247SMauro Carvalho ChehabType(A) offers the best locality for processes on Node(0), but ZONE_DMA
66057043247SMauro Carvalho Chehabwill be used before ZONE_NORMAL exhaustion. This increases possibility of
66157043247SMauro Carvalho Chehabout-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
66257043247SMauro Carvalho Chehab
66357043247SMauro Carvalho ChehabType(B) cannot offer the best locality but is more robust against OOM of
66457043247SMauro Carvalho Chehabthe DMA zone.
66557043247SMauro Carvalho Chehab
66657043247SMauro Carvalho ChehabType(A) is called as "Node" order. Type (B) is "Zone" order.
66757043247SMauro Carvalho Chehab
66857043247SMauro Carvalho Chehab"Node order" orders the zonelists by node, then by zone within each node.
66957043247SMauro Carvalho ChehabSpecify "[Nn]ode" for node order
67057043247SMauro Carvalho Chehab
67157043247SMauro Carvalho Chehab"Zone Order" orders the zonelists by zone type, then by node within each
67257043247SMauro Carvalho Chehabzone.  Specify "[Zz]one" for zone order.
67357043247SMauro Carvalho Chehab
67457043247SMauro Carvalho ChehabSpecify "[Dd]efault" to request automatic configuration.
67557043247SMauro Carvalho Chehab
67657043247SMauro Carvalho ChehabOn 32-bit, the Normal zone needs to be preserved for allocations accessible
67757043247SMauro Carvalho Chehabby the kernel, so "zone" order will be selected.
67857043247SMauro Carvalho Chehab
67957043247SMauro Carvalho ChehabOn 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
68057043247SMauro Carvalho Chehaborder will be selected.
68157043247SMauro Carvalho Chehab
68257043247SMauro Carvalho ChehabDefault order is recommended unless this is causing problems for your
68357043247SMauro Carvalho Chehabsystem/application.
68457043247SMauro Carvalho Chehab
68557043247SMauro Carvalho Chehab
68657043247SMauro Carvalho Chehaboom_dump_tasks
68757043247SMauro Carvalho Chehab==============
68857043247SMauro Carvalho Chehab
68957043247SMauro Carvalho ChehabEnables a system-wide task dump (excluding kernel threads) to be produced
69057043247SMauro Carvalho Chehabwhen the kernel performs an OOM-killing and includes such information as
69157043247SMauro Carvalho Chehabpid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
69257043247SMauro Carvalho Chehabscore, and name.  This is helpful to determine why the OOM killer was
69357043247SMauro Carvalho Chehabinvoked, to identify the rogue task that caused it, and to determine why
69457043247SMauro Carvalho Chehabthe OOM killer chose the task it did to kill.
69557043247SMauro Carvalho Chehab
69657043247SMauro Carvalho ChehabIf this is set to zero, this information is suppressed.  On very
69757043247SMauro Carvalho Chehablarge systems with thousands of tasks it may not be feasible to dump
69857043247SMauro Carvalho Chehabthe memory state information for each one.  Such systems should not
69957043247SMauro Carvalho Chehabbe forced to incur a performance penalty in OOM conditions when the
70057043247SMauro Carvalho Chehabinformation may not be desired.
70157043247SMauro Carvalho Chehab
70257043247SMauro Carvalho ChehabIf this is set to non-zero, this information is shown whenever the
70357043247SMauro Carvalho ChehabOOM killer actually kills a memory-hogging task.
70457043247SMauro Carvalho Chehab
70557043247SMauro Carvalho ChehabThe default value is 1 (enabled).
70657043247SMauro Carvalho Chehab
70757043247SMauro Carvalho Chehab
70857043247SMauro Carvalho Chehaboom_kill_allocating_task
70957043247SMauro Carvalho Chehab========================
71057043247SMauro Carvalho Chehab
71157043247SMauro Carvalho ChehabThis enables or disables killing the OOM-triggering task in
71257043247SMauro Carvalho Chehabout-of-memory situations.
71357043247SMauro Carvalho Chehab
71457043247SMauro Carvalho ChehabIf this is set to zero, the OOM killer will scan through the entire
71557043247SMauro Carvalho Chehabtasklist and select a task based on heuristics to kill.  This normally
71657043247SMauro Carvalho Chehabselects a rogue memory-hogging task that frees up a large amount of
71757043247SMauro Carvalho Chehabmemory when killed.
71857043247SMauro Carvalho Chehab
71957043247SMauro Carvalho ChehabIf this is set to non-zero, the OOM killer simply kills the task that
72057043247SMauro Carvalho Chehabtriggered the out-of-memory condition.  This avoids the expensive
72157043247SMauro Carvalho Chehabtasklist scan.
72257043247SMauro Carvalho Chehab
72357043247SMauro Carvalho ChehabIf panic_on_oom is selected, it takes precedence over whatever value
72457043247SMauro Carvalho Chehabis used in oom_kill_allocating_task.
72557043247SMauro Carvalho Chehab
72657043247SMauro Carvalho ChehabThe default value is 0.
72757043247SMauro Carvalho Chehab
72857043247SMauro Carvalho Chehab
72957043247SMauro Carvalho Chehabovercommit_kbytes
73057043247SMauro Carvalho Chehab=================
73157043247SMauro Carvalho Chehab
73257043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, the committed address space is not
73357043247SMauro Carvalho Chehabpermitted to exceed swap plus this amount of physical RAM. See below.
73457043247SMauro Carvalho Chehab
73557043247SMauro Carvalho ChehabNote: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
73657043247SMauro Carvalho Chehabof them may be specified at a time. Setting one disables the other (which
73757043247SMauro Carvalho Chehabthen appears as 0 when read).
73857043247SMauro Carvalho Chehab
73957043247SMauro Carvalho Chehab
74057043247SMauro Carvalho Chehabovercommit_memory
74157043247SMauro Carvalho Chehab=================
74257043247SMauro Carvalho Chehab
74357043247SMauro Carvalho ChehabThis value contains a flag that enables memory overcommitment.
74457043247SMauro Carvalho Chehab
74557043247SMauro Carvalho ChehabWhen this flag is 0, the kernel attempts to estimate the amount
74657043247SMauro Carvalho Chehabof free memory left when userspace requests more memory.
74757043247SMauro Carvalho Chehab
74857043247SMauro Carvalho ChehabWhen this flag is 1, the kernel pretends there is always enough
74957043247SMauro Carvalho Chehabmemory until it actually runs out.
75057043247SMauro Carvalho Chehab
75157043247SMauro Carvalho ChehabWhen this flag is 2, the kernel uses a "never overcommit"
75257043247SMauro Carvalho Chehabpolicy that attempts to prevent any overcommit of memory.
75357043247SMauro Carvalho ChehabNote that user_reserve_kbytes affects this policy.
75457043247SMauro Carvalho Chehab
75557043247SMauro Carvalho ChehabThis feature can be very useful because there are a lot of
75657043247SMauro Carvalho Chehabprograms that malloc() huge amounts of memory "just-in-case"
75757043247SMauro Carvalho Chehaband don't use much of it.
75857043247SMauro Carvalho Chehab
75957043247SMauro Carvalho ChehabThe default value is 0.
76057043247SMauro Carvalho Chehab
761ee65728eSMike RapoportSee Documentation/mm/overcommit-accounting.rst and
76257043247SMauro Carvalho Chehabmm/util.c::__vm_enough_memory() for more information.
76357043247SMauro Carvalho Chehab
76457043247SMauro Carvalho Chehab
76557043247SMauro Carvalho Chehabovercommit_ratio
76657043247SMauro Carvalho Chehab================
76757043247SMauro Carvalho Chehab
76857043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, the committed address
76957043247SMauro Carvalho Chehabspace is not permitted to exceed swap plus this percentage
77057043247SMauro Carvalho Chehabof physical RAM.  See above.
77157043247SMauro Carvalho Chehab
77257043247SMauro Carvalho Chehab
77357043247SMauro Carvalho Chehabpage-cluster
77457043247SMauro Carvalho Chehab============
77557043247SMauro Carvalho Chehab
77657043247SMauro Carvalho Chehabpage-cluster controls the number of pages up to which consecutive pages
77757043247SMauro Carvalho Chehabare read in from swap in a single attempt. This is the swap counterpart
77857043247SMauro Carvalho Chehabto page cache readahead.
77957043247SMauro Carvalho ChehabThe mentioned consecutivity is not in terms of virtual/physical addresses,
78057043247SMauro Carvalho Chehabbut consecutive on swap space - that means they were swapped out together.
78157043247SMauro Carvalho Chehab
78257043247SMauro Carvalho ChehabIt is a logarithmic value - setting it to zero means "1 page", setting
78357043247SMauro Carvalho Chehabit to 1 means "2 pages", setting it to 2 means "4 pages", etc.
78457043247SMauro Carvalho ChehabZero disables swap readahead completely.
78557043247SMauro Carvalho Chehab
78657043247SMauro Carvalho ChehabThe default value is three (eight pages at a time).  There may be some
78757043247SMauro Carvalho Chehabsmall benefits in tuning this to a different value if your workload is
78857043247SMauro Carvalho Chehabswap-intensive.
78957043247SMauro Carvalho Chehab
79057043247SMauro Carvalho ChehabLower values mean lower latencies for initial faults, but at the same time
79157043247SMauro Carvalho Chehabextra faults and I/O delays for following faults if they would have been part of
79257043247SMauro Carvalho Chehabthat consecutive pages readahead would have brought in.
79357043247SMauro Carvalho Chehab
79457043247SMauro Carvalho Chehab
7958d98e42fSJoel Savitzpage_lock_unfairness
7968d98e42fSJoel Savitz====================
7978d98e42fSJoel Savitz
7988d98e42fSJoel SavitzThis value determines the number of times that the page lock can be
7998d98e42fSJoel Savitzstolen from under a waiter. After the lock is stolen the number of times
8008d98e42fSJoel Savitzspecified in this file (default is 5), the "fair lock handoff" semantics
8018d98e42fSJoel Savitzwill apply, and the waiter will only be awakened if the lock can be taken.
8028d98e42fSJoel Savitz
80357043247SMauro Carvalho Chehabpanic_on_oom
80457043247SMauro Carvalho Chehab============
80557043247SMauro Carvalho Chehab
80657043247SMauro Carvalho ChehabThis enables or disables panic on out-of-memory feature.
80757043247SMauro Carvalho Chehab
80857043247SMauro Carvalho ChehabIf this is set to 0, the kernel will kill some rogue process,
80957043247SMauro Carvalho Chehabcalled oom_killer.  Usually, oom_killer can kill rogue processes and
81057043247SMauro Carvalho Chehabsystem will survive.
81157043247SMauro Carvalho Chehab
81257043247SMauro Carvalho ChehabIf this is set to 1, the kernel panics when out-of-memory happens.
81357043247SMauro Carvalho ChehabHowever, if a process limits using nodes by mempolicy/cpusets,
81457043247SMauro Carvalho Chehaband those nodes become memory exhaustion status, one process
81557043247SMauro Carvalho Chehabmay be killed by oom-killer. No panic occurs in this case.
81657043247SMauro Carvalho ChehabBecause other nodes' memory may be free. This means system total status
81757043247SMauro Carvalho Chehabmay be not fatal yet.
81857043247SMauro Carvalho Chehab
81957043247SMauro Carvalho ChehabIf this is set to 2, the kernel panics compulsorily even on the
82057043247SMauro Carvalho Chehababove-mentioned. Even oom happens under memory cgroup, the whole
82157043247SMauro Carvalho Chehabsystem panics.
82257043247SMauro Carvalho Chehab
82357043247SMauro Carvalho ChehabThe default value is 0.
82457043247SMauro Carvalho Chehab
82557043247SMauro Carvalho Chehab1 and 2 are for failover of clustering. Please select either
82657043247SMauro Carvalho Chehabaccording to your policy of failover.
82757043247SMauro Carvalho Chehab
82857043247SMauro Carvalho Chehabpanic_on_oom=2+kdump gives you very strong tool to investigate
82957043247SMauro Carvalho Chehabwhy oom happens. You can get snapshot.
83057043247SMauro Carvalho Chehab
83157043247SMauro Carvalho Chehab
83274f44822SMel Gormanpercpu_pagelist_high_fraction
83374f44822SMel Gorman=============================
83474f44822SMel Gorman
83574f44822SMel GormanThis is the fraction of pages in each zone that are can be stored to
83674f44822SMel Gormanper-cpu page lists. It is an upper boundary that is divided depending
83774f44822SMel Gormanon the number of online CPUs. The min value for this is 8 which means
83874f44822SMel Gormanthat we do not allow more than 1/8th of pages in each zone to be stored
83974f44822SMel Gormanon per-cpu page lists. This entry only changes the value of hot per-cpu
84074f44822SMel Gormanpage lists. A user can specify a number like 100 to allocate 1/100th of
84174f44822SMel Gormaneach zone between per-cpu lists.
84274f44822SMel Gorman
84374f44822SMel GormanThe batch value of each per-cpu page list remains the same regardless of
84474f44822SMel Gormanthe value of the high fraction so allocation latencies are unaffected.
84574f44822SMel Gorman
84674f44822SMel GormanThe initial value is zero. Kernel uses this value to set the high pcp->high
84774f44822SMel Gormanmark based on the low watermark for the zone and the number of local
84874f44822SMel Gormanonline CPUs.  If the user writes '0' to this sysctl, it will revert to
84974f44822SMel Gormanthis default behavior.
85074f44822SMel Gorman
85174f44822SMel Gorman
85257043247SMauro Carvalho Chehabstat_interval
85357043247SMauro Carvalho Chehab=============
85457043247SMauro Carvalho Chehab
85557043247SMauro Carvalho ChehabThe time interval between which vm statistics are updated.  The default
85657043247SMauro Carvalho Chehabis 1 second.
85757043247SMauro Carvalho Chehab
85857043247SMauro Carvalho Chehab
85957043247SMauro Carvalho Chehabstat_refresh
86057043247SMauro Carvalho Chehab============
86157043247SMauro Carvalho Chehab
86257043247SMauro Carvalho ChehabAny read or write (by root only) flushes all the per-cpu vm statistics
86357043247SMauro Carvalho Chehabinto their global totals, for more accurate reports when testing
86457043247SMauro Carvalho Chehabe.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
86557043247SMauro Carvalho Chehab
86657043247SMauro Carvalho ChehabAs a side-effect, it also checks for negative totals (elsewhere reported
86757043247SMauro Carvalho Chehabas 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
86857043247SMauro Carvalho Chehab(At time of writing, a few stats are known sometimes to be found negative,
86957043247SMauro Carvalho Chehabwith no ill effects: errors and warnings on these stats are suppressed.)
87057043247SMauro Carvalho Chehab
87157043247SMauro Carvalho Chehab
87257043247SMauro Carvalho Chehabnuma_stat
87357043247SMauro Carvalho Chehab=========
87457043247SMauro Carvalho Chehab
87557043247SMauro Carvalho ChehabThis interface allows runtime configuration of numa statistics.
87657043247SMauro Carvalho Chehab
87757043247SMauro Carvalho ChehabWhen page allocation performance becomes a bottleneck and you can tolerate
87857043247SMauro Carvalho Chehabsome possible tool breakage and decreased numa counter precision, you can
87957043247SMauro Carvalho Chehabdo::
88057043247SMauro Carvalho Chehab
88157043247SMauro Carvalho Chehab	echo 0 > /proc/sys/vm/numa_stat
88257043247SMauro Carvalho Chehab
88357043247SMauro Carvalho ChehabWhen page allocation performance is not a bottleneck and you want all
88457043247SMauro Carvalho Chehabtooling to work, you can do::
88557043247SMauro Carvalho Chehab
88657043247SMauro Carvalho Chehab	echo 1 > /proc/sys/vm/numa_stat
88757043247SMauro Carvalho Chehab
88857043247SMauro Carvalho Chehab
88957043247SMauro Carvalho Chehabswappiness
89057043247SMauro Carvalho Chehab==========
89157043247SMauro Carvalho Chehab
892c843966cSJohannes WeinerThis control is used to define the rough relative IO cost of swapping
893c843966cSJohannes Weinerand filesystem paging, as a value between 0 and 200. At 100, the VM
894c843966cSJohannes Weinerassumes equal IO cost and will thus apply memory pressure to the page
895c843966cSJohannes Weinercache and swap-backed pages equally; lower values signify more
896c843966cSJohannes Weinerexpensive swap IO, higher values indicates cheaper.
897c843966cSJohannes Weiner
898c843966cSJohannes WeinerKeep in mind that filesystem IO patterns under memory pressure tend to
899c843966cSJohannes Weinerbe more efficient than swap's random IO. An optimal value will require
900c843966cSJohannes Weinerexperimentation and will also be workload-dependent.
90157043247SMauro Carvalho Chehab
90257043247SMauro Carvalho ChehabThe default value is 60.
90357043247SMauro Carvalho Chehab
904c843966cSJohannes WeinerFor in-memory swap, like zram or zswap, as well as hybrid setups that
905c843966cSJohannes Weinerhave swap on faster devices than the filesystem, values beyond 100 can
906c843966cSJohannes Weinerbe considered. For example, if the random IO against the swap device
907c843966cSJohannes Weineris on average 2x faster than IO from the filesystem, swappiness should
908c843966cSJohannes Weinerbe 133 (x + 2x = 200, 2x = 133.33).
909c843966cSJohannes Weiner
910c843966cSJohannes WeinerAt 0, the kernel will not initiate swap until the amount of free and
911c843966cSJohannes Weinerfile-backed pages is less than the high watermark in a zone.
912c843966cSJohannes Weiner
91357043247SMauro Carvalho Chehab
91457043247SMauro Carvalho Chehabunprivileged_userfaultfd
91557043247SMauro Carvalho Chehab========================
91657043247SMauro Carvalho Chehab
917d0d4730aSLokesh GidraThis flag controls the mode in which unprivileged users can use the
918d0d4730aSLokesh Gidrauserfaultfd system calls. Set this to 0 to restrict unprivileged users
919d0d4730aSLokesh Gidrato handle page faults in user mode only. In this case, users without
920d0d4730aSLokesh GidraSYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to
921d0d4730aSLokesh Gidrasucceed. Prohibiting use of userfaultfd for handling faults from kernel
922d0d4730aSLokesh Gidramode may make certain vulnerabilities more difficult to exploit.
92357043247SMauro Carvalho Chehab
924d0d4730aSLokesh GidraSet this to 1 to allow unprivileged users to use the userfaultfd system
925d0d4730aSLokesh Gidracalls without any restrictions.
926d0d4730aSLokesh Gidra
927d0d4730aSLokesh GidraThe default value is 0.
92857043247SMauro Carvalho Chehab
929816284a3SAxel RasmussenAnother way to control permissions for userfaultfd is to use
930816284a3SAxel Rasmussen/dev/userfaultfd instead of userfaultfd(2). See
931816284a3SAxel RasmussenDocumentation/admin-guide/mm/userfaultfd.rst.
93257043247SMauro Carvalho Chehab
93357043247SMauro Carvalho Chehabuser_reserve_kbytes
93457043247SMauro Carvalho Chehab===================
93557043247SMauro Carvalho Chehab
93657043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, "never overcommit" mode, reserve
93757043247SMauro Carvalho Chehabmin(3% of current process size, user_reserve_kbytes) of free memory.
93857043247SMauro Carvalho ChehabThis is intended to prevent a user from starting a single memory hogging
93957043247SMauro Carvalho Chehabprocess, such that they cannot recover (kill the hog).
94057043247SMauro Carvalho Chehab
94157043247SMauro Carvalho Chehabuser_reserve_kbytes defaults to min(3% of the current process size, 128MB).
94257043247SMauro Carvalho Chehab
94357043247SMauro Carvalho ChehabIf this is reduced to zero, then the user will be allowed to allocate
94457043247SMauro Carvalho Chehaball free memory with a single process, minus admin_reserve_kbytes.
94557043247SMauro Carvalho ChehabAny subsequent attempts to execute a command will result in
94657043247SMauro Carvalho Chehab"fork: Cannot allocate memory".
94757043247SMauro Carvalho Chehab
94857043247SMauro Carvalho ChehabChanging this takes effect whenever an application requests memory.
94957043247SMauro Carvalho Chehab
95057043247SMauro Carvalho Chehab
95157043247SMauro Carvalho Chehabvfs_cache_pressure
95257043247SMauro Carvalho Chehab==================
95357043247SMauro Carvalho Chehab
95457043247SMauro Carvalho ChehabThis percentage value controls the tendency of the kernel to reclaim
95557043247SMauro Carvalho Chehabthe memory which is used for caching of directory and inode objects.
95657043247SMauro Carvalho Chehab
95757043247SMauro Carvalho ChehabAt the default value of vfs_cache_pressure=100 the kernel will attempt to
95857043247SMauro Carvalho Chehabreclaim dentries and inodes at a "fair" rate with respect to pagecache and
95957043247SMauro Carvalho Chehabswapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to prefer
96057043247SMauro Carvalho Chehabto retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
96157043247SMauro Carvalho Chehabnever reclaim dentries and inodes due to memory pressure and this can easily
96257043247SMauro Carvalho Chehablead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
96357043247SMauro Carvalho Chehabcauses the kernel to prefer to reclaim dentries and inodes.
96457043247SMauro Carvalho Chehab
96557043247SMauro Carvalho ChehabIncreasing vfs_cache_pressure significantly beyond 100 may have negative
96657043247SMauro Carvalho Chehabperformance impact. Reclaim code needs to take various locks to find freeable
96757043247SMauro Carvalho Chehabdirectory and inode objects. With vfs_cache_pressure=1000, it will look for
96857043247SMauro Carvalho Chehabten times more freeable objects than there are.
96957043247SMauro Carvalho Chehab
97057043247SMauro Carvalho Chehab
97157043247SMauro Carvalho Chehabwatermark_boost_factor
97257043247SMauro Carvalho Chehab======================
97357043247SMauro Carvalho Chehab
97457043247SMauro Carvalho ChehabThis factor controls the level of reclaim when memory is being fragmented.
97557043247SMauro Carvalho ChehabIt defines the percentage of the high watermark of a zone that will be
97657043247SMauro Carvalho Chehabreclaimed if pages of different mobility are being mixed within pageblocks.
97757043247SMauro Carvalho ChehabThe intent is that compaction has less work to do in the future and to
97857043247SMauro Carvalho Chehabincrease the success rate of future high-order allocations such as SLUB
97957043247SMauro Carvalho Chehaballocations, THP and hugetlbfs pages.
98057043247SMauro Carvalho Chehab
98157043247SMauro Carvalho ChehabTo make it sensible with respect to the watermark_scale_factor
98257043247SMauro Carvalho Chehabparameter, the unit is in fractions of 10,000. The default value of
98348d9f335SMike Rapoport15,000 means that up to 150% of the high watermark will be reclaimed in the
98448d9f335SMike Rapoportevent of a pageblock being mixed due to fragmentation. The level of reclaim
98548d9f335SMike Rapoportis determined by the number of fragmentation events that occurred in the
98648d9f335SMike Rapoportrecent past. If this value is smaller than a pageblock then a pageblocks
98748d9f335SMike Rapoportworth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
98848d9f335SMike Rapoportof 0 will disable the feature.
98957043247SMauro Carvalho Chehab
99057043247SMauro Carvalho Chehab
99157043247SMauro Carvalho Chehabwatermark_scale_factor
99257043247SMauro Carvalho Chehab======================
99357043247SMauro Carvalho Chehab
99457043247SMauro Carvalho ChehabThis factor controls the aggressiveness of kswapd. It defines the
99557043247SMauro Carvalho Chehabamount of memory left in a node/system before kswapd is woken up and
99657043247SMauro Carvalho Chehabhow much memory needs to be free before kswapd goes back to sleep.
99757043247SMauro Carvalho Chehab
99857043247SMauro Carvalho ChehabThe unit is in fractions of 10,000. The default value of 10 means the
99957043247SMauro Carvalho Chehabdistances between watermarks are 0.1% of the available memory in the
100039c65a94SSuren Baghdasaryannode/system. The maximum value is 3000, or 30% of memory.
100157043247SMauro Carvalho Chehab
100257043247SMauro Carvalho ChehabA high rate of threads entering direct reclaim (allocstall) or kswapd
100357043247SMauro Carvalho Chehabgoing to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
100457043247SMauro Carvalho Chehabthat the number of free pages kswapd maintains for latency reasons is
100557043247SMauro Carvalho Chehabtoo small for the allocation bursts occurring in the system. This knob
100657043247SMauro Carvalho Chehabcan then be used to tune kswapd aggressiveness accordingly.
100757043247SMauro Carvalho Chehab
100857043247SMauro Carvalho Chehab
100957043247SMauro Carvalho Chehabzone_reclaim_mode
101057043247SMauro Carvalho Chehab=================
101157043247SMauro Carvalho Chehab
101257043247SMauro Carvalho ChehabZone_reclaim_mode allows someone to set more or less aggressive approaches to
101357043247SMauro Carvalho Chehabreclaim memory when a zone runs out of memory. If it is set to zero then no
101457043247SMauro Carvalho Chehabzone reclaim occurs. Allocations will be satisfied from other zones / nodes
101557043247SMauro Carvalho Chehabin the system.
101657043247SMauro Carvalho Chehab
101757043247SMauro Carvalho ChehabThis is value OR'ed together of
101857043247SMauro Carvalho Chehab
101957043247SMauro Carvalho Chehab=	===================================
102057043247SMauro Carvalho Chehab1	Zone reclaim on
102157043247SMauro Carvalho Chehab2	Zone reclaim writes dirty pages out
102257043247SMauro Carvalho Chehab4	Zone reclaim swaps pages
102357043247SMauro Carvalho Chehab=	===================================
102457043247SMauro Carvalho Chehab
102557043247SMauro Carvalho Chehabzone_reclaim_mode is disabled by default.  For file servers or workloads
102657043247SMauro Carvalho Chehabthat benefit from having their data cached, zone_reclaim_mode should be
102757043247SMauro Carvalho Chehableft disabled as the caching effect is likely to be more important than
102857043247SMauro Carvalho Chehabdata locality.
102957043247SMauro Carvalho Chehab
103051998364SDave HansenConsider enabling one or more zone_reclaim mode bits if it's known that the
103151998364SDave Hansenworkload is partitioned such that each partition fits within a NUMA node
103251998364SDave Hansenand that accessing remote memory would cause a measurable performance
103351998364SDave Hansenreduction.  The page allocator will take additional actions before
103451998364SDave Hansenallocating off node pages.
103557043247SMauro Carvalho Chehab
103657043247SMauro Carvalho ChehabAllowing zone reclaim to write out pages stops processes that are
103757043247SMauro Carvalho Chehabwriting large amounts of data from dirtying pages on other nodes. Zone
103857043247SMauro Carvalho Chehabreclaim will write out dirty pages if a zone fills up and so effectively
103957043247SMauro Carvalho Chehabthrottle the process. This may decrease the performance of a single process
104057043247SMauro Carvalho Chehabsince it cannot use all of system memory to buffer the outgoing writes
104157043247SMauro Carvalho Chehabanymore but it preserve the memory on other nodes so that the performance
104257043247SMauro Carvalho Chehabof other processes running on other nodes will not be affected.
104357043247SMauro Carvalho Chehab
104457043247SMauro Carvalho ChehabAllowing regular swap effectively restricts allocations to the local
104557043247SMauro Carvalho Chehabnode unless explicitly overridden by memory policies or cpuset
104657043247SMauro Carvalho Chehabconfigurations.
1047