157043247SMauro Carvalho Chehab=============================== 257043247SMauro Carvalho ChehabDocumentation for /proc/sys/vm/ 357043247SMauro Carvalho Chehab=============================== 457043247SMauro Carvalho Chehab 557043247SMauro Carvalho Chehabkernel version 2.6.29 657043247SMauro Carvalho Chehab 757043247SMauro Carvalho ChehabCopyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> 857043247SMauro Carvalho Chehab 957043247SMauro Carvalho ChehabCopyright (c) 2008 Peter W. Morreale <pmorreale@novell.com> 1057043247SMauro Carvalho Chehab 1157043247SMauro Carvalho ChehabFor general info and legal blurb, please look in index.rst. 1257043247SMauro Carvalho Chehab 1357043247SMauro Carvalho Chehab------------------------------------------------------------------------------ 1457043247SMauro Carvalho Chehab 1557043247SMauro Carvalho ChehabThis file contains the documentation for the sysctl files in 1657043247SMauro Carvalho Chehab/proc/sys/vm and is valid for Linux kernel version 2.6.29. 1757043247SMauro Carvalho Chehab 1857043247SMauro Carvalho ChehabThe files in this directory can be used to tune the operation 1957043247SMauro Carvalho Chehabof the virtual memory (VM) subsystem of the Linux kernel and 2057043247SMauro Carvalho Chehabthe writeout of dirty data to disk. 2157043247SMauro Carvalho Chehab 2257043247SMauro Carvalho ChehabDefault values and initialization routines for most of these 2357043247SMauro Carvalho Chehabfiles can be found in mm/swap.c. 2457043247SMauro Carvalho Chehab 2557043247SMauro Carvalho ChehabCurrently, these files are in /proc/sys/vm: 2657043247SMauro Carvalho Chehab 2757043247SMauro Carvalho Chehab- admin_reserve_kbytes 2857043247SMauro Carvalho Chehab- compact_memory 2962af6964SFam Zheng- compaction_proactiveness 3057043247SMauro Carvalho Chehab- compact_unevictable_allowed 3157043247SMauro Carvalho Chehab- dirty_background_bytes 3257043247SMauro Carvalho Chehab- dirty_background_ratio 3357043247SMauro Carvalho Chehab- dirty_bytes 3457043247SMauro Carvalho Chehab- dirty_expire_centisecs 3557043247SMauro Carvalho Chehab- dirty_ratio 3657043247SMauro Carvalho Chehab- dirtytime_expire_seconds 3757043247SMauro Carvalho Chehab- dirty_writeback_centisecs 3857043247SMauro Carvalho Chehab- drop_caches 3957043247SMauro Carvalho Chehab- extfrag_threshold 4062af6964SFam Zheng- highmem_is_dirtyable 4157043247SMauro Carvalho Chehab- hugetlb_shm_group 4257043247SMauro Carvalho Chehab- laptop_mode 4357043247SMauro Carvalho Chehab- legacy_va_layout 4457043247SMauro Carvalho Chehab- lowmem_reserve_ratio 4557043247SMauro Carvalho Chehab- max_map_count 4657043247SMauro Carvalho Chehab- memory_failure_early_kill 4757043247SMauro Carvalho Chehab- memory_failure_recovery 4857043247SMauro Carvalho Chehab- min_free_kbytes 4957043247SMauro Carvalho Chehab- min_slab_ratio 5057043247SMauro Carvalho Chehab- min_unmapped_ratio 5157043247SMauro Carvalho Chehab- mmap_min_addr 5257043247SMauro Carvalho Chehab- mmap_rnd_bits 5357043247SMauro Carvalho Chehab- mmap_rnd_compat_bits 5457043247SMauro Carvalho Chehab- nr_hugepages 5557043247SMauro Carvalho Chehab- nr_hugepages_mempolicy 5657043247SMauro Carvalho Chehab- nr_overcommit_hugepages 5757043247SMauro Carvalho Chehab- nr_trim_pages (only if CONFIG_MMU=n) 5857043247SMauro Carvalho Chehab- numa_zonelist_order 5957043247SMauro Carvalho Chehab- oom_dump_tasks 6057043247SMauro Carvalho Chehab- oom_kill_allocating_task 6157043247SMauro Carvalho Chehab- overcommit_kbytes 6257043247SMauro Carvalho Chehab- overcommit_memory 6357043247SMauro Carvalho Chehab- overcommit_ratio 6457043247SMauro Carvalho Chehab- page-cluster 658d98e42fSJoel Savitz- page_lock_unfairness 6657043247SMauro Carvalho Chehab- panic_on_oom 6774f44822SMel Gorman- percpu_pagelist_high_fraction 6857043247SMauro Carvalho Chehab- stat_interval 6957043247SMauro Carvalho Chehab- stat_refresh 7057043247SMauro Carvalho Chehab- numa_stat 7157043247SMauro Carvalho Chehab- swappiness 7257043247SMauro Carvalho Chehab- unprivileged_userfaultfd 7357043247SMauro Carvalho Chehab- user_reserve_kbytes 7457043247SMauro Carvalho Chehab- vfs_cache_pressure 7557043247SMauro Carvalho Chehab- watermark_boost_factor 7657043247SMauro Carvalho Chehab- watermark_scale_factor 7757043247SMauro Carvalho Chehab- zone_reclaim_mode 7857043247SMauro Carvalho Chehab 7957043247SMauro Carvalho Chehab 8057043247SMauro Carvalho Chehabadmin_reserve_kbytes 8157043247SMauro Carvalho Chehab==================== 8257043247SMauro Carvalho Chehab 8357043247SMauro Carvalho ChehabThe amount of free memory in the system that should be reserved for users 8457043247SMauro Carvalho Chehabwith the capability cap_sys_admin. 8557043247SMauro Carvalho Chehab 8657043247SMauro Carvalho Chehabadmin_reserve_kbytes defaults to min(3% of free pages, 8MB) 8757043247SMauro Carvalho Chehab 8857043247SMauro Carvalho ChehabThat should provide enough for the admin to log in and kill a process, 8957043247SMauro Carvalho Chehabif necessary, under the default overcommit 'guess' mode. 9057043247SMauro Carvalho Chehab 9157043247SMauro Carvalho ChehabSystems running under overcommit 'never' should increase this to account 9257043247SMauro Carvalho Chehabfor the full Virtual Memory Size of programs used to recover. Otherwise, 9357043247SMauro Carvalho Chehabroot may not be able to log in to recover the system. 9457043247SMauro Carvalho Chehab 9557043247SMauro Carvalho ChehabHow do you calculate a minimum useful reserve? 9657043247SMauro Carvalho Chehab 9757043247SMauro Carvalho Chehabsshd or login + bash (or some other shell) + top (or ps, kill, etc.) 9857043247SMauro Carvalho Chehab 9957043247SMauro Carvalho ChehabFor overcommit 'guess', we can sum resident set sizes (RSS). 10057043247SMauro Carvalho ChehabOn x86_64 this is about 8MB. 10157043247SMauro Carvalho Chehab 10257043247SMauro Carvalho ChehabFor overcommit 'never', we can take the max of their virtual sizes (VSZ) 10357043247SMauro Carvalho Chehaband add the sum of their RSS. 10457043247SMauro Carvalho ChehabOn x86_64 this is about 128MB. 10557043247SMauro Carvalho Chehab 10657043247SMauro Carvalho ChehabChanging this takes effect whenever an application requests memory. 10757043247SMauro Carvalho Chehab 10857043247SMauro Carvalho Chehab 10957043247SMauro Carvalho Chehabcompact_memory 11057043247SMauro Carvalho Chehab============== 11157043247SMauro Carvalho Chehab 11257043247SMauro Carvalho ChehabAvailable only when CONFIG_COMPACTION is set. When 1 is written to the file, 11357043247SMauro Carvalho Chehaball zones are compacted such that free memory is available in contiguous 11457043247SMauro Carvalho Chehabblocks where possible. This can be important for example in the allocation of 11557043247SMauro Carvalho Chehabhuge pages although processes will also directly compact memory as required. 11657043247SMauro Carvalho Chehab 117facdaa91SNitin Guptacompaction_proactiveness 118facdaa91SNitin Gupta======================== 119facdaa91SNitin Gupta 120facdaa91SNitin GuptaThis tunable takes a value in the range [0, 100] with a default value of 121facdaa91SNitin Gupta20. This tunable determines how aggressively compaction is done in the 12265d759c8SCharan Teja Reddybackground. Write of a non zero value to this tunable will immediately 12365d759c8SCharan Teja Reddytrigger the proactive compaction. Setting it to 0 disables proactive compaction. 124facdaa91SNitin Gupta 125facdaa91SNitin GuptaNote that compaction has a non-trivial system-wide impact as pages 126facdaa91SNitin Guptabelonging to different processes are moved around, which could also lead 127facdaa91SNitin Guptato latency spikes in unsuspecting applications. The kernel employs 128facdaa91SNitin Guptavarious heuristics to avoid wasting CPU cycles if it detects that 129facdaa91SNitin Guptaproactive compaction is not being effective. 130facdaa91SNitin Gupta 131facdaa91SNitin GuptaBe careful when setting it to extreme values like 100, as that may 132facdaa91SNitin Guptacause excessive background compaction activity. 13357043247SMauro Carvalho Chehab 13457043247SMauro Carvalho Chehabcompact_unevictable_allowed 13557043247SMauro Carvalho Chehab=========================== 13657043247SMauro Carvalho Chehab 13757043247SMauro Carvalho ChehabAvailable only when CONFIG_COMPACTION is set. When set to 1, compaction is 13857043247SMauro Carvalho Chehaballowed to examine the unevictable lru (mlocked pages) for pages to compact. 13957043247SMauro Carvalho ChehabThis should be used on systems where stalls for minor page faults are an 14057043247SMauro Carvalho Chehabacceptable trade for large contiguous free memory. Set to 0 to prevent 14157043247SMauro Carvalho Chehabcompaction from moving pages that are unevictable. Default value is 1. 1426923aa0dSSebastian Andrzej SiewiorOn CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due 143751d5b27SAndrew Klychkovto compaction, which would block the task from becoming active until the fault 1446923aa0dSSebastian Andrzej Siewioris resolved. 14557043247SMauro Carvalho Chehab 14657043247SMauro Carvalho Chehab 14757043247SMauro Carvalho Chehabdirty_background_bytes 14857043247SMauro Carvalho Chehab====================== 14957043247SMauro Carvalho Chehab 15057043247SMauro Carvalho ChehabContains the amount of dirty memory at which the background kernel 15157043247SMauro Carvalho Chehabflusher threads will start writeback. 15257043247SMauro Carvalho Chehab 15357043247SMauro Carvalho ChehabNote: 15457043247SMauro Carvalho Chehab dirty_background_bytes is the counterpart of dirty_background_ratio. Only 15557043247SMauro Carvalho Chehab one of them may be specified at a time. When one sysctl is written it is 15657043247SMauro Carvalho Chehab immediately taken into account to evaluate the dirty memory limits and the 15757043247SMauro Carvalho Chehab other appears as 0 when read. 15857043247SMauro Carvalho Chehab 15957043247SMauro Carvalho Chehab 16057043247SMauro Carvalho Chehabdirty_background_ratio 16157043247SMauro Carvalho Chehab====================== 16257043247SMauro Carvalho Chehab 16357043247SMauro Carvalho ChehabContains, as a percentage of total available memory that contains free pages 16457043247SMauro Carvalho Chehaband reclaimable pages, the number of pages at which the background kernel 16557043247SMauro Carvalho Chehabflusher threads will start writing out dirty data. 16657043247SMauro Carvalho Chehab 16757043247SMauro Carvalho ChehabThe total available memory is not equal to total system memory. 16857043247SMauro Carvalho Chehab 16957043247SMauro Carvalho Chehab 17057043247SMauro Carvalho Chehabdirty_bytes 17157043247SMauro Carvalho Chehab=========== 17257043247SMauro Carvalho Chehab 17357043247SMauro Carvalho ChehabContains the amount of dirty memory at which a process generating disk writes 17457043247SMauro Carvalho Chehabwill itself start writeback. 17557043247SMauro Carvalho Chehab 17657043247SMauro Carvalho ChehabNote: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be 17757043247SMauro Carvalho Chehabspecified at a time. When one sysctl is written it is immediately taken into 17857043247SMauro Carvalho Chehabaccount to evaluate the dirty memory limits and the other appears as 0 when 17957043247SMauro Carvalho Chehabread. 18057043247SMauro Carvalho Chehab 18157043247SMauro Carvalho ChehabNote: the minimum value allowed for dirty_bytes is two pages (in bytes); any 18257043247SMauro Carvalho Chehabvalue lower than this limit will be ignored and the old configuration will be 18357043247SMauro Carvalho Chehabretained. 18457043247SMauro Carvalho Chehab 18557043247SMauro Carvalho Chehab 18657043247SMauro Carvalho Chehabdirty_expire_centisecs 18757043247SMauro Carvalho Chehab====================== 18857043247SMauro Carvalho Chehab 18957043247SMauro Carvalho ChehabThis tunable is used to define when dirty data is old enough to be eligible 19057043247SMauro Carvalho Chehabfor writeout by the kernel flusher threads. It is expressed in 100'ths 19157043247SMauro Carvalho Chehabof a second. Data which has been dirty in-memory for longer than this 19257043247SMauro Carvalho Chehabinterval will be written out next time a flusher thread wakes up. 19357043247SMauro Carvalho Chehab 19457043247SMauro Carvalho Chehab 19557043247SMauro Carvalho Chehabdirty_ratio 19657043247SMauro Carvalho Chehab=========== 19757043247SMauro Carvalho Chehab 19857043247SMauro Carvalho ChehabContains, as a percentage of total available memory that contains free pages 19957043247SMauro Carvalho Chehaband reclaimable pages, the number of pages at which a process which is 20057043247SMauro Carvalho Chehabgenerating disk writes will itself start writing out dirty data. 20157043247SMauro Carvalho Chehab 20257043247SMauro Carvalho ChehabThe total available memory is not equal to total system memory. 20357043247SMauro Carvalho Chehab 20457043247SMauro Carvalho Chehab 20557043247SMauro Carvalho Chehabdirtytime_expire_seconds 20657043247SMauro Carvalho Chehab======================== 20757043247SMauro Carvalho Chehab 20857043247SMauro Carvalho ChehabWhen a lazytime inode is constantly having its pages dirtied, the inode with 20957043247SMauro Carvalho Chehaban updated timestamp will never get chance to be written out. And, if the 21057043247SMauro Carvalho Chehabonly thing that has happened on the file system is a dirtytime inode caused 21157043247SMauro Carvalho Chehabby an atime update, a worker will be scheduled to make sure that inode 21257043247SMauro Carvalho Chehabeventually gets pushed out to disk. This tunable is used to define when dirty 21357043247SMauro Carvalho Chehabinode is old enough to be eligible for writeback by the kernel flusher threads. 21457043247SMauro Carvalho ChehabAnd, it is also used as the interval to wakeup dirtytime_writeback thread. 21557043247SMauro Carvalho Chehab 21657043247SMauro Carvalho Chehab 21757043247SMauro Carvalho Chehabdirty_writeback_centisecs 21857043247SMauro Carvalho Chehab========================= 21957043247SMauro Carvalho Chehab 22057043247SMauro Carvalho ChehabThe kernel flusher threads will periodically wake up and write `old` data 22157043247SMauro Carvalho Chehabout to disk. This tunable expresses the interval between those wakeups, in 22257043247SMauro Carvalho Chehab100'ths of a second. 22357043247SMauro Carvalho Chehab 22457043247SMauro Carvalho ChehabSetting this to zero disables periodic writeback altogether. 22557043247SMauro Carvalho Chehab 22657043247SMauro Carvalho Chehab 22757043247SMauro Carvalho Chehabdrop_caches 22857043247SMauro Carvalho Chehab=========== 22957043247SMauro Carvalho Chehab 23057043247SMauro Carvalho ChehabWriting to this will cause the kernel to drop clean caches, as well as 23157043247SMauro Carvalho Chehabreclaimable slab objects like dentries and inodes. Once dropped, their 23257043247SMauro Carvalho Chehabmemory becomes free. 23357043247SMauro Carvalho Chehab 23457043247SMauro Carvalho ChehabTo free pagecache:: 23557043247SMauro Carvalho Chehab 23657043247SMauro Carvalho Chehab echo 1 > /proc/sys/vm/drop_caches 23757043247SMauro Carvalho Chehab 23857043247SMauro Carvalho ChehabTo free reclaimable slab objects (includes dentries and inodes):: 23957043247SMauro Carvalho Chehab 24057043247SMauro Carvalho Chehab echo 2 > /proc/sys/vm/drop_caches 24157043247SMauro Carvalho Chehab 24257043247SMauro Carvalho ChehabTo free slab objects and pagecache:: 24357043247SMauro Carvalho Chehab 24457043247SMauro Carvalho Chehab echo 3 > /proc/sys/vm/drop_caches 24557043247SMauro Carvalho Chehab 24657043247SMauro Carvalho ChehabThis is a non-destructive operation and will not free any dirty objects. 24757043247SMauro Carvalho ChehabTo increase the number of objects freed by this operation, the user may run 24857043247SMauro Carvalho Chehab`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the 24957043247SMauro Carvalho Chehabnumber of dirty objects on the system and create more candidates to be 25057043247SMauro Carvalho Chehabdropped. 25157043247SMauro Carvalho Chehab 25257043247SMauro Carvalho ChehabThis file is not a means to control the growth of the various kernel caches 25357043247SMauro Carvalho Chehab(inodes, dentries, pagecache, etc...) These objects are automatically 25457043247SMauro Carvalho Chehabreclaimed by the kernel when memory is needed elsewhere on the system. 25557043247SMauro Carvalho Chehab 25657043247SMauro Carvalho ChehabUse of this file can cause performance problems. Since it discards cached 25757043247SMauro Carvalho Chehabobjects, it may cost a significant amount of I/O and CPU to recreate the 25857043247SMauro Carvalho Chehabdropped objects, especially if they were under heavy use. Because of this, 25957043247SMauro Carvalho Chehabuse outside of a testing or debugging environment is not recommended. 26057043247SMauro Carvalho Chehab 26157043247SMauro Carvalho ChehabYou may see informational messages in your kernel log when this file is 26257043247SMauro Carvalho Chehabused:: 26357043247SMauro Carvalho Chehab 26457043247SMauro Carvalho Chehab cat (1234): drop_caches: 3 26557043247SMauro Carvalho Chehab 26657043247SMauro Carvalho ChehabThese are informational only. They do not mean that anything is wrong 26757043247SMauro Carvalho Chehabwith your system. To disable them, echo 4 (bit 2) into drop_caches. 26857043247SMauro Carvalho Chehab 26957043247SMauro Carvalho Chehab 27057043247SMauro Carvalho Chehabextfrag_threshold 27157043247SMauro Carvalho Chehab================= 27257043247SMauro Carvalho Chehab 27357043247SMauro Carvalho ChehabThis parameter affects whether the kernel will compact memory or direct 27457043247SMauro Carvalho Chehabreclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in 27557043247SMauro Carvalho Chehabdebugfs shows what the fragmentation index for each order is in each zone in 27657043247SMauro Carvalho Chehabthe system. Values tending towards 0 imply allocations would fail due to lack 27757043247SMauro Carvalho Chehabof memory, values towards 1000 imply failures are due to fragmentation and -1 27857043247SMauro Carvalho Chehabimplies that the allocation will succeed as long as watermarks are met. 27957043247SMauro Carvalho Chehab 28057043247SMauro Carvalho ChehabThe kernel will not compact memory in a zone if the 28157043247SMauro Carvalho Chehabfragmentation index is <= extfrag_threshold. The default value is 500. 28257043247SMauro Carvalho Chehab 28357043247SMauro Carvalho Chehab 28457043247SMauro Carvalho Chehabhighmem_is_dirtyable 28557043247SMauro Carvalho Chehab==================== 28657043247SMauro Carvalho Chehab 28757043247SMauro Carvalho ChehabAvailable only for systems with CONFIG_HIGHMEM enabled (32b systems). 28857043247SMauro Carvalho Chehab 28957043247SMauro Carvalho ChehabThis parameter controls whether the high memory is considered for dirty 29057043247SMauro Carvalho Chehabwriters throttling. This is not the case by default which means that 29157043247SMauro Carvalho Chehabonly the amount of memory directly visible/usable by the kernel can 29257043247SMauro Carvalho Chehabbe dirtied. As a result, on systems with a large amount of memory and 29357043247SMauro Carvalho Chehablowmem basically depleted writers might be throttled too early and 29457043247SMauro Carvalho Chehabstreaming writes can get very slow. 29557043247SMauro Carvalho Chehab 29657043247SMauro Carvalho ChehabChanging the value to non zero would allow more memory to be dirtied 29757043247SMauro Carvalho Chehaband thus allow writers to write more data which can be flushed to the 29857043247SMauro Carvalho Chehabstorage more effectively. Note this also comes with a risk of pre-mature 29957043247SMauro Carvalho ChehabOOM killer because some writers (e.g. direct block device writes) can 30057043247SMauro Carvalho Chehabonly use the low memory and they can fill it up with dirty data without 30157043247SMauro Carvalho Chehabany throttling. 30257043247SMauro Carvalho Chehab 30357043247SMauro Carvalho Chehab 30457043247SMauro Carvalho Chehabhugetlb_shm_group 30557043247SMauro Carvalho Chehab================= 30657043247SMauro Carvalho Chehab 30757043247SMauro Carvalho Chehabhugetlb_shm_group contains group id that is allowed to create SysV 30857043247SMauro Carvalho Chehabshared memory segment using hugetlb page. 30957043247SMauro Carvalho Chehab 31057043247SMauro Carvalho Chehab 31157043247SMauro Carvalho Chehablaptop_mode 31257043247SMauro Carvalho Chehab=========== 31357043247SMauro Carvalho Chehab 31457043247SMauro Carvalho Chehablaptop_mode is a knob that controls "laptop mode". All the things that are 3159e1cbedeSMauro Carvalho Chehabcontrolled by this knob are discussed in Documentation/admin-guide/laptops/laptop-mode.rst. 31657043247SMauro Carvalho Chehab 31757043247SMauro Carvalho Chehab 31857043247SMauro Carvalho Chehablegacy_va_layout 31957043247SMauro Carvalho Chehab================ 32057043247SMauro Carvalho Chehab 32157043247SMauro Carvalho ChehabIf non-zero, this sysctl disables the new 32-bit mmap layout - the kernel 32257043247SMauro Carvalho Chehabwill use the legacy (2.4) layout for all processes. 32357043247SMauro Carvalho Chehab 32457043247SMauro Carvalho Chehab 32557043247SMauro Carvalho Chehablowmem_reserve_ratio 32657043247SMauro Carvalho Chehab==================== 32757043247SMauro Carvalho Chehab 32857043247SMauro Carvalho ChehabFor some specialised workloads on highmem machines it is dangerous for 32957043247SMauro Carvalho Chehabthe kernel to allow process memory to be allocated from the "lowmem" 33057043247SMauro Carvalho Chehabzone. This is because that memory could then be pinned via the mlock() 33157043247SMauro Carvalho Chehabsystem call, or by unavailability of swapspace. 33257043247SMauro Carvalho Chehab 33357043247SMauro Carvalho ChehabAnd on large highmem machines this lack of reclaimable lowmem memory 33457043247SMauro Carvalho Chehabcan be fatal. 33557043247SMauro Carvalho Chehab 33657043247SMauro Carvalho ChehabSo the Linux page allocator has a mechanism which prevents allocations 33757043247SMauro Carvalho Chehabwhich *could* use highmem from using too much lowmem. This means that 33857043247SMauro Carvalho Chehaba certain amount of lowmem is defended from the possibility of being 33957043247SMauro Carvalho Chehabcaptured into pinned user memory. 34057043247SMauro Carvalho Chehab 34157043247SMauro Carvalho Chehab(The same argument applies to the old 16 megabyte ISA DMA region. This 34257043247SMauro Carvalho Chehabmechanism will also defend that region from allocations which could use 34357043247SMauro Carvalho Chehabhighmem or lowmem). 34457043247SMauro Carvalho Chehab 34557043247SMauro Carvalho ChehabThe `lowmem_reserve_ratio` tunable determines how aggressive the kernel is 34657043247SMauro Carvalho Chehabin defending these lower zones. 34757043247SMauro Carvalho Chehab 34857043247SMauro Carvalho ChehabIf you have a machine which uses highmem or ISA DMA and your 34957043247SMauro Carvalho Chehabapplications are using mlock(), or if you are running with no swap then 35057043247SMauro Carvalho Chehabyou probably should change the lowmem_reserve_ratio setting. 35157043247SMauro Carvalho Chehab 35257043247SMauro Carvalho ChehabThe lowmem_reserve_ratio is an array. You can see them by reading this file:: 35357043247SMauro Carvalho Chehab 35457043247SMauro Carvalho Chehab % cat /proc/sys/vm/lowmem_reserve_ratio 35557043247SMauro Carvalho Chehab 256 256 32 35657043247SMauro Carvalho Chehab 35757043247SMauro Carvalho ChehabBut, these values are not used directly. The kernel calculates # of protection 35857043247SMauro Carvalho Chehabpages for each zones from them. These are shown as array of protection pages 359*dbeb56feSRandy Dunlapin /proc/zoneinfo like the following. (This is an example of x86-64 box). 36057043247SMauro Carvalho ChehabEach zone has an array of protection pages like this:: 36157043247SMauro Carvalho Chehab 36257043247SMauro Carvalho Chehab Node 0, zone DMA 36357043247SMauro Carvalho Chehab pages free 1355 36457043247SMauro Carvalho Chehab min 3 36557043247SMauro Carvalho Chehab low 3 36657043247SMauro Carvalho Chehab high 4 36757043247SMauro Carvalho Chehab : 36857043247SMauro Carvalho Chehab : 36957043247SMauro Carvalho Chehab numa_other 0 37057043247SMauro Carvalho Chehab protection: (0, 2004, 2004, 2004) 37157043247SMauro Carvalho Chehab ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 37257043247SMauro Carvalho Chehab pagesets 37357043247SMauro Carvalho Chehab cpu: 0 pcp: 0 37457043247SMauro Carvalho Chehab : 37557043247SMauro Carvalho Chehab 37657043247SMauro Carvalho ChehabThese protections are added to score to judge whether this zone should be used 37757043247SMauro Carvalho Chehabfor page allocation or should be reclaimed. 37857043247SMauro Carvalho Chehab 37957043247SMauro Carvalho ChehabIn this example, if normal pages (index=2) are required to this DMA zone and 38057043247SMauro Carvalho Chehabwatermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should 38157043247SMauro Carvalho Chehabnot be used because pages_free(1355) is smaller than watermark + protection[2] 38257043247SMauro Carvalho Chehab(4 + 2004 = 2008). If this protection value is 0, this zone would be used for 38357043247SMauro Carvalho Chehabnormal page requirement. If requirement is DMA zone(index=0), protection[0] 38457043247SMauro Carvalho Chehab(=0) is used. 38557043247SMauro Carvalho Chehab 38657043247SMauro Carvalho Chehabzone[i]'s protection[j] is calculated by following expression:: 38757043247SMauro Carvalho Chehab 38857043247SMauro Carvalho Chehab (i < j): 38957043247SMauro Carvalho Chehab zone[i]->protection[j] 39057043247SMauro Carvalho Chehab = (total sums of managed_pages from zone[i+1] to zone[j] on the node) 39157043247SMauro Carvalho Chehab / lowmem_reserve_ratio[i]; 39257043247SMauro Carvalho Chehab (i = j): 39357043247SMauro Carvalho Chehab (should not be protected. = 0; 39457043247SMauro Carvalho Chehab (i > j): 39557043247SMauro Carvalho Chehab (not necessary, but looks 0) 39657043247SMauro Carvalho Chehab 39757043247SMauro Carvalho ChehabThe default values of lowmem_reserve_ratio[i] are 39857043247SMauro Carvalho Chehab 39957043247SMauro Carvalho Chehab === ==================================== 40057043247SMauro Carvalho Chehab 256 (if zone[i] means DMA or DMA32 zone) 40157043247SMauro Carvalho Chehab 32 (others) 40257043247SMauro Carvalho Chehab === ==================================== 40357043247SMauro Carvalho Chehab 40457043247SMauro Carvalho ChehabAs above expression, they are reciprocal number of ratio. 40557043247SMauro Carvalho Chehab256 means 1/256. # of protection pages becomes about "0.39%" of total managed 40657043247SMauro Carvalho Chehabpages of higher zones on the node. 40757043247SMauro Carvalho Chehab 40857043247SMauro Carvalho ChehabIf you would like to protect more pages, smaller values are effective. 40957043247SMauro Carvalho ChehabThe minimum value is 1 (1/1 -> 100%). The value less than 1 completely 41057043247SMauro Carvalho Chehabdisables protection of the pages. 41157043247SMauro Carvalho Chehab 41257043247SMauro Carvalho Chehab 41357043247SMauro Carvalho Chehabmax_map_count: 41457043247SMauro Carvalho Chehab============== 41557043247SMauro Carvalho Chehab 41657043247SMauro Carvalho ChehabThis file contains the maximum number of memory map areas a process 41757043247SMauro Carvalho Chehabmay have. Memory map areas are used as a side-effect of calling 41857043247SMauro Carvalho Chehabmalloc, directly by mmap, mprotect, and madvise, and also when loading 41957043247SMauro Carvalho Chehabshared libraries. 42057043247SMauro Carvalho Chehab 42157043247SMauro Carvalho ChehabWhile most applications need less than a thousand maps, certain 42257043247SMauro Carvalho Chehabprograms, particularly malloc debuggers, may consume lots of them, 42357043247SMauro Carvalho Chehabe.g., up to one or two maps per allocation. 42457043247SMauro Carvalho Chehab 425c635b0ceSFengfei XiThe default value is 65530. 42657043247SMauro Carvalho Chehab 42757043247SMauro Carvalho Chehab 42857043247SMauro Carvalho Chehabmemory_failure_early_kill: 42957043247SMauro Carvalho Chehab========================== 43057043247SMauro Carvalho Chehab 43157043247SMauro Carvalho ChehabControl how to kill processes when uncorrected memory error (typically 43257043247SMauro Carvalho Chehaba 2bit error in a memory module) is detected in the background by hardware 43357043247SMauro Carvalho Chehabthat cannot be handled by the kernel. In some cases (like the page 43457043247SMauro Carvalho Chehabstill having a valid copy on disk) the kernel will handle the failure 43557043247SMauro Carvalho Chehabtransparently without affecting any applications. But if there is 436*dbeb56feSRandy Dunlapno other up-to-date copy of the data it will kill to prevent any data 43757043247SMauro Carvalho Chehabcorruptions from propagating. 43857043247SMauro Carvalho Chehab 43957043247SMauro Carvalho Chehab1: Kill all processes that have the corrupted and not reloadable page mapped 44057043247SMauro Carvalho Chehabas soon as the corruption is detected. Note this is not supported 44157043247SMauro Carvalho Chehabfor a few types of pages, like kernel internally allocated data or 44257043247SMauro Carvalho Chehabthe swap cache, but works for the majority of user pages. 44357043247SMauro Carvalho Chehab 44457043247SMauro Carvalho Chehab0: Only unmap the corrupted page from all processes and only kill a process 44557043247SMauro Carvalho Chehabwho tries to access it. 44657043247SMauro Carvalho Chehab 44757043247SMauro Carvalho ChehabThe kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can 44857043247SMauro Carvalho Chehabhandle this if they want to. 44957043247SMauro Carvalho Chehab 45057043247SMauro Carvalho ChehabThis is only active on architectures/platforms with advanced machine 45157043247SMauro Carvalho Chehabcheck handling and depends on the hardware capabilities. 45257043247SMauro Carvalho Chehab 45357043247SMauro Carvalho ChehabApplications can override this setting individually with the PR_MCE_KILL prctl 45457043247SMauro Carvalho Chehab 45557043247SMauro Carvalho Chehab 45657043247SMauro Carvalho Chehabmemory_failure_recovery 45757043247SMauro Carvalho Chehab======================= 45857043247SMauro Carvalho Chehab 45957043247SMauro Carvalho ChehabEnable memory failure recovery (when supported by the platform) 46057043247SMauro Carvalho Chehab 46157043247SMauro Carvalho Chehab1: Attempt recovery. 46257043247SMauro Carvalho Chehab 46357043247SMauro Carvalho Chehab0: Always panic on a memory failure. 46457043247SMauro Carvalho Chehab 46557043247SMauro Carvalho Chehab 46657043247SMauro Carvalho Chehabmin_free_kbytes 46757043247SMauro Carvalho Chehab=============== 46857043247SMauro Carvalho Chehab 46957043247SMauro Carvalho ChehabThis is used to force the Linux VM to keep a minimum number 47057043247SMauro Carvalho Chehabof kilobytes free. The VM uses this number to compute a 47157043247SMauro Carvalho Chehabwatermark[WMARK_MIN] value for each lowmem zone in the system. 47257043247SMauro Carvalho ChehabEach lowmem zone gets a number of reserved free pages based 47357043247SMauro Carvalho Chehabproportionally on its size. 47457043247SMauro Carvalho Chehab 47557043247SMauro Carvalho ChehabSome minimal amount of memory is needed to satisfy PF_MEMALLOC 47657043247SMauro Carvalho Chehaballocations; if you set this to lower than 1024KB, your system will 47757043247SMauro Carvalho Chehabbecome subtly broken, and prone to deadlock under high loads. 47857043247SMauro Carvalho Chehab 47957043247SMauro Carvalho ChehabSetting this too high will OOM your machine instantly. 48057043247SMauro Carvalho Chehab 48157043247SMauro Carvalho Chehab 48257043247SMauro Carvalho Chehabmin_slab_ratio 48357043247SMauro Carvalho Chehab============== 48457043247SMauro Carvalho Chehab 48557043247SMauro Carvalho ChehabThis is available only on NUMA kernels. 48657043247SMauro Carvalho Chehab 48757043247SMauro Carvalho ChehabA percentage of the total pages in each zone. On Zone reclaim 48857043247SMauro Carvalho Chehab(fallback from the local zone occurs) slabs will be reclaimed if more 48957043247SMauro Carvalho Chehabthan this percentage of pages in a zone are reclaimable slab pages. 49057043247SMauro Carvalho ChehabThis insures that the slab growth stays under control even in NUMA 49157043247SMauro Carvalho Chehabsystems that rarely perform global reclaim. 49257043247SMauro Carvalho Chehab 49357043247SMauro Carvalho ChehabThe default is 5 percent. 49457043247SMauro Carvalho Chehab 49557043247SMauro Carvalho ChehabNote that slab reclaim is triggered in a per zone / node fashion. 49657043247SMauro Carvalho ChehabThe process of reclaiming slab memory is currently not node specific 49757043247SMauro Carvalho Chehaband may not be fast. 49857043247SMauro Carvalho Chehab 49957043247SMauro Carvalho Chehab 50057043247SMauro Carvalho Chehabmin_unmapped_ratio 50157043247SMauro Carvalho Chehab================== 50257043247SMauro Carvalho Chehab 50357043247SMauro Carvalho ChehabThis is available only on NUMA kernels. 50457043247SMauro Carvalho Chehab 50557043247SMauro Carvalho ChehabThis is a percentage of the total pages in each zone. Zone reclaim will 50657043247SMauro Carvalho Chehabonly occur if more than this percentage of pages are in a state that 50757043247SMauro Carvalho Chehabzone_reclaim_mode allows to be reclaimed. 50857043247SMauro Carvalho Chehab 50957043247SMauro Carvalho ChehabIf zone_reclaim_mode has the value 4 OR'd, then the percentage is compared 51057043247SMauro Carvalho Chehabagainst all file-backed unmapped pages including swapcache pages and tmpfs 51157043247SMauro Carvalho Chehabfiles. Otherwise, only unmapped pages backed by normal files but not tmpfs 51257043247SMauro Carvalho Chehabfiles and similar are considered. 51357043247SMauro Carvalho Chehab 51457043247SMauro Carvalho ChehabThe default is 1 percent. 51557043247SMauro Carvalho Chehab 51657043247SMauro Carvalho Chehab 51757043247SMauro Carvalho Chehabmmap_min_addr 51857043247SMauro Carvalho Chehab============= 51957043247SMauro Carvalho Chehab 52057043247SMauro Carvalho ChehabThis file indicates the amount of address space which a user process will 52157043247SMauro Carvalho Chehabbe restricted from mmapping. Since kernel null dereference bugs could 52257043247SMauro Carvalho Chehabaccidentally operate based on the information in the first couple of pages 52357043247SMauro Carvalho Chehabof memory userspace processes should not be allowed to write to them. By 52457043247SMauro Carvalho Chehabdefault this value is set to 0 and no protections will be enforced by the 52557043247SMauro Carvalho Chehabsecurity module. Setting this value to something like 64k will allow the 52657043247SMauro Carvalho Chehabvast majority of applications to work correctly and provide defense in depth 52757043247SMauro Carvalho Chehabagainst future potential kernel bugs. 52857043247SMauro Carvalho Chehab 52957043247SMauro Carvalho Chehab 53057043247SMauro Carvalho Chehabmmap_rnd_bits 53157043247SMauro Carvalho Chehab============= 53257043247SMauro Carvalho Chehab 53357043247SMauro Carvalho ChehabThis value can be used to select the number of bits to use to 53457043247SMauro Carvalho Chehabdetermine the random offset to the base address of vma regions 53557043247SMauro Carvalho Chehabresulting from mmap allocations on architectures which support 53657043247SMauro Carvalho Chehabtuning address space randomization. This value will be bounded 53757043247SMauro Carvalho Chehabby the architecture's minimum and maximum supported values. 53857043247SMauro Carvalho Chehab 53957043247SMauro Carvalho ChehabThis value can be changed after boot using the 54057043247SMauro Carvalho Chehab/proc/sys/vm/mmap_rnd_bits tunable 54157043247SMauro Carvalho Chehab 54257043247SMauro Carvalho Chehab 54357043247SMauro Carvalho Chehabmmap_rnd_compat_bits 54457043247SMauro Carvalho Chehab==================== 54557043247SMauro Carvalho Chehab 54657043247SMauro Carvalho ChehabThis value can be used to select the number of bits to use to 54757043247SMauro Carvalho Chehabdetermine the random offset to the base address of vma regions 54857043247SMauro Carvalho Chehabresulting from mmap allocations for applications run in 54957043247SMauro Carvalho Chehabcompatibility mode on architectures which support tuning address 55057043247SMauro Carvalho Chehabspace randomization. This value will be bounded by the 55157043247SMauro Carvalho Chehabarchitecture's minimum and maximum supported values. 55257043247SMauro Carvalho Chehab 55357043247SMauro Carvalho ChehabThis value can be changed after boot using the 55457043247SMauro Carvalho Chehab/proc/sys/vm/mmap_rnd_compat_bits tunable 55557043247SMauro Carvalho Chehab 55657043247SMauro Carvalho Chehab 55757043247SMauro Carvalho Chehabnr_hugepages 55857043247SMauro Carvalho Chehab============ 55957043247SMauro Carvalho Chehab 56057043247SMauro Carvalho ChehabChange the minimum size of the hugepage pool. 56157043247SMauro Carvalho Chehab 56257043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst 56357043247SMauro Carvalho Chehab 56457043247SMauro Carvalho Chehab 56578f39084SMuchun Songhugetlb_optimize_vmemmap 56678f39084SMuchun Song======================== 56778f39084SMuchun Song 56866361095SMuchun SongThis knob is not available when the size of 'struct page' (a structure defined 56966361095SMuchun Songin include/linux/mm_types.h) is not power of two (an unusual system config could 57078f39084SMuchun Songresult in this). 57178f39084SMuchun Song 572dff03381SMuchun SongEnable (set to 1) or disable (set to 0) HugeTLB Vmemmap Optimization (HVO). 57378f39084SMuchun Song 57478f39084SMuchun SongOnce enabled, the vmemmap pages of subsequent allocation of HugeTLB pages from 57578f39084SMuchun Songbuddy allocator will be optimized (7 pages per 2MB HugeTLB page and 4095 pages 57678f39084SMuchun Songper 1GB HugeTLB page), whereas already allocated HugeTLB pages will not be 57778f39084SMuchun Songoptimized. When those optimized HugeTLB pages are freed from the HugeTLB pool 57878f39084SMuchun Songto the buddy allocator, the vmemmap pages representing that range needs to be 57978f39084SMuchun Songremapped again and the vmemmap pages discarded earlier need to be rellocated 58078f39084SMuchun Songagain. If your use case is that HugeTLB pages are allocated 'on the fly' (e.g. 58178f39084SMuchun Songnever explicitly allocating HugeTLB pages with 'nr_hugepages' but only set 58278f39084SMuchun Song'nr_overcommit_hugepages', those overcommitted HugeTLB pages are allocated 'on 58378f39084SMuchun Songthe fly') instead of being pulled from the HugeTLB pool, you should weigh the 58478f39084SMuchun Songbenefits of memory savings against the more overhead (~2x slower than before) 58578f39084SMuchun Songof allocation or freeing HugeTLB pages between the HugeTLB pool and the buddy 58678f39084SMuchun Songallocator. Another behavior to note is that if the system is under heavy memory 58778f39084SMuchun Songpressure, it could prevent the user from freeing HugeTLB pages from the HugeTLB 58878f39084SMuchun Songpool to the buddy allocator since the allocation of vmemmap pages could be 58978f39084SMuchun Songfailed, you have to retry later if your system encounter this situation. 59078f39084SMuchun Song 59178f39084SMuchun SongOnce disabled, the vmemmap pages of subsequent allocation of HugeTLB pages from 59278f39084SMuchun Songbuddy allocator will not be optimized meaning the extra overhead at allocation 59378f39084SMuchun Songtime from buddy allocator disappears, whereas already optimized HugeTLB pages 59478f39084SMuchun Songwill not be affected. If you want to make sure there are no optimized HugeTLB 59578f39084SMuchun Songpages, you can set "nr_hugepages" to 0 first and then disable this. Note that 59678f39084SMuchun Songwriting 0 to nr_hugepages will make any "in use" HugeTLB pages become surplus 59778f39084SMuchun Songpages. So, those surplus pages are still optimized until they are no longer 59878f39084SMuchun Songin use. You would need to wait for those surplus pages to be released before 59978f39084SMuchun Songthere are no optimized pages in the system. 60078f39084SMuchun Song 60178f39084SMuchun Song 60257043247SMauro Carvalho Chehabnr_hugepages_mempolicy 60357043247SMauro Carvalho Chehab====================== 60457043247SMauro Carvalho Chehab 60557043247SMauro Carvalho ChehabChange the size of the hugepage pool at run-time on a specific 60657043247SMauro Carvalho Chehabset of NUMA nodes. 60757043247SMauro Carvalho Chehab 60857043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst 60957043247SMauro Carvalho Chehab 61057043247SMauro Carvalho Chehab 61157043247SMauro Carvalho Chehabnr_overcommit_hugepages 61257043247SMauro Carvalho Chehab======================= 61357043247SMauro Carvalho Chehab 61457043247SMauro Carvalho ChehabChange the maximum size of the hugepage pool. The maximum is 61557043247SMauro Carvalho Chehabnr_hugepages + nr_overcommit_hugepages. 61657043247SMauro Carvalho Chehab 61757043247SMauro Carvalho ChehabSee Documentation/admin-guide/mm/hugetlbpage.rst 61857043247SMauro Carvalho Chehab 61957043247SMauro Carvalho Chehab 62057043247SMauro Carvalho Chehabnr_trim_pages 62157043247SMauro Carvalho Chehab============= 62257043247SMauro Carvalho Chehab 62357043247SMauro Carvalho ChehabThis is available only on NOMMU kernels. 62457043247SMauro Carvalho Chehab 62557043247SMauro Carvalho ChehabThis value adjusts the excess page trimming behaviour of power-of-2 aligned 62657043247SMauro Carvalho ChehabNOMMU mmap allocations. 62757043247SMauro Carvalho Chehab 62857043247SMauro Carvalho ChehabA value of 0 disables trimming of allocations entirely, while a value of 1 62957043247SMauro Carvalho Chehabtrims excess pages aggressively. Any value >= 1 acts as the watermark where 63057043247SMauro Carvalho Chehabtrimming of allocations is initiated. 63157043247SMauro Carvalho Chehab 63257043247SMauro Carvalho ChehabThe default value is 1. 63357043247SMauro Carvalho Chehab 634800c02f5SMauro Carvalho ChehabSee Documentation/admin-guide/mm/nommu-mmap.rst for more information. 63557043247SMauro Carvalho Chehab 63657043247SMauro Carvalho Chehab 63757043247SMauro Carvalho Chehabnuma_zonelist_order 63857043247SMauro Carvalho Chehab=================== 63957043247SMauro Carvalho Chehab 64057043247SMauro Carvalho ChehabThis sysctl is only for NUMA and it is deprecated. Anything but 64157043247SMauro Carvalho ChehabNode order will fail! 64257043247SMauro Carvalho Chehab 64357043247SMauro Carvalho Chehab'where the memory is allocated from' is controlled by zonelists. 64457043247SMauro Carvalho Chehab 64557043247SMauro Carvalho Chehab(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. 64657043247SMauro Carvalho Chehabyou may be able to read ZONE_DMA as ZONE_DMA32...) 64757043247SMauro Carvalho Chehab 64857043247SMauro Carvalho ChehabIn non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. 64957043247SMauro Carvalho ChehabZONE_NORMAL -> ZONE_DMA 65057043247SMauro Carvalho ChehabThis means that a memory allocation request for GFP_KERNEL will 65157043247SMauro Carvalho Chehabget memory from ZONE_DMA only when ZONE_NORMAL is not available. 65257043247SMauro Carvalho Chehab 65357043247SMauro Carvalho ChehabIn NUMA case, you can think of following 2 types of order. 65457043247SMauro Carvalho ChehabAssume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: 65557043247SMauro Carvalho Chehab 65657043247SMauro Carvalho Chehab (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL 65757043247SMauro Carvalho Chehab (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. 65857043247SMauro Carvalho Chehab 65957043247SMauro Carvalho ChehabType(A) offers the best locality for processes on Node(0), but ZONE_DMA 66057043247SMauro Carvalho Chehabwill be used before ZONE_NORMAL exhaustion. This increases possibility of 66157043247SMauro Carvalho Chehabout-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. 66257043247SMauro Carvalho Chehab 66357043247SMauro Carvalho ChehabType(B) cannot offer the best locality but is more robust against OOM of 66457043247SMauro Carvalho Chehabthe DMA zone. 66557043247SMauro Carvalho Chehab 66657043247SMauro Carvalho ChehabType(A) is called as "Node" order. Type (B) is "Zone" order. 66757043247SMauro Carvalho Chehab 66857043247SMauro Carvalho Chehab"Node order" orders the zonelists by node, then by zone within each node. 66957043247SMauro Carvalho ChehabSpecify "[Nn]ode" for node order 67057043247SMauro Carvalho Chehab 67157043247SMauro Carvalho Chehab"Zone Order" orders the zonelists by zone type, then by node within each 67257043247SMauro Carvalho Chehabzone. Specify "[Zz]one" for zone order. 67357043247SMauro Carvalho Chehab 67457043247SMauro Carvalho ChehabSpecify "[Dd]efault" to request automatic configuration. 67557043247SMauro Carvalho Chehab 67657043247SMauro Carvalho ChehabOn 32-bit, the Normal zone needs to be preserved for allocations accessible 67757043247SMauro Carvalho Chehabby the kernel, so "zone" order will be selected. 67857043247SMauro Carvalho Chehab 67957043247SMauro Carvalho ChehabOn 64-bit, devices that require DMA32/DMA are relatively rare, so "node" 68057043247SMauro Carvalho Chehaborder will be selected. 68157043247SMauro Carvalho Chehab 68257043247SMauro Carvalho ChehabDefault order is recommended unless this is causing problems for your 68357043247SMauro Carvalho Chehabsystem/application. 68457043247SMauro Carvalho Chehab 68557043247SMauro Carvalho Chehab 68657043247SMauro Carvalho Chehaboom_dump_tasks 68757043247SMauro Carvalho Chehab============== 68857043247SMauro Carvalho Chehab 68957043247SMauro Carvalho ChehabEnables a system-wide task dump (excluding kernel threads) to be produced 69057043247SMauro Carvalho Chehabwhen the kernel performs an OOM-killing and includes such information as 69157043247SMauro Carvalho Chehabpid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj 69257043247SMauro Carvalho Chehabscore, and name. This is helpful to determine why the OOM killer was 69357043247SMauro Carvalho Chehabinvoked, to identify the rogue task that caused it, and to determine why 69457043247SMauro Carvalho Chehabthe OOM killer chose the task it did to kill. 69557043247SMauro Carvalho Chehab 69657043247SMauro Carvalho ChehabIf this is set to zero, this information is suppressed. On very 69757043247SMauro Carvalho Chehablarge systems with thousands of tasks it may not be feasible to dump 69857043247SMauro Carvalho Chehabthe memory state information for each one. Such systems should not 69957043247SMauro Carvalho Chehabbe forced to incur a performance penalty in OOM conditions when the 70057043247SMauro Carvalho Chehabinformation may not be desired. 70157043247SMauro Carvalho Chehab 70257043247SMauro Carvalho ChehabIf this is set to non-zero, this information is shown whenever the 70357043247SMauro Carvalho ChehabOOM killer actually kills a memory-hogging task. 70457043247SMauro Carvalho Chehab 70557043247SMauro Carvalho ChehabThe default value is 1 (enabled). 70657043247SMauro Carvalho Chehab 70757043247SMauro Carvalho Chehab 70857043247SMauro Carvalho Chehaboom_kill_allocating_task 70957043247SMauro Carvalho Chehab======================== 71057043247SMauro Carvalho Chehab 71157043247SMauro Carvalho ChehabThis enables or disables killing the OOM-triggering task in 71257043247SMauro Carvalho Chehabout-of-memory situations. 71357043247SMauro Carvalho Chehab 71457043247SMauro Carvalho ChehabIf this is set to zero, the OOM killer will scan through the entire 71557043247SMauro Carvalho Chehabtasklist and select a task based on heuristics to kill. This normally 71657043247SMauro Carvalho Chehabselects a rogue memory-hogging task that frees up a large amount of 71757043247SMauro Carvalho Chehabmemory when killed. 71857043247SMauro Carvalho Chehab 71957043247SMauro Carvalho ChehabIf this is set to non-zero, the OOM killer simply kills the task that 72057043247SMauro Carvalho Chehabtriggered the out-of-memory condition. This avoids the expensive 72157043247SMauro Carvalho Chehabtasklist scan. 72257043247SMauro Carvalho Chehab 72357043247SMauro Carvalho ChehabIf panic_on_oom is selected, it takes precedence over whatever value 72457043247SMauro Carvalho Chehabis used in oom_kill_allocating_task. 72557043247SMauro Carvalho Chehab 72657043247SMauro Carvalho ChehabThe default value is 0. 72757043247SMauro Carvalho Chehab 72857043247SMauro Carvalho Chehab 72957043247SMauro Carvalho Chehabovercommit_kbytes 73057043247SMauro Carvalho Chehab================= 73157043247SMauro Carvalho Chehab 73257043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, the committed address space is not 73357043247SMauro Carvalho Chehabpermitted to exceed swap plus this amount of physical RAM. See below. 73457043247SMauro Carvalho Chehab 73557043247SMauro Carvalho ChehabNote: overcommit_kbytes is the counterpart of overcommit_ratio. Only one 73657043247SMauro Carvalho Chehabof them may be specified at a time. Setting one disables the other (which 73757043247SMauro Carvalho Chehabthen appears as 0 when read). 73857043247SMauro Carvalho Chehab 73957043247SMauro Carvalho Chehab 74057043247SMauro Carvalho Chehabovercommit_memory 74157043247SMauro Carvalho Chehab================= 74257043247SMauro Carvalho Chehab 74357043247SMauro Carvalho ChehabThis value contains a flag that enables memory overcommitment. 74457043247SMauro Carvalho Chehab 74557043247SMauro Carvalho ChehabWhen this flag is 0, the kernel attempts to estimate the amount 74657043247SMauro Carvalho Chehabof free memory left when userspace requests more memory. 74757043247SMauro Carvalho Chehab 74857043247SMauro Carvalho ChehabWhen this flag is 1, the kernel pretends there is always enough 74957043247SMauro Carvalho Chehabmemory until it actually runs out. 75057043247SMauro Carvalho Chehab 75157043247SMauro Carvalho ChehabWhen this flag is 2, the kernel uses a "never overcommit" 75257043247SMauro Carvalho Chehabpolicy that attempts to prevent any overcommit of memory. 75357043247SMauro Carvalho ChehabNote that user_reserve_kbytes affects this policy. 75457043247SMauro Carvalho Chehab 75557043247SMauro Carvalho ChehabThis feature can be very useful because there are a lot of 75657043247SMauro Carvalho Chehabprograms that malloc() huge amounts of memory "just-in-case" 75757043247SMauro Carvalho Chehaband don't use much of it. 75857043247SMauro Carvalho Chehab 75957043247SMauro Carvalho ChehabThe default value is 0. 76057043247SMauro Carvalho Chehab 761ee65728eSMike RapoportSee Documentation/mm/overcommit-accounting.rst and 76257043247SMauro Carvalho Chehabmm/util.c::__vm_enough_memory() for more information. 76357043247SMauro Carvalho Chehab 76457043247SMauro Carvalho Chehab 76557043247SMauro Carvalho Chehabovercommit_ratio 76657043247SMauro Carvalho Chehab================ 76757043247SMauro Carvalho Chehab 76857043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, the committed address 76957043247SMauro Carvalho Chehabspace is not permitted to exceed swap plus this percentage 77057043247SMauro Carvalho Chehabof physical RAM. See above. 77157043247SMauro Carvalho Chehab 77257043247SMauro Carvalho Chehab 77357043247SMauro Carvalho Chehabpage-cluster 77457043247SMauro Carvalho Chehab============ 77557043247SMauro Carvalho Chehab 77657043247SMauro Carvalho Chehabpage-cluster controls the number of pages up to which consecutive pages 77757043247SMauro Carvalho Chehabare read in from swap in a single attempt. This is the swap counterpart 77857043247SMauro Carvalho Chehabto page cache readahead. 77957043247SMauro Carvalho ChehabThe mentioned consecutivity is not in terms of virtual/physical addresses, 78057043247SMauro Carvalho Chehabbut consecutive on swap space - that means they were swapped out together. 78157043247SMauro Carvalho Chehab 78257043247SMauro Carvalho ChehabIt is a logarithmic value - setting it to zero means "1 page", setting 78357043247SMauro Carvalho Chehabit to 1 means "2 pages", setting it to 2 means "4 pages", etc. 78457043247SMauro Carvalho ChehabZero disables swap readahead completely. 78557043247SMauro Carvalho Chehab 78657043247SMauro Carvalho ChehabThe default value is three (eight pages at a time). There may be some 78757043247SMauro Carvalho Chehabsmall benefits in tuning this to a different value if your workload is 78857043247SMauro Carvalho Chehabswap-intensive. 78957043247SMauro Carvalho Chehab 79057043247SMauro Carvalho ChehabLower values mean lower latencies for initial faults, but at the same time 79157043247SMauro Carvalho Chehabextra faults and I/O delays for following faults if they would have been part of 79257043247SMauro Carvalho Chehabthat consecutive pages readahead would have brought in. 79357043247SMauro Carvalho Chehab 79457043247SMauro Carvalho Chehab 7958d98e42fSJoel Savitzpage_lock_unfairness 7968d98e42fSJoel Savitz==================== 7978d98e42fSJoel Savitz 7988d98e42fSJoel SavitzThis value determines the number of times that the page lock can be 7998d98e42fSJoel Savitzstolen from under a waiter. After the lock is stolen the number of times 8008d98e42fSJoel Savitzspecified in this file (default is 5), the "fair lock handoff" semantics 8018d98e42fSJoel Savitzwill apply, and the waiter will only be awakened if the lock can be taken. 8028d98e42fSJoel Savitz 80357043247SMauro Carvalho Chehabpanic_on_oom 80457043247SMauro Carvalho Chehab============ 80557043247SMauro Carvalho Chehab 80657043247SMauro Carvalho ChehabThis enables or disables panic on out-of-memory feature. 80757043247SMauro Carvalho Chehab 80857043247SMauro Carvalho ChehabIf this is set to 0, the kernel will kill some rogue process, 80957043247SMauro Carvalho Chehabcalled oom_killer. Usually, oom_killer can kill rogue processes and 81057043247SMauro Carvalho Chehabsystem will survive. 81157043247SMauro Carvalho Chehab 81257043247SMauro Carvalho ChehabIf this is set to 1, the kernel panics when out-of-memory happens. 81357043247SMauro Carvalho ChehabHowever, if a process limits using nodes by mempolicy/cpusets, 81457043247SMauro Carvalho Chehaband those nodes become memory exhaustion status, one process 81557043247SMauro Carvalho Chehabmay be killed by oom-killer. No panic occurs in this case. 81657043247SMauro Carvalho ChehabBecause other nodes' memory may be free. This means system total status 81757043247SMauro Carvalho Chehabmay be not fatal yet. 81857043247SMauro Carvalho Chehab 81957043247SMauro Carvalho ChehabIf this is set to 2, the kernel panics compulsorily even on the 82057043247SMauro Carvalho Chehababove-mentioned. Even oom happens under memory cgroup, the whole 82157043247SMauro Carvalho Chehabsystem panics. 82257043247SMauro Carvalho Chehab 82357043247SMauro Carvalho ChehabThe default value is 0. 82457043247SMauro Carvalho Chehab 82557043247SMauro Carvalho Chehab1 and 2 are for failover of clustering. Please select either 82657043247SMauro Carvalho Chehabaccording to your policy of failover. 82757043247SMauro Carvalho Chehab 82857043247SMauro Carvalho Chehabpanic_on_oom=2+kdump gives you very strong tool to investigate 82957043247SMauro Carvalho Chehabwhy oom happens. You can get snapshot. 83057043247SMauro Carvalho Chehab 83157043247SMauro Carvalho Chehab 83274f44822SMel Gormanpercpu_pagelist_high_fraction 83374f44822SMel Gorman============================= 83474f44822SMel Gorman 83574f44822SMel GormanThis is the fraction of pages in each zone that are can be stored to 83674f44822SMel Gormanper-cpu page lists. It is an upper boundary that is divided depending 83774f44822SMel Gormanon the number of online CPUs. The min value for this is 8 which means 83874f44822SMel Gormanthat we do not allow more than 1/8th of pages in each zone to be stored 83974f44822SMel Gormanon per-cpu page lists. This entry only changes the value of hot per-cpu 84074f44822SMel Gormanpage lists. A user can specify a number like 100 to allocate 1/100th of 84174f44822SMel Gormaneach zone between per-cpu lists. 84274f44822SMel Gorman 84374f44822SMel GormanThe batch value of each per-cpu page list remains the same regardless of 84474f44822SMel Gormanthe value of the high fraction so allocation latencies are unaffected. 84574f44822SMel Gorman 84674f44822SMel GormanThe initial value is zero. Kernel uses this value to set the high pcp->high 84774f44822SMel Gormanmark based on the low watermark for the zone and the number of local 84874f44822SMel Gormanonline CPUs. If the user writes '0' to this sysctl, it will revert to 84974f44822SMel Gormanthis default behavior. 85074f44822SMel Gorman 85174f44822SMel Gorman 85257043247SMauro Carvalho Chehabstat_interval 85357043247SMauro Carvalho Chehab============= 85457043247SMauro Carvalho Chehab 85557043247SMauro Carvalho ChehabThe time interval between which vm statistics are updated. The default 85657043247SMauro Carvalho Chehabis 1 second. 85757043247SMauro Carvalho Chehab 85857043247SMauro Carvalho Chehab 85957043247SMauro Carvalho Chehabstat_refresh 86057043247SMauro Carvalho Chehab============ 86157043247SMauro Carvalho Chehab 86257043247SMauro Carvalho ChehabAny read or write (by root only) flushes all the per-cpu vm statistics 86357043247SMauro Carvalho Chehabinto their global totals, for more accurate reports when testing 86457043247SMauro Carvalho Chehabe.g. cat /proc/sys/vm/stat_refresh /proc/meminfo 86557043247SMauro Carvalho Chehab 86657043247SMauro Carvalho ChehabAs a side-effect, it also checks for negative totals (elsewhere reported 86757043247SMauro Carvalho Chehabas 0) and "fails" with EINVAL if any are found, with a warning in dmesg. 86857043247SMauro Carvalho Chehab(At time of writing, a few stats are known sometimes to be found negative, 86957043247SMauro Carvalho Chehabwith no ill effects: errors and warnings on these stats are suppressed.) 87057043247SMauro Carvalho Chehab 87157043247SMauro Carvalho Chehab 87257043247SMauro Carvalho Chehabnuma_stat 87357043247SMauro Carvalho Chehab========= 87457043247SMauro Carvalho Chehab 87557043247SMauro Carvalho ChehabThis interface allows runtime configuration of numa statistics. 87657043247SMauro Carvalho Chehab 87757043247SMauro Carvalho ChehabWhen page allocation performance becomes a bottleneck and you can tolerate 87857043247SMauro Carvalho Chehabsome possible tool breakage and decreased numa counter precision, you can 87957043247SMauro Carvalho Chehabdo:: 88057043247SMauro Carvalho Chehab 88157043247SMauro Carvalho Chehab echo 0 > /proc/sys/vm/numa_stat 88257043247SMauro Carvalho Chehab 88357043247SMauro Carvalho ChehabWhen page allocation performance is not a bottleneck and you want all 88457043247SMauro Carvalho Chehabtooling to work, you can do:: 88557043247SMauro Carvalho Chehab 88657043247SMauro Carvalho Chehab echo 1 > /proc/sys/vm/numa_stat 88757043247SMauro Carvalho Chehab 88857043247SMauro Carvalho Chehab 88957043247SMauro Carvalho Chehabswappiness 89057043247SMauro Carvalho Chehab========== 89157043247SMauro Carvalho Chehab 892c843966cSJohannes WeinerThis control is used to define the rough relative IO cost of swapping 893c843966cSJohannes Weinerand filesystem paging, as a value between 0 and 200. At 100, the VM 894c843966cSJohannes Weinerassumes equal IO cost and will thus apply memory pressure to the page 895c843966cSJohannes Weinercache and swap-backed pages equally; lower values signify more 896c843966cSJohannes Weinerexpensive swap IO, higher values indicates cheaper. 897c843966cSJohannes Weiner 898c843966cSJohannes WeinerKeep in mind that filesystem IO patterns under memory pressure tend to 899c843966cSJohannes Weinerbe more efficient than swap's random IO. An optimal value will require 900c843966cSJohannes Weinerexperimentation and will also be workload-dependent. 90157043247SMauro Carvalho Chehab 90257043247SMauro Carvalho ChehabThe default value is 60. 90357043247SMauro Carvalho Chehab 904c843966cSJohannes WeinerFor in-memory swap, like zram or zswap, as well as hybrid setups that 905c843966cSJohannes Weinerhave swap on faster devices than the filesystem, values beyond 100 can 906c843966cSJohannes Weinerbe considered. For example, if the random IO against the swap device 907c843966cSJohannes Weineris on average 2x faster than IO from the filesystem, swappiness should 908c843966cSJohannes Weinerbe 133 (x + 2x = 200, 2x = 133.33). 909c843966cSJohannes Weiner 910c843966cSJohannes WeinerAt 0, the kernel will not initiate swap until the amount of free and 911c843966cSJohannes Weinerfile-backed pages is less than the high watermark in a zone. 912c843966cSJohannes Weiner 91357043247SMauro Carvalho Chehab 91457043247SMauro Carvalho Chehabunprivileged_userfaultfd 91557043247SMauro Carvalho Chehab======================== 91657043247SMauro Carvalho Chehab 917d0d4730aSLokesh GidraThis flag controls the mode in which unprivileged users can use the 918d0d4730aSLokesh Gidrauserfaultfd system calls. Set this to 0 to restrict unprivileged users 919d0d4730aSLokesh Gidrato handle page faults in user mode only. In this case, users without 920d0d4730aSLokesh GidraSYS_CAP_PTRACE must pass UFFD_USER_MODE_ONLY in order for userfaultfd to 921d0d4730aSLokesh Gidrasucceed. Prohibiting use of userfaultfd for handling faults from kernel 922d0d4730aSLokesh Gidramode may make certain vulnerabilities more difficult to exploit. 92357043247SMauro Carvalho Chehab 924d0d4730aSLokesh GidraSet this to 1 to allow unprivileged users to use the userfaultfd system 925d0d4730aSLokesh Gidracalls without any restrictions. 926d0d4730aSLokesh Gidra 927d0d4730aSLokesh GidraThe default value is 0. 92857043247SMauro Carvalho Chehab 929816284a3SAxel RasmussenAnother way to control permissions for userfaultfd is to use 930816284a3SAxel Rasmussen/dev/userfaultfd instead of userfaultfd(2). See 931816284a3SAxel RasmussenDocumentation/admin-guide/mm/userfaultfd.rst. 93257043247SMauro Carvalho Chehab 93357043247SMauro Carvalho Chehabuser_reserve_kbytes 93457043247SMauro Carvalho Chehab=================== 93557043247SMauro Carvalho Chehab 93657043247SMauro Carvalho ChehabWhen overcommit_memory is set to 2, "never overcommit" mode, reserve 93757043247SMauro Carvalho Chehabmin(3% of current process size, user_reserve_kbytes) of free memory. 93857043247SMauro Carvalho ChehabThis is intended to prevent a user from starting a single memory hogging 93957043247SMauro Carvalho Chehabprocess, such that they cannot recover (kill the hog). 94057043247SMauro Carvalho Chehab 94157043247SMauro Carvalho Chehabuser_reserve_kbytes defaults to min(3% of the current process size, 128MB). 94257043247SMauro Carvalho Chehab 94357043247SMauro Carvalho ChehabIf this is reduced to zero, then the user will be allowed to allocate 94457043247SMauro Carvalho Chehaball free memory with a single process, minus admin_reserve_kbytes. 94557043247SMauro Carvalho ChehabAny subsequent attempts to execute a command will result in 94657043247SMauro Carvalho Chehab"fork: Cannot allocate memory". 94757043247SMauro Carvalho Chehab 94857043247SMauro Carvalho ChehabChanging this takes effect whenever an application requests memory. 94957043247SMauro Carvalho Chehab 95057043247SMauro Carvalho Chehab 95157043247SMauro Carvalho Chehabvfs_cache_pressure 95257043247SMauro Carvalho Chehab================== 95357043247SMauro Carvalho Chehab 95457043247SMauro Carvalho ChehabThis percentage value controls the tendency of the kernel to reclaim 95557043247SMauro Carvalho Chehabthe memory which is used for caching of directory and inode objects. 95657043247SMauro Carvalho Chehab 95757043247SMauro Carvalho ChehabAt the default value of vfs_cache_pressure=100 the kernel will attempt to 95857043247SMauro Carvalho Chehabreclaim dentries and inodes at a "fair" rate with respect to pagecache and 95957043247SMauro Carvalho Chehabswapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer 96057043247SMauro Carvalho Chehabto retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will 96157043247SMauro Carvalho Chehabnever reclaim dentries and inodes due to memory pressure and this can easily 96257043247SMauro Carvalho Chehablead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 96357043247SMauro Carvalho Chehabcauses the kernel to prefer to reclaim dentries and inodes. 96457043247SMauro Carvalho Chehab 96557043247SMauro Carvalho ChehabIncreasing vfs_cache_pressure significantly beyond 100 may have negative 96657043247SMauro Carvalho Chehabperformance impact. Reclaim code needs to take various locks to find freeable 96757043247SMauro Carvalho Chehabdirectory and inode objects. With vfs_cache_pressure=1000, it will look for 96857043247SMauro Carvalho Chehabten times more freeable objects than there are. 96957043247SMauro Carvalho Chehab 97057043247SMauro Carvalho Chehab 97157043247SMauro Carvalho Chehabwatermark_boost_factor 97257043247SMauro Carvalho Chehab====================== 97357043247SMauro Carvalho Chehab 97457043247SMauro Carvalho ChehabThis factor controls the level of reclaim when memory is being fragmented. 97557043247SMauro Carvalho ChehabIt defines the percentage of the high watermark of a zone that will be 97657043247SMauro Carvalho Chehabreclaimed if pages of different mobility are being mixed within pageblocks. 97757043247SMauro Carvalho ChehabThe intent is that compaction has less work to do in the future and to 97857043247SMauro Carvalho Chehabincrease the success rate of future high-order allocations such as SLUB 97957043247SMauro Carvalho Chehaballocations, THP and hugetlbfs pages. 98057043247SMauro Carvalho Chehab 98157043247SMauro Carvalho ChehabTo make it sensible with respect to the watermark_scale_factor 98257043247SMauro Carvalho Chehabparameter, the unit is in fractions of 10,000. The default value of 98348d9f335SMike Rapoport15,000 means that up to 150% of the high watermark will be reclaimed in the 98448d9f335SMike Rapoportevent of a pageblock being mixed due to fragmentation. The level of reclaim 98548d9f335SMike Rapoportis determined by the number of fragmentation events that occurred in the 98648d9f335SMike Rapoportrecent past. If this value is smaller than a pageblock then a pageblocks 98748d9f335SMike Rapoportworth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor 98848d9f335SMike Rapoportof 0 will disable the feature. 98957043247SMauro Carvalho Chehab 99057043247SMauro Carvalho Chehab 99157043247SMauro Carvalho Chehabwatermark_scale_factor 99257043247SMauro Carvalho Chehab====================== 99357043247SMauro Carvalho Chehab 99457043247SMauro Carvalho ChehabThis factor controls the aggressiveness of kswapd. It defines the 99557043247SMauro Carvalho Chehabamount of memory left in a node/system before kswapd is woken up and 99657043247SMauro Carvalho Chehabhow much memory needs to be free before kswapd goes back to sleep. 99757043247SMauro Carvalho Chehab 99857043247SMauro Carvalho ChehabThe unit is in fractions of 10,000. The default value of 10 means the 99957043247SMauro Carvalho Chehabdistances between watermarks are 0.1% of the available memory in the 100039c65a94SSuren Baghdasaryannode/system. The maximum value is 3000, or 30% of memory. 100157043247SMauro Carvalho Chehab 100257043247SMauro Carvalho ChehabA high rate of threads entering direct reclaim (allocstall) or kswapd 100357043247SMauro Carvalho Chehabgoing to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate 100457043247SMauro Carvalho Chehabthat the number of free pages kswapd maintains for latency reasons is 100557043247SMauro Carvalho Chehabtoo small for the allocation bursts occurring in the system. This knob 100657043247SMauro Carvalho Chehabcan then be used to tune kswapd aggressiveness accordingly. 100757043247SMauro Carvalho Chehab 100857043247SMauro Carvalho Chehab 100957043247SMauro Carvalho Chehabzone_reclaim_mode 101057043247SMauro Carvalho Chehab================= 101157043247SMauro Carvalho Chehab 101257043247SMauro Carvalho ChehabZone_reclaim_mode allows someone to set more or less aggressive approaches to 101357043247SMauro Carvalho Chehabreclaim memory when a zone runs out of memory. If it is set to zero then no 101457043247SMauro Carvalho Chehabzone reclaim occurs. Allocations will be satisfied from other zones / nodes 101557043247SMauro Carvalho Chehabin the system. 101657043247SMauro Carvalho Chehab 101757043247SMauro Carvalho ChehabThis is value OR'ed together of 101857043247SMauro Carvalho Chehab 101957043247SMauro Carvalho Chehab= =================================== 102057043247SMauro Carvalho Chehab1 Zone reclaim on 102157043247SMauro Carvalho Chehab2 Zone reclaim writes dirty pages out 102257043247SMauro Carvalho Chehab4 Zone reclaim swaps pages 102357043247SMauro Carvalho Chehab= =================================== 102457043247SMauro Carvalho Chehab 102557043247SMauro Carvalho Chehabzone_reclaim_mode is disabled by default. For file servers or workloads 102657043247SMauro Carvalho Chehabthat benefit from having their data cached, zone_reclaim_mode should be 102757043247SMauro Carvalho Chehableft disabled as the caching effect is likely to be more important than 102857043247SMauro Carvalho Chehabdata locality. 102957043247SMauro Carvalho Chehab 103051998364SDave HansenConsider enabling one or more zone_reclaim mode bits if it's known that the 103151998364SDave Hansenworkload is partitioned such that each partition fits within a NUMA node 103251998364SDave Hansenand that accessing remote memory would cause a measurable performance 103351998364SDave Hansenreduction. The page allocator will take additional actions before 103451998364SDave Hansenallocating off node pages. 103557043247SMauro Carvalho Chehab 103657043247SMauro Carvalho ChehabAllowing zone reclaim to write out pages stops processes that are 103757043247SMauro Carvalho Chehabwriting large amounts of data from dirtying pages on other nodes. Zone 103857043247SMauro Carvalho Chehabreclaim will write out dirty pages if a zone fills up and so effectively 103957043247SMauro Carvalho Chehabthrottle the process. This may decrease the performance of a single process 104057043247SMauro Carvalho Chehabsince it cannot use all of system memory to buffer the outgoing writes 104157043247SMauro Carvalho Chehabanymore but it preserve the memory on other nodes so that the performance 104257043247SMauro Carvalho Chehabof other processes running on other nodes will not be affected. 104357043247SMauro Carvalho Chehab 104457043247SMauro Carvalho ChehabAllowing regular swap effectively restricts allocations to the local 104557043247SMauro Carvalho Chehabnode unless explicitly overridden by memory policies or cpuset 104657043247SMauro Carvalho Chehabconfigurations. 1047