1========================== 2Memory Resource Controller 3========================== 4 5NOTE: 6 This document is hopelessly outdated and it asks for a complete 7 rewrite. It still contains a useful information so we are keeping it 8 here but make sure to check the current code if you need a deeper 9 understanding. 10 11NOTE: 12 The Memory Resource Controller has generically been referred to as the 13 memory controller in this document. Do not confuse memory controller 14 used here with the memory controller that is used in hardware. 15 16(For editors) In this document: 17 When we mention a cgroup (cgroupfs's directory) with memory controller, 18 we call it "memory cgroup". When you see git-log and source code, you'll 19 see patch's title and function names tend to use "memcg". 20 In this document, we avoid using it. 21 22Benefits and Purpose of the memory controller 23============================================= 24 25The memory controller isolates the memory behaviour of a group of tasks 26from the rest of the system. The article on LWN [12] mentions some probable 27uses of the memory controller. The memory controller can be used to 28 29a. Isolate an application or a group of applications 30 Memory-hungry applications can be isolated and limited to a smaller 31 amount of memory. 32b. Create a cgroup with a limited amount of memory; this can be used 33 as a good alternative to booting with mem=XXXX. 34c. Virtualization solutions can control the amount of memory they want 35 to assign to a virtual machine instance. 36d. A CD/DVD burner could control the amount of memory used by the 37 rest of the system to ensure that burning does not fail due to lack 38 of available memory. 39e. There are several other use cases; find one or use the controller just 40 for fun (to learn and hack on the VM subsystem). 41 42Current Status: linux-2.6.34-mmotm(development version of 2010/April) 43 44Features: 45 46 - accounting anonymous pages, file caches, swap caches usage and limiting them. 47 - pages are linked to per-memcg LRU exclusively, and there is no global LRU. 48 - optionally, memory+swap usage can be accounted and limited. 49 - hierarchical accounting 50 - soft limit 51 - moving (recharging) account at moving a task is selectable. 52 - usage threshold notifier 53 - memory pressure notifier 54 - oom-killer disable knob and oom-notifier 55 - Root cgroup has no limit controls. 56 57 Kernel memory support is a work in progress, and the current version provides 58 basically functionality. (See Section 2.7) 59 60Brief summary of control files. 61 62==================================== ========================================== 63 tasks attach a task(thread) and show list of 64 threads 65 cgroup.procs show list of processes 66 cgroup.event_control an interface for event_fd() 67 memory.usage_in_bytes show current usage for memory 68 (See 5.5 for details) 69 memory.memsw.usage_in_bytes show current usage for memory+Swap 70 (See 5.5 for details) 71 memory.limit_in_bytes set/show limit of memory usage 72 memory.memsw.limit_in_bytes set/show limit of memory+Swap usage 73 memory.failcnt show the number of memory usage hits limits 74 memory.memsw.failcnt show the number of memory+Swap hits limits 75 memory.max_usage_in_bytes show max memory usage recorded 76 memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded 77 memory.soft_limit_in_bytes set/show soft limit of memory usage 78 memory.stat show various statistics 79 memory.use_hierarchy set/show hierarchical account enabled 80 memory.force_empty trigger forced page reclaim 81 memory.pressure_level set memory pressure notifications 82 memory.swappiness set/show swappiness parameter of vmscan 83 (See sysctl's vm.swappiness) 84 memory.move_charge_at_immigrate set/show controls of moving charges 85 memory.oom_control set/show oom controls. 86 memory.numa_stat show the number of memory usage per numa 87 node 88 89 memory.kmem.limit_in_bytes set/show hard limit for kernel memory 90 memory.kmem.usage_in_bytes show current kernel memory allocation 91 memory.kmem.failcnt show the number of kernel memory usage 92 hits limits 93 memory.kmem.max_usage_in_bytes show max kernel memory usage recorded 94 95 memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory 96 memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation 97 memory.kmem.tcp.failcnt show the number of tcp buf memory usage 98 hits limits 99 memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded 100==================================== ========================================== 101 1021. History 103========== 104 105The memory controller has a long history. A request for comments for the memory 106controller was posted by Balbir Singh [1]. At the time the RFC was posted 107there were several implementations for memory control. The goal of the 108RFC was to build consensus and agreement for the minimal features required 109for memory control. The first RSS controller was posted by Balbir Singh[2] 110in Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the 111RSS controller. At OLS, at the resource management BoF, everyone suggested 112that we handle both page cache and RSS together. Another request was raised 113to allow user space handling of OOM. The current memory controller is 114at version 6; it combines both mapped (RSS) and unmapped Page 115Cache Control [11]. 116 1172. Memory Control 118================= 119 120Memory is a unique resource in the sense that it is present in a limited 121amount. If a task requires a lot of CPU processing, the task can spread 122its processing over a period of hours, days, months or years, but with 123memory, the same physical memory needs to be reused to accomplish the task. 124 125The memory controller implementation has been divided into phases. These 126are: 127 1281. Memory controller 1292. mlock(2) controller 1303. Kernel user memory accounting and slab control 1314. user mappings length controller 132 133The memory controller is the first controller developed. 134 1352.1. Design 136----------- 137 138The core of the design is a counter called the page_counter. The 139page_counter tracks the current memory usage and limit of the group of 140processes associated with the controller. Each cgroup has a memory controller 141specific data structure (mem_cgroup) associated with it. 142 1432.2. Accounting 144--------------- 145 146:: 147 148 +--------------------+ 149 | mem_cgroup | 150 | (page_counter) | 151 +--------------------+ 152 / ^ \ 153 / | \ 154 +---------------+ | +---------------+ 155 | mm_struct | |.... | mm_struct | 156 | | | | | 157 +---------------+ | +---------------+ 158 | 159 + --------------+ 160 | 161 +---------------+ +------+--------+ 162 | page +----------> page_cgroup| 163 | | | | 164 +---------------+ +---------------+ 165 166 (Figure 1: Hierarchy of Accounting) 167 168 169Figure 1 shows the important aspects of the controller 170 1711. Accounting happens per cgroup 1722. Each mm_struct knows about which cgroup it belongs to 1733. Each page has a pointer to the page_cgroup, which in turn knows the 174 cgroup it belongs to 175 176The accounting is done as follows: mem_cgroup_charge_common() is invoked to 177set up the necessary data structures and check if the cgroup that is being 178charged is over its limit. If it is, then reclaim is invoked on the cgroup. 179More details can be found in the reclaim section of this document. 180If everything goes well, a page meta-data-structure called page_cgroup is 181updated. page_cgroup has its own LRU on cgroup. 182(*) page_cgroup structure is allocated at boot/memory-hotplug time. 183 1842.2.1 Accounting details 185------------------------ 186 187All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. 188Some pages which are never reclaimable and will not be on the LRU 189are not accounted. We just account pages under usual VM management. 190 191RSS pages are accounted at page_fault unless they've already been accounted 192for earlier. A file page will be accounted for as Page Cache when it's 193inserted into inode (radix-tree). While it's mapped into the page tables of 194processes, duplicate accounting is carefully avoided. 195 196An RSS page is unaccounted when it's fully unmapped. A PageCache page is 197unaccounted when it's removed from radix-tree. Even if RSS pages are fully 198unmapped (by kswapd), they may exist as SwapCache in the system until they 199are really freed. Such SwapCaches are also accounted. 200A swapped-in page is not accounted until it's mapped. 201 202Note: The kernel does swapin-readahead and reads multiple swaps at once. 203This means swapped-in pages may contain pages for other tasks than a task 204causing page fault. So, we avoid accounting at swap-in I/O. 205 206At page migration, accounting information is kept. 207 208Note: we just account pages-on-LRU because our purpose is to control amount 209of used pages; not-on-LRU pages tend to be out-of-control from VM view. 210 2112.3 Shared Page Accounting 212-------------------------- 213 214Shared pages are accounted on the basis of the first touch approach. The 215cgroup that first touches a page is accounted for the page. The principle 216behind this approach is that a cgroup that aggressively uses a shared 217page will eventually get charged for it (once it is uncharged from 218the cgroup that brought it in -- this will happen on memory pressure). 219 220But see section 8.2: when moving a task to another cgroup, its pages may 221be recharged to the new cgroup, if move_charge_at_immigrate has been chosen. 222 223Exception: If CONFIG_MEMCG_SWAP is not used. 224When you do swapoff and make swapped-out pages of shmem(tmpfs) to 225be backed into memory in force, charges for pages are accounted against the 226caller of swapoff rather than the users of shmem. 227 2282.4 Swap Extension (CONFIG_MEMCG_SWAP) 229-------------------------------------- 230 231Swap Extension allows you to record charge for swap. A swapped-in page is 232charged back to original page allocator if possible. 233 234When swap is accounted, following files are added. 235 236 - memory.memsw.usage_in_bytes. 237 - memory.memsw.limit_in_bytes. 238 239memsw means memory+swap. Usage of memory+swap is limited by 240memsw.limit_in_bytes. 241 242Example: Assume a system with 4G of swap. A task which allocates 6G of memory 243(by mistake) under 2G memory limitation will use all swap. 244In this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap. 245By using the memsw limit, you can avoid system OOM which can be caused by swap 246shortage. 247 248**why 'memory+swap' rather than swap** 249 250The global LRU(kswapd) can swap out arbitrary pages. Swap-out means 251to move account from memory to swap...there is no change in usage of 252memory+swap. In other words, when we want to limit the usage of swap without 253affecting global LRU, memory+swap limit is better than just limiting swap from 254an OS point of view. 255 256**What happens when a cgroup hits memory.memsw.limit_in_bytes** 257 258When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out 259in this cgroup. Then, swap-out will not be done by cgroup routine and file 260caches are dropped. But as mentioned above, global LRU can do swapout memory 261from it for sanity of the system's memory management state. You can't forbid 262it by cgroup. 263 2642.5 Reclaim 265----------- 266 267Each cgroup maintains a per cgroup LRU which has the same structure as 268global VM. When a cgroup goes over its limit, we first try 269to reclaim memory from the cgroup so as to make space for the new 270pages that the cgroup has touched. If the reclaim is unsuccessful, 271an OOM routine is invoked to select and kill the bulkiest task in the 272cgroup. (See 10. OOM Control below.) 273 274The reclaim algorithm has not been modified for cgroups, except that 275pages that are selected for reclaiming come from the per-cgroup LRU 276list. 277 278NOTE: 279 Reclaim does not work for the root cgroup, since we cannot set any 280 limits on the root cgroup. 281 282Note2: 283 When panic_on_oom is set to "2", the whole system will panic. 284 285When oom event notifier is registered, event will be delivered. 286(See oom_control section) 287 2882.6 Locking 289----------- 290 291 lock_page_cgroup()/unlock_page_cgroup() should not be called under 292 the i_pages lock. 293 294 Other lock order is following: 295 296 PG_locked. 297 mm->page_table_lock 298 pgdat->lru_lock 299 lock_page_cgroup. 300 301 In many cases, just lock_page_cgroup() is called. 302 303 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by 304 pgdat->lru_lock, it has no lock of its own. 305 3062.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) 307----------------------------------------------- 308 309With the Kernel memory extension, the Memory Controller is able to limit 310the amount of kernel memory used by the system. Kernel memory is fundamentally 311different than user memory, since it can't be swapped out, which makes it 312possible to DoS the system by consuming too much of this precious resource. 313 314Kernel memory accounting is enabled for all memory cgroups by default. But 315it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel 316at boot time. In this case, kernel memory will not be accounted at all. 317 318Kernel memory limits are not imposed for the root cgroup. Usage for the root 319cgroup may or may not be accounted. The memory used is accumulated into 320memory.kmem.usage_in_bytes, or in a separate counter when it makes sense. 321(currently only for tcp). 322 323The main "kmem" counter is fed into the main counter, so kmem charges will 324also be visible from the user counter. 325 326Currently no soft limit is implemented for kernel memory. It is future work 327to trigger slab reclaim when those limits are reached. 328 3292.7.1 Current Kernel Memory resources accounted 330----------------------------------------------- 331 332stack pages: 333 every process consumes some stack pages. By accounting into 334 kernel memory, we prevent new processes from being created when the kernel 335 memory usage is too high. 336 337slab pages: 338 pages allocated by the SLAB or SLUB allocator are tracked. A copy 339 of each kmem_cache is created every time the cache is touched by the first time 340 from inside the memcg. The creation is done lazily, so some objects can still be 341 skipped while the cache is being created. All objects in a slab page should 342 belong to the same memcg. This only fails to hold when a task is migrated to a 343 different memcg during the page allocation by the cache. 344 345sockets memory pressure: 346 some sockets protocols have memory pressure 347 thresholds. The Memory Controller allows them to be controlled individually 348 per cgroup, instead of globally. 349 350tcp memory pressure: 351 sockets memory pressure for the tcp protocol. 352 3532.7.2 Common use cases 354---------------------- 355 356Because the "kmem" counter is fed to the main user counter, kernel memory can 357never be limited completely independently of user memory. Say "U" is the user 358limit, and "K" the kernel limit. There are three possible ways limits can be 359set: 360 361U != 0, K = unlimited: 362 This is the standard memcg limitation mechanism already present before kmem 363 accounting. Kernel memory is completely ignored. 364 365U != 0, K < U: 366 Kernel memory is a subset of the user memory. This setup is useful in 367 deployments where the total amount of memory per-cgroup is overcommited. 368 Overcommiting kernel memory limits is definitely not recommended, since the 369 box can still run out of non-reclaimable memory. 370 In this case, the admin could set up K so that the sum of all groups is 371 never greater than the total memory, and freely set U at the cost of his 372 QoS. 373 374WARNING: 375 In the current implementation, memory reclaim will NOT be 376 triggered for a cgroup when it hits K while staying below U, which makes 377 this setup impractical. 378 379U != 0, K >= U: 380 Since kmem charges will also be fed to the user counter and reclaim will be 381 triggered for the cgroup for both kinds of memory. This setup gives the 382 admin a unified view of memory, and it is also useful for people who just 383 want to track kernel memory usage. 384 3853. User Interface 386================= 387 3883.0. Configuration 389------------------ 390 391a. Enable CONFIG_CGROUPS 392b. Enable CONFIG_MEMCG 393c. Enable CONFIG_MEMCG_SWAP (to use swap extension) 394d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 395 3963.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) 397------------------------------------------------------------------- 398 399:: 400 401 # mount -t tmpfs none /sys/fs/cgroup 402 # mkdir /sys/fs/cgroup/memory 403 # mount -t cgroup none /sys/fs/cgroup/memory -o memory 404 4053.2. Make the new group and move bash into it:: 406 407 # mkdir /sys/fs/cgroup/memory/0 408 # echo $$ > /sys/fs/cgroup/memory/0/tasks 409 410Since now we're in the 0 cgroup, we can alter the memory limit:: 411 412 # echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes 413 414NOTE: 415 We can use a suffix (k, K, m, M, g or G) to indicate values in kilo, 416 mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes, 417 Gibibytes.) 418 419NOTE: 420 We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``. 421 422NOTE: 423 We cannot set limits on the root cgroup any more. 424 425:: 426 427 # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes 428 4194304 429 430We can check the usage:: 431 432 # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes 433 1216512 434 435A successful write to this file does not guarantee a successful setting of 436this limit to the value written into the file. This can be due to a 437number of factors, such as rounding up to page boundaries or the total 438availability of memory on the system. The user is required to re-read 439this file after a write to guarantee the value committed by the kernel:: 440 441 # echo 1 > memory.limit_in_bytes 442 # cat memory.limit_in_bytes 443 4096 444 445The memory.failcnt field gives the number of times that the cgroup limit was 446exceeded. 447 448The memory.stat file gives accounting information. Now, the number of 449caches, RSS and Active pages/Inactive pages are shown. 450 4514. Testing 452========== 453 454For testing features and implementation, see memcg_test.txt. 455 456Performance test is also important. To see pure memory controller's overhead, 457testing on tmpfs will give you good numbers of small overheads. 458Example: do kernel make on tmpfs. 459 460Page-fault scalability is also important. At measuring parallel 461page fault test, multi-process test may be better than multi-thread 462test because it has noise of shared objects/status. 463 464But the above two are testing extreme situations. 465Trying usual test under memory controller is always helpful. 466 4674.1 Troubleshooting 468------------------- 469 470Sometimes a user might find that the application under a cgroup is 471terminated by the OOM killer. There are several causes for this: 472 4731. The cgroup limit is too low (just too low to do anything useful) 4742. The user is using anonymous memory and swap is turned off or too low 475 476A sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of 477some of the pages cached in the cgroup (page cache pages). 478 479To know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and 480seeing what happens will be helpful. 481 4824.2 Task migration 483------------------ 484 485When a task migrates from one cgroup to another, its charge is not 486carried forward by default. The pages allocated from the original cgroup still 487remain charged to it, the charge is dropped when the page is freed or 488reclaimed. 489 490You can move charges of a task along with task migration. 491See 8. "Move charges at task migration" 492 4934.3 Removing a cgroup 494--------------------- 495 496A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a 497cgroup might have some charge associated with it, even though all 498tasks have migrated away from it. (because we charge against pages, not 499against tasks.) 500 501We move the stats to root (if use_hierarchy==0) or parent (if 502use_hierarchy==1), and no change on the charge except uncharging 503from the child. 504 505Charges recorded in swap information is not updated at removal of cgroup. 506Recorded information is discarded and a cgroup which uses swap (swapcache) 507will be charged as a new owner of it. 508 509About use_hierarchy, see Section 6. 510 5115. Misc. interfaces 512=================== 513 5145.1 force_empty 515--------------- 516 memory.force_empty interface is provided to make cgroup's memory usage empty. 517 When writing anything to this:: 518 519 # echo 0 > memory.force_empty 520 521 the cgroup will be reclaimed and as many pages reclaimed as possible. 522 523 The typical use case for this interface is before calling rmdir(). 524 Though rmdir() offlines memcg, but the memcg may still stay there due to 525 charged file caches. Some out-of-use page caches may keep charged until 526 memory pressure happens. If you want to avoid that, force_empty will be useful. 527 528 Also, note that when memory.kmem.limit_in_bytes is set the charges due to 529 kernel pages will still be seen. This is not considered a failure and the 530 write will still return success. In this case, it is expected that 531 memory.kmem.usage_in_bytes == memory.usage_in_bytes. 532 533 About use_hierarchy, see Section 6. 534 5355.2 stat file 536------------- 537 538memory.stat file includes following statistics 539 540per-memory cgroup local status 541^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 542 543=============== =============================================================== 544cache # of bytes of page cache memory. 545rss # of bytes of anonymous and swap cache memory (includes 546 transparent hugepages). 547rss_huge # of bytes of anonymous transparent hugepages. 548mapped_file # of bytes of mapped file (includes tmpfs/shmem) 549pgpgin # of charging events to the memory cgroup. The charging 550 event happens each time a page is accounted as either mapped 551 anon page(RSS) or cache page(Page Cache) to the cgroup. 552pgpgout # of uncharging events to the memory cgroup. The uncharging 553 event happens each time a page is unaccounted from the cgroup. 554swap # of bytes of swap usage 555dirty # of bytes that are waiting to get written back to the disk. 556writeback # of bytes of file/anon cache that are queued for syncing to 557 disk. 558inactive_anon # of bytes of anonymous and swap cache memory on inactive 559 LRU list. 560active_anon # of bytes of anonymous and swap cache memory on active 561 LRU list. 562inactive_file # of bytes of file-backed memory on inactive LRU list. 563active_file # of bytes of file-backed memory on active LRU list. 564unevictable # of bytes of memory that cannot be reclaimed (mlocked etc). 565=============== =============================================================== 566 567status considering hierarchy (see memory.use_hierarchy settings) 568^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 569 570========================= =================================================== 571hierarchical_memory_limit # of bytes of memory limit with regard to hierarchy 572 under which the memory cgroup is 573hierarchical_memsw_limit # of bytes of memory+swap limit with regard to 574 hierarchy under which memory cgroup is. 575 576total_<counter> # hierarchical version of <counter>, which in 577 addition to the cgroup's own value includes the 578 sum of all hierarchical children's values of 579 <counter>, i.e. total_cache 580========================= =================================================== 581 582The following additional stats are dependent on CONFIG_DEBUG_VM 583^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 584 585========================= ======================================== 586recent_rotated_anon VM internal parameter. (see mm/vmscan.c) 587recent_rotated_file VM internal parameter. (see mm/vmscan.c) 588recent_scanned_anon VM internal parameter. (see mm/vmscan.c) 589recent_scanned_file VM internal parameter. (see mm/vmscan.c) 590========================= ======================================== 591 592Memo: 593 recent_rotated means recent frequency of LRU rotation. 594 recent_scanned means recent # of scans to LRU. 595 showing for better debug please see the code for meanings. 596 597Note: 598 Only anonymous and swap cache memory is listed as part of 'rss' stat. 599 This should not be confused with the true 'resident set size' or the 600 amount of physical memory used by the cgroup. 601 602 'rss + mapped_file" will give you resident set size of cgroup. 603 604 (Note: file and shmem may be shared among other cgroups. In that case, 605 mapped_file is accounted only when the memory cgroup is owner of page 606 cache.) 607 6085.3 swappiness 609-------------- 610 611Overrides /proc/sys/vm/swappiness for the particular group. The tunable 612in the root cgroup corresponds to the global swappiness setting. 613 614Please note that unlike during the global reclaim, limit reclaim 615enforces that 0 swappiness really prevents from any swapping even if 616there is a swap storage available. This might lead to memcg OOM killer 617if there are no file pages to reclaim. 618 6195.4 failcnt 620----------- 621 622A memory cgroup provides memory.failcnt and memory.memsw.failcnt files. 623This failcnt(== failure count) shows the number of times that a usage counter 624hit its limit. When a memory cgroup hits a limit, failcnt increases and 625memory under it will be reclaimed. 626 627You can reset failcnt by writing 0 to failcnt file:: 628 629 # echo 0 > .../memory.failcnt 630 6315.5 usage_in_bytes 632------------------ 633 634For efficiency, as other kernel components, memory cgroup uses some optimization 635to avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the 636method and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz 637value for efficient access. (Of course, when necessary, it's synchronized.) 638If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP) 639value in memory.stat(see 5.2). 640 6415.6 numa_stat 642------------- 643 644This is similar to numa_maps but operates on a per-memcg basis. This is 645useful for providing visibility into the numa locality information within 646an memcg since the pages are allowed to be allocated from any physical 647node. One of the use cases is evaluating application performance by 648combining this information with the application's CPU allocation. 649 650Each memcg's numa_stat file includes "total", "file", "anon" and "unevictable" 651per-node page counts including "hierarchical_<counter>" which sums up all 652hierarchical children's values in addition to the memcg's own value. 653 654The output format of memory.numa_stat is:: 655 656 total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ... 657 file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ... 658 anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 659 unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ... 660 hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ... 661 662The "total" count is sum of file + anon + unevictable. 663 6646. Hierarchy support 665==================== 666 667The memory controller supports a deep hierarchy and hierarchical accounting. 668The hierarchy is created by creating the appropriate cgroups in the 669cgroup filesystem. Consider for example, the following cgroup filesystem 670hierarchy:: 671 672 root 673 / | \ 674 / | \ 675 a b c 676 | \ 677 | \ 678 d e 679 680In the diagram above, with hierarchical accounting enabled, all memory 681usage of e, is accounted to its ancestors up until the root (i.e, c and root), 682that has memory.use_hierarchy enabled. If one of the ancestors goes over its 683limit, the reclaim algorithm reclaims from the tasks in the ancestor and the 684children of the ancestor. 685 6866.1 Enabling hierarchical accounting and reclaim 687------------------------------------------------ 688 689A memory cgroup by default disables the hierarchy feature. Support 690can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup:: 691 692 # echo 1 > memory.use_hierarchy 693 694The feature can be disabled by:: 695 696 # echo 0 > memory.use_hierarchy 697 698NOTE1: 699 Enabling/disabling will fail if either the cgroup already has other 700 cgroups created below it, or if the parent cgroup has use_hierarchy 701 enabled. 702 703NOTE2: 704 When panic_on_oom is set to "2", the whole system will panic in 705 case of an OOM event in any cgroup. 706 7077. Soft limits 708============== 709 710Soft limits allow for greater sharing of memory. The idea behind soft limits 711is to allow control groups to use as much of the memory as needed, provided 712 713a. There is no memory contention 714b. They do not exceed their hard limit 715 716When the system detects memory contention or low memory, control groups 717are pushed back to their soft limits. If the soft limit of each control 718group is very high, they are pushed back as much as possible to make 719sure that one control group does not starve the others of memory. 720 721Please note that soft limits is a best-effort feature; it comes with 722no guarantees, but it does its best to make sure that when memory is 723heavily contended for, memory is allocated based on the soft limit 724hints/setup. Currently soft limit based reclaim is set up such that 725it gets invoked from balance_pgdat (kswapd). 726 7277.1 Interface 728------------- 729 730Soft limits can be setup by using the following commands (in this example we 731assume a soft limit of 256 MiB):: 732 733 # echo 256M > memory.soft_limit_in_bytes 734 735If we want to change this to 1G, we can at any time use:: 736 737 # echo 1G > memory.soft_limit_in_bytes 738 739NOTE1: 740 Soft limits take effect over a long period of time, since they involve 741 reclaiming memory for balancing between memory cgroups 742NOTE2: 743 It is recommended to set the soft limit always below the hard limit, 744 otherwise the hard limit will take precedence. 745 7468. Move charges at task migration 747================================= 748 749Users can move charges associated with a task along with task migration, that 750is, uncharge task's pages from the old cgroup and charge them to the new cgroup. 751This feature is not supported in !CONFIG_MMU environments because of lack of 752page tables. 753 7548.1 Interface 755------------- 756 757This feature is disabled by default. It can be enabled (and disabled again) by 758writing to memory.move_charge_at_immigrate of the destination cgroup. 759 760If you want to enable it:: 761 762 # echo (some positive value) > memory.move_charge_at_immigrate 763 764Note: 765 Each bits of move_charge_at_immigrate has its own meaning about what type 766 of charges should be moved. See 8.2 for details. 767Note: 768 Charges are moved only when you move mm->owner, in other words, 769 a leader of a thread group. 770Note: 771 If we cannot find enough space for the task in the destination cgroup, we 772 try to make space by reclaiming memory. Task migration may fail if we 773 cannot make enough space. 774Note: 775 It can take several seconds if you move charges much. 776 777And if you want disable it again:: 778 779 # echo 0 > memory.move_charge_at_immigrate 780 7818.2 Type of charges which can be moved 782-------------------------------------- 783 784Each bit in move_charge_at_immigrate has its own meaning about what type of 785charges should be moved. But in any case, it must be noted that an account of 786a page or a swap can be moved only when it is charged to the task's current 787(old) memory cgroup. 788 789+---+--------------------------------------------------------------------------+ 790|bit| what type of charges would be moved ? | 791+===+==========================================================================+ 792| 0 | A charge of an anonymous page (or swap of it) used by the target task. | 793| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | 794+---+--------------------------------------------------------------------------+ 795| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | 796| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | 797| | anonymous pages, file pages (and swaps) in the range mmapped by the task | 798| | will be moved even if the task hasn't done page fault, i.e. they might | 799| | not be the task's "RSS", but other task's "RSS" that maps the same file. | 800| | And mapcount of the page is ignored (the page can be moved even if | 801| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | 802| | enable move of swap charges. | 803+---+--------------------------------------------------------------------------+ 804 8058.3 TODO 806-------- 807 808- All of moving charge operations are done under cgroup_mutex. It's not good 809 behavior to hold the mutex too long, so we may need some trick. 810 8119. Memory thresholds 812==================== 813 814Memory cgroup implements memory thresholds using the cgroups notification 815API (see cgroups.txt). It allows to register multiple memory and memsw 816thresholds and gets notifications when it crosses. 817 818To register a threshold, an application must: 819 820- create an eventfd using eventfd(2); 821- open memory.usage_in_bytes or memory.memsw.usage_in_bytes; 822- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to 823 cgroup.event_control. 824 825Application will be notified through eventfd when memory usage crosses 826threshold in any direction. 827 828It's applicable for root and non-root cgroup. 829 83010. OOM Control 831=============== 832 833memory.oom_control file is for OOM notification and other controls. 834 835Memory cgroup implements OOM notifier using the cgroup notification 836API (See cgroups.txt). It allows to register multiple OOM notification 837delivery and gets notification when OOM happens. 838 839To register a notifier, an application must: 840 841 - create an eventfd using eventfd(2) 842 - open memory.oom_control file 843 - write string like "<event_fd> <fd of memory.oom_control>" to 844 cgroup.event_control 845 846The application will be notified through eventfd when OOM happens. 847OOM notification doesn't work for the root cgroup. 848 849You can disable the OOM-killer by writing "1" to memory.oom_control file, as: 850 851 #echo 1 > memory.oom_control 852 853If OOM-killer is disabled, tasks under cgroup will hang/sleep 854in memory cgroup's OOM-waitqueue when they request accountable memory. 855 856For running them, you have to relax the memory cgroup's OOM status by 857 858 * enlarge limit or reduce usage. 859 860To reduce usage, 861 862 * kill some tasks. 863 * move some tasks to other group with account migration. 864 * remove some files (on tmpfs?) 865 866Then, stopped tasks will work again. 867 868At reading, current status of OOM is shown. 869 870 - oom_kill_disable 0 or 1 871 (if 1, oom-killer is disabled) 872 - under_oom 0 or 1 873 (if 1, the memory cgroup is under OOM, tasks may be stopped.) 874 87511. Memory Pressure 876=================== 877 878The pressure level notifications can be used to monitor the memory 879allocation cost; based on the pressure, applications can implement 880different strategies of managing their memory resources. The pressure 881levels are defined as following: 882 883The "low" level means that the system is reclaiming memory for new 884allocations. Monitoring this reclaiming activity might be useful for 885maintaining cache level. Upon notification, the program (typically 886"Activity Manager") might analyze vmstat and act in advance (i.e. 887prematurely shutdown unimportant services). 888 889The "medium" level means that the system is experiencing medium memory 890pressure, the system might be making swap, paging out active file caches, 891etc. Upon this event applications may decide to further analyze 892vmstat/zoneinfo/memcg or internal memory usage statistics and free any 893resources that can be easily reconstructed or re-read from a disk. 894 895The "critical" level means that the system is actively thrashing, it is 896about to out of memory (OOM) or even the in-kernel OOM killer is on its 897way to trigger. Applications should do whatever they can to help the 898system. It might be too late to consult with vmstat or any other 899statistics, so it's advisable to take an immediate action. 900 901By default, events are propagated upward until the event is handled, i.e. the 902events are not pass-through. For example, you have three cgroups: A->B->C. Now 903you set up an event listener on cgroups A, B and C, and suppose group C 904experiences some pressure. In this situation, only group C will receive the 905notification, i.e. groups A and B will not receive it. This is done to avoid 906excessive "broadcasting" of messages, which disturbs the system and which is 907especially bad if we are low on memory or thrashing. Group B, will receive 908notification only if there are no event listers for group C. 909 910There are three optional modes that specify different propagation behavior: 911 912 - "default": this is the default behavior specified above. This mode is the 913 same as omitting the optional mode parameter, preserved by backwards 914 compatibility. 915 916 - "hierarchy": events always propagate up to the root, similar to the default 917 behavior, except that propagation continues regardless of whether there are 918 event listeners at each level, with the "hierarchy" mode. In the above 919 example, groups A, B, and C will receive notification of memory pressure. 920 921 - "local": events are pass-through, i.e. they only receive notifications when 922 memory pressure is experienced in the memcg for which the notification is 923 registered. In the above example, group C will receive notification if 924 registered for "local" notification and the group experiences memory 925 pressure. However, group B will never receive notification, regardless if 926 there is an event listener for group C or not, if group B is registered for 927 local notification. 928 929The level and event notification mode ("hierarchy" or "local", if necessary) are 930specified by a comma-delimited string, i.e. "low,hierarchy" specifies 931hierarchical, pass-through, notification for all ancestor memcgs. Notification 932that is the default, non pass-through behavior, does not specify a mode. 933"medium,local" specifies pass-through notification for the medium level. 934 935The file memory.pressure_level is only used to setup an eventfd. To 936register a notification, an application must: 937 938- create an eventfd using eventfd(2); 939- open memory.pressure_level; 940- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>" 941 to cgroup.event_control. 942 943Application will be notified through eventfd when memory pressure is at 944the specific level (or higher). Read/write operations to 945memory.pressure_level are no implemented. 946 947Test: 948 949 Here is a small script example that makes a new cgroup, sets up a 950 memory limit, sets up a notification in the cgroup and then makes child 951 cgroup experience a critical pressure:: 952 953 # cd /sys/fs/cgroup/memory/ 954 # mkdir foo 955 # cd foo 956 # cgroup_event_listener memory.pressure_level low,hierarchy & 957 # echo 8000000 > memory.limit_in_bytes 958 # echo 8000000 > memory.memsw.limit_in_bytes 959 # echo $$ > tasks 960 # dd if=/dev/zero | read x 961 962 (Expect a bunch of notifications, and eventually, the oom-killer will 963 trigger.) 964 96512. TODO 966======== 967 9681. Make per-cgroup scanner reclaim not-shared pages first 9692. Teach controller to account for shared-pages 9703. Start reclamation in the background when the limit is 971 not yet hit but the usage is getting closer 972 973Summary 974======= 975 976Overall, the memory controller has been a stable controller and has been 977commented and discussed quite extensively in the community. 978 979References 980========== 981 9821. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/ 9832. Singh, Balbir. Memory Controller (RSS Control), 984 http://lwn.net/Articles/222762/ 9853. Emelianov, Pavel. Resource controllers based on process cgroups 986 http://lkml.org/lkml/2007/3/6/198 9874. Emelianov, Pavel. RSS controller based on process cgroups (v2) 988 http://lkml.org/lkml/2007/4/9/78 9895. Emelianov, Pavel. RSS controller based on process cgroups (v3) 990 http://lkml.org/lkml/2007/5/30/244 9916. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/ 9927. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control 993 subsystem (v3), http://lwn.net/Articles/235534/ 9948. Singh, Balbir. RSS controller v2 test results (lmbench), 995 http://lkml.org/lkml/2007/5/17/232 9969. Singh, Balbir. RSS controller v2 AIM9 results 997 http://lkml.org/lkml/2007/5/18/1 99810. Singh, Balbir. Memory controller v6 test results, 999 http://lkml.org/lkml/2007/8/19/36 100011. Singh, Balbir. Memory controller introduction (v6), 1001 http://lkml.org/lkml/2007/8/17/69 100212. Corbet, Jonathan, Controlling memory use in cgroups, 1003 http://lwn.net/Articles/243795/ 1004