1.. _cgroup-v2: 2 3================ 4Control Group v2 5================ 6 7:Date: October, 2015 8:Author: Tejun Heo <tj@kernel.org> 9 10This is the authoritative documentation on the design, interface and 11conventions of cgroup v2. It describes all userland-visible aspects 12of cgroup including core and specific controller behaviors. All 13future changes must be reflected in this document. Documentation for 14v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`. 15 16.. CONTENTS 17 18 1. Introduction 19 1-1. Terminology 20 1-2. What is cgroup? 21 2. Basic Operations 22 2-1. Mounting 23 2-2. Organizing Processes and Threads 24 2-2-1. Processes 25 2-2-2. Threads 26 2-3. [Un]populated Notification 27 2-4. Controlling Controllers 28 2-4-1. Enabling and Disabling 29 2-4-2. Top-down Constraint 30 2-4-3. No Internal Process Constraint 31 2-5. Delegation 32 2-5-1. Model of Delegation 33 2-5-2. Delegation Containment 34 2-6. Guidelines 35 2-6-1. Organize Once and Control 36 2-6-2. Avoid Name Collisions 37 3. Resource Distribution Models 38 3-1. Weights 39 3-2. Limits 40 3-3. Protections 41 3-4. Allocations 42 4. Interface Files 43 4-1. Format 44 4-2. Conventions 45 4-3. Core Interface Files 46 5. Controllers 47 5-1. CPU 48 5-1-1. CPU Interface Files 49 5-2. Memory 50 5-2-1. Memory Interface Files 51 5-2-2. Usage Guidelines 52 5-2-3. Memory Ownership 53 5-3. IO 54 5-3-1. IO Interface Files 55 5-3-2. Writeback 56 5-3-3. IO Latency 57 5-3-3-1. How IO Latency Throttling Works 58 5-3-3-2. IO Latency Interface Files 59 5-4. PID 60 5-4-1. PID Interface Files 61 5-5. Cpuset 62 5.5-1. Cpuset Interface Files 63 5-6. Device 64 5-7. RDMA 65 5-7-1. RDMA Interface Files 66 5-8. HugeTLB 67 5.8-1. HugeTLB Interface Files 68 5-9. Misc 69 5.9-1 Miscellaneous cgroup Interface Files 70 5.9-2 Migration and Ownership 71 5-10. Others 72 5-10-1. perf_event 73 5-N. Non-normative information 74 5-N-1. CPU controller root cgroup process behaviour 75 5-N-2. IO controller root cgroup process behaviour 76 6. Namespace 77 6-1. Basics 78 6-2. The Root and Views 79 6-3. Migration and setns(2) 80 6-4. Interaction with Other Namespaces 81 P. Information on Kernel Programming 82 P-1. Filesystem Support for Writeback 83 D. Deprecated v1 Core Features 84 R. Issues with v1 and Rationales for v2 85 R-1. Multiple Hierarchies 86 R-2. Thread Granularity 87 R-3. Competition Between Inner Nodes and Threads 88 R-4. Other Interface Issues 89 R-5. Controller Issues and Remedies 90 R-5-1. Memory 91 92 93Introduction 94============ 95 96Terminology 97----------- 98 99"cgroup" stands for "control group" and is never capitalized. The 100singular form is used to designate the whole feature and also as a 101qualifier as in "cgroup controllers". When explicitly referring to 102multiple individual control groups, the plural form "cgroups" is used. 103 104 105What is cgroup? 106--------------- 107 108cgroup is a mechanism to organize processes hierarchically and 109distribute system resources along the hierarchy in a controlled and 110configurable manner. 111 112cgroup is largely composed of two parts - the core and controllers. 113cgroup core is primarily responsible for hierarchically organizing 114processes. A cgroup controller is usually responsible for 115distributing a specific type of system resource along the hierarchy 116although there are utility controllers which serve purposes other than 117resource distribution. 118 119cgroups form a tree structure and every process in the system belongs 120to one and only one cgroup. All threads of a process belong to the 121same cgroup. On creation, all processes are put in the cgroup that 122the parent process belongs to at the time. A process can be migrated 123to another cgroup. Migration of a process doesn't affect already 124existing descendant processes. 125 126Following certain structural constraints, controllers may be enabled or 127disabled selectively on a cgroup. All controller behaviors are 128hierarchical - if a controller is enabled on a cgroup, it affects all 129processes which belong to the cgroups consisting the inclusive 130sub-hierarchy of the cgroup. When a controller is enabled on a nested 131cgroup, it always restricts the resource distribution further. The 132restrictions set closer to the root in the hierarchy can not be 133overridden from further away. 134 135 136Basic Operations 137================ 138 139Mounting 140-------- 141 142Unlike v1, cgroup v2 has only single hierarchy. The cgroup v2 143hierarchy can be mounted with the following mount command:: 144 145 # mount -t cgroup2 none $MOUNT_POINT 146 147cgroup2 filesystem has the magic number 0x63677270 ("cgrp"). All 148controllers which support v2 and are not bound to a v1 hierarchy are 149automatically bound to the v2 hierarchy and show up at the root. 150Controllers which are not in active use in the v2 hierarchy can be 151bound to other hierarchies. This allows mixing v2 hierarchy with the 152legacy v1 multiple hierarchies in a fully backward compatible way. 153 154A controller can be moved across hierarchies only after the controller 155is no longer referenced in its current hierarchy. Because per-cgroup 156controller states are destroyed asynchronously and controllers may 157have lingering references, a controller may not show up immediately on 158the v2 hierarchy after the final umount of the previous hierarchy. 159Similarly, a controller should be fully disabled to be moved out of 160the unified hierarchy and it may take some time for the disabled 161controller to become available for other hierarchies; furthermore, due 162to inter-controller dependencies, other controllers may need to be 163disabled too. 164 165While useful for development and manual configurations, moving 166controllers dynamically between the v2 and other hierarchies is 167strongly discouraged for production use. It is recommended to decide 168the hierarchies and controller associations before starting using the 169controllers after system boot. 170 171During transition to v2, system management software might still 172automount the v1 cgroup filesystem and so hijack all controllers 173during boot, before manual intervention is possible. To make testing 174and experimenting easier, the kernel parameter cgroup_no_v1= allows 175disabling controllers in v1 and make them always available in v2. 176 177cgroup v2 currently supports the following mount options. 178 179 nsdelegate 180 Consider cgroup namespaces as delegation boundaries. This 181 option is system wide and can only be set on mount or modified 182 through remount from the init namespace. The mount option is 183 ignored on non-init namespace mounts. Please refer to the 184 Delegation section for details. 185 186 memory_localevents 187 Only populate memory.events with data for the current cgroup, 188 and not any subtrees. This is legacy behaviour, the default 189 behaviour without this option is to include subtree counts. 190 This option is system wide and can only be set on mount or 191 modified through remount from the init namespace. The mount 192 option is ignored on non-init namespace mounts. 193 194 memory_recursiveprot 195 Recursively apply memory.min and memory.low protection to 196 entire subtrees, without requiring explicit downward 197 propagation into leaf cgroups. This allows protecting entire 198 subtrees from one another, while retaining free competition 199 within those subtrees. This should have been the default 200 behavior but is a mount-option to avoid regressing setups 201 relying on the original semantics (e.g. specifying bogusly 202 high 'bypass' protection values at higher tree levels). 203 204 205Organizing Processes and Threads 206-------------------------------- 207 208Processes 209~~~~~~~~~ 210 211Initially, only the root cgroup exists to which all processes belong. 212A child cgroup can be created by creating a sub-directory:: 213 214 # mkdir $CGROUP_NAME 215 216A given cgroup may have multiple child cgroups forming a tree 217structure. Each cgroup has a read-writable interface file 218"cgroup.procs". When read, it lists the PIDs of all processes which 219belong to the cgroup one-per-line. The PIDs are not ordered and the 220same PID may show up more than once if the process got moved to 221another cgroup and then back or the PID got recycled while reading. 222 223A process can be migrated into a cgroup by writing its PID to the 224target cgroup's "cgroup.procs" file. Only one process can be migrated 225on a single write(2) call. If a process is composed of multiple 226threads, writing the PID of any thread migrates all threads of the 227process. 228 229When a process forks a child process, the new process is born into the 230cgroup that the forking process belongs to at the time of the 231operation. After exit, a process stays associated with the cgroup 232that it belonged to at the time of exit until it's reaped; however, a 233zombie process does not appear in "cgroup.procs" and thus can't be 234moved to another cgroup. 235 236A cgroup which doesn't have any children or live processes can be 237destroyed by removing the directory. Note that a cgroup which doesn't 238have any children and is associated only with zombie processes is 239considered empty and can be removed:: 240 241 # rmdir $CGROUP_NAME 242 243"/proc/$PID/cgroup" lists a process's cgroup membership. If legacy 244cgroup is in use in the system, this file may contain multiple lines, 245one for each hierarchy. The entry for cgroup v2 is always in the 246format "0::$PATH":: 247 248 # cat /proc/842/cgroup 249 ... 250 0::/test-cgroup/test-cgroup-nested 251 252If the process becomes a zombie and the cgroup it was associated with 253is removed subsequently, " (deleted)" is appended to the path:: 254 255 # cat /proc/842/cgroup 256 ... 257 0::/test-cgroup/test-cgroup-nested (deleted) 258 259 260Threads 261~~~~~~~ 262 263cgroup v2 supports thread granularity for a subset of controllers to 264support use cases requiring hierarchical resource distribution across 265the threads of a group of processes. By default, all threads of a 266process belong to the same cgroup, which also serves as the resource 267domain to host resource consumptions which are not specific to a 268process or thread. The thread mode allows threads to be spread across 269a subtree while still maintaining the common resource domain for them. 270 271Controllers which support thread mode are called threaded controllers. 272The ones which don't are called domain controllers. 273 274Marking a cgroup threaded makes it join the resource domain of its 275parent as a threaded cgroup. The parent may be another threaded 276cgroup whose resource domain is further up in the hierarchy. The root 277of a threaded subtree, that is, the nearest ancestor which is not 278threaded, is called threaded domain or thread root interchangeably and 279serves as the resource domain for the entire subtree. 280 281Inside a threaded subtree, threads of a process can be put in 282different cgroups and are not subject to the no internal process 283constraint - threaded controllers can be enabled on non-leaf cgroups 284whether they have threads in them or not. 285 286As the threaded domain cgroup hosts all the domain resource 287consumptions of the subtree, it is considered to have internal 288resource consumptions whether there are processes in it or not and 289can't have populated child cgroups which aren't threaded. Because the 290root cgroup is not subject to no internal process constraint, it can 291serve both as a threaded domain and a parent to domain cgroups. 292 293The current operation mode or type of the cgroup is shown in the 294"cgroup.type" file which indicates whether the cgroup is a normal 295domain, a domain which is serving as the domain of a threaded subtree, 296or a threaded cgroup. 297 298On creation, a cgroup is always a domain cgroup and can be made 299threaded by writing "threaded" to the "cgroup.type" file. The 300operation is single direction:: 301 302 # echo threaded > cgroup.type 303 304Once threaded, the cgroup can't be made a domain again. To enable the 305thread mode, the following conditions must be met. 306 307- As the cgroup will join the parent's resource domain. The parent 308 must either be a valid (threaded) domain or a threaded cgroup. 309 310- When the parent is an unthreaded domain, it must not have any domain 311 controllers enabled or populated domain children. The root is 312 exempt from this requirement. 313 314Topology-wise, a cgroup can be in an invalid state. Please consider 315the following topology:: 316 317 A (threaded domain) - B (threaded) - C (domain, just created) 318 319C is created as a domain but isn't connected to a parent which can 320host child domains. C can't be used until it is turned into a 321threaded cgroup. "cgroup.type" file will report "domain (invalid)" in 322these cases. Operations which fail due to invalid topology use 323EOPNOTSUPP as the errno. 324 325A domain cgroup is turned into a threaded domain when one of its child 326cgroup becomes threaded or threaded controllers are enabled in the 327"cgroup.subtree_control" file while there are processes in the cgroup. 328A threaded domain reverts to a normal domain when the conditions 329clear. 330 331When read, "cgroup.threads" contains the list of the thread IDs of all 332threads in the cgroup. Except that the operations are per-thread 333instead of per-process, "cgroup.threads" has the same format and 334behaves the same way as "cgroup.procs". While "cgroup.threads" can be 335written to in any cgroup, as it can only move threads inside the same 336threaded domain, its operations are confined inside each threaded 337subtree. 338 339The threaded domain cgroup serves as the resource domain for the whole 340subtree, and, while the threads can be scattered across the subtree, 341all the processes are considered to be in the threaded domain cgroup. 342"cgroup.procs" in a threaded domain cgroup contains the PIDs of all 343processes in the subtree and is not readable in the subtree proper. 344However, "cgroup.procs" can be written to from anywhere in the subtree 345to migrate all threads of the matching process to the cgroup. 346 347Only threaded controllers can be enabled in a threaded subtree. When 348a threaded controller is enabled inside a threaded subtree, it only 349accounts for and controls resource consumptions associated with the 350threads in the cgroup and its descendants. All consumptions which 351aren't tied to a specific thread belong to the threaded domain cgroup. 352 353Because a threaded subtree is exempt from no internal process 354constraint, a threaded controller must be able to handle competition 355between threads in a non-leaf cgroup and its child cgroups. Each 356threaded controller defines how such competitions are handled. 357 358 359[Un]populated Notification 360-------------------------- 361 362Each non-root cgroup has a "cgroup.events" file which contains 363"populated" field indicating whether the cgroup's sub-hierarchy has 364live processes in it. Its value is 0 if there is no live process in 365the cgroup and its descendants; otherwise, 1. poll and [id]notify 366events are triggered when the value changes. This can be used, for 367example, to start a clean-up operation after all processes of a given 368sub-hierarchy have exited. The populated state updates and 369notifications are recursive. Consider the following sub-hierarchy 370where the numbers in the parentheses represent the numbers of processes 371in each cgroup:: 372 373 A(4) - B(0) - C(1) 374 \ D(0) 375 376A, B and C's "populated" fields would be 1 while D's 0. After the one 377process in C exits, B and C's "populated" fields would flip to "0" and 378file modified events will be generated on the "cgroup.events" files of 379both cgroups. 380 381 382Controlling Controllers 383----------------------- 384 385Enabling and Disabling 386~~~~~~~~~~~~~~~~~~~~~~ 387 388Each cgroup has a "cgroup.controllers" file which lists all 389controllers available for the cgroup to enable:: 390 391 # cat cgroup.controllers 392 cpu io memory 393 394No controller is enabled by default. Controllers can be enabled and 395disabled by writing to the "cgroup.subtree_control" file:: 396 397 # echo "+cpu +memory -io" > cgroup.subtree_control 398 399Only controllers which are listed in "cgroup.controllers" can be 400enabled. When multiple operations are specified as above, either they 401all succeed or fail. If multiple operations on the same controller 402are specified, the last one is effective. 403 404Enabling a controller in a cgroup indicates that the distribution of 405the target resource across its immediate children will be controlled. 406Consider the following sub-hierarchy. The enabled controllers are 407listed in parentheses:: 408 409 A(cpu,memory) - B(memory) - C() 410 \ D() 411 412As A has "cpu" and "memory" enabled, A will control the distribution 413of CPU cycles and memory to its children, in this case, B. As B has 414"memory" enabled but not "CPU", C and D will compete freely on CPU 415cycles but their division of memory available to B will be controlled. 416 417As a controller regulates the distribution of the target resource to 418the cgroup's children, enabling it creates the controller's interface 419files in the child cgroups. In the above example, enabling "cpu" on B 420would create the "cpu." prefixed controller interface files in C and 421D. Likewise, disabling "memory" from B would remove the "memory." 422prefixed controller interface files from C and D. This means that the 423controller interface files - anything which doesn't start with 424"cgroup." are owned by the parent rather than the cgroup itself. 425 426 427Top-down Constraint 428~~~~~~~~~~~~~~~~~~~ 429 430Resources are distributed top-down and a cgroup can further distribute 431a resource only if the resource has been distributed to it from the 432parent. This means that all non-root "cgroup.subtree_control" files 433can only contain controllers which are enabled in the parent's 434"cgroup.subtree_control" file. A controller can be enabled only if 435the parent has the controller enabled and a controller can't be 436disabled if one or more children have it enabled. 437 438 439No Internal Process Constraint 440~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 441 442Non-root cgroups can distribute domain resources to their children 443only when they don't have any processes of their own. In other words, 444only domain cgroups which don't contain any processes can have domain 445controllers enabled in their "cgroup.subtree_control" files. 446 447This guarantees that, when a domain controller is looking at the part 448of the hierarchy which has it enabled, processes are always only on 449the leaves. This rules out situations where child cgroups compete 450against internal processes of the parent. 451 452The root cgroup is exempt from this restriction. Root contains 453processes and anonymous resource consumption which can't be associated 454with any other cgroups and requires special treatment from most 455controllers. How resource consumption in the root cgroup is governed 456is up to each controller (for more information on this topic please 457refer to the Non-normative information section in the Controllers 458chapter). 459 460Note that the restriction doesn't get in the way if there is no 461enabled controller in the cgroup's "cgroup.subtree_control". This is 462important as otherwise it wouldn't be possible to create children of a 463populated cgroup. To control resource distribution of a cgroup, the 464cgroup must create children and transfer all its processes to the 465children before enabling controllers in its "cgroup.subtree_control" 466file. 467 468 469Delegation 470---------- 471 472Model of Delegation 473~~~~~~~~~~~~~~~~~~~ 474 475A cgroup can be delegated in two ways. First, to a less privileged 476user by granting write access of the directory and its "cgroup.procs", 477"cgroup.threads" and "cgroup.subtree_control" files to the user. 478Second, if the "nsdelegate" mount option is set, automatically to a 479cgroup namespace on namespace creation. 480 481Because the resource control interface files in a given directory 482control the distribution of the parent's resources, the delegatee 483shouldn't be allowed to write to them. For the first method, this is 484achieved by not granting access to these files. For the second, the 485kernel rejects writes to all files other than "cgroup.procs" and 486"cgroup.subtree_control" on a namespace root from inside the 487namespace. 488 489The end results are equivalent for both delegation types. Once 490delegated, the user can build sub-hierarchy under the directory, 491organize processes inside it as it sees fit and further distribute the 492resources it received from the parent. The limits and other settings 493of all resource controllers are hierarchical and regardless of what 494happens in the delegated sub-hierarchy, nothing can escape the 495resource restrictions imposed by the parent. 496 497Currently, cgroup doesn't impose any restrictions on the number of 498cgroups in or nesting depth of a delegated sub-hierarchy; however, 499this may be limited explicitly in the future. 500 501 502Delegation Containment 503~~~~~~~~~~~~~~~~~~~~~~ 504 505A delegated sub-hierarchy is contained in the sense that processes 506can't be moved into or out of the sub-hierarchy by the delegatee. 507 508For delegations to a less privileged user, this is achieved by 509requiring the following conditions for a process with a non-root euid 510to migrate a target process into a cgroup by writing its PID to the 511"cgroup.procs" file. 512 513- The writer must have write access to the "cgroup.procs" file. 514 515- The writer must have write access to the "cgroup.procs" file of the 516 common ancestor of the source and destination cgroups. 517 518The above two constraints ensure that while a delegatee may migrate 519processes around freely in the delegated sub-hierarchy it can't pull 520in from or push out to outside the sub-hierarchy. 521 522For an example, let's assume cgroups C0 and C1 have been delegated to 523user U0 who created C00, C01 under C0 and C10 under C1 as follows and 524all processes under C0 and C1 belong to U0:: 525 526 ~~~~~~~~~~~~~ - C0 - C00 527 ~ cgroup ~ \ C01 528 ~ hierarchy ~ 529 ~~~~~~~~~~~~~ - C1 - C10 530 531Let's also say U0 wants to write the PID of a process which is 532currently in C10 into "C00/cgroup.procs". U0 has write access to the 533file; however, the common ancestor of the source cgroup C10 and the 534destination cgroup C00 is above the points of delegation and U0 would 535not have write access to its "cgroup.procs" files and thus the write 536will be denied with -EACCES. 537 538For delegations to namespaces, containment is achieved by requiring 539that both the source and destination cgroups are reachable from the 540namespace of the process which is attempting the migration. If either 541is not reachable, the migration is rejected with -ENOENT. 542 543 544Guidelines 545---------- 546 547Organize Once and Control 548~~~~~~~~~~~~~~~~~~~~~~~~~ 549 550Migrating a process across cgroups is a relatively expensive operation 551and stateful resources such as memory are not moved together with the 552process. This is an explicit design decision as there often exist 553inherent trade-offs between migration and various hot paths in terms 554of synchronization cost. 555 556As such, migrating processes across cgroups frequently as a means to 557apply different resource restrictions is discouraged. A workload 558should be assigned to a cgroup according to the system's logical and 559resource structure once on start-up. Dynamic adjustments to resource 560distribution can be made by changing controller configuration through 561the interface files. 562 563 564Avoid Name Collisions 565~~~~~~~~~~~~~~~~~~~~~ 566 567Interface files for a cgroup and its children cgroups occupy the same 568directory and it is possible to create children cgroups which collide 569with interface files. 570 571All cgroup core interface files are prefixed with "cgroup." and each 572controller's interface files are prefixed with the controller name and 573a dot. A controller's name is composed of lower case alphabets and 574'_'s but never begins with an '_' so it can be used as the prefix 575character for collision avoidance. Also, interface file names won't 576start or end with terms which are often used in categorizing workloads 577such as job, service, slice, unit or workload. 578 579cgroup doesn't do anything to prevent name collisions and it's the 580user's responsibility to avoid them. 581 582 583Resource Distribution Models 584============================ 585 586cgroup controllers implement several resource distribution schemes 587depending on the resource type and expected use cases. This section 588describes major schemes in use along with their expected behaviors. 589 590 591Weights 592------- 593 594A parent's resource is distributed by adding up the weights of all 595active children and giving each the fraction matching the ratio of its 596weight against the sum. As only children which can make use of the 597resource at the moment participate in the distribution, this is 598work-conserving. Due to the dynamic nature, this model is usually 599used for stateless resources. 600 601All weights are in the range [1, 10000] with the default at 100. This 602allows symmetric multiplicative biases in both directions at fine 603enough granularity while staying in the intuitive range. 604 605As long as the weight is in range, all configuration combinations are 606valid and there is no reason to reject configuration changes or 607process migrations. 608 609"cpu.weight" proportionally distributes CPU cycles to active children 610and is an example of this type. 611 612 613Limits 614------ 615 616A child can only consume upto the configured amount of the resource. 617Limits can be over-committed - the sum of the limits of children can 618exceed the amount of resource available to the parent. 619 620Limits are in the range [0, max] and defaults to "max", which is noop. 621 622As limits can be over-committed, all configuration combinations are 623valid and there is no reason to reject configuration changes or 624process migrations. 625 626"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume 627on an IO device and is an example of this type. 628 629 630Protections 631----------- 632 633A cgroup is protected upto the configured amount of the resource 634as long as the usages of all its ancestors are under their 635protected levels. Protections can be hard guarantees or best effort 636soft boundaries. Protections can also be over-committed in which case 637only upto the amount available to the parent is protected among 638children. 639 640Protections are in the range [0, max] and defaults to 0, which is 641noop. 642 643As protections can be over-committed, all configuration combinations 644are valid and there is no reason to reject configuration changes or 645process migrations. 646 647"memory.low" implements best-effort memory protection and is an 648example of this type. 649 650 651Allocations 652----------- 653 654A cgroup is exclusively allocated a certain amount of a finite 655resource. Allocations can't be over-committed - the sum of the 656allocations of children can not exceed the amount of resource 657available to the parent. 658 659Allocations are in the range [0, max] and defaults to 0, which is no 660resource. 661 662As allocations can't be over-committed, some configuration 663combinations are invalid and should be rejected. Also, if the 664resource is mandatory for execution of processes, process migrations 665may be rejected. 666 667"cpu.rt.max" hard-allocates realtime slices and is an example of this 668type. 669 670 671Interface Files 672=============== 673 674Format 675------ 676 677All interface files should be in one of the following formats whenever 678possible:: 679 680 New-line separated values 681 (when only one value can be written at once) 682 683 VAL0\n 684 VAL1\n 685 ... 686 687 Space separated values 688 (when read-only or multiple values can be written at once) 689 690 VAL0 VAL1 ...\n 691 692 Flat keyed 693 694 KEY0 VAL0\n 695 KEY1 VAL1\n 696 ... 697 698 Nested keyed 699 700 KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... 701 KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... 702 ... 703 704For a writable file, the format for writing should generally match 705reading; however, controllers may allow omitting later fields or 706implement restricted shortcuts for most common use cases. 707 708For both flat and nested keyed files, only the values for a single key 709can be written at a time. For nested keyed files, the sub key pairs 710may be specified in any order and not all pairs have to be specified. 711 712 713Conventions 714----------- 715 716- Settings for a single feature should be contained in a single file. 717 718- The root cgroup should be exempt from resource control and thus 719 shouldn't have resource control interface files. 720 721- The default time unit is microseconds. If a different unit is ever 722 used, an explicit unit suffix must be present. 723 724- A parts-per quantity should use a percentage decimal with at least 725 two digit fractional part - e.g. 13.40. 726 727- If a controller implements weight based resource distribution, its 728 interface file should be named "weight" and have the range [1, 729 10000] with 100 as the default. The values are chosen to allow 730 enough and symmetric bias in both directions while keeping it 731 intuitive (the default is 100%). 732 733- If a controller implements an absolute resource guarantee and/or 734 limit, the interface files should be named "min" and "max" 735 respectively. If a controller implements best effort resource 736 guarantee and/or limit, the interface files should be named "low" 737 and "high" respectively. 738 739 In the above four control files, the special token "max" should be 740 used to represent upward infinity for both reading and writing. 741 742- If a setting has a configurable default value and keyed specific 743 overrides, the default entry should be keyed with "default" and 744 appear as the first entry in the file. 745 746 The default value can be updated by writing either "default $VAL" or 747 "$VAL". 748 749 When writing to update a specific override, "default" can be used as 750 the value to indicate removal of the override. Override entries 751 with "default" as the value must not appear when read. 752 753 For example, a setting which is keyed by major:minor device numbers 754 with integer values may look like the following:: 755 756 # cat cgroup-example-interface-file 757 default 150 758 8:0 300 759 760 The default value can be updated by:: 761 762 # echo 125 > cgroup-example-interface-file 763 764 or:: 765 766 # echo "default 125" > cgroup-example-interface-file 767 768 An override can be set by:: 769 770 # echo "8:16 170" > cgroup-example-interface-file 771 772 and cleared by:: 773 774 # echo "8:0 default" > cgroup-example-interface-file 775 # cat cgroup-example-interface-file 776 default 125 777 8:16 170 778 779- For events which are not very high frequency, an interface file 780 "events" should be created which lists event key value pairs. 781 Whenever a notifiable event happens, file modified event should be 782 generated on the file. 783 784 785Core Interface Files 786-------------------- 787 788All cgroup core files are prefixed with "cgroup." 789 790 cgroup.type 791 A read-write single value file which exists on non-root 792 cgroups. 793 794 When read, it indicates the current type of the cgroup, which 795 can be one of the following values. 796 797 - "domain" : A normal valid domain cgroup. 798 799 - "domain threaded" : A threaded domain cgroup which is 800 serving as the root of a threaded subtree. 801 802 - "domain invalid" : A cgroup which is in an invalid state. 803 It can't be populated or have controllers enabled. It may 804 be allowed to become a threaded cgroup. 805 806 - "threaded" : A threaded cgroup which is a member of a 807 threaded subtree. 808 809 A cgroup can be turned into a threaded cgroup by writing 810 "threaded" to this file. 811 812 cgroup.procs 813 A read-write new-line separated values file which exists on 814 all cgroups. 815 816 When read, it lists the PIDs of all processes which belong to 817 the cgroup one-per-line. The PIDs are not ordered and the 818 same PID may show up more than once if the process got moved 819 to another cgroup and then back or the PID got recycled while 820 reading. 821 822 A PID can be written to migrate the process associated with 823 the PID to the cgroup. The writer should match all of the 824 following conditions. 825 826 - It must have write access to the "cgroup.procs" file. 827 828 - It must have write access to the "cgroup.procs" file of the 829 common ancestor of the source and destination cgroups. 830 831 When delegating a sub-hierarchy, write access to this file 832 should be granted along with the containing directory. 833 834 In a threaded cgroup, reading this file fails with EOPNOTSUPP 835 as all the processes belong to the thread root. Writing is 836 supported and moves every thread of the process to the cgroup. 837 838 cgroup.threads 839 A read-write new-line separated values file which exists on 840 all cgroups. 841 842 When read, it lists the TIDs of all threads which belong to 843 the cgroup one-per-line. The TIDs are not ordered and the 844 same TID may show up more than once if the thread got moved to 845 another cgroup and then back or the TID got recycled while 846 reading. 847 848 A TID can be written to migrate the thread associated with the 849 TID to the cgroup. The writer should match all of the 850 following conditions. 851 852 - It must have write access to the "cgroup.threads" file. 853 854 - The cgroup that the thread is currently in must be in the 855 same resource domain as the destination cgroup. 856 857 - It must have write access to the "cgroup.procs" file of the 858 common ancestor of the source and destination cgroups. 859 860 When delegating a sub-hierarchy, write access to this file 861 should be granted along with the containing directory. 862 863 cgroup.controllers 864 A read-only space separated values file which exists on all 865 cgroups. 866 867 It shows space separated list of all controllers available to 868 the cgroup. The controllers are not ordered. 869 870 cgroup.subtree_control 871 A read-write space separated values file which exists on all 872 cgroups. Starts out empty. 873 874 When read, it shows space separated list of the controllers 875 which are enabled to control resource distribution from the 876 cgroup to its children. 877 878 Space separated list of controllers prefixed with '+' or '-' 879 can be written to enable or disable controllers. A controller 880 name prefixed with '+' enables the controller and '-' 881 disables. If a controller appears more than once on the list, 882 the last one is effective. When multiple enable and disable 883 operations are specified, either all succeed or all fail. 884 885 cgroup.events 886 A read-only flat-keyed file which exists on non-root cgroups. 887 The following entries are defined. Unless specified 888 otherwise, a value change in this file generates a file 889 modified event. 890 891 populated 892 1 if the cgroup or its descendants contains any live 893 processes; otherwise, 0. 894 frozen 895 1 if the cgroup is frozen; otherwise, 0. 896 897 cgroup.max.descendants 898 A read-write single value files. The default is "max". 899 900 Maximum allowed number of descent cgroups. 901 If the actual number of descendants is equal or larger, 902 an attempt to create a new cgroup in the hierarchy will fail. 903 904 cgroup.max.depth 905 A read-write single value files. The default is "max". 906 907 Maximum allowed descent depth below the current cgroup. 908 If the actual descent depth is equal or larger, 909 an attempt to create a new child cgroup will fail. 910 911 cgroup.stat 912 A read-only flat-keyed file with the following entries: 913 914 nr_descendants 915 Total number of visible descendant cgroups. 916 917 nr_dying_descendants 918 Total number of dying descendant cgroups. A cgroup becomes 919 dying after being deleted by a user. The cgroup will remain 920 in dying state for some time undefined time (which can depend 921 on system load) before being completely destroyed. 922 923 A process can't enter a dying cgroup under any circumstances, 924 a dying cgroup can't revive. 925 926 A dying cgroup can consume system resources not exceeding 927 limits, which were active at the moment of cgroup deletion. 928 929 cgroup.freeze 930 A read-write single value file which exists on non-root cgroups. 931 Allowed values are "0" and "1". The default is "0". 932 933 Writing "1" to the file causes freezing of the cgroup and all 934 descendant cgroups. This means that all belonging processes will 935 be stopped and will not run until the cgroup will be explicitly 936 unfrozen. Freezing of the cgroup may take some time; when this action 937 is completed, the "frozen" value in the cgroup.events control file 938 will be updated to "1" and the corresponding notification will be 939 issued. 940 941 A cgroup can be frozen either by its own settings, or by settings 942 of any ancestor cgroups. If any of ancestor cgroups is frozen, the 943 cgroup will remain frozen. 944 945 Processes in the frozen cgroup can be killed by a fatal signal. 946 They also can enter and leave a frozen cgroup: either by an explicit 947 move by a user, or if freezing of the cgroup races with fork(). 948 If a process is moved to a frozen cgroup, it stops. If a process is 949 moved out of a frozen cgroup, it becomes running. 950 951 Frozen status of a cgroup doesn't affect any cgroup tree operations: 952 it's possible to delete a frozen (and empty) cgroup, as well as 953 create new sub-cgroups. 954 955Controllers 956=========== 957 958.. _cgroup-v2-cpu: 959 960CPU 961--- 962 963The "cpu" controllers regulates distribution of CPU cycles. This 964controller implements weight and absolute bandwidth limit models for 965normal scheduling policy and absolute bandwidth allocation model for 966realtime scheduling policy. 967 968In all the above models, cycles distribution is defined only on a temporal 969base and it does not account for the frequency at which tasks are executed. 970The (optional) utilization clamping support allows to hint the schedutil 971cpufreq governor about the minimum desired frequency which should always be 972provided by a CPU, as well as the maximum desired frequency, which should not 973be exceeded by a CPU. 974 975WARNING: cgroup2 doesn't yet support control of realtime processes and 976the cpu controller can only be enabled when all RT processes are in 977the root cgroup. Be aware that system management software may already 978have placed RT processes into nonroot cgroups during the system boot 979process, and these processes may need to be moved to the root cgroup 980before the cpu controller can be enabled. 981 982 983CPU Interface Files 984~~~~~~~~~~~~~~~~~~~ 985 986All time durations are in microseconds. 987 988 cpu.stat 989 A read-only flat-keyed file. 990 This file exists whether the controller is enabled or not. 991 992 It always reports the following three stats: 993 994 - usage_usec 995 - user_usec 996 - system_usec 997 998 and the following three when the controller is enabled: 999 1000 - nr_periods 1001 - nr_throttled 1002 - throttled_usec 1003 1004 cpu.weight 1005 A read-write single value file which exists on non-root 1006 cgroups. The default is "100". 1007 1008 The weight in the range [1, 10000]. 1009 1010 cpu.weight.nice 1011 A read-write single value file which exists on non-root 1012 cgroups. The default is "0". 1013 1014 The nice value is in the range [-20, 19]. 1015 1016 This interface file is an alternative interface for 1017 "cpu.weight" and allows reading and setting weight using the 1018 same values used by nice(2). Because the range is smaller and 1019 granularity is coarser for the nice values, the read value is 1020 the closest approximation of the current weight. 1021 1022 cpu.max 1023 A read-write two value file which exists on non-root cgroups. 1024 The default is "max 100000". 1025 1026 The maximum bandwidth limit. It's in the following format:: 1027 1028 $MAX $PERIOD 1029 1030 which indicates that the group may consume upto $MAX in each 1031 $PERIOD duration. "max" for $MAX indicates no limit. If only 1032 one number is written, $MAX is updated. 1033 1034 cpu.pressure 1035 A read-write nested-keyed file. 1036 1037 Shows pressure stall information for CPU. See 1038 :ref:`Documentation/accounting/psi.rst <psi>` for details. 1039 1040 cpu.uclamp.min 1041 A read-write single value file which exists on non-root cgroups. 1042 The default is "0", i.e. no utilization boosting. 1043 1044 The requested minimum utilization (protection) as a percentage 1045 rational number, e.g. 12.34 for 12.34%. 1046 1047 This interface allows reading and setting minimum utilization clamp 1048 values similar to the sched_setattr(2). This minimum utilization 1049 value is used to clamp the task specific minimum utilization clamp. 1050 1051 The requested minimum utilization (protection) is always capped by 1052 the current value for the maximum utilization (limit), i.e. 1053 `cpu.uclamp.max`. 1054 1055 cpu.uclamp.max 1056 A read-write single value file which exists on non-root cgroups. 1057 The default is "max". i.e. no utilization capping 1058 1059 The requested maximum utilization (limit) as a percentage rational 1060 number, e.g. 98.76 for 98.76%. 1061 1062 This interface allows reading and setting maximum utilization clamp 1063 values similar to the sched_setattr(2). This maximum utilization 1064 value is used to clamp the task specific maximum utilization clamp. 1065 1066 1067 1068Memory 1069------ 1070 1071The "memory" controller regulates distribution of memory. Memory is 1072stateful and implements both limit and protection models. Due to the 1073intertwining between memory usage and reclaim pressure and the 1074stateful nature of memory, the distribution model is relatively 1075complex. 1076 1077While not completely water-tight, all major memory usages by a given 1078cgroup are tracked so that the total memory consumption can be 1079accounted and controlled to a reasonable extent. Currently, the 1080following types of memory usages are tracked. 1081 1082- Userland memory - page cache and anonymous memory. 1083 1084- Kernel data structures such as dentries and inodes. 1085 1086- TCP socket buffers. 1087 1088The above list may expand in the future for better coverage. 1089 1090 1091Memory Interface Files 1092~~~~~~~~~~~~~~~~~~~~~~ 1093 1094All memory amounts are in bytes. If a value which is not aligned to 1095PAGE_SIZE is written, the value may be rounded up to the closest 1096PAGE_SIZE multiple when read back. 1097 1098 memory.current 1099 A read-only single value file which exists on non-root 1100 cgroups. 1101 1102 The total amount of memory currently being used by the cgroup 1103 and its descendants. 1104 1105 memory.min 1106 A read-write single value file which exists on non-root 1107 cgroups. The default is "0". 1108 1109 Hard memory protection. If the memory usage of a cgroup 1110 is within its effective min boundary, the cgroup's memory 1111 won't be reclaimed under any conditions. If there is no 1112 unprotected reclaimable memory available, OOM killer 1113 is invoked. Above the effective min boundary (or 1114 effective low boundary if it is higher), pages are reclaimed 1115 proportionally to the overage, reducing reclaim pressure for 1116 smaller overages. 1117 1118 Effective min boundary is limited by memory.min values of 1119 all ancestor cgroups. If there is memory.min overcommitment 1120 (child cgroup or cgroups are requiring more protected memory 1121 than parent will allow), then each child cgroup will get 1122 the part of parent's protection proportional to its 1123 actual memory usage below memory.min. 1124 1125 Putting more memory than generally available under this 1126 protection is discouraged and may lead to constant OOMs. 1127 1128 If a memory cgroup is not populated with processes, 1129 its memory.min is ignored. 1130 1131 memory.low 1132 A read-write single value file which exists on non-root 1133 cgroups. The default is "0". 1134 1135 Best-effort memory protection. If the memory usage of a 1136 cgroup is within its effective low boundary, the cgroup's 1137 memory won't be reclaimed unless there is no reclaimable 1138 memory available in unprotected cgroups. 1139 Above the effective low boundary (or 1140 effective min boundary if it is higher), pages are reclaimed 1141 proportionally to the overage, reducing reclaim pressure for 1142 smaller overages. 1143 1144 Effective low boundary is limited by memory.low values of 1145 all ancestor cgroups. If there is memory.low overcommitment 1146 (child cgroup or cgroups are requiring more protected memory 1147 than parent will allow), then each child cgroup will get 1148 the part of parent's protection proportional to its 1149 actual memory usage below memory.low. 1150 1151 Putting more memory than generally available under this 1152 protection is discouraged. 1153 1154 memory.high 1155 A read-write single value file which exists on non-root 1156 cgroups. The default is "max". 1157 1158 Memory usage throttle limit. This is the main mechanism to 1159 control memory usage of a cgroup. If a cgroup's usage goes 1160 over the high boundary, the processes of the cgroup are 1161 throttled and put under heavy reclaim pressure. 1162 1163 Going over the high limit never invokes the OOM killer and 1164 under extreme conditions the limit may be breached. 1165 1166 memory.max 1167 A read-write single value file which exists on non-root 1168 cgroups. The default is "max". 1169 1170 Memory usage hard limit. This is the final protection 1171 mechanism. If a cgroup's memory usage reaches this limit and 1172 can't be reduced, the OOM killer is invoked in the cgroup. 1173 Under certain circumstances, the usage may go over the limit 1174 temporarily. 1175 1176 In default configuration regular 0-order allocations always 1177 succeed unless OOM killer chooses current task as a victim. 1178 1179 Some kinds of allocations don't invoke the OOM killer. 1180 Caller could retry them differently, return into userspace 1181 as -ENOMEM or silently ignore in cases like disk readahead. 1182 1183 This is the ultimate protection mechanism. As long as the 1184 high limit is used and monitored properly, this limit's 1185 utility is limited to providing the final safety net. 1186 1187 memory.oom.group 1188 A read-write single value file which exists on non-root 1189 cgroups. The default value is "0". 1190 1191 Determines whether the cgroup should be treated as 1192 an indivisible workload by the OOM killer. If set, 1193 all tasks belonging to the cgroup or to its descendants 1194 (if the memory cgroup is not a leaf cgroup) are killed 1195 together or not at all. This can be used to avoid 1196 partial kills to guarantee workload integrity. 1197 1198 Tasks with the OOM protection (oom_score_adj set to -1000) 1199 are treated as an exception and are never killed. 1200 1201 If the OOM killer is invoked in a cgroup, it's not going 1202 to kill any tasks outside of this cgroup, regardless 1203 memory.oom.group values of ancestor cgroups. 1204 1205 memory.events 1206 A read-only flat-keyed file which exists on non-root cgroups. 1207 The following entries are defined. Unless specified 1208 otherwise, a value change in this file generates a file 1209 modified event. 1210 1211 Note that all fields in this file are hierarchical and the 1212 file modified event can be generated due to an event down the 1213 hierarchy. For for the local events at the cgroup level see 1214 memory.events.local. 1215 1216 low 1217 The number of times the cgroup is reclaimed due to 1218 high memory pressure even though its usage is under 1219 the low boundary. This usually indicates that the low 1220 boundary is over-committed. 1221 1222 high 1223 The number of times processes of the cgroup are 1224 throttled and routed to perform direct memory reclaim 1225 because the high memory boundary was exceeded. For a 1226 cgroup whose memory usage is capped by the high limit 1227 rather than global memory pressure, this event's 1228 occurrences are expected. 1229 1230 max 1231 The number of times the cgroup's memory usage was 1232 about to go over the max boundary. If direct reclaim 1233 fails to bring it down, the cgroup goes to OOM state. 1234 1235 oom 1236 The number of time the cgroup's memory usage was 1237 reached the limit and allocation was about to fail. 1238 1239 This event is not raised if the OOM killer is not 1240 considered as an option, e.g. for failed high-order 1241 allocations or if caller asked to not retry attempts. 1242 1243 oom_kill 1244 The number of processes belonging to this cgroup 1245 killed by any kind of OOM killer. 1246 1247 memory.events.local 1248 Similar to memory.events but the fields in the file are local 1249 to the cgroup i.e. not hierarchical. The file modified event 1250 generated on this file reflects only the local events. 1251 1252 memory.stat 1253 A read-only flat-keyed file which exists on non-root cgroups. 1254 1255 This breaks down the cgroup's memory footprint into different 1256 types of memory, type-specific details, and other information 1257 on the state and past events of the memory management system. 1258 1259 All memory amounts are in bytes. 1260 1261 The entries are ordered to be human readable, and new entries 1262 can show up in the middle. Don't rely on items remaining in a 1263 fixed position; use the keys to look up specific values! 1264 1265 If the entry has no per-node counter (or not show in the 1266 memory.numa_stat). We use 'npn' (non-per-node) as the tag 1267 to indicate that it will not show in the memory.numa_stat. 1268 1269 anon 1270 Amount of memory used in anonymous mappings such as 1271 brk(), sbrk(), and mmap(MAP_ANONYMOUS) 1272 1273 file 1274 Amount of memory used to cache filesystem data, 1275 including tmpfs and shared memory. 1276 1277 kernel_stack 1278 Amount of memory allocated to kernel stacks. 1279 1280 pagetables 1281 Amount of memory allocated for page tables. 1282 1283 percpu (npn) 1284 Amount of memory used for storing per-cpu kernel 1285 data structures. 1286 1287 sock (npn) 1288 Amount of memory used in network transmission buffers 1289 1290 shmem 1291 Amount of cached filesystem data that is swap-backed, 1292 such as tmpfs, shm segments, shared anonymous mmap()s 1293 1294 file_mapped 1295 Amount of cached filesystem data mapped with mmap() 1296 1297 file_dirty 1298 Amount of cached filesystem data that was modified but 1299 not yet written back to disk 1300 1301 file_writeback 1302 Amount of cached filesystem data that was modified and 1303 is currently being written back to disk 1304 1305 swapcached 1306 Amount of swap cached in memory. The swapcache is accounted 1307 against both memory and swap usage. 1308 1309 anon_thp 1310 Amount of memory used in anonymous mappings backed by 1311 transparent hugepages 1312 1313 file_thp 1314 Amount of cached filesystem data backed by transparent 1315 hugepages 1316 1317 shmem_thp 1318 Amount of shm, tmpfs, shared anonymous mmap()s backed by 1319 transparent hugepages 1320 1321 inactive_anon, active_anon, inactive_file, active_file, unevictable 1322 Amount of memory, swap-backed and filesystem-backed, 1323 on the internal memory management lists used by the 1324 page reclaim algorithm. 1325 1326 As these represent internal list state (eg. shmem pages are on anon 1327 memory management lists), inactive_foo + active_foo may not be equal to 1328 the value for the foo counter, since the foo counter is type-based, not 1329 list-based. 1330 1331 slab_reclaimable 1332 Part of "slab" that might be reclaimed, such as 1333 dentries and inodes. 1334 1335 slab_unreclaimable 1336 Part of "slab" that cannot be reclaimed on memory 1337 pressure. 1338 1339 slab (npn) 1340 Amount of memory used for storing in-kernel data 1341 structures. 1342 1343 workingset_refault_anon 1344 Number of refaults of previously evicted anonymous pages. 1345 1346 workingset_refault_file 1347 Number of refaults of previously evicted file pages. 1348 1349 workingset_activate_anon 1350 Number of refaulted anonymous pages that were immediately 1351 activated. 1352 1353 workingset_activate_file 1354 Number of refaulted file pages that were immediately activated. 1355 1356 workingset_restore_anon 1357 Number of restored anonymous pages which have been detected as 1358 an active workingset before they got reclaimed. 1359 1360 workingset_restore_file 1361 Number of restored file pages which have been detected as an 1362 active workingset before they got reclaimed. 1363 1364 workingset_nodereclaim 1365 Number of times a shadow node has been reclaimed 1366 1367 pgfault (npn) 1368 Total number of page faults incurred 1369 1370 pgmajfault (npn) 1371 Number of major page faults incurred 1372 1373 pgrefill (npn) 1374 Amount of scanned pages (in an active LRU list) 1375 1376 pgscan (npn) 1377 Amount of scanned pages (in an inactive LRU list) 1378 1379 pgsteal (npn) 1380 Amount of reclaimed pages 1381 1382 pgactivate (npn) 1383 Amount of pages moved to the active LRU list 1384 1385 pgdeactivate (npn) 1386 Amount of pages moved to the inactive LRU list 1387 1388 pglazyfree (npn) 1389 Amount of pages postponed to be freed under memory pressure 1390 1391 pglazyfreed (npn) 1392 Amount of reclaimed lazyfree pages 1393 1394 thp_fault_alloc (npn) 1395 Number of transparent hugepages which were allocated to satisfy 1396 a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE 1397 is not set. 1398 1399 thp_collapse_alloc (npn) 1400 Number of transparent hugepages which were allocated to allow 1401 collapsing an existing range of pages. This counter is not 1402 present when CONFIG_TRANSPARENT_HUGEPAGE is not set. 1403 1404 memory.numa_stat 1405 A read-only nested-keyed file which exists on non-root cgroups. 1406 1407 This breaks down the cgroup's memory footprint into different 1408 types of memory, type-specific details, and other information 1409 per node on the state of the memory management system. 1410 1411 This is useful for providing visibility into the NUMA locality 1412 information within an memcg since the pages are allowed to be 1413 allocated from any physical node. One of the use case is evaluating 1414 application performance by combining this information with the 1415 application's CPU allocation. 1416 1417 All memory amounts are in bytes. 1418 1419 The output format of memory.numa_stat is:: 1420 1421 type N0=<bytes in node 0> N1=<bytes in node 1> ... 1422 1423 The entries are ordered to be human readable, and new entries 1424 can show up in the middle. Don't rely on items remaining in a 1425 fixed position; use the keys to look up specific values! 1426 1427 The entries can refer to the memory.stat. 1428 1429 memory.swap.current 1430 A read-only single value file which exists on non-root 1431 cgroups. 1432 1433 The total amount of swap currently being used by the cgroup 1434 and its descendants. 1435 1436 memory.swap.high 1437 A read-write single value file which exists on non-root 1438 cgroups. The default is "max". 1439 1440 Swap usage throttle limit. If a cgroup's swap usage exceeds 1441 this limit, all its further allocations will be throttled to 1442 allow userspace to implement custom out-of-memory procedures. 1443 1444 This limit marks a point of no return for the cgroup. It is NOT 1445 designed to manage the amount of swapping a workload does 1446 during regular operation. Compare to memory.swap.max, which 1447 prohibits swapping past a set amount, but lets the cgroup 1448 continue unimpeded as long as other memory can be reclaimed. 1449 1450 Healthy workloads are not expected to reach this limit. 1451 1452 memory.swap.max 1453 A read-write single value file which exists on non-root 1454 cgroups. The default is "max". 1455 1456 Swap usage hard limit. If a cgroup's swap usage reaches this 1457 limit, anonymous memory of the cgroup will not be swapped out. 1458 1459 memory.swap.events 1460 A read-only flat-keyed file which exists on non-root cgroups. 1461 The following entries are defined. Unless specified 1462 otherwise, a value change in this file generates a file 1463 modified event. 1464 1465 high 1466 The number of times the cgroup's swap usage was over 1467 the high threshold. 1468 1469 max 1470 The number of times the cgroup's swap usage was about 1471 to go over the max boundary and swap allocation 1472 failed. 1473 1474 fail 1475 The number of times swap allocation failed either 1476 because of running out of swap system-wide or max 1477 limit. 1478 1479 When reduced under the current usage, the existing swap 1480 entries are reclaimed gradually and the swap usage may stay 1481 higher than the limit for an extended period of time. This 1482 reduces the impact on the workload and memory management. 1483 1484 memory.pressure 1485 A read-only nested-keyed file. 1486 1487 Shows pressure stall information for memory. See 1488 :ref:`Documentation/accounting/psi.rst <psi>` for details. 1489 1490 1491Usage Guidelines 1492~~~~~~~~~~~~~~~~ 1493 1494"memory.high" is the main mechanism to control memory usage. 1495Over-committing on high limit (sum of high limits > available memory) 1496and letting global memory pressure to distribute memory according to 1497usage is a viable strategy. 1498 1499Because breach of the high limit doesn't trigger the OOM killer but 1500throttles the offending cgroup, a management agent has ample 1501opportunities to monitor and take appropriate actions such as granting 1502more memory or terminating the workload. 1503 1504Determining whether a cgroup has enough memory is not trivial as 1505memory usage doesn't indicate whether the workload can benefit from 1506more memory. For example, a workload which writes data received from 1507network to a file can use all available memory but can also operate as 1508performant with a small amount of memory. A measure of memory 1509pressure - how much the workload is being impacted due to lack of 1510memory - is necessary to determine whether a workload needs more 1511memory; unfortunately, memory pressure monitoring mechanism isn't 1512implemented yet. 1513 1514 1515Memory Ownership 1516~~~~~~~~~~~~~~~~ 1517 1518A memory area is charged to the cgroup which instantiated it and stays 1519charged to the cgroup until the area is released. Migrating a process 1520to a different cgroup doesn't move the memory usages that it 1521instantiated while in the previous cgroup to the new cgroup. 1522 1523A memory area may be used by processes belonging to different cgroups. 1524To which cgroup the area will be charged is in-deterministic; however, 1525over time, the memory area is likely to end up in a cgroup which has 1526enough memory allowance to avoid high reclaim pressure. 1527 1528If a cgroup sweeps a considerable amount of memory which is expected 1529to be accessed repeatedly by other cgroups, it may make sense to use 1530POSIX_FADV_DONTNEED to relinquish the ownership of memory areas 1531belonging to the affected files to ensure correct memory ownership. 1532 1533 1534IO 1535-- 1536 1537The "io" controller regulates the distribution of IO resources. This 1538controller implements both weight based and absolute bandwidth or IOPS 1539limit distribution; however, weight based distribution is available 1540only if cfq-iosched is in use and neither scheme is available for 1541blk-mq devices. 1542 1543 1544IO Interface Files 1545~~~~~~~~~~~~~~~~~~ 1546 1547 io.stat 1548 A read-only nested-keyed file. 1549 1550 Lines are keyed by $MAJ:$MIN device numbers and not ordered. 1551 The following nested keys are defined. 1552 1553 ====== ===================== 1554 rbytes Bytes read 1555 wbytes Bytes written 1556 rios Number of read IOs 1557 wios Number of write IOs 1558 dbytes Bytes discarded 1559 dios Number of discard IOs 1560 ====== ===================== 1561 1562 An example read output follows:: 1563 1564 8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0 1565 8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021 1566 1567 io.cost.qos 1568 A read-write nested-keyed file which exists only on the root 1569 cgroup. 1570 1571 This file configures the Quality of Service of the IO cost 1572 model based controller (CONFIG_BLK_CGROUP_IOCOST) which 1573 currently implements "io.weight" proportional control. Lines 1574 are keyed by $MAJ:$MIN device numbers and not ordered. The 1575 line for a given device is populated on the first write for 1576 the device on "io.cost.qos" or "io.cost.model". The following 1577 nested keys are defined. 1578 1579 ====== ===================================== 1580 enable Weight-based control enable 1581 ctrl "auto" or "user" 1582 rpct Read latency percentile [0, 100] 1583 rlat Read latency threshold 1584 wpct Write latency percentile [0, 100] 1585 wlat Write latency threshold 1586 min Minimum scaling percentage [1, 10000] 1587 max Maximum scaling percentage [1, 10000] 1588 ====== ===================================== 1589 1590 The controller is disabled by default and can be enabled by 1591 setting "enable" to 1. "rpct" and "wpct" parameters default 1592 to zero and the controller uses internal device saturation 1593 state to adjust the overall IO rate between "min" and "max". 1594 1595 When a better control quality is needed, latency QoS 1596 parameters can be configured. For example:: 1597 1598 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0 1599 1600 shows that on sdb, the controller is enabled, will consider 1601 the device saturated if the 95th percentile of read completion 1602 latencies is above 75ms or write 150ms, and adjust the overall 1603 IO issue rate between 50% and 150% accordingly. 1604 1605 The lower the saturation point, the better the latency QoS at 1606 the cost of aggregate bandwidth. The narrower the allowed 1607 adjustment range between "min" and "max", the more conformant 1608 to the cost model the IO behavior. Note that the IO issue 1609 base rate may be far off from 100% and setting "min" and "max" 1610 blindly can lead to a significant loss of device capacity or 1611 control quality. "min" and "max" are useful for regulating 1612 devices which show wide temporary behavior changes - e.g. a 1613 ssd which accepts writes at the line speed for a while and 1614 then completely stalls for multiple seconds. 1615 1616 When "ctrl" is "auto", the parameters are controlled by the 1617 kernel and may change automatically. Setting "ctrl" to "user" 1618 or setting any of the percentile and latency parameters puts 1619 it into "user" mode and disables the automatic changes. The 1620 automatic mode can be restored by setting "ctrl" to "auto". 1621 1622 io.cost.model 1623 A read-write nested-keyed file which exists only on the root 1624 cgroup. 1625 1626 This file configures the cost model of the IO cost model based 1627 controller (CONFIG_BLK_CGROUP_IOCOST) which currently 1628 implements "io.weight" proportional control. Lines are keyed 1629 by $MAJ:$MIN device numbers and not ordered. The line for a 1630 given device is populated on the first write for the device on 1631 "io.cost.qos" or "io.cost.model". The following nested keys 1632 are defined. 1633 1634 ===== ================================ 1635 ctrl "auto" or "user" 1636 model The cost model in use - "linear" 1637 ===== ================================ 1638 1639 When "ctrl" is "auto", the kernel may change all parameters 1640 dynamically. When "ctrl" is set to "user" or any other 1641 parameters are written to, "ctrl" become "user" and the 1642 automatic changes are disabled. 1643 1644 When "model" is "linear", the following model parameters are 1645 defined. 1646 1647 ============= ======================================== 1648 [r|w]bps The maximum sequential IO throughput 1649 [r|w]seqiops The maximum 4k sequential IOs per second 1650 [r|w]randiops The maximum 4k random IOs per second 1651 ============= ======================================== 1652 1653 From the above, the builtin linear model determines the base 1654 costs of a sequential and random IO and the cost coefficient 1655 for the IO size. While simple, this model can cover most 1656 common device classes acceptably. 1657 1658 The IO cost model isn't expected to be accurate in absolute 1659 sense and is scaled to the device behavior dynamically. 1660 1661 If needed, tools/cgroup/iocost_coef_gen.py can be used to 1662 generate device-specific coefficients. 1663 1664 io.weight 1665 A read-write flat-keyed file which exists on non-root cgroups. 1666 The default is "default 100". 1667 1668 The first line is the default weight applied to devices 1669 without specific override. The rest are overrides keyed by 1670 $MAJ:$MIN device numbers and not ordered. The weights are in 1671 the range [1, 10000] and specifies the relative amount IO time 1672 the cgroup can use in relation to its siblings. 1673 1674 The default weight can be updated by writing either "default 1675 $WEIGHT" or simply "$WEIGHT". Overrides can be set by writing 1676 "$MAJ:$MIN $WEIGHT" and unset by writing "$MAJ:$MIN default". 1677 1678 An example read output follows:: 1679 1680 default 100 1681 8:16 200 1682 8:0 50 1683 1684 io.max 1685 A read-write nested-keyed file which exists on non-root 1686 cgroups. 1687 1688 BPS and IOPS based IO limit. Lines are keyed by $MAJ:$MIN 1689 device numbers and not ordered. The following nested keys are 1690 defined. 1691 1692 ===== ================================== 1693 rbps Max read bytes per second 1694 wbps Max write bytes per second 1695 riops Max read IO operations per second 1696 wiops Max write IO operations per second 1697 ===== ================================== 1698 1699 When writing, any number of nested key-value pairs can be 1700 specified in any order. "max" can be specified as the value 1701 to remove a specific limit. If the same key is specified 1702 multiple times, the outcome is undefined. 1703 1704 BPS and IOPS are measured in each IO direction and IOs are 1705 delayed if limit is reached. Temporary bursts are allowed. 1706 1707 Setting read limit at 2M BPS and write at 120 IOPS for 8:16:: 1708 1709 echo "8:16 rbps=2097152 wiops=120" > io.max 1710 1711 Reading returns the following:: 1712 1713 8:16 rbps=2097152 wbps=max riops=max wiops=120 1714 1715 Write IOPS limit can be removed by writing the following:: 1716 1717 echo "8:16 wiops=max" > io.max 1718 1719 Reading now returns the following:: 1720 1721 8:16 rbps=2097152 wbps=max riops=max wiops=max 1722 1723 io.pressure 1724 A read-only nested-keyed file. 1725 1726 Shows pressure stall information for IO. See 1727 :ref:`Documentation/accounting/psi.rst <psi>` for details. 1728 1729 1730Writeback 1731~~~~~~~~~ 1732 1733Page cache is dirtied through buffered writes and shared mmaps and 1734written asynchronously to the backing filesystem by the writeback 1735mechanism. Writeback sits between the memory and IO domains and 1736regulates the proportion of dirty memory by balancing dirtying and 1737write IOs. 1738 1739The io controller, in conjunction with the memory controller, 1740implements control of page cache writeback IOs. The memory controller 1741defines the memory domain that dirty memory ratio is calculated and 1742maintained for and the io controller defines the io domain which 1743writes out dirty pages for the memory domain. Both system-wide and 1744per-cgroup dirty memory states are examined and the more restrictive 1745of the two is enforced. 1746 1747cgroup writeback requires explicit support from the underlying 1748filesystem. Currently, cgroup writeback is implemented on ext2, ext4, 1749btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are 1750attributed to the root cgroup. 1751 1752There are inherent differences in memory and writeback management 1753which affects how cgroup ownership is tracked. Memory is tracked per 1754page while writeback per inode. For the purpose of writeback, an 1755inode is assigned to a cgroup and all IO requests to write dirty pages 1756from the inode are attributed to that cgroup. 1757 1758As cgroup ownership for memory is tracked per page, there can be pages 1759which are associated with different cgroups than the one the inode is 1760associated with. These are called foreign pages. The writeback 1761constantly keeps track of foreign pages and, if a particular foreign 1762cgroup becomes the majority over a certain period of time, switches 1763the ownership of the inode to that cgroup. 1764 1765While this model is enough for most use cases where a given inode is 1766mostly dirtied by a single cgroup even when the main writing cgroup 1767changes over time, use cases where multiple cgroups write to a single 1768inode simultaneously are not supported well. In such circumstances, a 1769significant portion of IOs are likely to be attributed incorrectly. 1770As memory controller assigns page ownership on the first use and 1771doesn't update it until the page is released, even if writeback 1772strictly follows page ownership, multiple cgroups dirtying overlapping 1773areas wouldn't work as expected. It's recommended to avoid such usage 1774patterns. 1775 1776The sysctl knobs which affect writeback behavior are applied to cgroup 1777writeback as follows. 1778 1779 vm.dirty_background_ratio, vm.dirty_ratio 1780 These ratios apply the same to cgroup writeback with the 1781 amount of available memory capped by limits imposed by the 1782 memory controller and system-wide clean memory. 1783 1784 vm.dirty_background_bytes, vm.dirty_bytes 1785 For cgroup writeback, this is calculated into ratio against 1786 total available memory and applied the same way as 1787 vm.dirty[_background]_ratio. 1788 1789 1790IO Latency 1791~~~~~~~~~~ 1792 1793This is a cgroup v2 controller for IO workload protection. You provide a group 1794with a latency target, and if the average latency exceeds that target the 1795controller will throttle any peers that have a lower latency target than the 1796protected workload. 1797 1798The limits are only applied at the peer level in the hierarchy. This means that 1799in the diagram below, only groups A, B, and C will influence each other, and 1800groups D and F will influence each other. Group G will influence nobody:: 1801 1802 [root] 1803 / | \ 1804 A B C 1805 / \ | 1806 D F G 1807 1808 1809So the ideal way to configure this is to set io.latency in groups A, B, and C. 1810Generally you do not want to set a value lower than the latency your device 1811supports. Experiment to find the value that works best for your workload. 1812Start at higher than the expected latency for your device and watch the 1813avg_lat value in io.stat for your workload group to get an idea of the 1814latency you see during normal operation. Use the avg_lat value as a basis for 1815your real setting, setting at 10-15% higher than the value in io.stat. 1816 1817How IO Latency Throttling Works 1818~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1819 1820io.latency is work conserving; so as long as everybody is meeting their latency 1821target the controller doesn't do anything. Once a group starts missing its 1822target it begins throttling any peer group that has a higher target than itself. 1823This throttling takes 2 forms: 1824 1825- Queue depth throttling. This is the number of outstanding IO's a group is 1826 allowed to have. We will clamp down relatively quickly, starting at no limit 1827 and going all the way down to 1 IO at a time. 1828 1829- Artificial delay induction. There are certain types of IO that cannot be 1830 throttled without possibly adversely affecting higher priority groups. This 1831 includes swapping and metadata IO. These types of IO are allowed to occur 1832 normally, however they are "charged" to the originating group. If the 1833 originating group is being throttled you will see the use_delay and delay 1834 fields in io.stat increase. The delay value is how many microseconds that are 1835 being added to any process that runs in this group. Because this number can 1836 grow quite large if there is a lot of swapping or metadata IO occurring we 1837 limit the individual delay events to 1 second at a time. 1838 1839Once the victimized group starts meeting its latency target again it will start 1840unthrottling any peer groups that were throttled previously. If the victimized 1841group simply stops doing IO the global counter will unthrottle appropriately. 1842 1843IO Latency Interface Files 1844~~~~~~~~~~~~~~~~~~~~~~~~~~ 1845 1846 io.latency 1847 This takes a similar format as the other controllers. 1848 1849 "MAJOR:MINOR target=<target time in microseconds" 1850 1851 io.stat 1852 If the controller is enabled you will see extra stats in io.stat in 1853 addition to the normal ones. 1854 1855 depth 1856 This is the current queue depth for the group. 1857 1858 avg_lat 1859 This is an exponential moving average with a decay rate of 1/exp 1860 bound by the sampling interval. The decay rate interval can be 1861 calculated by multiplying the win value in io.stat by the 1862 corresponding number of samples based on the win value. 1863 1864 win 1865 The sampling window size in milliseconds. This is the minimum 1866 duration of time between evaluation events. Windows only elapse 1867 with IO activity. Idle periods extend the most recent window. 1868 1869PID 1870--- 1871 1872The process number controller is used to allow a cgroup to stop any 1873new tasks from being fork()'d or clone()'d after a specified limit is 1874reached. 1875 1876The number of tasks in a cgroup can be exhausted in ways which other 1877controllers cannot prevent, thus warranting its own controller. For 1878example, a fork bomb is likely to exhaust the number of tasks before 1879hitting memory restrictions. 1880 1881Note that PIDs used in this controller refer to TIDs, process IDs as 1882used by the kernel. 1883 1884 1885PID Interface Files 1886~~~~~~~~~~~~~~~~~~~ 1887 1888 pids.max 1889 A read-write single value file which exists on non-root 1890 cgroups. The default is "max". 1891 1892 Hard limit of number of processes. 1893 1894 pids.current 1895 A read-only single value file which exists on all cgroups. 1896 1897 The number of processes currently in the cgroup and its 1898 descendants. 1899 1900Organisational operations are not blocked by cgroup policies, so it is 1901possible to have pids.current > pids.max. This can be done by either 1902setting the limit to be smaller than pids.current, or attaching enough 1903processes to the cgroup such that pids.current is larger than 1904pids.max. However, it is not possible to violate a cgroup PID policy 1905through fork() or clone(). These will return -EAGAIN if the creation 1906of a new process would cause a cgroup policy to be violated. 1907 1908 1909Cpuset 1910------ 1911 1912The "cpuset" controller provides a mechanism for constraining 1913the CPU and memory node placement of tasks to only the resources 1914specified in the cpuset interface files in a task's current cgroup. 1915This is especially valuable on large NUMA systems where placing jobs 1916on properly sized subsets of the systems with careful processor and 1917memory placement to reduce cross-node memory access and contention 1918can improve overall system performance. 1919 1920The "cpuset" controller is hierarchical. That means the controller 1921cannot use CPUs or memory nodes not allowed in its parent. 1922 1923 1924Cpuset Interface Files 1925~~~~~~~~~~~~~~~~~~~~~~ 1926 1927 cpuset.cpus 1928 A read-write multiple values file which exists on non-root 1929 cpuset-enabled cgroups. 1930 1931 It lists the requested CPUs to be used by tasks within this 1932 cgroup. The actual list of CPUs to be granted, however, is 1933 subjected to constraints imposed by its parent and can differ 1934 from the requested CPUs. 1935 1936 The CPU numbers are comma-separated numbers or ranges. 1937 For example:: 1938 1939 # cat cpuset.cpus 1940 0-4,6,8-10 1941 1942 An empty value indicates that the cgroup is using the same 1943 setting as the nearest cgroup ancestor with a non-empty 1944 "cpuset.cpus" or all the available CPUs if none is found. 1945 1946 The value of "cpuset.cpus" stays constant until the next update 1947 and won't be affected by any CPU hotplug events. 1948 1949 cpuset.cpus.effective 1950 A read-only multiple values file which exists on all 1951 cpuset-enabled cgroups. 1952 1953 It lists the onlined CPUs that are actually granted to this 1954 cgroup by its parent. These CPUs are allowed to be used by 1955 tasks within the current cgroup. 1956 1957 If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows 1958 all the CPUs from the parent cgroup that can be available to 1959 be used by this cgroup. Otherwise, it should be a subset of 1960 "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus" 1961 can be granted. In this case, it will be treated just like an 1962 empty "cpuset.cpus". 1963 1964 Its value will be affected by CPU hotplug events. 1965 1966 cpuset.mems 1967 A read-write multiple values file which exists on non-root 1968 cpuset-enabled cgroups. 1969 1970 It lists the requested memory nodes to be used by tasks within 1971 this cgroup. The actual list of memory nodes granted, however, 1972 is subjected to constraints imposed by its parent and can differ 1973 from the requested memory nodes. 1974 1975 The memory node numbers are comma-separated numbers or ranges. 1976 For example:: 1977 1978 # cat cpuset.mems 1979 0-1,3 1980 1981 An empty value indicates that the cgroup is using the same 1982 setting as the nearest cgroup ancestor with a non-empty 1983 "cpuset.mems" or all the available memory nodes if none 1984 is found. 1985 1986 The value of "cpuset.mems" stays constant until the next update 1987 and won't be affected by any memory nodes hotplug events. 1988 1989 cpuset.mems.effective 1990 A read-only multiple values file which exists on all 1991 cpuset-enabled cgroups. 1992 1993 It lists the onlined memory nodes that are actually granted to 1994 this cgroup by its parent. These memory nodes are allowed to 1995 be used by tasks within the current cgroup. 1996 1997 If "cpuset.mems" is empty, it shows all the memory nodes from the 1998 parent cgroup that will be available to be used by this cgroup. 1999 Otherwise, it should be a subset of "cpuset.mems" unless none of 2000 the memory nodes listed in "cpuset.mems" can be granted. In this 2001 case, it will be treated just like an empty "cpuset.mems". 2002 2003 Its value will be affected by memory nodes hotplug events. 2004 2005 cpuset.cpus.partition 2006 A read-write single value file which exists on non-root 2007 cpuset-enabled cgroups. This flag is owned by the parent cgroup 2008 and is not delegatable. 2009 2010 It accepts only the following input values when written to. 2011 2012 ======== ================================ 2013 "root" a partition root 2014 "member" a non-root member of a partition 2015 ======== ================================ 2016 2017 When set to be a partition root, the current cgroup is the 2018 root of a new partition or scheduling domain that comprises 2019 itself and all its descendants except those that are separate 2020 partition roots themselves and their descendants. The root 2021 cgroup is always a partition root. 2022 2023 There are constraints on where a partition root can be set. 2024 It can only be set in a cgroup if all the following conditions 2025 are true. 2026 2027 1) The "cpuset.cpus" is not empty and the list of CPUs are 2028 exclusive, i.e. they are not shared by any of its siblings. 2029 2) The parent cgroup is a partition root. 2030 3) The "cpuset.cpus" is also a proper subset of the parent's 2031 "cpuset.cpus.effective". 2032 4) There is no child cgroups with cpuset enabled. This is for 2033 eliminating corner cases that have to be handled if such a 2034 condition is allowed. 2035 2036 Setting it to partition root will take the CPUs away from the 2037 effective CPUs of the parent cgroup. Once it is set, this 2038 file cannot be reverted back to "member" if there are any child 2039 cgroups with cpuset enabled. 2040 2041 A parent partition cannot distribute all its CPUs to its 2042 child partitions. There must be at least one cpu left in the 2043 parent partition. 2044 2045 Once becoming a partition root, changes to "cpuset.cpus" is 2046 generally allowed as long as the first condition above is true, 2047 the change will not take away all the CPUs from the parent 2048 partition and the new "cpuset.cpus" value is a superset of its 2049 children's "cpuset.cpus" values. 2050 2051 Sometimes, external factors like changes to ancestors' 2052 "cpuset.cpus" or cpu hotplug can cause the state of the partition 2053 root to change. On read, the "cpuset.sched.partition" file 2054 can show the following values. 2055 2056 ============== ============================== 2057 "member" Non-root member of a partition 2058 "root" Partition root 2059 "root invalid" Invalid partition root 2060 ============== ============================== 2061 2062 It is a partition root if the first 2 partition root conditions 2063 above are true and at least one CPU from "cpuset.cpus" is 2064 granted by the parent cgroup. 2065 2066 A partition root can become invalid if none of CPUs requested 2067 in "cpuset.cpus" can be granted by the parent cgroup or the 2068 parent cgroup is no longer a partition root itself. In this 2069 case, it is not a real partition even though the restriction 2070 of the first partition root condition above will still apply. 2071 The cpu affinity of all the tasks in the cgroup will then be 2072 associated with CPUs in the nearest ancestor partition. 2073 2074 An invalid partition root can be transitioned back to a 2075 real partition root if at least one of the requested CPUs 2076 can now be granted by its parent. In this case, the cpu 2077 affinity of all the tasks in the formerly invalid partition 2078 will be associated to the CPUs of the newly formed partition. 2079 Changing the partition state of an invalid partition root to 2080 "member" is always allowed even if child cpusets are present. 2081 2082 2083Device controller 2084----------------- 2085 2086Device controller manages access to device files. It includes both 2087creation of new device files (using mknod), and access to the 2088existing device files. 2089 2090Cgroup v2 device controller has no interface files and is implemented 2091on top of cgroup BPF. To control access to device files, a user may 2092create bpf programs of the BPF_CGROUP_DEVICE type and attach them 2093to cgroups. On an attempt to access a device file, corresponding 2094BPF programs will be executed, and depending on the return value 2095the attempt will succeed or fail with -EPERM. 2096 2097A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx 2098structure, which describes the device access attempt: access type 2099(mknod/read/write) and device (type, major and minor numbers). 2100If the program returns 0, the attempt fails with -EPERM, otherwise 2101it succeeds. 2102 2103An example of BPF_CGROUP_DEVICE program may be found in the kernel 2104source tree in the tools/testing/selftests/bpf/progs/dev_cgroup.c file. 2105 2106 2107RDMA 2108---- 2109 2110The "rdma" controller regulates the distribution and accounting of 2111RDMA resources. 2112 2113RDMA Interface Files 2114~~~~~~~~~~~~~~~~~~~~ 2115 2116 rdma.max 2117 A readwrite nested-keyed file that exists for all the cgroups 2118 except root that describes current configured resource limit 2119 for a RDMA/IB device. 2120 2121 Lines are keyed by device name and are not ordered. 2122 Each line contains space separated resource name and its configured 2123 limit that can be distributed. 2124 2125 The following nested keys are defined. 2126 2127 ========== ============================= 2128 hca_handle Maximum number of HCA Handles 2129 hca_object Maximum number of HCA Objects 2130 ========== ============================= 2131 2132 An example for mlx4 and ocrdma device follows:: 2133 2134 mlx4_0 hca_handle=2 hca_object=2000 2135 ocrdma1 hca_handle=3 hca_object=max 2136 2137 rdma.current 2138 A read-only file that describes current resource usage. 2139 It exists for all the cgroup except root. 2140 2141 An example for mlx4 and ocrdma device follows:: 2142 2143 mlx4_0 hca_handle=1 hca_object=20 2144 ocrdma1 hca_handle=1 hca_object=23 2145 2146HugeTLB 2147------- 2148 2149The HugeTLB controller allows to limit the HugeTLB usage per control group and 2150enforces the controller limit during page fault. 2151 2152HugeTLB Interface Files 2153~~~~~~~~~~~~~~~~~~~~~~~ 2154 2155 hugetlb.<hugepagesize>.current 2156 Show current usage for "hugepagesize" hugetlb. It exists for all 2157 the cgroup except root. 2158 2159 hugetlb.<hugepagesize>.max 2160 Set/show the hard limit of "hugepagesize" hugetlb usage. 2161 The default value is "max". It exists for all the cgroup except root. 2162 2163 hugetlb.<hugepagesize>.events 2164 A read-only flat-keyed file which exists on non-root cgroups. 2165 2166 max 2167 The number of allocation failure due to HugeTLB limit 2168 2169 hugetlb.<hugepagesize>.events.local 2170 Similar to hugetlb.<hugepagesize>.events but the fields in the file 2171 are local to the cgroup i.e. not hierarchical. The file modified event 2172 generated on this file reflects only the local events. 2173 2174Misc 2175---- 2176 2177The Miscellaneous cgroup provides the resource limiting and tracking 2178mechanism for the scalar resources which cannot be abstracted like the other 2179cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config 2180option. 2181 2182A resource can be added to the controller via enum misc_res_type{} in the 2183include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[] 2184in the kernel/cgroup/misc.c file. Provider of the resource must set its 2185capacity prior to using the resource by calling misc_cg_set_capacity(). 2186 2187Once a capacity is set then the resource usage can be updated using charge and 2188uncharge APIs. All of the APIs to interact with misc controller are in 2189include/linux/misc_cgroup.h. 2190 2191Misc Interface Files 2192~~~~~~~~~~~~~~~~~~~~ 2193 2194Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then: 2195 2196 misc.capacity 2197 A read-only flat-keyed file shown only in the root cgroup. It shows 2198 miscellaneous scalar resources available on the platform along with 2199 their quantities:: 2200 2201 $ cat misc.capacity 2202 res_a 50 2203 res_b 10 2204 2205 misc.current 2206 A read-only flat-keyed file shown in the non-root cgroups. It shows 2207 the current usage of the resources in the cgroup and its children.:: 2208 2209 $ cat misc.current 2210 res_a 3 2211 res_b 0 2212 2213 misc.max 2214 A read-write flat-keyed file shown in the non root cgroups. Allowed 2215 maximum usage of the resources in the cgroup and its children.:: 2216 2217 $ cat misc.max 2218 res_a max 2219 res_b 4 2220 2221 Limit can be set by:: 2222 2223 # echo res_a 1 > misc.max 2224 2225 Limit can be set to max by:: 2226 2227 # echo res_a max > misc.max 2228 2229 Limits can be set higher than the capacity value in the misc.capacity 2230 file. 2231 2232Migration and Ownership 2233~~~~~~~~~~~~~~~~~~~~~~~ 2234 2235A miscellaneous scalar resource is charged to the cgroup in which it is used 2236first, and stays charged to that cgroup until that resource is freed. Migrating 2237a process to a different cgroup does not move the charge to the destination 2238cgroup where the process has moved. 2239 2240Others 2241------ 2242 2243perf_event 2244~~~~~~~~~~ 2245 2246perf_event controller, if not mounted on a legacy hierarchy, is 2247automatically enabled on the v2 hierarchy so that perf events can 2248always be filtered by cgroup v2 path. The controller can still be 2249moved to a legacy hierarchy after v2 hierarchy is populated. 2250 2251 2252Non-normative information 2253------------------------- 2254 2255This section contains information that isn't considered to be a part of 2256the stable kernel API and so is subject to change. 2257 2258 2259CPU controller root cgroup process behaviour 2260~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2261 2262When distributing CPU cycles in the root cgroup each thread in this 2263cgroup is treated as if it was hosted in a separate child cgroup of the 2264root cgroup. This child cgroup weight is dependent on its thread nice 2265level. 2266 2267For details of this mapping see sched_prio_to_weight array in 2268kernel/sched/core.c file (values from this array should be scaled 2269appropriately so the neutral - nice 0 - value is 100 instead of 1024). 2270 2271 2272IO controller root cgroup process behaviour 2273~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2274 2275Root cgroup processes are hosted in an implicit leaf child node. 2276When distributing IO resources this implicit child node is taken into 2277account as if it was a normal child cgroup of the root cgroup with a 2278weight value of 200. 2279 2280 2281Namespace 2282========= 2283 2284Basics 2285------ 2286 2287cgroup namespace provides a mechanism to virtualize the view of the 2288"/proc/$PID/cgroup" file and cgroup mounts. The CLONE_NEWCGROUP clone 2289flag can be used with clone(2) and unshare(2) to create a new cgroup 2290namespace. The process running inside the cgroup namespace will have 2291its "/proc/$PID/cgroup" output restricted to cgroupns root. The 2292cgroupns root is the cgroup of the process at the time of creation of 2293the cgroup namespace. 2294 2295Without cgroup namespace, the "/proc/$PID/cgroup" file shows the 2296complete path of the cgroup of a process. In a container setup where 2297a set of cgroups and namespaces are intended to isolate processes the 2298"/proc/$PID/cgroup" file may leak potential system level information 2299to the isolated processes. For example:: 2300 2301 # cat /proc/self/cgroup 2302 0::/batchjobs/container_id1 2303 2304The path '/batchjobs/container_id1' can be considered as system-data 2305and undesirable to expose to the isolated processes. cgroup namespace 2306can be used to restrict visibility of this path. For example, before 2307creating a cgroup namespace, one would see:: 2308 2309 # ls -l /proc/self/ns/cgroup 2310 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835] 2311 # cat /proc/self/cgroup 2312 0::/batchjobs/container_id1 2313 2314After unsharing a new namespace, the view changes:: 2315 2316 # ls -l /proc/self/ns/cgroup 2317 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183] 2318 # cat /proc/self/cgroup 2319 0::/ 2320 2321When some thread from a multi-threaded process unshares its cgroup 2322namespace, the new cgroupns gets applied to the entire process (all 2323the threads). This is natural for the v2 hierarchy; however, for the 2324legacy hierarchies, this may be unexpected. 2325 2326A cgroup namespace is alive as long as there are processes inside or 2327mounts pinning it. When the last usage goes away, the cgroup 2328namespace is destroyed. The cgroupns root and the actual cgroups 2329remain. 2330 2331 2332The Root and Views 2333------------------ 2334 2335The 'cgroupns root' for a cgroup namespace is the cgroup in which the 2336process calling unshare(2) is running. For example, if a process in 2337/batchjobs/container_id1 cgroup calls unshare, cgroup 2338/batchjobs/container_id1 becomes the cgroupns root. For the 2339init_cgroup_ns, this is the real root ('/') cgroup. 2340 2341The cgroupns root cgroup does not change even if the namespace creator 2342process later moves to a different cgroup:: 2343 2344 # ~/unshare -c # unshare cgroupns in some cgroup 2345 # cat /proc/self/cgroup 2346 0::/ 2347 # mkdir sub_cgrp_1 2348 # echo 0 > sub_cgrp_1/cgroup.procs 2349 # cat /proc/self/cgroup 2350 0::/sub_cgrp_1 2351 2352Each process gets its namespace-specific view of "/proc/$PID/cgroup" 2353 2354Processes running inside the cgroup namespace will be able to see 2355cgroup paths (in /proc/self/cgroup) only inside their root cgroup. 2356From within an unshared cgroupns:: 2357 2358 # sleep 100000 & 2359 [1] 7353 2360 # echo 7353 > sub_cgrp_1/cgroup.procs 2361 # cat /proc/7353/cgroup 2362 0::/sub_cgrp_1 2363 2364From the initial cgroup namespace, the real cgroup path will be 2365visible:: 2366 2367 $ cat /proc/7353/cgroup 2368 0::/batchjobs/container_id1/sub_cgrp_1 2369 2370From a sibling cgroup namespace (that is, a namespace rooted at a 2371different cgroup), the cgroup path relative to its own cgroup 2372namespace root will be shown. For instance, if PID 7353's cgroup 2373namespace root is at '/batchjobs/container_id2', then it will see:: 2374 2375 # cat /proc/7353/cgroup 2376 0::/../container_id2/sub_cgrp_1 2377 2378Note that the relative path always starts with '/' to indicate that 2379its relative to the cgroup namespace root of the caller. 2380 2381 2382Migration and setns(2) 2383---------------------- 2384 2385Processes inside a cgroup namespace can move into and out of the 2386namespace root if they have proper access to external cgroups. For 2387example, from inside a namespace with cgroupns root at 2388/batchjobs/container_id1, and assuming that the global hierarchy is 2389still accessible inside cgroupns:: 2390 2391 # cat /proc/7353/cgroup 2392 0::/sub_cgrp_1 2393 # echo 7353 > batchjobs/container_id2/cgroup.procs 2394 # cat /proc/7353/cgroup 2395 0::/../container_id2 2396 2397Note that this kind of setup is not encouraged. A task inside cgroup 2398namespace should only be exposed to its own cgroupns hierarchy. 2399 2400setns(2) to another cgroup namespace is allowed when: 2401 2402(a) the process has CAP_SYS_ADMIN against its current user namespace 2403(b) the process has CAP_SYS_ADMIN against the target cgroup 2404 namespace's userns 2405 2406No implicit cgroup changes happen with attaching to another cgroup 2407namespace. It is expected that the someone moves the attaching 2408process under the target cgroup namespace root. 2409 2410 2411Interaction with Other Namespaces 2412--------------------------------- 2413 2414Namespace specific cgroup hierarchy can be mounted by a process 2415running inside a non-init cgroup namespace:: 2416 2417 # mount -t cgroup2 none $MOUNT_POINT 2418 2419This will mount the unified cgroup hierarchy with cgroupns root as the 2420filesystem root. The process needs CAP_SYS_ADMIN against its user and 2421mount namespaces. 2422 2423The virtualization of /proc/self/cgroup file combined with restricting 2424the view of cgroup hierarchy by namespace-private cgroupfs mount 2425provides a properly isolated cgroup view inside the container. 2426 2427 2428Information on Kernel Programming 2429================================= 2430 2431This section contains kernel programming information in the areas 2432where interacting with cgroup is necessary. cgroup core and 2433controllers are not covered. 2434 2435 2436Filesystem Support for Writeback 2437-------------------------------- 2438 2439A filesystem can support cgroup writeback by updating 2440address_space_operations->writepage[s]() to annotate bio's using the 2441following two functions. 2442 2443 wbc_init_bio(@wbc, @bio) 2444 Should be called for each bio carrying writeback data and 2445 associates the bio with the inode's owner cgroup and the 2446 corresponding request queue. This must be called after 2447 a queue (device) has been associated with the bio and 2448 before submission. 2449 2450 wbc_account_cgroup_owner(@wbc, @page, @bytes) 2451 Should be called for each data segment being written out. 2452 While this function doesn't care exactly when it's called 2453 during the writeback session, it's the easiest and most 2454 natural to call it as data segments are added to a bio. 2455 2456With writeback bio's annotated, cgroup support can be enabled per 2457super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for 2458selective disabling of cgroup writeback support which is helpful when 2459certain filesystem features, e.g. journaled data mode, are 2460incompatible. 2461 2462wbc_init_bio() binds the specified bio to its cgroup. Depending on 2463the configuration, the bio may be executed at a lower priority and if 2464the writeback session is holding shared resources, e.g. a journal 2465entry, may lead to priority inversion. There is no one easy solution 2466for the problem. Filesystems can try to work around specific problem 2467cases by skipping wbc_init_bio() and using bio_associate_blkg() 2468directly. 2469 2470 2471Deprecated v1 Core Features 2472=========================== 2473 2474- Multiple hierarchies including named ones are not supported. 2475 2476- All v1 mount options are not supported. 2477 2478- The "tasks" file is removed and "cgroup.procs" is not sorted. 2479 2480- "cgroup.clone_children" is removed. 2481 2482- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file 2483 at the root instead. 2484 2485 2486Issues with v1 and Rationales for v2 2487==================================== 2488 2489Multiple Hierarchies 2490-------------------- 2491 2492cgroup v1 allowed an arbitrary number of hierarchies and each 2493hierarchy could host any number of controllers. While this seemed to 2494provide a high level of flexibility, it wasn't useful in practice. 2495 2496For example, as there is only one instance of each controller, utility 2497type controllers such as freezer which can be useful in all 2498hierarchies could only be used in one. The issue is exacerbated by 2499the fact that controllers couldn't be moved to another hierarchy once 2500hierarchies were populated. Another issue was that all controllers 2501bound to a hierarchy were forced to have exactly the same view of the 2502hierarchy. It wasn't possible to vary the granularity depending on 2503the specific controller. 2504 2505In practice, these issues heavily limited which controllers could be 2506put on the same hierarchy and most configurations resorted to putting 2507each controller on its own hierarchy. Only closely related ones, such 2508as the cpu and cpuacct controllers, made sense to be put on the same 2509hierarchy. This often meant that userland ended up managing multiple 2510similar hierarchies repeating the same steps on each hierarchy 2511whenever a hierarchy management operation was necessary. 2512 2513Furthermore, support for multiple hierarchies came at a steep cost. 2514It greatly complicated cgroup core implementation but more importantly 2515the support for multiple hierarchies restricted how cgroup could be 2516used in general and what controllers was able to do. 2517 2518There was no limit on how many hierarchies there might be, which meant 2519that a thread's cgroup membership couldn't be described in finite 2520length. The key might contain any number of entries and was unlimited 2521in length, which made it highly awkward to manipulate and led to 2522addition of controllers which existed only to identify membership, 2523which in turn exacerbated the original problem of proliferating number 2524of hierarchies. 2525 2526Also, as a controller couldn't have any expectation regarding the 2527topologies of hierarchies other controllers might be on, each 2528controller had to assume that all other controllers were attached to 2529completely orthogonal hierarchies. This made it impossible, or at 2530least very cumbersome, for controllers to cooperate with each other. 2531 2532In most use cases, putting controllers on hierarchies which are 2533completely orthogonal to each other isn't necessary. What usually is 2534called for is the ability to have differing levels of granularity 2535depending on the specific controller. In other words, hierarchy may 2536be collapsed from leaf towards root when viewed from specific 2537controllers. For example, a given configuration might not care about 2538how memory is distributed beyond a certain level while still wanting 2539to control how CPU cycles are distributed. 2540 2541 2542Thread Granularity 2543------------------ 2544 2545cgroup v1 allowed threads of a process to belong to different cgroups. 2546This didn't make sense for some controllers and those controllers 2547ended up implementing different ways to ignore such situations but 2548much more importantly it blurred the line between API exposed to 2549individual applications and system management interface. 2550 2551Generally, in-process knowledge is available only to the process 2552itself; thus, unlike service-level organization of processes, 2553categorizing threads of a process requires active participation from 2554the application which owns the target process. 2555 2556cgroup v1 had an ambiguously defined delegation model which got abused 2557in combination with thread granularity. cgroups were delegated to 2558individual applications so that they can create and manage their own 2559sub-hierarchies and control resource distributions along them. This 2560effectively raised cgroup to the status of a syscall-like API exposed 2561to lay programs. 2562 2563First of all, cgroup has a fundamentally inadequate interface to be 2564exposed this way. For a process to access its own knobs, it has to 2565extract the path on the target hierarchy from /proc/self/cgroup, 2566construct the path by appending the name of the knob to the path, open 2567and then read and/or write to it. This is not only extremely clunky 2568and unusual but also inherently racy. There is no conventional way to 2569define transaction across the required steps and nothing can guarantee 2570that the process would actually be operating on its own sub-hierarchy. 2571 2572cgroup controllers implemented a number of knobs which would never be 2573accepted as public APIs because they were just adding control knobs to 2574system-management pseudo filesystem. cgroup ended up with interface 2575knobs which were not properly abstracted or refined and directly 2576revealed kernel internal details. These knobs got exposed to 2577individual applications through the ill-defined delegation mechanism 2578effectively abusing cgroup as a shortcut to implementing public APIs 2579without going through the required scrutiny. 2580 2581This was painful for both userland and kernel. Userland ended up with 2582misbehaving and poorly abstracted interfaces and kernel exposing and 2583locked into constructs inadvertently. 2584 2585 2586Competition Between Inner Nodes and Threads 2587------------------------------------------- 2588 2589cgroup v1 allowed threads to be in any cgroups which created an 2590interesting problem where threads belonging to a parent cgroup and its 2591children cgroups competed for resources. This was nasty as two 2592different types of entities competed and there was no obvious way to 2593settle it. Different controllers did different things. 2594 2595The cpu controller considered threads and cgroups as equivalents and 2596mapped nice levels to cgroup weights. This worked for some cases but 2597fell flat when children wanted to be allocated specific ratios of CPU 2598cycles and the number of internal threads fluctuated - the ratios 2599constantly changed as the number of competing entities fluctuated. 2600There also were other issues. The mapping from nice level to weight 2601wasn't obvious or universal, and there were various other knobs which 2602simply weren't available for threads. 2603 2604The io controller implicitly created a hidden leaf node for each 2605cgroup to host the threads. The hidden leaf had its own copies of all 2606the knobs with ``leaf_`` prefixed. While this allowed equivalent 2607control over internal threads, it was with serious drawbacks. It 2608always added an extra layer of nesting which wouldn't be necessary 2609otherwise, made the interface messy and significantly complicated the 2610implementation. 2611 2612The memory controller didn't have a way to control what happened 2613between internal tasks and child cgroups and the behavior was not 2614clearly defined. There were attempts to add ad-hoc behaviors and 2615knobs to tailor the behavior to specific workloads which would have 2616led to problems extremely difficult to resolve in the long term. 2617 2618Multiple controllers struggled with internal tasks and came up with 2619different ways to deal with it; unfortunately, all the approaches were 2620severely flawed and, furthermore, the widely different behaviors 2621made cgroup as a whole highly inconsistent. 2622 2623This clearly is a problem which needs to be addressed from cgroup core 2624in a uniform way. 2625 2626 2627Other Interface Issues 2628---------------------- 2629 2630cgroup v1 grew without oversight and developed a large number of 2631idiosyncrasies and inconsistencies. One issue on the cgroup core side 2632was how an empty cgroup was notified - a userland helper binary was 2633forked and executed for each event. The event delivery wasn't 2634recursive or delegatable. The limitations of the mechanism also led 2635to in-kernel event delivery filtering mechanism further complicating 2636the interface. 2637 2638Controller interfaces were problematic too. An extreme example is 2639controllers completely ignoring hierarchical organization and treating 2640all cgroups as if they were all located directly under the root 2641cgroup. Some controllers exposed a large amount of inconsistent 2642implementation details to userland. 2643 2644There also was no consistency across controllers. When a new cgroup 2645was created, some controllers defaulted to not imposing extra 2646restrictions while others disallowed any resource usage until 2647explicitly configured. Configuration knobs for the same type of 2648control used widely differing naming schemes and formats. Statistics 2649and information knobs were named arbitrarily and used different 2650formats and units even in the same controller. 2651 2652cgroup v2 establishes common conventions where appropriate and updates 2653controllers so that they expose minimal and consistent interfaces. 2654 2655 2656Controller Issues and Remedies 2657------------------------------ 2658 2659Memory 2660~~~~~~ 2661 2662The original lower boundary, the soft limit, is defined as a limit 2663that is per default unset. As a result, the set of cgroups that 2664global reclaim prefers is opt-in, rather than opt-out. The costs for 2665optimizing these mostly negative lookups are so high that the 2666implementation, despite its enormous size, does not even provide the 2667basic desirable behavior. First off, the soft limit has no 2668hierarchical meaning. All configured groups are organized in a global 2669rbtree and treated like equal peers, regardless where they are located 2670in the hierarchy. This makes subtree delegation impossible. Second, 2671the soft limit reclaim pass is so aggressive that it not just 2672introduces high allocation latencies into the system, but also impacts 2673system performance due to overreclaim, to the point where the feature 2674becomes self-defeating. 2675 2676The memory.low boundary on the other hand is a top-down allocated 2677reserve. A cgroup enjoys reclaim protection when it's within its 2678effective low, which makes delegation of subtrees possible. It also 2679enjoys having reclaim pressure proportional to its overage when 2680above its effective low. 2681 2682The original high boundary, the hard limit, is defined as a strict 2683limit that can not budge, even if the OOM killer has to be called. 2684But this generally goes against the goal of making the most out of the 2685available memory. The memory consumption of workloads varies during 2686runtime, and that requires users to overcommit. But doing that with a 2687strict upper limit requires either a fairly accurate prediction of the 2688working set size or adding slack to the limit. Since working set size 2689estimation is hard and error prone, and getting it wrong results in 2690OOM kills, most users tend to err on the side of a looser limit and 2691end up wasting precious resources. 2692 2693The memory.high boundary on the other hand can be set much more 2694conservatively. When hit, it throttles allocations by forcing them 2695into direct reclaim to work off the excess, but it never invokes the 2696OOM killer. As a result, a high boundary that is chosen too 2697aggressively will not terminate the processes, but instead it will 2698lead to gradual performance degradation. The user can monitor this 2699and make corrections until the minimal memory footprint that still 2700gives acceptable performance is found. 2701 2702In extreme cases, with many concurrent allocations and a complete 2703breakdown of reclaim progress within the group, the high boundary can 2704be exceeded. But even then it's mostly better to satisfy the 2705allocation from the slack available in other groups or the rest of the 2706system than killing the group. Otherwise, memory.max is there to 2707limit this type of spillover and ultimately contain buggy or even 2708malicious applications. 2709 2710Setting the original memory.limit_in_bytes below the current usage was 2711subject to a race condition, where concurrent charges could cause the 2712limit setting to fail. memory.max on the other hand will first set the 2713limit to prevent new charges, and then reclaim and OOM kill until the 2714new limit is met - or the task writing to memory.max is killed. 2715 2716The combined memory+swap accounting and limiting is replaced by real 2717control over swap space. 2718 2719The main argument for a combined memory+swap facility in the original 2720cgroup design was that global or parental pressure would always be 2721able to swap all anonymous memory of a child group, regardless of the 2722child's own (possibly untrusted) configuration. However, untrusted 2723groups can sabotage swapping by other means - such as referencing its 2724anonymous memory in a tight loop - and an admin can not assume full 2725swappability when overcommitting untrusted jobs. 2726 2727For trusted jobs, on the other hand, a combined counter is not an 2728intuitive userspace interface, and it flies in the face of the idea 2729that cgroup controllers should account and limit specific physical 2730resources. Swap space is a resource like all others in the system, 2731and that's why unified hierarchy allows distributing it separately. 2732