1da82c92fSMauro Carvalho Chehab==========================
2da82c92fSMauro Carvalho ChehabMemory Resource Controller
3da82c92fSMauro Carvalho Chehab==========================
4da82c92fSMauro Carvalho Chehab
5da82c92fSMauro Carvalho ChehabNOTE:
6da82c92fSMauro Carvalho Chehab      This document is hopelessly outdated and it asks for a complete
7da82c92fSMauro Carvalho Chehab      rewrite. It still contains a useful information so we are keeping it
8da82c92fSMauro Carvalho Chehab      here but make sure to check the current code if you need a deeper
9da82c92fSMauro Carvalho Chehab      understanding.
10da82c92fSMauro Carvalho Chehab
11da82c92fSMauro Carvalho ChehabNOTE:
12da82c92fSMauro Carvalho Chehab      The Memory Resource Controller has generically been referred to as the
13da82c92fSMauro Carvalho Chehab      memory controller in this document. Do not confuse memory controller
14da82c92fSMauro Carvalho Chehab      used here with the memory controller that is used in hardware.
15da82c92fSMauro Carvalho Chehab
16da82c92fSMauro Carvalho Chehab(For editors) In this document:
17da82c92fSMauro Carvalho Chehab      When we mention a cgroup (cgroupfs's directory) with memory controller,
18da82c92fSMauro Carvalho Chehab      we call it "memory cgroup". When you see git-log and source code, you'll
19da82c92fSMauro Carvalho Chehab      see patch's title and function names tend to use "memcg".
20da82c92fSMauro Carvalho Chehab      In this document, we avoid using it.
21da82c92fSMauro Carvalho Chehab
22da82c92fSMauro Carvalho ChehabBenefits and Purpose of the memory controller
23da82c92fSMauro Carvalho Chehab=============================================
24da82c92fSMauro Carvalho Chehab
25da82c92fSMauro Carvalho ChehabThe memory controller isolates the memory behaviour of a group of tasks
26da82c92fSMauro Carvalho Chehabfrom the rest of the system. The article on LWN [12] mentions some probable
27da82c92fSMauro Carvalho Chehabuses of the memory controller. The memory controller can be used to
28da82c92fSMauro Carvalho Chehab
29da82c92fSMauro Carvalho Chehaba. Isolate an application or a group of applications
30da82c92fSMauro Carvalho Chehab   Memory-hungry applications can be isolated and limited to a smaller
31da82c92fSMauro Carvalho Chehab   amount of memory.
32da82c92fSMauro Carvalho Chehabb. Create a cgroup with a limited amount of memory; this can be used
33da82c92fSMauro Carvalho Chehab   as a good alternative to booting with mem=XXXX.
34da82c92fSMauro Carvalho Chehabc. Virtualization solutions can control the amount of memory they want
35da82c92fSMauro Carvalho Chehab   to assign to a virtual machine instance.
36da82c92fSMauro Carvalho Chehabd. A CD/DVD burner could control the amount of memory used by the
37da82c92fSMauro Carvalho Chehab   rest of the system to ensure that burning does not fail due to lack
38da82c92fSMauro Carvalho Chehab   of available memory.
39da82c92fSMauro Carvalho Chehabe. There are several other use cases; find one or use the controller just
40da82c92fSMauro Carvalho Chehab   for fun (to learn and hack on the VM subsystem).
41da82c92fSMauro Carvalho Chehab
42da82c92fSMauro Carvalho ChehabCurrent Status: linux-2.6.34-mmotm(development version of 2010/April)
43da82c92fSMauro Carvalho Chehab
44da82c92fSMauro Carvalho ChehabFeatures:
45da82c92fSMauro Carvalho Chehab
46da82c92fSMauro Carvalho Chehab - accounting anonymous pages, file caches, swap caches usage and limiting them.
47da82c92fSMauro Carvalho Chehab - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
48da82c92fSMauro Carvalho Chehab - optionally, memory+swap usage can be accounted and limited.
49da82c92fSMauro Carvalho Chehab - hierarchical accounting
50da82c92fSMauro Carvalho Chehab - soft limit
51da82c92fSMauro Carvalho Chehab - moving (recharging) account at moving a task is selectable.
52da82c92fSMauro Carvalho Chehab - usage threshold notifier
53da82c92fSMauro Carvalho Chehab - memory pressure notifier
54da82c92fSMauro Carvalho Chehab - oom-killer disable knob and oom-notifier
55da82c92fSMauro Carvalho Chehab - Root cgroup has no limit controls.
56da82c92fSMauro Carvalho Chehab
57da82c92fSMauro Carvalho Chehab Kernel memory support is a work in progress, and the current version provides
58da82c92fSMauro Carvalho Chehab basically functionality. (See Section 2.7)
59da82c92fSMauro Carvalho Chehab
60da82c92fSMauro Carvalho ChehabBrief summary of control files.
61da82c92fSMauro Carvalho Chehab
62da82c92fSMauro Carvalho Chehab==================================== ==========================================
63da82c92fSMauro Carvalho Chehab tasks				     attach a task(thread) and show list of
64da82c92fSMauro Carvalho Chehab				     threads
65da82c92fSMauro Carvalho Chehab cgroup.procs			     show list of processes
66da82c92fSMauro Carvalho Chehab cgroup.event_control		     an interface for event_fd()
672343e88dSSebastian Andrzej Siewior				     This knob is not available on CONFIG_PREEMPT_RT systems.
68da82c92fSMauro Carvalho Chehab memory.usage_in_bytes		     show current usage for memory
69da82c92fSMauro Carvalho Chehab				     (See 5.5 for details)
70da82c92fSMauro Carvalho Chehab memory.memsw.usage_in_bytes	     show current usage for memory+Swap
71da82c92fSMauro Carvalho Chehab				     (See 5.5 for details)
72da82c92fSMauro Carvalho Chehab memory.limit_in_bytes		     set/show limit of memory usage
73da82c92fSMauro Carvalho Chehab memory.memsw.limit_in_bytes	     set/show limit of memory+Swap usage
74da82c92fSMauro Carvalho Chehab memory.failcnt			     show the number of memory usage hits limits
75da82c92fSMauro Carvalho Chehab memory.memsw.failcnt		     show the number of memory+Swap hits limits
76da82c92fSMauro Carvalho Chehab memory.max_usage_in_bytes	     show max memory usage recorded
77da82c92fSMauro Carvalho Chehab memory.memsw.max_usage_in_bytes     show max memory+Swap usage recorded
78da82c92fSMauro Carvalho Chehab memory.soft_limit_in_bytes	     set/show soft limit of memory usage
792343e88dSSebastian Andrzej Siewior				     This knob is not available on CONFIG_PREEMPT_RT systems.
80da82c92fSMauro Carvalho Chehab memory.stat			     show various statistics
81da82c92fSMauro Carvalho Chehab memory.use_hierarchy		     set/show hierarchical account enabled
8218421863SRoman Gushchin                                     This knob is deprecated and shouldn't be
8318421863SRoman Gushchin                                     used.
84da82c92fSMauro Carvalho Chehab memory.force_empty		     trigger forced page reclaim
85da82c92fSMauro Carvalho Chehab memory.pressure_level		     set memory pressure notifications
86da82c92fSMauro Carvalho Chehab memory.swappiness		     set/show swappiness parameter of vmscan
87da82c92fSMauro Carvalho Chehab				     (See sysctl's vm.swappiness)
88da82c92fSMauro Carvalho Chehab memory.move_charge_at_immigrate     set/show controls of moving charges
89*da34a848SJohannes Weiner                                     This knob is deprecated and shouldn't be
90*da34a848SJohannes Weiner                                     used.
91da82c92fSMauro Carvalho Chehab memory.oom_control		     set/show oom controls.
92da82c92fSMauro Carvalho Chehab memory.numa_stat		     show the number of memory usage per numa
93da82c92fSMauro Carvalho Chehab				     node
9458056f77SShakeel Butt memory.kmem.limit_in_bytes          This knob is deprecated and writing to
9558056f77SShakeel Butt                                     it will return -ENOTSUPP.
96da82c92fSMauro Carvalho Chehab memory.kmem.usage_in_bytes          show current kernel memory allocation
97da82c92fSMauro Carvalho Chehab memory.kmem.failcnt                 show the number of kernel memory usage
98da82c92fSMauro Carvalho Chehab				     hits limits
99da82c92fSMauro Carvalho Chehab memory.kmem.max_usage_in_bytes      show max kernel memory usage recorded
100da82c92fSMauro Carvalho Chehab
101da82c92fSMauro Carvalho Chehab memory.kmem.tcp.limit_in_bytes      set/show hard limit for tcp buf memory
102da82c92fSMauro Carvalho Chehab memory.kmem.tcp.usage_in_bytes      show current tcp buf memory allocation
103da82c92fSMauro Carvalho Chehab memory.kmem.tcp.failcnt             show the number of tcp buf memory usage
104da82c92fSMauro Carvalho Chehab				     hits limits
105da82c92fSMauro Carvalho Chehab memory.kmem.tcp.max_usage_in_bytes  show max tcp buf memory usage recorded
106da82c92fSMauro Carvalho Chehab==================================== ==========================================
107da82c92fSMauro Carvalho Chehab
108da82c92fSMauro Carvalho Chehab1. History
109da82c92fSMauro Carvalho Chehab==========
110da82c92fSMauro Carvalho Chehab
111da82c92fSMauro Carvalho ChehabThe memory controller has a long history. A request for comments for the memory
112da82c92fSMauro Carvalho Chehabcontroller was posted by Balbir Singh [1]. At the time the RFC was posted
113da82c92fSMauro Carvalho Chehabthere were several implementations for memory control. The goal of the
114da82c92fSMauro Carvalho ChehabRFC was to build consensus and agreement for the minimal features required
115da82c92fSMauro Carvalho Chehabfor memory control. The first RSS controller was posted by Balbir Singh[2]
116da82c92fSMauro Carvalho Chehabin Feb 2007. Pavel Emelianov [3][4][5] has since posted three versions of the
117da82c92fSMauro Carvalho ChehabRSS controller. At OLS, at the resource management BoF, everyone suggested
118da82c92fSMauro Carvalho Chehabthat we handle both page cache and RSS together. Another request was raised
119da82c92fSMauro Carvalho Chehabto allow user space handling of OOM. The current memory controller is
120da82c92fSMauro Carvalho Chehabat version 6; it combines both mapped (RSS) and unmapped Page
121da82c92fSMauro Carvalho ChehabCache Control [11].
122da82c92fSMauro Carvalho Chehab
123da82c92fSMauro Carvalho Chehab2. Memory Control
124da82c92fSMauro Carvalho Chehab=================
125da82c92fSMauro Carvalho Chehab
126da82c92fSMauro Carvalho ChehabMemory is a unique resource in the sense that it is present in a limited
127da82c92fSMauro Carvalho Chehabamount. If a task requires a lot of CPU processing, the task can spread
128da82c92fSMauro Carvalho Chehabits processing over a period of hours, days, months or years, but with
129da82c92fSMauro Carvalho Chehabmemory, the same physical memory needs to be reused to accomplish the task.
130da82c92fSMauro Carvalho Chehab
131da82c92fSMauro Carvalho ChehabThe memory controller implementation has been divided into phases. These
132da82c92fSMauro Carvalho Chehabare:
133da82c92fSMauro Carvalho Chehab
134da82c92fSMauro Carvalho Chehab1. Memory controller
135da82c92fSMauro Carvalho Chehab2. mlock(2) controller
136da82c92fSMauro Carvalho Chehab3. Kernel user memory accounting and slab control
137da82c92fSMauro Carvalho Chehab4. user mappings length controller
138da82c92fSMauro Carvalho Chehab
139da82c92fSMauro Carvalho ChehabThe memory controller is the first controller developed.
140da82c92fSMauro Carvalho Chehab
141da82c92fSMauro Carvalho Chehab2.1. Design
142da82c92fSMauro Carvalho Chehab-----------
143da82c92fSMauro Carvalho Chehab
144da82c92fSMauro Carvalho ChehabThe core of the design is a counter called the page_counter. The
145da82c92fSMauro Carvalho Chehabpage_counter tracks the current memory usage and limit of the group of
146da82c92fSMauro Carvalho Chehabprocesses associated with the controller. Each cgroup has a memory controller
147da82c92fSMauro Carvalho Chehabspecific data structure (mem_cgroup) associated with it.
148da82c92fSMauro Carvalho Chehab
149da82c92fSMauro Carvalho Chehab2.2. Accounting
150da82c92fSMauro Carvalho Chehab---------------
151da82c92fSMauro Carvalho Chehab
152da82c92fSMauro Carvalho Chehab::
153da82c92fSMauro Carvalho Chehab
154da82c92fSMauro Carvalho Chehab		+--------------------+
155da82c92fSMauro Carvalho Chehab		|  mem_cgroup        |
156da82c92fSMauro Carvalho Chehab		|  (page_counter)    |
157da82c92fSMauro Carvalho Chehab		+--------------------+
158da82c92fSMauro Carvalho Chehab		 /            ^      \
159da82c92fSMauro Carvalho Chehab		/             |       \
160da82c92fSMauro Carvalho Chehab           +---------------+  |        +---------------+
161da82c92fSMauro Carvalho Chehab           | mm_struct     |  |....    | mm_struct     |
162da82c92fSMauro Carvalho Chehab           |               |  |        |               |
163da82c92fSMauro Carvalho Chehab           +---------------+  |        +---------------+
164da82c92fSMauro Carvalho Chehab                              |
165da82c92fSMauro Carvalho Chehab                              + --------------+
166da82c92fSMauro Carvalho Chehab                                              |
167da82c92fSMauro Carvalho Chehab           +---------------+           +------+--------+
168da82c92fSMauro Carvalho Chehab           | page          +---------->  page_cgroup|
169da82c92fSMauro Carvalho Chehab           |               |           |               |
170da82c92fSMauro Carvalho Chehab           +---------------+           +---------------+
171da82c92fSMauro Carvalho Chehab
172da82c92fSMauro Carvalho Chehab             (Figure 1: Hierarchy of Accounting)
173da82c92fSMauro Carvalho Chehab
174da82c92fSMauro Carvalho Chehab
175da82c92fSMauro Carvalho ChehabFigure 1 shows the important aspects of the controller
176da82c92fSMauro Carvalho Chehab
177da82c92fSMauro Carvalho Chehab1. Accounting happens per cgroup
178da82c92fSMauro Carvalho Chehab2. Each mm_struct knows about which cgroup it belongs to
179da82c92fSMauro Carvalho Chehab3. Each page has a pointer to the page_cgroup, which in turn knows the
180da82c92fSMauro Carvalho Chehab   cgroup it belongs to
181da82c92fSMauro Carvalho Chehab
182da82c92fSMauro Carvalho ChehabThe accounting is done as follows: mem_cgroup_charge_common() is invoked to
183da82c92fSMauro Carvalho Chehabset up the necessary data structures and check if the cgroup that is being
184da82c92fSMauro Carvalho Chehabcharged is over its limit. If it is, then reclaim is invoked on the cgroup.
185da82c92fSMauro Carvalho ChehabMore details can be found in the reclaim section of this document.
186da82c92fSMauro Carvalho ChehabIf everything goes well, a page meta-data-structure called page_cgroup is
187da82c92fSMauro Carvalho Chehabupdated. page_cgroup has its own LRU on cgroup.
188da82c92fSMauro Carvalho Chehab(*) page_cgroup structure is allocated at boot/memory-hotplug time.
189da82c92fSMauro Carvalho Chehab
190da82c92fSMauro Carvalho Chehab2.2.1 Accounting details
191da82c92fSMauro Carvalho Chehab------------------------
192da82c92fSMauro Carvalho Chehab
193da82c92fSMauro Carvalho ChehabAll mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
194da82c92fSMauro Carvalho ChehabSome pages which are never reclaimable and will not be on the LRU
195da82c92fSMauro Carvalho Chehabare not accounted. We just account pages under usual VM management.
196da82c92fSMauro Carvalho Chehab
197da82c92fSMauro Carvalho ChehabRSS pages are accounted at page_fault unless they've already been accounted
198da82c92fSMauro Carvalho Chehabfor earlier. A file page will be accounted for as Page Cache when it's
199da82c92fSMauro Carvalho Chehabinserted into inode (radix-tree). While it's mapped into the page tables of
200da82c92fSMauro Carvalho Chehabprocesses, duplicate accounting is carefully avoided.
201da82c92fSMauro Carvalho Chehab
202da82c92fSMauro Carvalho ChehabAn RSS page is unaccounted when it's fully unmapped. A PageCache page is
203da82c92fSMauro Carvalho Chehabunaccounted when it's removed from radix-tree. Even if RSS pages are fully
204da82c92fSMauro Carvalho Chehabunmapped (by kswapd), they may exist as SwapCache in the system until they
205da82c92fSMauro Carvalho Chehabare really freed. Such SwapCaches are also accounted.
2060a27cae1SAlex ShiA swapped-in page is accounted after adding into swapcache.
207da82c92fSMauro Carvalho Chehab
208da82c92fSMauro Carvalho ChehabNote: The kernel does swapin-readahead and reads multiple swaps at once.
2090a27cae1SAlex ShiSince page's memcg recorded into swap whatever memsw enabled, the page will
2100a27cae1SAlex Shibe accounted after swapin.
211da82c92fSMauro Carvalho Chehab
212da82c92fSMauro Carvalho ChehabAt page migration, accounting information is kept.
213da82c92fSMauro Carvalho Chehab
214da82c92fSMauro Carvalho ChehabNote: we just account pages-on-LRU because our purpose is to control amount
215da82c92fSMauro Carvalho Chehabof used pages; not-on-LRU pages tend to be out-of-control from VM view.
216da82c92fSMauro Carvalho Chehab
217da82c92fSMauro Carvalho Chehab2.3 Shared Page Accounting
218da82c92fSMauro Carvalho Chehab--------------------------
219da82c92fSMauro Carvalho Chehab
220da82c92fSMauro Carvalho ChehabShared pages are accounted on the basis of the first touch approach. The
221da82c92fSMauro Carvalho Chehabcgroup that first touches a page is accounted for the page. The principle
222da82c92fSMauro Carvalho Chehabbehind this approach is that a cgroup that aggressively uses a shared
223da82c92fSMauro Carvalho Chehabpage will eventually get charged for it (once it is uncharged from
224da82c92fSMauro Carvalho Chehabthe cgroup that brought it in -- this will happen on memory pressure).
225da82c92fSMauro Carvalho Chehab
226da82c92fSMauro Carvalho ChehabBut see section 8.2: when moving a task to another cgroup, its pages may
227da82c92fSMauro Carvalho Chehabbe recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
228da82c92fSMauro Carvalho Chehab
2290a27cae1SAlex Shi2.4 Swap Extension
230da82c92fSMauro Carvalho Chehab--------------------------------------
231da82c92fSMauro Carvalho Chehab
2320a27cae1SAlex ShiSwap usage is always recorded for each of cgroup. Swap Extension allows you to
2330a27cae1SAlex Shiread and limit it.
234da82c92fSMauro Carvalho Chehab
2350a27cae1SAlex ShiWhen CONFIG_SWAP is enabled, following files are added.
236da82c92fSMauro Carvalho Chehab
237da82c92fSMauro Carvalho Chehab - memory.memsw.usage_in_bytes.
238da82c92fSMauro Carvalho Chehab - memory.memsw.limit_in_bytes.
239da82c92fSMauro Carvalho Chehab
240da82c92fSMauro Carvalho Chehabmemsw means memory+swap. Usage of memory+swap is limited by
241da82c92fSMauro Carvalho Chehabmemsw.limit_in_bytes.
242da82c92fSMauro Carvalho Chehab
243da82c92fSMauro Carvalho ChehabExample: Assume a system with 4G of swap. A task which allocates 6G of memory
244da82c92fSMauro Carvalho Chehab(by mistake) under 2G memory limitation will use all swap.
245da82c92fSMauro Carvalho ChehabIn this case, setting memsw.limit_in_bytes=3G will prevent bad use of swap.
246da82c92fSMauro Carvalho ChehabBy using the memsw limit, you can avoid system OOM which can be caused by swap
247da82c92fSMauro Carvalho Chehabshortage.
248da82c92fSMauro Carvalho Chehab
249da82c92fSMauro Carvalho Chehab**why 'memory+swap' rather than swap**
250da82c92fSMauro Carvalho Chehab
251da82c92fSMauro Carvalho ChehabThe global LRU(kswapd) can swap out arbitrary pages. Swap-out means
252da82c92fSMauro Carvalho Chehabto move account from memory to swap...there is no change in usage of
253da82c92fSMauro Carvalho Chehabmemory+swap. In other words, when we want to limit the usage of swap without
254da82c92fSMauro Carvalho Chehabaffecting global LRU, memory+swap limit is better than just limiting swap from
255da82c92fSMauro Carvalho Chehaban OS point of view.
256da82c92fSMauro Carvalho Chehab
257da82c92fSMauro Carvalho Chehab**What happens when a cgroup hits memory.memsw.limit_in_bytes**
258da82c92fSMauro Carvalho Chehab
259da82c92fSMauro Carvalho ChehabWhen a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
260da82c92fSMauro Carvalho Chehabin this cgroup. Then, swap-out will not be done by cgroup routine and file
261da82c92fSMauro Carvalho Chehabcaches are dropped. But as mentioned above, global LRU can do swapout memory
262da82c92fSMauro Carvalho Chehabfrom it for sanity of the system's memory management state. You can't forbid
263da82c92fSMauro Carvalho Chehabit by cgroup.
264da82c92fSMauro Carvalho Chehab
265da82c92fSMauro Carvalho Chehab2.5 Reclaim
266da82c92fSMauro Carvalho Chehab-----------
267da82c92fSMauro Carvalho Chehab
268da82c92fSMauro Carvalho ChehabEach cgroup maintains a per cgroup LRU which has the same structure as
269da82c92fSMauro Carvalho Chehabglobal VM. When a cgroup goes over its limit, we first try
270da82c92fSMauro Carvalho Chehabto reclaim memory from the cgroup so as to make space for the new
271da82c92fSMauro Carvalho Chehabpages that the cgroup has touched. If the reclaim is unsuccessful,
272da82c92fSMauro Carvalho Chehaban OOM routine is invoked to select and kill the bulkiest task in the
273da82c92fSMauro Carvalho Chehabcgroup. (See 10. OOM Control below.)
274da82c92fSMauro Carvalho Chehab
275da82c92fSMauro Carvalho ChehabThe reclaim algorithm has not been modified for cgroups, except that
276da82c92fSMauro Carvalho Chehabpages that are selected for reclaiming come from the per-cgroup LRU
277da82c92fSMauro Carvalho Chehablist.
278da82c92fSMauro Carvalho Chehab
279da82c92fSMauro Carvalho ChehabNOTE:
280da82c92fSMauro Carvalho Chehab  Reclaim does not work for the root cgroup, since we cannot set any
281da82c92fSMauro Carvalho Chehab  limits on the root cgroup.
282da82c92fSMauro Carvalho Chehab
283da82c92fSMauro Carvalho ChehabNote2:
284da82c92fSMauro Carvalho Chehab  When panic_on_oom is set to "2", the whole system will panic.
285da82c92fSMauro Carvalho Chehab
286da82c92fSMauro Carvalho ChehabWhen oom event notifier is registered, event will be delivered.
287da82c92fSMauro Carvalho Chehab(See oom_control section)
288da82c92fSMauro Carvalho Chehab
289da82c92fSMauro Carvalho Chehab2.6 Locking
290da82c92fSMauro Carvalho Chehab-----------
291da82c92fSMauro Carvalho Chehab
29215b44736SHugh DickinsLock order is as follows:
293da82c92fSMauro Carvalho Chehab
29415b44736SHugh Dickins  Page lock (PG_locked bit of page->flags)
29515b44736SHugh Dickins    mm->page_table_lock or split pte_lock
29615b44736SHugh Dickins      lock_page_memcg (memcg->move_lock)
29715b44736SHugh Dickins        mapping->i_pages lock
29815b44736SHugh Dickins          lruvec->lru_lock.
299da82c92fSMauro Carvalho Chehab
30015b44736SHugh DickinsPer-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
30115b44736SHugh Dickinslruvec->lru_lock; PG_lru bit of page->flags is cleared before
30215b44736SHugh Dickinsisolating a page from its LRU under lruvec->lru_lock.
303da82c92fSMauro Carvalho Chehab
304e55b9f96SJohannes Weiner2.7 Kernel Memory Extension
305da82c92fSMauro Carvalho Chehab-----------------------------------------------
306da82c92fSMauro Carvalho Chehab
307da82c92fSMauro Carvalho ChehabWith the Kernel memory extension, the Memory Controller is able to limit
308da82c92fSMauro Carvalho Chehabthe amount of kernel memory used by the system. Kernel memory is fundamentally
309da82c92fSMauro Carvalho Chehabdifferent than user memory, since it can't be swapped out, which makes it
310da82c92fSMauro Carvalho Chehabpossible to DoS the system by consuming too much of this precious resource.
311da82c92fSMauro Carvalho Chehab
312da82c92fSMauro Carvalho ChehabKernel memory accounting is enabled for all memory cgroups by default. But
313da82c92fSMauro Carvalho Chehabit can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
314da82c92fSMauro Carvalho Chehabat boot time. In this case, kernel memory will not be accounted at all.
315da82c92fSMauro Carvalho Chehab
316da82c92fSMauro Carvalho ChehabKernel memory limits are not imposed for the root cgroup. Usage for the root
317da82c92fSMauro Carvalho Chehabcgroup may or may not be accounted. The memory used is accumulated into
318da82c92fSMauro Carvalho Chehabmemory.kmem.usage_in_bytes, or in a separate counter when it makes sense.
319da82c92fSMauro Carvalho Chehab(currently only for tcp).
320da82c92fSMauro Carvalho Chehab
321da82c92fSMauro Carvalho ChehabThe main "kmem" counter is fed into the main counter, so kmem charges will
322da82c92fSMauro Carvalho Chehabalso be visible from the user counter.
323da82c92fSMauro Carvalho Chehab
324da82c92fSMauro Carvalho ChehabCurrently no soft limit is implemented for kernel memory. It is future work
325da82c92fSMauro Carvalho Chehabto trigger slab reclaim when those limits are reached.
326da82c92fSMauro Carvalho Chehab
327da82c92fSMauro Carvalho Chehab2.7.1 Current Kernel Memory resources accounted
328da82c92fSMauro Carvalho Chehab-----------------------------------------------
329da82c92fSMauro Carvalho Chehab
330da82c92fSMauro Carvalho Chehabstack pages:
331da82c92fSMauro Carvalho Chehab  every process consumes some stack pages. By accounting into
332da82c92fSMauro Carvalho Chehab  kernel memory, we prevent new processes from being created when the kernel
333da82c92fSMauro Carvalho Chehab  memory usage is too high.
334da82c92fSMauro Carvalho Chehab
335da82c92fSMauro Carvalho Chehabslab pages:
336da82c92fSMauro Carvalho Chehab  pages allocated by the SLAB or SLUB allocator are tracked. A copy
337da82c92fSMauro Carvalho Chehab  of each kmem_cache is created every time the cache is touched by the first time
338da82c92fSMauro Carvalho Chehab  from inside the memcg. The creation is done lazily, so some objects can still be
339da82c92fSMauro Carvalho Chehab  skipped while the cache is being created. All objects in a slab page should
340da82c92fSMauro Carvalho Chehab  belong to the same memcg. This only fails to hold when a task is migrated to a
341da82c92fSMauro Carvalho Chehab  different memcg during the page allocation by the cache.
342da82c92fSMauro Carvalho Chehab
343da82c92fSMauro Carvalho Chehabsockets memory pressure:
344da82c92fSMauro Carvalho Chehab  some sockets protocols have memory pressure
345da82c92fSMauro Carvalho Chehab  thresholds. The Memory Controller allows them to be controlled individually
346da82c92fSMauro Carvalho Chehab  per cgroup, instead of globally.
347da82c92fSMauro Carvalho Chehab
348da82c92fSMauro Carvalho Chehabtcp memory pressure:
349da82c92fSMauro Carvalho Chehab  sockets memory pressure for the tcp protocol.
350da82c92fSMauro Carvalho Chehab
351da82c92fSMauro Carvalho Chehab2.7.2 Common use cases
352da82c92fSMauro Carvalho Chehab----------------------
353da82c92fSMauro Carvalho Chehab
354da82c92fSMauro Carvalho ChehabBecause the "kmem" counter is fed to the main user counter, kernel memory can
355da82c92fSMauro Carvalho Chehabnever be limited completely independently of user memory. Say "U" is the user
356da82c92fSMauro Carvalho Chehablimit, and "K" the kernel limit. There are three possible ways limits can be
357da82c92fSMauro Carvalho Chehabset:
358da82c92fSMauro Carvalho Chehab
359da82c92fSMauro Carvalho ChehabU != 0, K = unlimited:
360da82c92fSMauro Carvalho Chehab    This is the standard memcg limitation mechanism already present before kmem
361da82c92fSMauro Carvalho Chehab    accounting. Kernel memory is completely ignored.
362da82c92fSMauro Carvalho Chehab
363da82c92fSMauro Carvalho ChehabU != 0, K < U:
364da82c92fSMauro Carvalho Chehab    Kernel memory is a subset of the user memory. This setup is useful in
365fdebeae0SBhaskar Chowdhury    deployments where the total amount of memory per-cgroup is overcommitted.
366fdebeae0SBhaskar Chowdhury    Overcommitting kernel memory limits is definitely not recommended, since the
367da82c92fSMauro Carvalho Chehab    box can still run out of non-reclaimable memory.
368da82c92fSMauro Carvalho Chehab    In this case, the admin could set up K so that the sum of all groups is
369da82c92fSMauro Carvalho Chehab    never greater than the total memory, and freely set U at the cost of his
370da82c92fSMauro Carvalho Chehab    QoS.
371da82c92fSMauro Carvalho Chehab
372da82c92fSMauro Carvalho ChehabWARNING:
373da82c92fSMauro Carvalho Chehab    In the current implementation, memory reclaim will NOT be
374da82c92fSMauro Carvalho Chehab    triggered for a cgroup when it hits K while staying below U, which makes
375da82c92fSMauro Carvalho Chehab    this setup impractical.
376da82c92fSMauro Carvalho Chehab
377da82c92fSMauro Carvalho ChehabU != 0, K >= U:
378da82c92fSMauro Carvalho Chehab    Since kmem charges will also be fed to the user counter and reclaim will be
379da82c92fSMauro Carvalho Chehab    triggered for the cgroup for both kinds of memory. This setup gives the
380da82c92fSMauro Carvalho Chehab    admin a unified view of memory, and it is also useful for people who just
381da82c92fSMauro Carvalho Chehab    want to track kernel memory usage.
382da82c92fSMauro Carvalho Chehab
383da82c92fSMauro Carvalho Chehab3. User Interface
384da82c92fSMauro Carvalho Chehab=================
385da82c92fSMauro Carvalho Chehab
386da82c92fSMauro Carvalho Chehab3.0. Configuration
387da82c92fSMauro Carvalho Chehab------------------
388da82c92fSMauro Carvalho Chehab
389da82c92fSMauro Carvalho Chehaba. Enable CONFIG_CGROUPS
390da82c92fSMauro Carvalho Chehabb. Enable CONFIG_MEMCG
391da82c92fSMauro Carvalho Chehab
392da82c92fSMauro Carvalho Chehab3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?)
393da82c92fSMauro Carvalho Chehab-------------------------------------------------------------------
394da82c92fSMauro Carvalho Chehab
395da82c92fSMauro Carvalho Chehab::
396da82c92fSMauro Carvalho Chehab
397da82c92fSMauro Carvalho Chehab	# mount -t tmpfs none /sys/fs/cgroup
398da82c92fSMauro Carvalho Chehab	# mkdir /sys/fs/cgroup/memory
399da82c92fSMauro Carvalho Chehab	# mount -t cgroup none /sys/fs/cgroup/memory -o memory
400da82c92fSMauro Carvalho Chehab
401da82c92fSMauro Carvalho Chehab3.2. Make the new group and move bash into it::
402da82c92fSMauro Carvalho Chehab
403da82c92fSMauro Carvalho Chehab	# mkdir /sys/fs/cgroup/memory/0
404da82c92fSMauro Carvalho Chehab	# echo $$ > /sys/fs/cgroup/memory/0/tasks
405da82c92fSMauro Carvalho Chehab
406da82c92fSMauro Carvalho ChehabSince now we're in the 0 cgroup, we can alter the memory limit::
407da82c92fSMauro Carvalho Chehab
408da82c92fSMauro Carvalho Chehab	# echo 4M > /sys/fs/cgroup/memory/0/memory.limit_in_bytes
409da82c92fSMauro Carvalho Chehab
410da82c92fSMauro Carvalho ChehabNOTE:
411da82c92fSMauro Carvalho Chehab  We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
412da82c92fSMauro Carvalho Chehab  mega or gigabytes. (Here, Kilo, Mega, Giga are Kibibytes, Mebibytes,
413da82c92fSMauro Carvalho Chehab  Gibibytes.)
414da82c92fSMauro Carvalho Chehab
415da82c92fSMauro Carvalho ChehabNOTE:
416da82c92fSMauro Carvalho Chehab  We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
417da82c92fSMauro Carvalho Chehab
418da82c92fSMauro Carvalho ChehabNOTE:
419da82c92fSMauro Carvalho Chehab  We cannot set limits on the root cgroup any more.
420da82c92fSMauro Carvalho Chehab
421da82c92fSMauro Carvalho Chehab::
422da82c92fSMauro Carvalho Chehab
423da82c92fSMauro Carvalho Chehab  # cat /sys/fs/cgroup/memory/0/memory.limit_in_bytes
424da82c92fSMauro Carvalho Chehab  4194304
425da82c92fSMauro Carvalho Chehab
426da82c92fSMauro Carvalho ChehabWe can check the usage::
427da82c92fSMauro Carvalho Chehab
428da82c92fSMauro Carvalho Chehab  # cat /sys/fs/cgroup/memory/0/memory.usage_in_bytes
429da82c92fSMauro Carvalho Chehab  1216512
430da82c92fSMauro Carvalho Chehab
431da82c92fSMauro Carvalho ChehabA successful write to this file does not guarantee a successful setting of
432da82c92fSMauro Carvalho Chehabthis limit to the value written into the file. This can be due to a
433da82c92fSMauro Carvalho Chehabnumber of factors, such as rounding up to page boundaries or the total
434da82c92fSMauro Carvalho Chehabavailability of memory on the system. The user is required to re-read
435da82c92fSMauro Carvalho Chehabthis file after a write to guarantee the value committed by the kernel::
436da82c92fSMauro Carvalho Chehab
437da82c92fSMauro Carvalho Chehab  # echo 1 > memory.limit_in_bytes
438da82c92fSMauro Carvalho Chehab  # cat memory.limit_in_bytes
439da82c92fSMauro Carvalho Chehab  4096
440da82c92fSMauro Carvalho Chehab
441da82c92fSMauro Carvalho ChehabThe memory.failcnt field gives the number of times that the cgroup limit was
442da82c92fSMauro Carvalho Chehabexceeded.
443da82c92fSMauro Carvalho Chehab
444da82c92fSMauro Carvalho ChehabThe memory.stat file gives accounting information. Now, the number of
445da82c92fSMauro Carvalho Chehabcaches, RSS and Active pages/Inactive pages are shown.
446da82c92fSMauro Carvalho Chehab
447da82c92fSMauro Carvalho Chehab4. Testing
448da82c92fSMauro Carvalho Chehab==========
449da82c92fSMauro Carvalho Chehab
450da82c92fSMauro Carvalho ChehabFor testing features and implementation, see memcg_test.txt.
451da82c92fSMauro Carvalho Chehab
452da82c92fSMauro Carvalho ChehabPerformance test is also important. To see pure memory controller's overhead,
453da82c92fSMauro Carvalho Chehabtesting on tmpfs will give you good numbers of small overheads.
454da82c92fSMauro Carvalho ChehabExample: do kernel make on tmpfs.
455da82c92fSMauro Carvalho Chehab
456da82c92fSMauro Carvalho ChehabPage-fault scalability is also important. At measuring parallel
457da82c92fSMauro Carvalho Chehabpage fault test, multi-process test may be better than multi-thread
458da82c92fSMauro Carvalho Chehabtest because it has noise of shared objects/status.
459da82c92fSMauro Carvalho Chehab
460da82c92fSMauro Carvalho ChehabBut the above two are testing extreme situations.
461da82c92fSMauro Carvalho ChehabTrying usual test under memory controller is always helpful.
462da82c92fSMauro Carvalho Chehab
463da82c92fSMauro Carvalho Chehab4.1 Troubleshooting
464da82c92fSMauro Carvalho Chehab-------------------
465da82c92fSMauro Carvalho Chehab
466da82c92fSMauro Carvalho ChehabSometimes a user might find that the application under a cgroup is
467da82c92fSMauro Carvalho Chehabterminated by the OOM killer. There are several causes for this:
468da82c92fSMauro Carvalho Chehab
469da82c92fSMauro Carvalho Chehab1. The cgroup limit is too low (just too low to do anything useful)
470da82c92fSMauro Carvalho Chehab2. The user is using anonymous memory and swap is turned off or too low
471da82c92fSMauro Carvalho Chehab
472da82c92fSMauro Carvalho ChehabA sync followed by echo 1 > /proc/sys/vm/drop_caches will help get rid of
473da82c92fSMauro Carvalho Chehabsome of the pages cached in the cgroup (page cache pages).
474da82c92fSMauro Carvalho Chehab
475da82c92fSMauro Carvalho ChehabTo know what happens, disabling OOM_Kill as per "10. OOM Control" (below) and
476da82c92fSMauro Carvalho Chehabseeing what happens will be helpful.
477da82c92fSMauro Carvalho Chehab
478da82c92fSMauro Carvalho Chehab4.2 Task migration
479da82c92fSMauro Carvalho Chehab------------------
480da82c92fSMauro Carvalho Chehab
481da82c92fSMauro Carvalho ChehabWhen a task migrates from one cgroup to another, its charge is not
482da82c92fSMauro Carvalho Chehabcarried forward by default. The pages allocated from the original cgroup still
483da82c92fSMauro Carvalho Chehabremain charged to it, the charge is dropped when the page is freed or
484da82c92fSMauro Carvalho Chehabreclaimed.
485da82c92fSMauro Carvalho Chehab
486da82c92fSMauro Carvalho ChehabYou can move charges of a task along with task migration.
487da82c92fSMauro Carvalho ChehabSee 8. "Move charges at task migration"
488da82c92fSMauro Carvalho Chehab
489da82c92fSMauro Carvalho Chehab4.3 Removing a cgroup
490da82c92fSMauro Carvalho Chehab---------------------
491da82c92fSMauro Carvalho Chehab
492da82c92fSMauro Carvalho ChehabA cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
493da82c92fSMauro Carvalho Chehabcgroup might have some charge associated with it, even though all
494da82c92fSMauro Carvalho Chehabtasks have migrated away from it. (because we charge against pages, not
495da82c92fSMauro Carvalho Chehabagainst tasks.)
496da82c92fSMauro Carvalho Chehab
49718421863SRoman GushchinWe move the stats to parent, and no change on the charge except uncharging
498da82c92fSMauro Carvalho Chehabfrom the child.
499da82c92fSMauro Carvalho Chehab
500da82c92fSMauro Carvalho ChehabCharges recorded in swap information is not updated at removal of cgroup.
501da82c92fSMauro Carvalho ChehabRecorded information is discarded and a cgroup which uses swap (swapcache)
502da82c92fSMauro Carvalho Chehabwill be charged as a new owner of it.
503da82c92fSMauro Carvalho Chehab
504da82c92fSMauro Carvalho Chehab5. Misc. interfaces
505da82c92fSMauro Carvalho Chehab===================
506da82c92fSMauro Carvalho Chehab
507da82c92fSMauro Carvalho Chehab5.1 force_empty
508da82c92fSMauro Carvalho Chehab---------------
509da82c92fSMauro Carvalho Chehab  memory.force_empty interface is provided to make cgroup's memory usage empty.
510da82c92fSMauro Carvalho Chehab  When writing anything to this::
511da82c92fSMauro Carvalho Chehab
512da82c92fSMauro Carvalho Chehab    # echo 0 > memory.force_empty
513da82c92fSMauro Carvalho Chehab
514da82c92fSMauro Carvalho Chehab  the cgroup will be reclaimed and as many pages reclaimed as possible.
515da82c92fSMauro Carvalho Chehab
516da82c92fSMauro Carvalho Chehab  The typical use case for this interface is before calling rmdir().
517da82c92fSMauro Carvalho Chehab  Though rmdir() offlines memcg, but the memcg may still stay there due to
518da82c92fSMauro Carvalho Chehab  charged file caches. Some out-of-use page caches may keep charged until
519da82c92fSMauro Carvalho Chehab  memory pressure happens. If you want to avoid that, force_empty will be useful.
520da82c92fSMauro Carvalho Chehab
521da82c92fSMauro Carvalho Chehab5.2 stat file
522da82c92fSMauro Carvalho Chehab-------------
523da82c92fSMauro Carvalho Chehab
524da82c92fSMauro Carvalho Chehabmemory.stat file includes following statistics
525da82c92fSMauro Carvalho Chehab
526da82c92fSMauro Carvalho Chehabper-memory cgroup local status
527da82c92fSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
528da82c92fSMauro Carvalho Chehab
529da82c92fSMauro Carvalho Chehab=============== ===============================================================
530da82c92fSMauro Carvalho Chehabcache		# of bytes of page cache memory.
531da82c92fSMauro Carvalho Chehabrss		# of bytes of anonymous and swap cache memory (includes
532da82c92fSMauro Carvalho Chehab		transparent hugepages).
533da82c92fSMauro Carvalho Chehabrss_huge	# of bytes of anonymous transparent hugepages.
534da82c92fSMauro Carvalho Chehabmapped_file	# of bytes of mapped file (includes tmpfs/shmem)
535da82c92fSMauro Carvalho Chehabpgpgin		# of charging events to the memory cgroup. The charging
536da82c92fSMauro Carvalho Chehab		event happens each time a page is accounted as either mapped
537da82c92fSMauro Carvalho Chehab		anon page(RSS) or cache page(Page Cache) to the cgroup.
538da82c92fSMauro Carvalho Chehabpgpgout		# of uncharging events to the memory cgroup. The uncharging
539da82c92fSMauro Carvalho Chehab		event happens each time a page is unaccounted from the cgroup.
540da82c92fSMauro Carvalho Chehabswap		# of bytes of swap usage
541da82c92fSMauro Carvalho Chehabdirty		# of bytes that are waiting to get written back to the disk.
542da82c92fSMauro Carvalho Chehabwriteback	# of bytes of file/anon cache that are queued for syncing to
543da82c92fSMauro Carvalho Chehab		disk.
544da82c92fSMauro Carvalho Chehabinactive_anon	# of bytes of anonymous and swap cache memory on inactive
545da82c92fSMauro Carvalho Chehab		LRU list.
546da82c92fSMauro Carvalho Chehabactive_anon	# of bytes of anonymous and swap cache memory on active
547da82c92fSMauro Carvalho Chehab		LRU list.
5489b34a307SJian Weninactive_file	# of bytes of file-backed memory and MADV_FREE anonymous memory(
5499b34a307SJian Wen                LazyFree pages) on inactive LRU list.
550da82c92fSMauro Carvalho Chehabactive_file	# of bytes of file-backed memory on active LRU list.
551da82c92fSMauro Carvalho Chehabunevictable	# of bytes of memory that cannot be reclaimed (mlocked etc).
552da82c92fSMauro Carvalho Chehab=============== ===============================================================
553da82c92fSMauro Carvalho Chehab
554da82c92fSMauro Carvalho Chehabstatus considering hierarchy (see memory.use_hierarchy settings)
555da82c92fSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
556da82c92fSMauro Carvalho Chehab
557da82c92fSMauro Carvalho Chehab========================= ===================================================
558da82c92fSMauro Carvalho Chehabhierarchical_memory_limit # of bytes of memory limit with regard to hierarchy
559da82c92fSMauro Carvalho Chehab			  under which the memory cgroup is
560da82c92fSMauro Carvalho Chehabhierarchical_memsw_limit  # of bytes of memory+swap limit with regard to
561da82c92fSMauro Carvalho Chehab			  hierarchy under which memory cgroup is.
562da82c92fSMauro Carvalho Chehab
563da82c92fSMauro Carvalho Chehabtotal_<counter>		  # hierarchical version of <counter>, which in
564da82c92fSMauro Carvalho Chehab			  addition to the cgroup's own value includes the
565da82c92fSMauro Carvalho Chehab			  sum of all hierarchical children's values of
566da82c92fSMauro Carvalho Chehab			  <counter>, i.e. total_cache
567da82c92fSMauro Carvalho Chehab========================= ===================================================
568da82c92fSMauro Carvalho Chehab
569da82c92fSMauro Carvalho ChehabThe following additional stats are dependent on CONFIG_DEBUG_VM
570da82c92fSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
571da82c92fSMauro Carvalho Chehab
572da82c92fSMauro Carvalho Chehab========================= ========================================
573da82c92fSMauro Carvalho Chehabrecent_rotated_anon	  VM internal parameter. (see mm/vmscan.c)
574da82c92fSMauro Carvalho Chehabrecent_rotated_file	  VM internal parameter. (see mm/vmscan.c)
575da82c92fSMauro Carvalho Chehabrecent_scanned_anon	  VM internal parameter. (see mm/vmscan.c)
576da82c92fSMauro Carvalho Chehabrecent_scanned_file	  VM internal parameter. (see mm/vmscan.c)
577da82c92fSMauro Carvalho Chehab========================= ========================================
578da82c92fSMauro Carvalho Chehab
579da82c92fSMauro Carvalho ChehabMemo:
580da82c92fSMauro Carvalho Chehab	recent_rotated means recent frequency of LRU rotation.
581da82c92fSMauro Carvalho Chehab	recent_scanned means recent # of scans to LRU.
582da82c92fSMauro Carvalho Chehab	showing for better debug please see the code for meanings.
583da82c92fSMauro Carvalho Chehab
584da82c92fSMauro Carvalho ChehabNote:
585da82c92fSMauro Carvalho Chehab	Only anonymous and swap cache memory is listed as part of 'rss' stat.
586da82c92fSMauro Carvalho Chehab	This should not be confused with the true 'resident set size' or the
587da82c92fSMauro Carvalho Chehab	amount of physical memory used by the cgroup.
588da82c92fSMauro Carvalho Chehab
589da82c92fSMauro Carvalho Chehab	'rss + mapped_file" will give you resident set size of cgroup.
590da82c92fSMauro Carvalho Chehab
591da82c92fSMauro Carvalho Chehab	(Note: file and shmem may be shared among other cgroups. In that case,
592da82c92fSMauro Carvalho Chehab	mapped_file is accounted only when the memory cgroup is owner of page
593da82c92fSMauro Carvalho Chehab	cache.)
594da82c92fSMauro Carvalho Chehab
595da82c92fSMauro Carvalho Chehab5.3 swappiness
596da82c92fSMauro Carvalho Chehab--------------
597da82c92fSMauro Carvalho Chehab
598da82c92fSMauro Carvalho ChehabOverrides /proc/sys/vm/swappiness for the particular group. The tunable
599da82c92fSMauro Carvalho Chehabin the root cgroup corresponds to the global swappiness setting.
600da82c92fSMauro Carvalho Chehab
601da82c92fSMauro Carvalho ChehabPlease note that unlike during the global reclaim, limit reclaim
602da82c92fSMauro Carvalho Chehabenforces that 0 swappiness really prevents from any swapping even if
603da82c92fSMauro Carvalho Chehabthere is a swap storage available. This might lead to memcg OOM killer
604da82c92fSMauro Carvalho Chehabif there are no file pages to reclaim.
605da82c92fSMauro Carvalho Chehab
606da82c92fSMauro Carvalho Chehab5.4 failcnt
607da82c92fSMauro Carvalho Chehab-----------
608da82c92fSMauro Carvalho Chehab
609da82c92fSMauro Carvalho ChehabA memory cgroup provides memory.failcnt and memory.memsw.failcnt files.
610da82c92fSMauro Carvalho ChehabThis failcnt(== failure count) shows the number of times that a usage counter
611da82c92fSMauro Carvalho Chehabhit its limit. When a memory cgroup hits a limit, failcnt increases and
612da82c92fSMauro Carvalho Chehabmemory under it will be reclaimed.
613da82c92fSMauro Carvalho Chehab
614da82c92fSMauro Carvalho ChehabYou can reset failcnt by writing 0 to failcnt file::
615da82c92fSMauro Carvalho Chehab
616da82c92fSMauro Carvalho Chehab	# echo 0 > .../memory.failcnt
617da82c92fSMauro Carvalho Chehab
618da82c92fSMauro Carvalho Chehab5.5 usage_in_bytes
619da82c92fSMauro Carvalho Chehab------------------
620da82c92fSMauro Carvalho Chehab
621da82c92fSMauro Carvalho ChehabFor efficiency, as other kernel components, memory cgroup uses some optimization
622da82c92fSMauro Carvalho Chehabto avoid unnecessary cacheline false sharing. usage_in_bytes is affected by the
623da82c92fSMauro Carvalho Chehabmethod and doesn't show 'exact' value of memory (and swap) usage, it's a fuzz
624da82c92fSMauro Carvalho Chehabvalue for efficient access. (Of course, when necessary, it's synchronized.)
625da82c92fSMauro Carvalho ChehabIf you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
626da82c92fSMauro Carvalho Chehabvalue in memory.stat(see 5.2).
627da82c92fSMauro Carvalho Chehab
628da82c92fSMauro Carvalho Chehab5.6 numa_stat
629da82c92fSMauro Carvalho Chehab-------------
630da82c92fSMauro Carvalho Chehab
631da82c92fSMauro Carvalho ChehabThis is similar to numa_maps but operates on a per-memcg basis.  This is
632da82c92fSMauro Carvalho Chehabuseful for providing visibility into the numa locality information within
633da82c92fSMauro Carvalho Chehaban memcg since the pages are allowed to be allocated from any physical
634da82c92fSMauro Carvalho Chehabnode.  One of the use cases is evaluating application performance by
635da82c92fSMauro Carvalho Chehabcombining this information with the application's CPU allocation.
636da82c92fSMauro Carvalho Chehab
637da82c92fSMauro Carvalho ChehabEach memcg's numa_stat file includes "total", "file", "anon" and "unevictable"
638da82c92fSMauro Carvalho Chehabper-node page counts including "hierarchical_<counter>" which sums up all
639da82c92fSMauro Carvalho Chehabhierarchical children's values in addition to the memcg's own value.
640da82c92fSMauro Carvalho Chehab
641da82c92fSMauro Carvalho ChehabThe output format of memory.numa_stat is::
642da82c92fSMauro Carvalho Chehab
643da82c92fSMauro Carvalho Chehab  total=<total pages> N0=<node 0 pages> N1=<node 1 pages> ...
644da82c92fSMauro Carvalho Chehab  file=<total file pages> N0=<node 0 pages> N1=<node 1 pages> ...
645da82c92fSMauro Carvalho Chehab  anon=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
646da82c92fSMauro Carvalho Chehab  unevictable=<total anon pages> N0=<node 0 pages> N1=<node 1 pages> ...
647da82c92fSMauro Carvalho Chehab  hierarchical_<counter>=<counter pages> N0=<node 0 pages> N1=<node 1 pages> ...
648da82c92fSMauro Carvalho Chehab
649da82c92fSMauro Carvalho ChehabThe "total" count is sum of file + anon + unevictable.
650da82c92fSMauro Carvalho Chehab
651da82c92fSMauro Carvalho Chehab6. Hierarchy support
652da82c92fSMauro Carvalho Chehab====================
653da82c92fSMauro Carvalho Chehab
654da82c92fSMauro Carvalho ChehabThe memory controller supports a deep hierarchy and hierarchical accounting.
655da82c92fSMauro Carvalho ChehabThe hierarchy is created by creating the appropriate cgroups in the
656da82c92fSMauro Carvalho Chehabcgroup filesystem. Consider for example, the following cgroup filesystem
657da82c92fSMauro Carvalho Chehabhierarchy::
658da82c92fSMauro Carvalho Chehab
659da82c92fSMauro Carvalho Chehab	       root
660da82c92fSMauro Carvalho Chehab	     /  |   \
661da82c92fSMauro Carvalho Chehab            /	|    \
662da82c92fSMauro Carvalho Chehab	   a	b     c
663da82c92fSMauro Carvalho Chehab		      | \
664da82c92fSMauro Carvalho Chehab		      |  \
665da82c92fSMauro Carvalho Chehab		      d   e
666da82c92fSMauro Carvalho Chehab
667da82c92fSMauro Carvalho ChehabIn the diagram above, with hierarchical accounting enabled, all memory
66818421863SRoman Gushchinusage of e, is accounted to its ancestors up until the root (i.e, c and root).
66918421863SRoman GushchinIf one of the ancestors goes over its limit, the reclaim algorithm reclaims
67018421863SRoman Gushchinfrom the tasks in the ancestor and the children of the ancestor.
671da82c92fSMauro Carvalho Chehab
67218421863SRoman Gushchin6.1 Hierarchical accounting and reclaim
67318421863SRoman Gushchin---------------------------------------
674da82c92fSMauro Carvalho Chehab
67518421863SRoman GushchinHierarchical accounting is enabled by default. Disabling the hierarchical
67618421863SRoman Gushchinaccounting is deprecated. An attempt to do it will result in a failure
67718421863SRoman Gushchinand a warning printed to dmesg.
67818421863SRoman Gushchin
67918421863SRoman GushchinFor compatibility reasons writing 1 to memory.use_hierarchy will always pass::
680da82c92fSMauro Carvalho Chehab
681da82c92fSMauro Carvalho Chehab	# echo 1 > memory.use_hierarchy
682da82c92fSMauro Carvalho Chehab
683da82c92fSMauro Carvalho Chehab7. Soft limits
684da82c92fSMauro Carvalho Chehab==============
685da82c92fSMauro Carvalho Chehab
686da82c92fSMauro Carvalho ChehabSoft limits allow for greater sharing of memory. The idea behind soft limits
687da82c92fSMauro Carvalho Chehabis to allow control groups to use as much of the memory as needed, provided
688da82c92fSMauro Carvalho Chehab
689da82c92fSMauro Carvalho Chehaba. There is no memory contention
690da82c92fSMauro Carvalho Chehabb. They do not exceed their hard limit
691da82c92fSMauro Carvalho Chehab
692da82c92fSMauro Carvalho ChehabWhen the system detects memory contention or low memory, control groups
693da82c92fSMauro Carvalho Chehabare pushed back to their soft limits. If the soft limit of each control
694da82c92fSMauro Carvalho Chehabgroup is very high, they are pushed back as much as possible to make
695da82c92fSMauro Carvalho Chehabsure that one control group does not starve the others of memory.
696da82c92fSMauro Carvalho Chehab
697da82c92fSMauro Carvalho ChehabPlease note that soft limits is a best-effort feature; it comes with
698da82c92fSMauro Carvalho Chehabno guarantees, but it does its best to make sure that when memory is
699da82c92fSMauro Carvalho Chehabheavily contended for, memory is allocated based on the soft limit
700da82c92fSMauro Carvalho Chehabhints/setup. Currently soft limit based reclaim is set up such that
701da82c92fSMauro Carvalho Chehabit gets invoked from balance_pgdat (kswapd).
702da82c92fSMauro Carvalho Chehab
703da82c92fSMauro Carvalho Chehab7.1 Interface
704da82c92fSMauro Carvalho Chehab-------------
705da82c92fSMauro Carvalho Chehab
706da82c92fSMauro Carvalho ChehabSoft limits can be setup by using the following commands (in this example we
707da82c92fSMauro Carvalho Chehabassume a soft limit of 256 MiB)::
708da82c92fSMauro Carvalho Chehab
709da82c92fSMauro Carvalho Chehab	# echo 256M > memory.soft_limit_in_bytes
710da82c92fSMauro Carvalho Chehab
711da82c92fSMauro Carvalho ChehabIf we want to change this to 1G, we can at any time use::
712da82c92fSMauro Carvalho Chehab
713da82c92fSMauro Carvalho Chehab	# echo 1G > memory.soft_limit_in_bytes
714da82c92fSMauro Carvalho Chehab
715da82c92fSMauro Carvalho ChehabNOTE1:
716da82c92fSMauro Carvalho Chehab       Soft limits take effect over a long period of time, since they involve
717da82c92fSMauro Carvalho Chehab       reclaiming memory for balancing between memory cgroups
718da82c92fSMauro Carvalho ChehabNOTE2:
719da82c92fSMauro Carvalho Chehab       It is recommended to set the soft limit always below the hard limit,
720da82c92fSMauro Carvalho Chehab       otherwise the hard limit will take precedence.
721da82c92fSMauro Carvalho Chehab
722*da34a848SJohannes Weiner8. Move charges at task migration (DEPRECATED!)
723*da34a848SJohannes Weiner===============================================
724*da34a848SJohannes Weiner
725*da34a848SJohannes WeinerTHIS IS DEPRECATED!
726*da34a848SJohannes Weiner
727*da34a848SJohannes WeinerIt's expensive and unreliable! It's better practice to launch workload
728*da34a848SJohannes Weinertasks directly from inside their target cgroup. Use dedicated workload
729*da34a848SJohannes Weinercgroups to allow fine-grained policy adjustments without having to
730*da34a848SJohannes Weinermove physical pages between control domains.
731da82c92fSMauro Carvalho Chehab
732da82c92fSMauro Carvalho ChehabUsers can move charges associated with a task along with task migration, that
733da82c92fSMauro Carvalho Chehabis, uncharge task's pages from the old cgroup and charge them to the new cgroup.
734da82c92fSMauro Carvalho ChehabThis feature is not supported in !CONFIG_MMU environments because of lack of
735da82c92fSMauro Carvalho Chehabpage tables.
736da82c92fSMauro Carvalho Chehab
737da82c92fSMauro Carvalho Chehab8.1 Interface
738da82c92fSMauro Carvalho Chehab-------------
739da82c92fSMauro Carvalho Chehab
740da82c92fSMauro Carvalho ChehabThis feature is disabled by default. It can be enabled (and disabled again) by
741da82c92fSMauro Carvalho Chehabwriting to memory.move_charge_at_immigrate of the destination cgroup.
742da82c92fSMauro Carvalho Chehab
743da82c92fSMauro Carvalho ChehabIf you want to enable it::
744da82c92fSMauro Carvalho Chehab
745da82c92fSMauro Carvalho Chehab	# echo (some positive value) > memory.move_charge_at_immigrate
746da82c92fSMauro Carvalho Chehab
747da82c92fSMauro Carvalho ChehabNote:
748da82c92fSMauro Carvalho Chehab      Each bits of move_charge_at_immigrate has its own meaning about what type
749da82c92fSMauro Carvalho Chehab      of charges should be moved. See 8.2 for details.
750da82c92fSMauro Carvalho ChehabNote:
751da82c92fSMauro Carvalho Chehab      Charges are moved only when you move mm->owner, in other words,
752da82c92fSMauro Carvalho Chehab      a leader of a thread group.
753da82c92fSMauro Carvalho ChehabNote:
754da82c92fSMauro Carvalho Chehab      If we cannot find enough space for the task in the destination cgroup, we
755da82c92fSMauro Carvalho Chehab      try to make space by reclaiming memory. Task migration may fail if we
756da82c92fSMauro Carvalho Chehab      cannot make enough space.
757da82c92fSMauro Carvalho ChehabNote:
758da82c92fSMauro Carvalho Chehab      It can take several seconds if you move charges much.
759da82c92fSMauro Carvalho Chehab
760da82c92fSMauro Carvalho ChehabAnd if you want disable it again::
761da82c92fSMauro Carvalho Chehab
762da82c92fSMauro Carvalho Chehab	# echo 0 > memory.move_charge_at_immigrate
763da82c92fSMauro Carvalho Chehab
764da82c92fSMauro Carvalho Chehab8.2 Type of charges which can be moved
765da82c92fSMauro Carvalho Chehab--------------------------------------
766da82c92fSMauro Carvalho Chehab
767da82c92fSMauro Carvalho ChehabEach bit in move_charge_at_immigrate has its own meaning about what type of
768da82c92fSMauro Carvalho Chehabcharges should be moved. But in any case, it must be noted that an account of
769da82c92fSMauro Carvalho Chehaba page or a swap can be moved only when it is charged to the task's current
770da82c92fSMauro Carvalho Chehab(old) memory cgroup.
771da82c92fSMauro Carvalho Chehab
772da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+
773da82c92fSMauro Carvalho Chehab|bit| what type of charges would be moved ?                                    |
774da82c92fSMauro Carvalho Chehab+===+==========================================================================+
775da82c92fSMauro Carvalho Chehab| 0 | A charge of an anonymous page (or swap of it) used by the target task.   |
776da82c92fSMauro Carvalho Chehab|   | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
777da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+
778da82c92fSMauro Carvalho Chehab| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
779da82c92fSMauro Carvalho Chehab|   | and swaps of tmpfs file) mmapped by the target task. Unlike the case of  |
780da82c92fSMauro Carvalho Chehab|   | anonymous pages, file pages (and swaps) in the range mmapped by the task |
781da82c92fSMauro Carvalho Chehab|   | will be moved even if the task hasn't done page fault, i.e. they might   |
782da82c92fSMauro Carvalho Chehab|   | not be the task's "RSS", but other task's "RSS" that maps the same file. |
783da82c92fSMauro Carvalho Chehab|   | And mapcount of the page is ignored (the page can be moved even if       |
784da82c92fSMauro Carvalho Chehab|   | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to    |
785da82c92fSMauro Carvalho Chehab|   | enable move of swap charges.                                             |
786da82c92fSMauro Carvalho Chehab+---+--------------------------------------------------------------------------+
787da82c92fSMauro Carvalho Chehab
788da82c92fSMauro Carvalho Chehab8.3 TODO
789da82c92fSMauro Carvalho Chehab--------
790da82c92fSMauro Carvalho Chehab
791da82c92fSMauro Carvalho Chehab- All of moving charge operations are done under cgroup_mutex. It's not good
792da82c92fSMauro Carvalho Chehab  behavior to hold the mutex too long, so we may need some trick.
793da82c92fSMauro Carvalho Chehab
794da82c92fSMauro Carvalho Chehab9. Memory thresholds
795da82c92fSMauro Carvalho Chehab====================
796da82c92fSMauro Carvalho Chehab
797da82c92fSMauro Carvalho ChehabMemory cgroup implements memory thresholds using the cgroups notification
798da82c92fSMauro Carvalho ChehabAPI (see cgroups.txt). It allows to register multiple memory and memsw
799da82c92fSMauro Carvalho Chehabthresholds and gets notifications when it crosses.
800da82c92fSMauro Carvalho Chehab
801da82c92fSMauro Carvalho ChehabTo register a threshold, an application must:
802da82c92fSMauro Carvalho Chehab
803da82c92fSMauro Carvalho Chehab- create an eventfd using eventfd(2);
804da82c92fSMauro Carvalho Chehab- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
805da82c92fSMauro Carvalho Chehab- write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
806da82c92fSMauro Carvalho Chehab  cgroup.event_control.
807da82c92fSMauro Carvalho Chehab
808da82c92fSMauro Carvalho ChehabApplication will be notified through eventfd when memory usage crosses
809da82c92fSMauro Carvalho Chehabthreshold in any direction.
810da82c92fSMauro Carvalho Chehab
811da82c92fSMauro Carvalho ChehabIt's applicable for root and non-root cgroup.
812da82c92fSMauro Carvalho Chehab
813da82c92fSMauro Carvalho Chehab10. OOM Control
814da82c92fSMauro Carvalho Chehab===============
815da82c92fSMauro Carvalho Chehab
816da82c92fSMauro Carvalho Chehabmemory.oom_control file is for OOM notification and other controls.
817da82c92fSMauro Carvalho Chehab
818da82c92fSMauro Carvalho ChehabMemory cgroup implements OOM notifier using the cgroup notification
819da82c92fSMauro Carvalho ChehabAPI (See cgroups.txt). It allows to register multiple OOM notification
820da82c92fSMauro Carvalho Chehabdelivery and gets notification when OOM happens.
821da82c92fSMauro Carvalho Chehab
822da82c92fSMauro Carvalho ChehabTo register a notifier, an application must:
823da82c92fSMauro Carvalho Chehab
824da82c92fSMauro Carvalho Chehab - create an eventfd using eventfd(2)
825da82c92fSMauro Carvalho Chehab - open memory.oom_control file
826da82c92fSMauro Carvalho Chehab - write string like "<event_fd> <fd of memory.oom_control>" to
827da82c92fSMauro Carvalho Chehab   cgroup.event_control
828da82c92fSMauro Carvalho Chehab
829da82c92fSMauro Carvalho ChehabThe application will be notified through eventfd when OOM happens.
830da82c92fSMauro Carvalho ChehabOOM notification doesn't work for the root cgroup.
831da82c92fSMauro Carvalho Chehab
832da82c92fSMauro Carvalho ChehabYou can disable the OOM-killer by writing "1" to memory.oom_control file, as:
833da82c92fSMauro Carvalho Chehab
834da82c92fSMauro Carvalho Chehab	#echo 1 > memory.oom_control
835da82c92fSMauro Carvalho Chehab
836da82c92fSMauro Carvalho ChehabIf OOM-killer is disabled, tasks under cgroup will hang/sleep
837da82c92fSMauro Carvalho Chehabin memory cgroup's OOM-waitqueue when they request accountable memory.
838da82c92fSMauro Carvalho Chehab
839da82c92fSMauro Carvalho ChehabFor running them, you have to relax the memory cgroup's OOM status by
840da82c92fSMauro Carvalho Chehab
841da82c92fSMauro Carvalho Chehab	* enlarge limit or reduce usage.
842da82c92fSMauro Carvalho Chehab
843da82c92fSMauro Carvalho ChehabTo reduce usage,
844da82c92fSMauro Carvalho Chehab
845da82c92fSMauro Carvalho Chehab	* kill some tasks.
846da82c92fSMauro Carvalho Chehab	* move some tasks to other group with account migration.
847da82c92fSMauro Carvalho Chehab	* remove some files (on tmpfs?)
848da82c92fSMauro Carvalho Chehab
849da82c92fSMauro Carvalho ChehabThen, stopped tasks will work again.
850da82c92fSMauro Carvalho Chehab
851da82c92fSMauro Carvalho ChehabAt reading, current status of OOM is shown.
852da82c92fSMauro Carvalho Chehab
853da82c92fSMauro Carvalho Chehab	- oom_kill_disable 0 or 1
854da82c92fSMauro Carvalho Chehab	  (if 1, oom-killer is disabled)
855da82c92fSMauro Carvalho Chehab	- under_oom	   0 or 1
856da82c92fSMauro Carvalho Chehab	  (if 1, the memory cgroup is under OOM, tasks may be stopped.)
8571eff491fSYang Shi        - oom_kill         integer counter
8581eff491fSYang Shi          The number of processes belonging to this cgroup killed by any
8591eff491fSYang Shi          kind of OOM killer.
860da82c92fSMauro Carvalho Chehab
861da82c92fSMauro Carvalho Chehab11. Memory Pressure
862da82c92fSMauro Carvalho Chehab===================
863da82c92fSMauro Carvalho Chehab
864da82c92fSMauro Carvalho ChehabThe pressure level notifications can be used to monitor the memory
865da82c92fSMauro Carvalho Chehaballocation cost; based on the pressure, applications can implement
866da82c92fSMauro Carvalho Chehabdifferent strategies of managing their memory resources. The pressure
867da82c92fSMauro Carvalho Chehablevels are defined as following:
868da82c92fSMauro Carvalho Chehab
869da82c92fSMauro Carvalho ChehabThe "low" level means that the system is reclaiming memory for new
870da82c92fSMauro Carvalho Chehaballocations. Monitoring this reclaiming activity might be useful for
871da82c92fSMauro Carvalho Chehabmaintaining cache level. Upon notification, the program (typically
872da82c92fSMauro Carvalho Chehab"Activity Manager") might analyze vmstat and act in advance (i.e.
873da82c92fSMauro Carvalho Chehabprematurely shutdown unimportant services).
874da82c92fSMauro Carvalho Chehab
875da82c92fSMauro Carvalho ChehabThe "medium" level means that the system is experiencing medium memory
876da82c92fSMauro Carvalho Chehabpressure, the system might be making swap, paging out active file caches,
877da82c92fSMauro Carvalho Chehabetc. Upon this event applications may decide to further analyze
878da82c92fSMauro Carvalho Chehabvmstat/zoneinfo/memcg or internal memory usage statistics and free any
879da82c92fSMauro Carvalho Chehabresources that can be easily reconstructed or re-read from a disk.
880da82c92fSMauro Carvalho Chehab
881da82c92fSMauro Carvalho ChehabThe "critical" level means that the system is actively thrashing, it is
882da82c92fSMauro Carvalho Chehababout to out of memory (OOM) or even the in-kernel OOM killer is on its
883da82c92fSMauro Carvalho Chehabway to trigger. Applications should do whatever they can to help the
884da82c92fSMauro Carvalho Chehabsystem. It might be too late to consult with vmstat or any other
885da82c92fSMauro Carvalho Chehabstatistics, so it's advisable to take an immediate action.
886da82c92fSMauro Carvalho Chehab
887da82c92fSMauro Carvalho ChehabBy default, events are propagated upward until the event is handled, i.e. the
888da82c92fSMauro Carvalho Chehabevents are not pass-through. For example, you have three cgroups: A->B->C. Now
889da82c92fSMauro Carvalho Chehabyou set up an event listener on cgroups A, B and C, and suppose group C
890da82c92fSMauro Carvalho Chehabexperiences some pressure. In this situation, only group C will receive the
891da82c92fSMauro Carvalho Chehabnotification, i.e. groups A and B will not receive it. This is done to avoid
892da82c92fSMauro Carvalho Chehabexcessive "broadcasting" of messages, which disturbs the system and which is
893da82c92fSMauro Carvalho Chehabespecially bad if we are low on memory or thrashing. Group B, will receive
894da82c92fSMauro Carvalho Chehabnotification only if there are no event listers for group C.
895da82c92fSMauro Carvalho Chehab
896da82c92fSMauro Carvalho ChehabThere are three optional modes that specify different propagation behavior:
897da82c92fSMauro Carvalho Chehab
898da82c92fSMauro Carvalho Chehab - "default": this is the default behavior specified above. This mode is the
899da82c92fSMauro Carvalho Chehab   same as omitting the optional mode parameter, preserved by backwards
900da82c92fSMauro Carvalho Chehab   compatibility.
901da82c92fSMauro Carvalho Chehab
902da82c92fSMauro Carvalho Chehab - "hierarchy": events always propagate up to the root, similar to the default
903da82c92fSMauro Carvalho Chehab   behavior, except that propagation continues regardless of whether there are
904da82c92fSMauro Carvalho Chehab   event listeners at each level, with the "hierarchy" mode. In the above
905da82c92fSMauro Carvalho Chehab   example, groups A, B, and C will receive notification of memory pressure.
906da82c92fSMauro Carvalho Chehab
907da82c92fSMauro Carvalho Chehab - "local": events are pass-through, i.e. they only receive notifications when
908da82c92fSMauro Carvalho Chehab   memory pressure is experienced in the memcg for which the notification is
909da82c92fSMauro Carvalho Chehab   registered. In the above example, group C will receive notification if
910da82c92fSMauro Carvalho Chehab   registered for "local" notification and the group experiences memory
911da82c92fSMauro Carvalho Chehab   pressure. However, group B will never receive notification, regardless if
912da82c92fSMauro Carvalho Chehab   there is an event listener for group C or not, if group B is registered for
913da82c92fSMauro Carvalho Chehab   local notification.
914da82c92fSMauro Carvalho Chehab
915da82c92fSMauro Carvalho ChehabThe level and event notification mode ("hierarchy" or "local", if necessary) are
916da82c92fSMauro Carvalho Chehabspecified by a comma-delimited string, i.e. "low,hierarchy" specifies
917da82c92fSMauro Carvalho Chehabhierarchical, pass-through, notification for all ancestor memcgs. Notification
918da82c92fSMauro Carvalho Chehabthat is the default, non pass-through behavior, does not specify a mode.
919da82c92fSMauro Carvalho Chehab"medium,local" specifies pass-through notification for the medium level.
920da82c92fSMauro Carvalho Chehab
921da82c92fSMauro Carvalho ChehabThe file memory.pressure_level is only used to setup an eventfd. To
922da82c92fSMauro Carvalho Chehabregister a notification, an application must:
923da82c92fSMauro Carvalho Chehab
924da82c92fSMauro Carvalho Chehab- create an eventfd using eventfd(2);
925da82c92fSMauro Carvalho Chehab- open memory.pressure_level;
926da82c92fSMauro Carvalho Chehab- write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
927da82c92fSMauro Carvalho Chehab  to cgroup.event_control.
928da82c92fSMauro Carvalho Chehab
929da82c92fSMauro Carvalho ChehabApplication will be notified through eventfd when memory pressure is at
930da82c92fSMauro Carvalho Chehabthe specific level (or higher). Read/write operations to
931da82c92fSMauro Carvalho Chehabmemory.pressure_level are no implemented.
932da82c92fSMauro Carvalho Chehab
933da82c92fSMauro Carvalho ChehabTest:
934da82c92fSMauro Carvalho Chehab
935da82c92fSMauro Carvalho Chehab   Here is a small script example that makes a new cgroup, sets up a
936da82c92fSMauro Carvalho Chehab   memory limit, sets up a notification in the cgroup and then makes child
937da82c92fSMauro Carvalho Chehab   cgroup experience a critical pressure::
938da82c92fSMauro Carvalho Chehab
939da82c92fSMauro Carvalho Chehab	# cd /sys/fs/cgroup/memory/
940da82c92fSMauro Carvalho Chehab	# mkdir foo
941da82c92fSMauro Carvalho Chehab	# cd foo
942da82c92fSMauro Carvalho Chehab	# cgroup_event_listener memory.pressure_level low,hierarchy &
943da82c92fSMauro Carvalho Chehab	# echo 8000000 > memory.limit_in_bytes
944da82c92fSMauro Carvalho Chehab	# echo 8000000 > memory.memsw.limit_in_bytes
945da82c92fSMauro Carvalho Chehab	# echo $$ > tasks
946da82c92fSMauro Carvalho Chehab	# dd if=/dev/zero | read x
947da82c92fSMauro Carvalho Chehab
948da82c92fSMauro Carvalho Chehab   (Expect a bunch of notifications, and eventually, the oom-killer will
949da82c92fSMauro Carvalho Chehab   trigger.)
950da82c92fSMauro Carvalho Chehab
951da82c92fSMauro Carvalho Chehab12. TODO
952da82c92fSMauro Carvalho Chehab========
953da82c92fSMauro Carvalho Chehab
954da82c92fSMauro Carvalho Chehab1. Make per-cgroup scanner reclaim not-shared pages first
955da82c92fSMauro Carvalho Chehab2. Teach controller to account for shared-pages
956da82c92fSMauro Carvalho Chehab3. Start reclamation in the background when the limit is
957da82c92fSMauro Carvalho Chehab   not yet hit but the usage is getting closer
958da82c92fSMauro Carvalho Chehab
959da82c92fSMauro Carvalho ChehabSummary
960da82c92fSMauro Carvalho Chehab=======
961da82c92fSMauro Carvalho Chehab
962da82c92fSMauro Carvalho ChehabOverall, the memory controller has been a stable controller and has been
963da82c92fSMauro Carvalho Chehabcommented and discussed quite extensively in the community.
964da82c92fSMauro Carvalho Chehab
965da82c92fSMauro Carvalho ChehabReferences
966da82c92fSMauro Carvalho Chehab==========
967da82c92fSMauro Carvalho Chehab
968da82c92fSMauro Carvalho Chehab1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
969da82c92fSMauro Carvalho Chehab2. Singh, Balbir. Memory Controller (RSS Control),
970da82c92fSMauro Carvalho Chehab   http://lwn.net/Articles/222762/
971da82c92fSMauro Carvalho Chehab3. Emelianov, Pavel. Resource controllers based on process cgroups
97205a5f51cSJoe Perches   https://lore.kernel.org/r/45ED7DEC.7010403@sw.ru
973da82c92fSMauro Carvalho Chehab4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
97405a5f51cSJoe Perches   https://lore.kernel.org/r/461A3010.90403@sw.ru
975da82c92fSMauro Carvalho Chehab5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
97605a5f51cSJoe Perches   https://lore.kernel.org/r/465D9739.8070209@openvz.org
977da82c92fSMauro Carvalho Chehab6. Menage, Paul. Control Groups v10, http://lwn.net/Articles/236032/
978da82c92fSMauro Carvalho Chehab7. Vaidyanathan, Srinivasan, Control Groups: Pagecache accounting and control
979da82c92fSMauro Carvalho Chehab   subsystem (v3), http://lwn.net/Articles/235534/
980da82c92fSMauro Carvalho Chehab8. Singh, Balbir. RSS controller v2 test results (lmbench),
98105a5f51cSJoe Perches   https://lore.kernel.org/r/464C95D4.7070806@linux.vnet.ibm.com
982da82c92fSMauro Carvalho Chehab9. Singh, Balbir. RSS controller v2 AIM9 results
98305a5f51cSJoe Perches   https://lore.kernel.org/r/464D267A.50107@linux.vnet.ibm.com
984da82c92fSMauro Carvalho Chehab10. Singh, Balbir. Memory controller v6 test results,
98505a5f51cSJoe Perches    https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
986da82c92fSMauro Carvalho Chehab11. Singh, Balbir. Memory controller introduction (v6),
98705a5f51cSJoe Perches    https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop
988da82c92fSMauro Carvalho Chehab12. Corbet, Jonathan, Controlling memory use in cgroups,
989da82c92fSMauro Carvalho Chehab    http://lwn.net/Articles/243795/
990