1============================
2Transparent Hugepage Support
3============================
4
5Objective
6=========
7
8Performance critical computing applications dealing with large memory
9working sets are already running on top of libhugetlbfs and in turn
10hugetlbfs. Transparent HugePage Support (THP) is an alternative mean of
11using huge pages for the backing of virtual memory with huge pages
12that supports the automatic promotion and demotion of page sizes and
13without the shortcomings of hugetlbfs.
14
15Currently THP only works for anonymous memory mappings and tmpfs/shmem.
16But in the future it can expand to other filesystems.
17
18.. note::
19   in the examples below we presume that the basic page size is 4K and
20   the huge page size is 2M, although the actual numbers may vary
21   depending on the CPU architecture.
22
23The reason applications are running faster is because of two
24factors. The first factor is almost completely irrelevant and it's not
25of significant interest because it'll also have the downside of
26requiring larger clear-page copy-page in page faults which is a
27potentially negative effect. The first factor consists in taking a
28single page fault for each 2M virtual region touched by userland (so
29reducing the enter/exit kernel frequency by a 512 times factor). This
30only matters the first time the memory is accessed for the lifetime of
31a memory mapping. The second long lasting and much more important
32factor will affect all subsequent accesses to the memory for the whole
33runtime of the application. The second factor consist of two
34components:
35
361) the TLB miss will run faster (especially with virtualization using
37   nested pagetables but almost always also on bare metal without
38   virtualization)
39
402) a single TLB entry will be mapping a much larger amount of virtual
41   memory in turn reducing the number of TLB misses. With
42   virtualization and nested pagetables the TLB can be mapped of
43   larger size only if both KVM and the Linux guest are using
44   hugepages but a significant speedup already happens if only one of
45   the two is using hugepages just because of the fact the TLB miss is
46   going to run faster.
47
48THP can be enabled system wide or restricted to certain tasks or even
49memory ranges inside task's address space. Unless THP is completely
50disabled, there is ``khugepaged`` daemon that scans memory and
51collapses sequences of basic pages into huge pages.
52
53The THP behaviour is controlled via :ref:`sysfs <thp_sysfs>`
54interface and using madvise(2) and prctl(2) system calls.
55
56Transparent Hugepage Support maximizes the usefulness of free memory
57if compared to the reservation approach of hugetlbfs by allowing all
58unused memory to be used as cache or other movable (or even unmovable
59entities). It doesn't require reservation to prevent hugepage
60allocation failures to be noticeable from userland. It allows paging
61and all other advanced VM features to be available on the
62hugepages. It requires no modifications for applications to take
63advantage of it.
64
65Applications however can be further optimized to take advantage of
66this feature, like for example they've been optimized before to avoid
67a flood of mmap system calls for every malloc(4k). Optimizing userland
68is by far not mandatory and khugepaged already can take care of long
69lived page allocations even for hugepage unaware applications that
70deals with large amounts of memory.
71
72In certain cases when hugepages are enabled system wide, application
73may end up allocating more memory resources. An application may mmap a
74large region but only touch 1 byte of it, in that case a 2M page might
75be allocated instead of a 4k page for no good. This is why it's
76possible to disable hugepages system-wide and to only have them inside
77MADV_HUGEPAGE madvise regions.
78
79Embedded systems should enable hugepages only inside madvise regions
80to eliminate any risk of wasting any precious byte of memory and to
81only run faster.
82
83Applications that gets a lot of benefit from hugepages and that don't
84risk to lose memory by using hugepages, should use
85madvise(MADV_HUGEPAGE) on their critical mmapped regions.
86
87.. _thp_sysfs:
88
89sysfs
90=====
91
92Global THP controls
93-------------------
94
95Transparent Hugepage Support for anonymous memory can be entirely disabled
96(mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE
97regions (to avoid the risk of consuming more memory resources) or enabled
98system wide. This can be achieved with one of::
99
100	echo always >/sys/kernel/mm/transparent_hugepage/enabled
101	echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
102	echo never >/sys/kernel/mm/transparent_hugepage/enabled
103
104It's also possible to limit defrag efforts in the VM to generate
105anonymous hugepages in case they're not immediately free to madvise
106regions or to never try to defrag memory and simply fallback to regular
107pages unless hugepages are immediately available. Clearly if we spend CPU
108time to defrag memory, we would expect to gain even more by the fact we
109use hugepages later instead of regular pages. This isn't always
110guaranteed, but it may be more likely in case the allocation is for a
111MADV_HUGEPAGE region.
112
113::
114
115	echo always >/sys/kernel/mm/transparent_hugepage/defrag
116	echo defer >/sys/kernel/mm/transparent_hugepage/defrag
117	echo defer+madvise >/sys/kernel/mm/transparent_hugepage/defrag
118	echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
119	echo never >/sys/kernel/mm/transparent_hugepage/defrag
120
121always
122	means that an application requesting THP will stall on
123	allocation failure and directly reclaim pages and compact
124	memory in an effort to allocate a THP immediately. This may be
125	desirable for virtual machines that benefit heavily from THP
126	use and are willing to delay the VM start to utilise them.
127
128defer
129	means that an application will wake kswapd in the background
130	to reclaim pages and wake kcompactd to compact memory so that
131	THP is available in the near future. It's the responsibility
132	of khugepaged to then install the THP pages later.
133
134defer+madvise
135	will enter direct reclaim and compaction like ``always``, but
136	only for regions that have used madvise(MADV_HUGEPAGE); all
137	other regions will wake kswapd in the background to reclaim
138	pages and wake kcompactd to compact memory so that THP is
139	available in the near future.
140
141madvise
142	will enter direct reclaim like ``always`` but only for regions
143	that are have used madvise(MADV_HUGEPAGE). This is the default
144	behaviour.
145
146never
147	should be self-explanatory.
148
149By default kernel tries to use huge zero page on read page fault to
150anonymous mapping. It's possible to disable huge zero page by writing 0
151or enable it back by writing 1::
152
153	echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page
154	echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page
155
156Some userspace (such as a test program, or an optimized memory allocation
157library) may want to know the size (in bytes) of a transparent hugepage::
158
159	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
160
161khugepaged will be automatically started when
162transparent_hugepage/enabled is set to "always" or "madvise, and it'll
163be automatically shutdown if it's set to "never".
164
165Khugepaged controls
166-------------------
167
168khugepaged runs usually at low frequency so while one may not want to
169invoke defrag algorithms synchronously during the page faults, it
170should be worth invoking defrag at least in khugepaged. However it's
171also possible to disable defrag in khugepaged by writing 0 or enable
172defrag in khugepaged by writing 1::
173
174	echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
175	echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
176
177You can also control how many pages khugepaged should scan at each
178pass::
179
180	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
181
182and how many milliseconds to wait in khugepaged between each pass (you
183can set this to 0 to run khugepaged at 100% utilization of one core)::
184
185	/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
186
187and how many milliseconds to wait in khugepaged if there's an hugepage
188allocation failure to throttle the next allocation attempt::
189
190	/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
191
192The khugepaged progress can be seen in the number of pages collapsed (note
193that this counter may not be an exact count of the number of pages
194collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping
195being replaced by a PMD mapping, or (2) All 4K physical pages replaced by
196one 2M hugepage. Each may happen independently, or together, depending on
197the type of memory and the failures that occur. As such, this value should
198be interpreted roughly as a sign of progress, and counters in /proc/vmstat
199consulted for more accurate accounting)::
200
201	/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
202
203for each pass::
204
205	/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
206
207``max_ptes_none`` specifies how many extra small pages (that are
208not already mapped) can be allocated when collapsing a group
209of small pages into one large page::
210
211	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
212
213A higher value leads to use additional memory for programs.
214A lower value leads to gain less thp performance. Value of
215max_ptes_none can waste cpu time very little, you can
216ignore it.
217
218``max_ptes_swap`` specifies how many pages can be brought in from
219swap when collapsing a group of pages into a transparent huge page::
220
221	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap
222
223A higher value can cause excessive swap IO and waste
224memory. A lower value can prevent THPs from being
225collapsed, resulting fewer pages being collapsed into
226THPs, and lower memory access performance.
227
228``max_ptes_shared`` specifies how many pages can be shared across multiple
229processes. Exceeding the number would block the collapse::
230
231	/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
232
233A higher value may increase memory footprint for some workloads.
234
235Boot parameter
236==============
237
238You can change the sysfs boot time defaults of Transparent Hugepage
239Support by passing the parameter ``transparent_hugepage=always`` or
240``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
241to the kernel command line.
242
243Hugepages in tmpfs/shmem
244========================
245
246You can control hugepage allocation policy in tmpfs with mount option
247``huge=``. It can have following values:
248
249always
250    Attempt to allocate huge pages every time we need a new page;
251
252never
253    Do not allocate huge pages;
254
255within_size
256    Only allocate huge page if it will be fully within i_size.
257    Also respect fadvise()/madvise() hints;
258
259advise
260    Only allocate huge pages if requested with fadvise()/madvise();
261
262The default policy is ``never``.
263
264``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
265``huge=never`` will not attempt to break up huge pages at all, just stop more
266from being allocated.
267
268There's also sysfs knob to control hugepage allocation policy for internal
269shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
270is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
271MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
272
273In addition to policies listed above, shmem_enabled allows two further
274values:
275
276deny
277    For use in emergencies, to force the huge option off from
278    all mounts;
279force
280    Force the huge option on for all - very useful for testing;
281
282Need of application restart
283===========================
284
285The transparent_hugepage/enabled values and tmpfs mount option only affect
286future behavior. So to make them effective you need to restart any
287application that could have been using hugepages. This also applies to the
288regions registered in khugepaged.
289
290Monitoring usage
291================
292
293The number of anonymous transparent huge pages currently used by the
294system is available by reading the AnonHugePages field in ``/proc/meminfo``.
295To identify what applications are using anonymous transparent huge pages,
296it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields
297for each mapping.
298
299The number of file transparent huge pages mapped to userspace is available
300by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
301To identify what applications are mapping file transparent huge pages, it
302is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
303for each mapping.
304
305Note that reading the smaps file is expensive and reading it
306frequently will incur overhead.
307
308There are a number of counters in ``/proc/vmstat`` that may be used to
309monitor how successfully the system is providing huge pages for use.
310
311thp_fault_alloc
312	is incremented every time a huge page is successfully
313	allocated to handle a page fault.
314
315thp_collapse_alloc
316	is incremented by khugepaged when it has found
317	a range of pages to collapse into one huge page and has
318	successfully allocated a new huge page to store the data.
319
320thp_fault_fallback
321	is incremented if a page fault fails to allocate
322	a huge page and instead falls back to using small pages.
323
324thp_fault_fallback_charge
325	is incremented if a page fault fails to charge a huge page and
326	instead falls back to using small pages even though the
327	allocation was successful.
328
329thp_collapse_alloc_failed
330	is incremented if khugepaged found a range
331	of pages that should be collapsed into one huge page but failed
332	the allocation.
333
334thp_file_alloc
335	is incremented every time a file huge page is successfully
336	allocated.
337
338thp_file_fallback
339	is incremented if a file huge page is attempted to be allocated
340	but fails and instead falls back to using small pages.
341
342thp_file_fallback_charge
343	is incremented if a file huge page cannot be charged and instead
344	falls back to using small pages even though the allocation was
345	successful.
346
347thp_file_mapped
348	is incremented every time a file huge page is mapped into
349	user address space.
350
351thp_split_page
352	is incremented every time a huge page is split into base
353	pages. This can happen for a variety of reasons but a common
354	reason is that a huge page is old and is being reclaimed.
355	This action implies splitting all PMD the page mapped with.
356
357thp_split_page_failed
358	is incremented if kernel fails to split huge
359	page. This can happen if the page was pinned by somebody.
360
361thp_deferred_split_page
362	is incremented when a huge page is put onto split
363	queue. This happens when a huge page is partially unmapped and
364	splitting it would free up some memory. Pages on split queue are
365	going to be split under memory pressure.
366
367thp_split_pmd
368	is incremented every time a PMD split into table of PTEs.
369	This can happen, for instance, when application calls mprotect() or
370	munmap() on part of huge page. It doesn't split huge page, only
371	page table entry.
372
373thp_zero_page_alloc
374	is incremented every time a huge zero page used for thp is
375	successfully allocated. Note, it doesn't count every map of
376	the huge zero page, only its allocation.
377
378thp_zero_page_alloc_failed
379	is incremented if kernel fails to allocate
380	huge zero page and falls back to using small pages.
381
382thp_swpout
383	is incremented every time a huge page is swapout in one
384	piece without splitting.
385
386thp_swpout_fallback
387	is incremented if a huge page has to be split before swapout.
388	Usually because failed to allocate some continuous swap space
389	for the huge page.
390
391As the system ages, allocating huge pages may be expensive as the
392system uses memory compaction to copy data around memory to free a
393huge page for use. There are some counters in ``/proc/vmstat`` to help
394monitor this overhead.
395
396compact_stall
397	is incremented every time a process stalls to run
398	memory compaction so that a huge page is free for use.
399
400compact_success
401	is incremented if the system compacted memory and
402	freed a huge page for use.
403
404compact_fail
405	is incremented if the system tries to compact memory
406	but failed.
407
408It is possible to establish how long the stalls were using the function
409tracer to record how long was spent in __alloc_pages() and
410using the mm_page_alloc tracepoint to identify which allocations were
411for huge pages.
412
413Optimizing the applications
414===========================
415
416To be guaranteed that the kernel will map a 2M page immediately in any
417memory region, the mmap region has to be hugepage naturally
418aligned. posix_memalign() can provide that guarantee.
419
420Hugetlbfs
421=========
422
423You can use hugetlbfs on a kernel that has transparent hugepage
424support enabled just fine as always. No difference can be noted in
425hugetlbfs other than there will be less overall fragmentation. All
426usual features belonging to hugetlbfs are preserved and
427unaffected. libhugetlbfs will also work fine as usual.
428