xref: /openbmc/linux/Documentation/admin-guide/mm/hugetlbpage.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
11ad1335dSMike Rapoport=============
21ad1335dSMike RapoportHugeTLB Pages
31ad1335dSMike Rapoport=============
41ad1335dSMike Rapoport
51ad1335dSMike RapoportOverview
61ad1335dSMike Rapoport========
71ad1335dSMike Rapoport
81ad1335dSMike RapoportThe intent of this file is to give a brief summary of hugetlbpage support in
91ad1335dSMike Rapoportthe Linux kernel.  This support is built on top of multiple page size support
101ad1335dSMike Rapoportthat is provided by most modern architectures.  For example, x86 CPUs normally
111ad1335dSMike Rapoportsupport 4K and 2M (1G if architecturally supported) page sizes, ia64
121ad1335dSMike Rapoportarchitecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M,
131ad1335dSMike Rapoport256M and ppc64 supports 4K and 16M.  A TLB is a cache of virtual-to-physical
141ad1335dSMike Rapoporttranslations.  Typically this is a very scarce resource on processor.
151ad1335dSMike RapoportOperating systems try to make best use of limited number of TLB resources.
161ad1335dSMike RapoportThis optimization is more critical now as bigger and bigger physical memories
171ad1335dSMike Rapoport(several GBs) are more readily available.
181ad1335dSMike Rapoport
191ad1335dSMike RapoportUsers can use the huge page support in Linux kernel by either using the mmap
201ad1335dSMike Rapoportsystem call or standard SYSV shared memory system calls (shmget, shmat).
211ad1335dSMike Rapoport
221ad1335dSMike RapoportFirst the Linux kernel needs to be built with the CONFIG_HUGETLBFS
231ad1335dSMike Rapoport(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
241ad1335dSMike Rapoportautomatically when CONFIG_HUGETLBFS is selected) configuration
251ad1335dSMike Rapoportoptions.
261ad1335dSMike Rapoport
271ad1335dSMike RapoportThe ``/proc/meminfo`` file provides information about the total number of
281ad1335dSMike Rapoportpersistent hugetlb pages in the kernel's huge page pool.  It also displays
291ad1335dSMike Rapoportdefault huge page size and information about the number of free, reserved
301ad1335dSMike Rapoportand surplus huge pages in the pool of huge pages of default size.
311ad1335dSMike RapoportThe huge page size is needed for generating the proper alignment and
321ad1335dSMike Rapoportsize of the arguments to system calls that map huge page regions.
331ad1335dSMike Rapoport
341ad1335dSMike RapoportThe output of ``cat /proc/meminfo`` will include lines like::
351ad1335dSMike Rapoport
361ad1335dSMike Rapoport	HugePages_Total: uuu
371ad1335dSMike Rapoport	HugePages_Free:  vvv
381ad1335dSMike Rapoport	HugePages_Rsvd:  www
391ad1335dSMike Rapoport	HugePages_Surp:  xxx
401ad1335dSMike Rapoport	Hugepagesize:    yyy kB
411ad1335dSMike Rapoport	Hugetlb:         zzz kB
421ad1335dSMike Rapoport
431ad1335dSMike Rapoportwhere:
441ad1335dSMike Rapoport
451ad1335dSMike RapoportHugePages_Total
461ad1335dSMike Rapoport	is the size of the pool of huge pages.
471ad1335dSMike RapoportHugePages_Free
481ad1335dSMike Rapoport	is the number of huge pages in the pool that are not yet
491ad1335dSMike Rapoport        allocated.
501ad1335dSMike RapoportHugePages_Rsvd
511ad1335dSMike Rapoport	is short for "reserved," and is the number of huge pages for
521ad1335dSMike Rapoport        which a commitment to allocate from the pool has been made,
531ad1335dSMike Rapoport        but no allocation has yet been made.  Reserved huge pages
541ad1335dSMike Rapoport        guarantee that an application will be able to allocate a
551ad1335dSMike Rapoport        huge page from the pool of huge pages at fault time.
561ad1335dSMike RapoportHugePages_Surp
571ad1335dSMike Rapoport	is short for "surplus," and is the number of huge pages in
581ad1335dSMike Rapoport        the pool above the value in ``/proc/sys/vm/nr_hugepages``. The
591ad1335dSMike Rapoport        maximum number of surplus huge pages is controlled by
601ad1335dSMike Rapoport        ``/proc/sys/vm/nr_overcommit_hugepages``.
61ad2fa371SMuchun Song	Note: When the feature of freeing unused vmemmap pages associated
62ad2fa371SMuchun Song	with each hugetlb page is enabled, the number of surplus huge pages
63ad2fa371SMuchun Song	may be temporarily larger than the maximum number of surplus huge
64ad2fa371SMuchun Song	pages when the system is under memory pressure.
651ad1335dSMike RapoportHugepagesize
6616461c66SHoi Pok Wu	is the default hugepage size (in kB).
671ad1335dSMike RapoportHugetlb
681ad1335dSMike Rapoport        is the total amount of memory (in kB), consumed by huge
691ad1335dSMike Rapoport        pages of all sizes.
701ad1335dSMike Rapoport        If huge pages of different sizes are in use, this number
711ad1335dSMike Rapoport        will exceed HugePages_Total \* Hugepagesize. To get more
721ad1335dSMike Rapoport        detailed information, please, refer to
731ad1335dSMike Rapoport        ``/sys/kernel/mm/hugepages`` (described below).
741ad1335dSMike Rapoport
751ad1335dSMike Rapoport
761ad1335dSMike Rapoport``/proc/filesystems`` should also show a filesystem of type "hugetlbfs"
771ad1335dSMike Rapoportconfigured in the kernel.
781ad1335dSMike Rapoport
791ad1335dSMike Rapoport``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge
801ad1335dSMike Rapoportpages in the kernel's huge page pool.  "Persistent" huge pages will be
811ad1335dSMike Rapoportreturned to the huge page pool when freed by a task.  A user with root
821ad1335dSMike Rapoportprivileges can dynamically allocate more or free some persistent huge pages
831ad1335dSMike Rapoportby increasing or decreasing the value of ``nr_hugepages``.
841ad1335dSMike Rapoport
85ad2fa371SMuchun SongNote: When the feature of freeing unused vmemmap pages associated with each
86ad2fa371SMuchun Songhugetlb page is enabled, we can fail to free the huge pages triggered by
87*dbeb56feSRandy Dunlapthe user when the system is under memory pressure.  Please try again later.
88ad2fa371SMuchun Song
891ad1335dSMike RapoportPages that are used as huge pages are reserved inside the kernel and cannot
901ad1335dSMike Rapoportbe used for other purposes.  Huge pages cannot be swapped out under
911ad1335dSMike Rapoportmemory pressure.
921ad1335dSMike Rapoport
931ad1335dSMike RapoportOnce a number of huge pages have been pre-allocated to the kernel huge page
941ad1335dSMike Rapoportpool, a user with appropriate privilege can use either the mmap system call
951ad1335dSMike Rapoportor shared memory system calls to use the huge pages.  See the discussion of
961ad1335dSMike Rapoport:ref:`Using Huge Pages <using_huge_pages>`, below.
971ad1335dSMike Rapoport
981ad1335dSMike RapoportThe administrator can allocate persistent huge pages on the kernel boot
991ad1335dSMike Rapoportcommand line by specifying the "hugepages=N" parameter, where 'N' = the
1001ad1335dSMike Rapoportnumber of huge pages requested.  This is the most reliable method of
1011ad1335dSMike Rapoportallocating huge pages as memory has not yet become fragmented.
1021ad1335dSMike Rapoport
1031ad1335dSMike RapoportSome platforms support multiple huge page sizes.  To allocate huge pages
1041ad1335dSMike Rapoportof a specific size, one must precede the huge pages boot command parameters
1051ad1335dSMike Rapoportwith a huge page size selection parameter "hugepagesz=<size>".  <size> must
1061ad1335dSMike Rapoportbe specified in bytes with optional scale suffix [kKmMgG].  The default huge
1071ad1335dSMike Rapoportpage size may be selected with the "default_hugepagesz=<size>" boot parameter.
1081ad1335dSMike Rapoport
109282f4214SMike KravetzHugetlb boot command line parameter semantics
11072a3e3e2SMauro Carvalho Chehab
11172a3e3e2SMauro Carvalho Chehabhugepagesz
11272a3e3e2SMauro Carvalho Chehab	Specify a huge page size.  Used in conjunction with hugepages
113282f4214SMike Kravetz	parameter to preallocate a number of huge pages of the specified
114282f4214SMike Kravetz	size.  Hence, hugepagesz and hugepages are typically specified in
11572a3e3e2SMauro Carvalho Chehab	pairs such as::
11672a3e3e2SMauro Carvalho Chehab
117282f4214SMike Kravetz		hugepagesz=2M hugepages=512
11872a3e3e2SMauro Carvalho Chehab
119282f4214SMike Kravetz	hugepagesz can only be specified once on the command line for a
120282f4214SMike Kravetz	specific huge page size.  Valid huge page sizes are architecture
121282f4214SMike Kravetz	dependent.
12272a3e3e2SMauro Carvalho Chehabhugepages
12372a3e3e2SMauro Carvalho Chehab	Specify the number of huge pages to preallocate.  This typically
124282f4214SMike Kravetz	follows a valid hugepagesz or default_hugepagesz parameter.  However,
125282f4214SMike Kravetz	if hugepages is the first or only hugetlb command line parameter it
126282f4214SMike Kravetz	implicitly specifies the number of huge pages of default size to
127282f4214SMike Kravetz	allocate.  If the number of huge pages of default size is implicitly
128282f4214SMike Kravetz	specified, it can not be overwritten by a hugepagesz,hugepages
129b5389086SZhenguo Yao	parameter pair for the default size.  This parameter also has a
130b5389086SZhenguo Yao	node format.  The node format specifies the number of huge pages
131b5389086SZhenguo Yao	to allocate on specific nodes.
13272a3e3e2SMauro Carvalho Chehab
13372a3e3e2SMauro Carvalho Chehab	For example, on an architecture with 2M default huge page size::
13472a3e3e2SMauro Carvalho Chehab
135282f4214SMike Kravetz		hugepages=256 hugepagesz=2M hugepages=512
13672a3e3e2SMauro Carvalho Chehab
137282f4214SMike Kravetz	will result in 256 2M huge pages being allocated and a warning message
138282f4214SMike Kravetz	indicating that the hugepages=512 parameter is ignored.  If a hugepages
139282f4214SMike Kravetz	parameter is preceded by an invalid hugepagesz parameter, it will
140282f4214SMike Kravetz	be ignored.
141b5389086SZhenguo Yao
142b5389086SZhenguo Yao	Node format example::
143b5389086SZhenguo Yao
144b5389086SZhenguo Yao		hugepagesz=2M hugepages=0:1,1:2
145b5389086SZhenguo Yao
146b5389086SZhenguo Yao	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
147b5389086SZhenguo Yao	If the node number is invalid,  the parameter will be ignored.
148b5389086SZhenguo Yao
14972a3e3e2SMauro Carvalho Chehabdefault_hugepagesz
150540809beSBaoquan He	Specify the default huge page size.  This parameter can
151282f4214SMike Kravetz	only be specified once on the command line.  default_hugepagesz can
152282f4214SMike Kravetz	optionally be followed by the hugepages parameter to preallocate a
153282f4214SMike Kravetz	specific number of huge pages of default size.  The number of default
154282f4214SMike Kravetz	sized huge pages to preallocate can also be implicitly specified as
155282f4214SMike Kravetz	mentioned in the hugepages section above.  Therefore, on an
15672a3e3e2SMauro Carvalho Chehab	architecture with 2M default huge page size::
15772a3e3e2SMauro Carvalho Chehab
158282f4214SMike Kravetz		hugepages=256
159282f4214SMike Kravetz		default_hugepagesz=2M hugepages=256
160282f4214SMike Kravetz		hugepages=256 default_hugepagesz=2M
16172a3e3e2SMauro Carvalho Chehab
162282f4214SMike Kravetz	will all result in 256 2M huge pages being allocated.  Valid default
163282f4214SMike Kravetz	huge page size is architecture dependent.
164e9fdff87SMuchun Songhugetlb_free_vmemmap
165dff03381SMuchun Song	When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB
166dff03381SMuchun Song	Vmemmap Optimization (HVO).
167282f4214SMike Kravetz
1681ad1335dSMike RapoportWhen multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages``
1691ad1335dSMike Rapoportindicates the current number of pre-allocated huge pages of the default size.
1701ad1335dSMike RapoportThus, one can use the following command to dynamically allocate/deallocate
1711ad1335dSMike Rapoportdefault sized persistent huge pages::
1721ad1335dSMike Rapoport
1731ad1335dSMike Rapoport	echo 20 > /proc/sys/vm/nr_hugepages
1741ad1335dSMike Rapoport
1751ad1335dSMike RapoportThis command will try to adjust the number of default sized huge pages in the
1761ad1335dSMike Rapoporthuge page pool to 20, allocating or freeing huge pages, as required.
1771ad1335dSMike Rapoport
1781ad1335dSMike RapoportOn a NUMA platform, the kernel will attempt to distribute the huge page pool
1791ad1335dSMike Rapoportover all the set of allowed nodes specified by the NUMA memory policy of the
1801ad1335dSMike Rapoporttask that modifies ``nr_hugepages``. The default for the allowed nodes--when the
1811ad1335dSMike Rapoporttask has default memory policy--is all on-line nodes with memory.  Allowed
1821ad1335dSMike Rapoportnodes with insufficient available, contiguous memory for a huge page will be
1831ad1335dSMike Rapoportsilently skipped when allocating persistent huge pages.  See the
1841ad1335dSMike Rapoport:ref:`discussion below <mem_policy_and_hp_alloc>`
1851ad1335dSMike Rapoportof the interaction of task memory policy, cpusets and per node attributes
1861ad1335dSMike Rapoportwith the allocation and freeing of persistent huge pages.
1871ad1335dSMike Rapoport
1881ad1335dSMike RapoportThe success or failure of huge page allocation depends on the amount of
1891ad1335dSMike Rapoportphysically contiguous memory that is present in system at the time of the
1901ad1335dSMike Rapoportallocation attempt.  If the kernel is unable to allocate huge pages from
1911ad1335dSMike Rapoportsome nodes in a NUMA system, it will attempt to make up the difference by
1921ad1335dSMike Rapoportallocating extra pages on other nodes with sufficient available contiguous
1931ad1335dSMike Rapoportmemory, if any.
1941ad1335dSMike Rapoport
1951ad1335dSMike RapoportSystem administrators may want to put this command in one of the local rc
1961ad1335dSMike Rapoportinit files.  This will enable the kernel to allocate huge pages early in
1971ad1335dSMike Rapoportthe boot process when the possibility of getting physical contiguous pages
1981ad1335dSMike Rapoportis still very high.  Administrators can verify the number of huge pages
1991ad1335dSMike Rapoportactually allocated by checking the sysctl or meminfo.  To check the per node
2001ad1335dSMike Rapoportdistribution of huge pages in a NUMA system, use::
2011ad1335dSMike Rapoport
2021ad1335dSMike Rapoport	cat /sys/devices/system/node/node*/meminfo | fgrep Huge
2031ad1335dSMike Rapoport
2041ad1335dSMike Rapoport``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of
2051ad1335dSMike Rapoporthuge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are
2061ad1335dSMike Rapoportrequested by applications.  Writing any non-zero value into this file
2071ad1335dSMike Rapoportindicates that the hugetlb subsystem is allowed to try to obtain that
2081ad1335dSMike Rapoportnumber of "surplus" huge pages from the kernel's normal page pool, when the
2091ad1335dSMike Rapoportpersistent huge page pool is exhausted. As these surplus huge pages become
2101ad1335dSMike Rapoportunused, they are freed back to the kernel's normal page pool.
2111ad1335dSMike Rapoport
2121ad1335dSMike RapoportWhen increasing the huge page pool size via ``nr_hugepages``, any existing
2131ad1335dSMike Rapoportsurplus pages will first be promoted to persistent huge pages.  Then, additional
2141ad1335dSMike Rapoporthuge pages will be allocated, if necessary and if possible, to fulfill
2151ad1335dSMike Rapoportthe new persistent huge page pool size.
2161ad1335dSMike Rapoport
2171ad1335dSMike RapoportThe administrator may shrink the pool of persistent huge pages for
2181ad1335dSMike Rapoportthe default huge page size by setting the ``nr_hugepages`` sysctl to a
2191ad1335dSMike Rapoportsmaller value.  The kernel will attempt to balance the freeing of huge pages
2201ad1335dSMike Rapoportacross all nodes in the memory policy of the task modifying ``nr_hugepages``.
2211ad1335dSMike RapoportAny free huge pages on the selected nodes will be freed back to the kernel's
2221ad1335dSMike Rapoportnormal page pool.
2231ad1335dSMike Rapoport
2241ad1335dSMike RapoportCaveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that
2251ad1335dSMike Rapoportit becomes less than the number of huge pages in use will convert the balance
2261ad1335dSMike Rapoportof the in-use huge pages to surplus huge pages.  This will occur even if
2271ad1335dSMike Rapoportthe number of surplus pages would exceed the overcommit value.  As long as
2281ad1335dSMike Rapoportthis condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is
2291ad1335dSMike Rapoportincreased sufficiently, or the surplus huge pages go out of use and are freed--
2301ad1335dSMike Rapoportno more surplus huge pages will be allowed to be allocated.
2311ad1335dSMike Rapoport
2321ad1335dSMike RapoportWith support for multiple huge page pools at run-time available, much of
2331ad1335dSMike Rapoportthe huge page userspace interface in ``/proc/sys/vm`` has been duplicated in
2341ad1335dSMike Rapoportsysfs.
2351ad1335dSMike RapoportThe ``/proc`` interfaces discussed above have been retained for backwards
2361ad1335dSMike Rapoportcompatibility. The root huge page control directory in sysfs is::
2371ad1335dSMike Rapoport
2381ad1335dSMike Rapoport	/sys/kernel/mm/hugepages
2391ad1335dSMike Rapoport
2401ad1335dSMike RapoportFor each huge page size supported by the running kernel, a subdirectory
2411ad1335dSMike Rapoportwill exist, of the form::
2421ad1335dSMike Rapoport
2431ad1335dSMike Rapoport	hugepages-${size}kB
2441ad1335dSMike Rapoport
24579dfc695SMike KravetzInside each of these directories, the set of files contained in ``/proc``
24679dfc695SMike Kravetzwill exist.  In addition, two additional interfaces for demoting huge
24779dfc695SMike Kravetzpages may exist::
2481ad1335dSMike Rapoport
24979dfc695SMike Kravetz        demote
25079dfc695SMike Kravetz        demote_size
2511ad1335dSMike Rapoport	nr_hugepages
2521ad1335dSMike Rapoport	nr_hugepages_mempolicy
2531ad1335dSMike Rapoport	nr_overcommit_hugepages
2541ad1335dSMike Rapoport	free_hugepages
2551ad1335dSMike Rapoport	resv_hugepages
2561ad1335dSMike Rapoport	surplus_hugepages
2571ad1335dSMike Rapoport
25879dfc695SMike KravetzThe demote interfaces provide the ability to split a huge page into
25979dfc695SMike Kravetzsmaller huge pages.  For example, the x86 architecture supports both
26079dfc695SMike Kravetz1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
26179dfc695SMike Kravetz2MB huge pages.  Demote interfaces are not available for the smallest
26279dfc695SMike Kravetzhuge page size.  The demote interfaces are:
26379dfc695SMike Kravetz
26479dfc695SMike Kravetzdemote_size
26579dfc695SMike Kravetz        is the size of demoted pages.  When a page is demoted a corresponding
26679dfc695SMike Kravetz        number of huge pages of demote_size will be created.  By default,
26779dfc695SMike Kravetz        demote_size is set to the next smaller huge page size.  If there are
26879dfc695SMike Kravetz        multiple smaller huge page sizes, demote_size can be set to any of
26979dfc695SMike Kravetz        these smaller sizes.  Only huge page sizes less than the current huge
27079dfc695SMike Kravetz        pages size are allowed.
27179dfc695SMike Kravetz
27279dfc695SMike Kravetzdemote
27379dfc695SMike Kravetz        is used to demote a number of huge pages.  A user with root privileges
27479dfc695SMike Kravetz        can write to this file.  It may not be possible to demote the
27579dfc695SMike Kravetz        requested number of huge pages.  To determine how many pages were
27679dfc695SMike Kravetz        actually demoted, compare the value of nr_hugepages before and after
27779dfc695SMike Kravetz        writing to the demote interface.  demote is a write only interface.
27879dfc695SMike Kravetz
27979dfc695SMike KravetzThe interfaces which are the same as in ``/proc`` (all except demote and
28079dfc695SMike Kravetzdemote_size) function as described above for the default huge page-sized case.
2811ad1335dSMike Rapoport
2821ad1335dSMike Rapoport.. _mem_policy_and_hp_alloc:
2831ad1335dSMike Rapoport
2841ad1335dSMike RapoportInteraction of Task Memory Policy with Huge Page Allocation/Freeing
2851ad1335dSMike Rapoport===================================================================
2861ad1335dSMike Rapoport
2871ad1335dSMike RapoportWhether huge pages are allocated and freed via the ``/proc`` interface or
2881ad1335dSMike Rapoportthe ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the
2891ad1335dSMike RapoportNUMA nodes from which huge pages are allocated or freed are controlled by the
2901ad1335dSMike RapoportNUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy``
2911ad1335dSMike Rapoportsysctl or attribute.  When the ``nr_hugepages`` attribute is used, mempolicy
2921ad1335dSMike Rapoportis ignored.
2931ad1335dSMike Rapoport
2941ad1335dSMike RapoportThe recommended method to allocate or free huge pages to/from the kernel
2951ad1335dSMike Rapoporthuge page pool, using the ``nr_hugepages`` example above, is::
2961ad1335dSMike Rapoport
2971ad1335dSMike Rapoport    numactl --interleave <node-list> echo 20 \
2981ad1335dSMike Rapoport				>/proc/sys/vm/nr_hugepages_mempolicy
2991ad1335dSMike Rapoport
3001ad1335dSMike Rapoportor, more succinctly::
3011ad1335dSMike Rapoport
3021ad1335dSMike Rapoport    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
3031ad1335dSMike Rapoport
3041ad1335dSMike RapoportThis will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes
3051ad1335dSMike Rapoportspecified in <node-list>, depending on whether number of persistent huge pages
3061ad1335dSMike Rapoportis initially less than or greater than 20, respectively.  No huge pages will be
3071ad1335dSMike Rapoportallocated nor freed on any node not included in the specified <node-list>.
3081ad1335dSMike Rapoport
3091ad1335dSMike RapoportWhen adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any
3101ad1335dSMike Rapoportmemory policy mode--bind, preferred, local or interleave--may be used.  The
3111ad1335dSMike Rapoportresulting effect on persistent huge page allocation is as follows:
3121ad1335dSMike Rapoport
313e27a20f1SMike Rapoport#. Regardless of mempolicy mode [see
31400cba6b6SMike Rapoport (IBM)   Documentation/admin-guide/mm/numa_memory_policy.rst],
3151ad1335dSMike Rapoport   persistent huge pages will be distributed across the node or nodes
3161ad1335dSMike Rapoport   specified in the mempolicy as if "interleave" had been specified.
3171ad1335dSMike Rapoport   However, if a node in the policy does not contain sufficient contiguous
3181ad1335dSMike Rapoport   memory for a huge page, the allocation will not "fallback" to the nearest
3191ad1335dSMike Rapoport   neighbor node with sufficient contiguous memory.  To do this would cause
3201ad1335dSMike Rapoport   undesirable imbalance in the distribution of the huge page pool, or
3211ad1335dSMike Rapoport   possibly, allocation of persistent huge pages on nodes not allowed by
3221ad1335dSMike Rapoport   the task's memory policy.
3231ad1335dSMike Rapoport
3241ad1335dSMike Rapoport#. One or more nodes may be specified with the bind or interleave policy.
3251ad1335dSMike Rapoport   If more than one node is specified with the preferred policy, only the
3261ad1335dSMike Rapoport   lowest numeric id will be used.  Local policy will select the node where
3271ad1335dSMike Rapoport   the task is running at the time the nodes_allowed mask is constructed.
3281ad1335dSMike Rapoport   For local policy to be deterministic, the task must be bound to a cpu or
3291ad1335dSMike Rapoport   cpus in a single node.  Otherwise, the task could be migrated to some
3301ad1335dSMike Rapoport   other node at any time after launch and the resulting node will be
3311ad1335dSMike Rapoport   indeterminate.  Thus, local policy is not very useful for this purpose.
3321ad1335dSMike Rapoport   Any of the other mempolicy modes may be used to specify a single node.
3331ad1335dSMike Rapoport
3341ad1335dSMike Rapoport#. The nodes allowed mask will be derived from any non-default task mempolicy,
3351ad1335dSMike Rapoport   whether this policy was set explicitly by the task itself or one of its
3361ad1335dSMike Rapoport   ancestors, such as numactl.  This means that if the task is invoked from a
3371ad1335dSMike Rapoport   shell with non-default policy, that policy will be used.  One can specify a
3381ad1335dSMike Rapoport   node list of "all" with numactl --interleave or --membind [-m] to achieve
3391ad1335dSMike Rapoport   interleaving over all nodes in the system or cpuset.
3401ad1335dSMike Rapoport
3411ad1335dSMike Rapoport#. Any task mempolicy specified--e.g., using numactl--will be constrained by
3421ad1335dSMike Rapoport   the resource limits of any cpuset in which the task runs.  Thus, there will
3431ad1335dSMike Rapoport   be no way for a task with non-default policy running in a cpuset with a
3441ad1335dSMike Rapoport   subset of the system nodes to allocate huge pages outside the cpuset
3451ad1335dSMike Rapoport   without first moving to a cpuset that contains all of the desired nodes.
3461ad1335dSMike Rapoport
3471ad1335dSMike Rapoport#. Boot-time huge page allocation attempts to distribute the requested number
3481ad1335dSMike Rapoport   of huge pages over all on-lines nodes with memory.
3491ad1335dSMike Rapoport
3501ad1335dSMike RapoportPer Node Hugepages Attributes
3511ad1335dSMike Rapoport=============================
3521ad1335dSMike Rapoport
3531ad1335dSMike RapoportA subset of the contents of the root huge page control directory in sysfs,
3541ad1335dSMike Rapoportdescribed above, will be replicated under each the system device of each
3551ad1335dSMike RapoportNUMA node with memory in::
3561ad1335dSMike Rapoport
3571ad1335dSMike Rapoport	/sys/devices/system/node/node[0-9]*/hugepages/
3581ad1335dSMike Rapoport
3591ad1335dSMike RapoportUnder this directory, the subdirectory for each supported huge page size
3601ad1335dSMike Rapoportcontains the following attribute files::
3611ad1335dSMike Rapoport
3621ad1335dSMike Rapoport	nr_hugepages
3631ad1335dSMike Rapoport	free_hugepages
3641ad1335dSMike Rapoport	surplus_hugepages
3651ad1335dSMike Rapoport
3661ad1335dSMike RapoportThe free\_' and surplus\_' attribute files are read-only.  They return the number
3671ad1335dSMike Rapoportof free and surplus [overcommitted] huge pages, respectively, on the parent
3681ad1335dSMike Rapoportnode.
3691ad1335dSMike Rapoport
3701ad1335dSMike RapoportThe ``nr_hugepages`` attribute returns the total number of huge pages on the
3711ad1335dSMike Rapoportspecified node.  When this attribute is written, the number of persistent huge
3721ad1335dSMike Rapoportpages on the parent node will be adjusted to the specified value, if sufficient
3731ad1335dSMike Rapoportresources exist, regardless of the task's mempolicy or cpuset constraints.
3741ad1335dSMike Rapoport
3751ad1335dSMike RapoportNote that the number of overcommit and reserve pages remain global quantities,
3761ad1335dSMike Rapoportas we don't know until fault time, when the faulting task's mempolicy is
3771ad1335dSMike Rapoportapplied, from which node the huge page allocation will be attempted.
3781ad1335dSMike Rapoport
3791ad1335dSMike Rapoport.. _using_huge_pages:
3801ad1335dSMike Rapoport
3811ad1335dSMike RapoportUsing Huge Pages
3821ad1335dSMike Rapoport================
3831ad1335dSMike Rapoport
3841ad1335dSMike RapoportIf the user applications are going to request huge pages using mmap system
3851ad1335dSMike Rapoportcall, then it is required that system administrator mount a file system of
3861ad1335dSMike Rapoporttype hugetlbfs::
3871ad1335dSMike Rapoport
3881ad1335dSMike Rapoport  mount -t hugetlbfs \
3891ad1335dSMike Rapoport	-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
3901ad1335dSMike Rapoport	min_size=<value>,nr_inodes=<value> none /mnt/huge
3911ad1335dSMike Rapoport
3921ad1335dSMike RapoportThis command mounts a (pseudo) filesystem of type hugetlbfs on the directory
3931ad1335dSMike Rapoport``/mnt/huge``.  Any file created on ``/mnt/huge`` uses huge pages.
3941ad1335dSMike Rapoport
3951ad1335dSMike RapoportThe ``uid`` and ``gid`` options sets the owner and group of the root of the
3961ad1335dSMike Rapoportfile system.  By default the ``uid`` and ``gid`` of the current process
3971ad1335dSMike Rapoportare taken.
3981ad1335dSMike Rapoport
3991ad1335dSMike RapoportThe ``mode`` option sets the mode of root of file system to value & 01777.
4001ad1335dSMike RapoportThis value is given in octal. By default the value 0755 is picked.
4011ad1335dSMike Rapoport
4021ad1335dSMike RapoportIf the platform supports multiple huge page sizes, the ``pagesize`` option can
4031ad1335dSMike Rapoportbe used to specify the huge page size and associated pool. ``pagesize``
4041ad1335dSMike Rapoportis specified in bytes. If ``pagesize`` is not specified the platform's
4051ad1335dSMike Rapoportdefault huge page size and associated pool will be used.
4061ad1335dSMike Rapoport
4071ad1335dSMike RapoportThe ``size`` option sets the maximum value of memory (huge pages) allowed
4081ad1335dSMike Rapoportfor that filesystem (``/mnt/huge``). The ``size`` option can be specified
4091ad1335dSMike Rapoportin bytes, or as a percentage of the specified huge page pool (``nr_hugepages``).
4101ad1335dSMike RapoportThe size is rounded down to HPAGE_SIZE boundary.
4111ad1335dSMike Rapoport
4121ad1335dSMike RapoportThe ``min_size`` option sets the minimum value of memory (huge pages) allowed
4131ad1335dSMike Rapoportfor the filesystem. ``min_size`` can be specified in the same way as ``size``,
4141ad1335dSMike Rapoporteither bytes or a percentage of the huge page pool.
4151ad1335dSMike RapoportAt mount time, the number of huge pages specified by ``min_size`` are reserved
4161ad1335dSMike Rapoportfor use by the filesystem.
4171ad1335dSMike RapoportIf there are not enough free huge pages available, the mount will fail.
4181ad1335dSMike RapoportAs huge pages are allocated to the filesystem and freed, the reserve count
4191ad1335dSMike Rapoportis adjusted so that the sum of allocated and reserved huge pages is always
4201ad1335dSMike Rapoportat least ``min_size``.
4211ad1335dSMike Rapoport
4221ad1335dSMike RapoportThe option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge``
4231ad1335dSMike Rapoportcan use.
4241ad1335dSMike Rapoport
4251ad1335dSMike RapoportIf the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on
4261ad1335dSMike Rapoportcommand line then no limits are set.
4271ad1335dSMike Rapoport
4281ad1335dSMike RapoportFor ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can
4291ad1335dSMike Rapoportuse [G|g]/[M|m]/[K|k] to represent giga/mega/kilo.
4301ad1335dSMike RapoportFor example, size=2K has the same meaning as size=2048.
4311ad1335dSMike Rapoport
4321ad1335dSMike RapoportWhile read system calls are supported on files that reside on hugetlb
4331ad1335dSMike Rapoportfile systems, write system calls are not.
4341ad1335dSMike Rapoport
4351ad1335dSMike RapoportRegular chown, chgrp, and chmod commands (with right permissions) could be
4361ad1335dSMike Rapoportused to change the file attributes on hugetlbfs.
4371ad1335dSMike Rapoport
4381ad1335dSMike RapoportAlso, it is important to note that no such mount command is required if
4391ad1335dSMike Rapoportapplications are going to use only shmat/shmget system calls or mmap with
4401ad1335dSMike RapoportMAP_HUGETLB.  For an example of how to use mmap with MAP_HUGETLB see
4411ad1335dSMike Rapoport:ref:`map_hugetlb <map_hugetlb>` below.
4421ad1335dSMike Rapoport
4431ad1335dSMike RapoportUsers who wish to use hugetlb memory via shared memory segment should be
4441ad1335dSMike Rapoportmembers of a supplementary group and system admin needs to configure that gid
4451ad1335dSMike Rapoportinto ``/proc/sys/vm/hugetlb_shm_group``.  It is possible for same or different
4461ad1335dSMike Rapoportapplications to use any combination of mmaps and shm* calls, though the mount of
4471ad1335dSMike Rapoportfilesystem will be required for using mmap calls without MAP_HUGETLB.
4481ad1335dSMike Rapoport
4491ad1335dSMike RapoportSyscalls that operate on memory backed by hugetlb pages only have their lengths
4501ad1335dSMike Rapoportaligned to the native page size of the processor; they will normally fail with
4511ad1335dSMike Rapoporterrno set to EINVAL or exclude hugetlb pages that extend beyond the length if
4521ad1335dSMike Rapoportnot hugepage aligned.  For example, munmap(2) will fail if memory is backed by
4531ad1335dSMike Rapoporta hugetlb page and the length is smaller than the hugepage size.
4541ad1335dSMike Rapoport
4551ad1335dSMike Rapoport
4561ad1335dSMike RapoportExamples
4571ad1335dSMike Rapoport========
4581ad1335dSMike Rapoport
4591ad1335dSMike Rapoport.. _map_hugetlb:
4601ad1335dSMike Rapoport
4611ad1335dSMike Rapoport``map_hugetlb``
462baa489faSSeongJae Park	see tools/testing/selftests/mm/map_hugetlb.c
4631ad1335dSMike Rapoport
4641ad1335dSMike Rapoport``hugepage-shm``
465baa489faSSeongJae Park	see tools/testing/selftests/mm/hugepage-shm.c
4661ad1335dSMike Rapoport
4671ad1335dSMike Rapoport``hugepage-mmap``
468baa489faSSeongJae Park	see tools/testing/selftests/mm/hugepage-mmap.c
4691ad1335dSMike Rapoport
4701ad1335dSMike RapoportThe `libhugetlbfs`_  library provides a wide range of userspace tools
4711ad1335dSMike Rapoportto help with huge page usability, environment setup, and control.
4721ad1335dSMike Rapoport
4731ad1335dSMike Rapoport.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs
474