11ad1335dSMike Rapoport============= 21ad1335dSMike RapoportHugeTLB Pages 31ad1335dSMike Rapoport============= 41ad1335dSMike Rapoport 51ad1335dSMike RapoportOverview 61ad1335dSMike Rapoport======== 71ad1335dSMike Rapoport 81ad1335dSMike RapoportThe intent of this file is to give a brief summary of hugetlbpage support in 91ad1335dSMike Rapoportthe Linux kernel. This support is built on top of multiple page size support 101ad1335dSMike Rapoportthat is provided by most modern architectures. For example, x86 CPUs normally 111ad1335dSMike Rapoportsupport 4K and 2M (1G if architecturally supported) page sizes, ia64 121ad1335dSMike Rapoportarchitecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 131ad1335dSMike Rapoport256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical 141ad1335dSMike Rapoporttranslations. Typically this is a very scarce resource on processor. 151ad1335dSMike RapoportOperating systems try to make best use of limited number of TLB resources. 161ad1335dSMike RapoportThis optimization is more critical now as bigger and bigger physical memories 171ad1335dSMike Rapoport(several GBs) are more readily available. 181ad1335dSMike Rapoport 191ad1335dSMike RapoportUsers can use the huge page support in Linux kernel by either using the mmap 201ad1335dSMike Rapoportsystem call or standard SYSV shared memory system calls (shmget, shmat). 211ad1335dSMike Rapoport 221ad1335dSMike RapoportFirst the Linux kernel needs to be built with the CONFIG_HUGETLBFS 231ad1335dSMike Rapoport(present under "File systems") and CONFIG_HUGETLB_PAGE (selected 241ad1335dSMike Rapoportautomatically when CONFIG_HUGETLBFS is selected) configuration 251ad1335dSMike Rapoportoptions. 261ad1335dSMike Rapoport 271ad1335dSMike RapoportThe ``/proc/meminfo`` file provides information about the total number of 281ad1335dSMike Rapoportpersistent hugetlb pages in the kernel's huge page pool. It also displays 291ad1335dSMike Rapoportdefault huge page size and information about the number of free, reserved 301ad1335dSMike Rapoportand surplus huge pages in the pool of huge pages of default size. 311ad1335dSMike RapoportThe huge page size is needed for generating the proper alignment and 321ad1335dSMike Rapoportsize of the arguments to system calls that map huge page regions. 331ad1335dSMike Rapoport 341ad1335dSMike RapoportThe output of ``cat /proc/meminfo`` will include lines like:: 351ad1335dSMike Rapoport 361ad1335dSMike Rapoport HugePages_Total: uuu 371ad1335dSMike Rapoport HugePages_Free: vvv 381ad1335dSMike Rapoport HugePages_Rsvd: www 391ad1335dSMike Rapoport HugePages_Surp: xxx 401ad1335dSMike Rapoport Hugepagesize: yyy kB 411ad1335dSMike Rapoport Hugetlb: zzz kB 421ad1335dSMike Rapoport 431ad1335dSMike Rapoportwhere: 441ad1335dSMike Rapoport 451ad1335dSMike RapoportHugePages_Total 461ad1335dSMike Rapoport is the size of the pool of huge pages. 471ad1335dSMike RapoportHugePages_Free 481ad1335dSMike Rapoport is the number of huge pages in the pool that are not yet 491ad1335dSMike Rapoport allocated. 501ad1335dSMike RapoportHugePages_Rsvd 511ad1335dSMike Rapoport is short for "reserved," and is the number of huge pages for 521ad1335dSMike Rapoport which a commitment to allocate from the pool has been made, 531ad1335dSMike Rapoport but no allocation has yet been made. Reserved huge pages 541ad1335dSMike Rapoport guarantee that an application will be able to allocate a 551ad1335dSMike Rapoport huge page from the pool of huge pages at fault time. 561ad1335dSMike RapoportHugePages_Surp 571ad1335dSMike Rapoport is short for "surplus," and is the number of huge pages in 581ad1335dSMike Rapoport the pool above the value in ``/proc/sys/vm/nr_hugepages``. The 591ad1335dSMike Rapoport maximum number of surplus huge pages is controlled by 601ad1335dSMike Rapoport ``/proc/sys/vm/nr_overcommit_hugepages``. 61ad2fa371SMuchun Song Note: When the feature of freeing unused vmemmap pages associated 62ad2fa371SMuchun Song with each hugetlb page is enabled, the number of surplus huge pages 63ad2fa371SMuchun Song may be temporarily larger than the maximum number of surplus huge 64ad2fa371SMuchun Song pages when the system is under memory pressure. 651ad1335dSMike RapoportHugepagesize 6616461c66SHoi Pok Wu is the default hugepage size (in kB). 671ad1335dSMike RapoportHugetlb 681ad1335dSMike Rapoport is the total amount of memory (in kB), consumed by huge 691ad1335dSMike Rapoport pages of all sizes. 701ad1335dSMike Rapoport If huge pages of different sizes are in use, this number 711ad1335dSMike Rapoport will exceed HugePages_Total \* Hugepagesize. To get more 721ad1335dSMike Rapoport detailed information, please, refer to 731ad1335dSMike Rapoport ``/sys/kernel/mm/hugepages`` (described below). 741ad1335dSMike Rapoport 751ad1335dSMike Rapoport 761ad1335dSMike Rapoport``/proc/filesystems`` should also show a filesystem of type "hugetlbfs" 771ad1335dSMike Rapoportconfigured in the kernel. 781ad1335dSMike Rapoport 791ad1335dSMike Rapoport``/proc/sys/vm/nr_hugepages`` indicates the current number of "persistent" huge 801ad1335dSMike Rapoportpages in the kernel's huge page pool. "Persistent" huge pages will be 811ad1335dSMike Rapoportreturned to the huge page pool when freed by a task. A user with root 821ad1335dSMike Rapoportprivileges can dynamically allocate more or free some persistent huge pages 831ad1335dSMike Rapoportby increasing or decreasing the value of ``nr_hugepages``. 841ad1335dSMike Rapoport 85ad2fa371SMuchun SongNote: When the feature of freeing unused vmemmap pages associated with each 86ad2fa371SMuchun Songhugetlb page is enabled, we can fail to free the huge pages triggered by 87*dbeb56feSRandy Dunlapthe user when the system is under memory pressure. Please try again later. 88ad2fa371SMuchun Song 891ad1335dSMike RapoportPages that are used as huge pages are reserved inside the kernel and cannot 901ad1335dSMike Rapoportbe used for other purposes. Huge pages cannot be swapped out under 911ad1335dSMike Rapoportmemory pressure. 921ad1335dSMike Rapoport 931ad1335dSMike RapoportOnce a number of huge pages have been pre-allocated to the kernel huge page 941ad1335dSMike Rapoportpool, a user with appropriate privilege can use either the mmap system call 951ad1335dSMike Rapoportor shared memory system calls to use the huge pages. See the discussion of 961ad1335dSMike Rapoport:ref:`Using Huge Pages <using_huge_pages>`, below. 971ad1335dSMike Rapoport 981ad1335dSMike RapoportThe administrator can allocate persistent huge pages on the kernel boot 991ad1335dSMike Rapoportcommand line by specifying the "hugepages=N" parameter, where 'N' = the 1001ad1335dSMike Rapoportnumber of huge pages requested. This is the most reliable method of 1011ad1335dSMike Rapoportallocating huge pages as memory has not yet become fragmented. 1021ad1335dSMike Rapoport 1031ad1335dSMike RapoportSome platforms support multiple huge page sizes. To allocate huge pages 1041ad1335dSMike Rapoportof a specific size, one must precede the huge pages boot command parameters 1051ad1335dSMike Rapoportwith a huge page size selection parameter "hugepagesz=<size>". <size> must 1061ad1335dSMike Rapoportbe specified in bytes with optional scale suffix [kKmMgG]. The default huge 1071ad1335dSMike Rapoportpage size may be selected with the "default_hugepagesz=<size>" boot parameter. 1081ad1335dSMike Rapoport 109282f4214SMike KravetzHugetlb boot command line parameter semantics 11072a3e3e2SMauro Carvalho Chehab 11172a3e3e2SMauro Carvalho Chehabhugepagesz 11272a3e3e2SMauro Carvalho Chehab Specify a huge page size. Used in conjunction with hugepages 113282f4214SMike Kravetz parameter to preallocate a number of huge pages of the specified 114282f4214SMike Kravetz size. Hence, hugepagesz and hugepages are typically specified in 11572a3e3e2SMauro Carvalho Chehab pairs such as:: 11672a3e3e2SMauro Carvalho Chehab 117282f4214SMike Kravetz hugepagesz=2M hugepages=512 11872a3e3e2SMauro Carvalho Chehab 119282f4214SMike Kravetz hugepagesz can only be specified once on the command line for a 120282f4214SMike Kravetz specific huge page size. Valid huge page sizes are architecture 121282f4214SMike Kravetz dependent. 12272a3e3e2SMauro Carvalho Chehabhugepages 12372a3e3e2SMauro Carvalho Chehab Specify the number of huge pages to preallocate. This typically 124282f4214SMike Kravetz follows a valid hugepagesz or default_hugepagesz parameter. However, 125282f4214SMike Kravetz if hugepages is the first or only hugetlb command line parameter it 126282f4214SMike Kravetz implicitly specifies the number of huge pages of default size to 127282f4214SMike Kravetz allocate. If the number of huge pages of default size is implicitly 128282f4214SMike Kravetz specified, it can not be overwritten by a hugepagesz,hugepages 129b5389086SZhenguo Yao parameter pair for the default size. This parameter also has a 130b5389086SZhenguo Yao node format. The node format specifies the number of huge pages 131b5389086SZhenguo Yao to allocate on specific nodes. 13272a3e3e2SMauro Carvalho Chehab 13372a3e3e2SMauro Carvalho Chehab For example, on an architecture with 2M default huge page size:: 13472a3e3e2SMauro Carvalho Chehab 135282f4214SMike Kravetz hugepages=256 hugepagesz=2M hugepages=512 13672a3e3e2SMauro Carvalho Chehab 137282f4214SMike Kravetz will result in 256 2M huge pages being allocated and a warning message 138282f4214SMike Kravetz indicating that the hugepages=512 parameter is ignored. If a hugepages 139282f4214SMike Kravetz parameter is preceded by an invalid hugepagesz parameter, it will 140282f4214SMike Kravetz be ignored. 141b5389086SZhenguo Yao 142b5389086SZhenguo Yao Node format example:: 143b5389086SZhenguo Yao 144b5389086SZhenguo Yao hugepagesz=2M hugepages=0:1,1:2 145b5389086SZhenguo Yao 146b5389086SZhenguo Yao It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. 147b5389086SZhenguo Yao If the node number is invalid, the parameter will be ignored. 148b5389086SZhenguo Yao 14972a3e3e2SMauro Carvalho Chehabdefault_hugepagesz 150540809beSBaoquan He Specify the default huge page size. This parameter can 151282f4214SMike Kravetz only be specified once on the command line. default_hugepagesz can 152282f4214SMike Kravetz optionally be followed by the hugepages parameter to preallocate a 153282f4214SMike Kravetz specific number of huge pages of default size. The number of default 154282f4214SMike Kravetz sized huge pages to preallocate can also be implicitly specified as 155282f4214SMike Kravetz mentioned in the hugepages section above. Therefore, on an 15672a3e3e2SMauro Carvalho Chehab architecture with 2M default huge page size:: 15772a3e3e2SMauro Carvalho Chehab 158282f4214SMike Kravetz hugepages=256 159282f4214SMike Kravetz default_hugepagesz=2M hugepages=256 160282f4214SMike Kravetz hugepages=256 default_hugepagesz=2M 16172a3e3e2SMauro Carvalho Chehab 162282f4214SMike Kravetz will all result in 256 2M huge pages being allocated. Valid default 163282f4214SMike Kravetz huge page size is architecture dependent. 164e9fdff87SMuchun Songhugetlb_free_vmemmap 165dff03381SMuchun Song When CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP is set, this enables HugeTLB 166dff03381SMuchun Song Vmemmap Optimization (HVO). 167282f4214SMike Kravetz 1681ad1335dSMike RapoportWhen multiple huge page sizes are supported, ``/proc/sys/vm/nr_hugepages`` 1691ad1335dSMike Rapoportindicates the current number of pre-allocated huge pages of the default size. 1701ad1335dSMike RapoportThus, one can use the following command to dynamically allocate/deallocate 1711ad1335dSMike Rapoportdefault sized persistent huge pages:: 1721ad1335dSMike Rapoport 1731ad1335dSMike Rapoport echo 20 > /proc/sys/vm/nr_hugepages 1741ad1335dSMike Rapoport 1751ad1335dSMike RapoportThis command will try to adjust the number of default sized huge pages in the 1761ad1335dSMike Rapoporthuge page pool to 20, allocating or freeing huge pages, as required. 1771ad1335dSMike Rapoport 1781ad1335dSMike RapoportOn a NUMA platform, the kernel will attempt to distribute the huge page pool 1791ad1335dSMike Rapoportover all the set of allowed nodes specified by the NUMA memory policy of the 1801ad1335dSMike Rapoporttask that modifies ``nr_hugepages``. The default for the allowed nodes--when the 1811ad1335dSMike Rapoporttask has default memory policy--is all on-line nodes with memory. Allowed 1821ad1335dSMike Rapoportnodes with insufficient available, contiguous memory for a huge page will be 1831ad1335dSMike Rapoportsilently skipped when allocating persistent huge pages. See the 1841ad1335dSMike Rapoport:ref:`discussion below <mem_policy_and_hp_alloc>` 1851ad1335dSMike Rapoportof the interaction of task memory policy, cpusets and per node attributes 1861ad1335dSMike Rapoportwith the allocation and freeing of persistent huge pages. 1871ad1335dSMike Rapoport 1881ad1335dSMike RapoportThe success or failure of huge page allocation depends on the amount of 1891ad1335dSMike Rapoportphysically contiguous memory that is present in system at the time of the 1901ad1335dSMike Rapoportallocation attempt. If the kernel is unable to allocate huge pages from 1911ad1335dSMike Rapoportsome nodes in a NUMA system, it will attempt to make up the difference by 1921ad1335dSMike Rapoportallocating extra pages on other nodes with sufficient available contiguous 1931ad1335dSMike Rapoportmemory, if any. 1941ad1335dSMike Rapoport 1951ad1335dSMike RapoportSystem administrators may want to put this command in one of the local rc 1961ad1335dSMike Rapoportinit files. This will enable the kernel to allocate huge pages early in 1971ad1335dSMike Rapoportthe boot process when the possibility of getting physical contiguous pages 1981ad1335dSMike Rapoportis still very high. Administrators can verify the number of huge pages 1991ad1335dSMike Rapoportactually allocated by checking the sysctl or meminfo. To check the per node 2001ad1335dSMike Rapoportdistribution of huge pages in a NUMA system, use:: 2011ad1335dSMike Rapoport 2021ad1335dSMike Rapoport cat /sys/devices/system/node/node*/meminfo | fgrep Huge 2031ad1335dSMike Rapoport 2041ad1335dSMike Rapoport``/proc/sys/vm/nr_overcommit_hugepages`` specifies how large the pool of 2051ad1335dSMike Rapoporthuge pages can grow, if more huge pages than ``/proc/sys/vm/nr_hugepages`` are 2061ad1335dSMike Rapoportrequested by applications. Writing any non-zero value into this file 2071ad1335dSMike Rapoportindicates that the hugetlb subsystem is allowed to try to obtain that 2081ad1335dSMike Rapoportnumber of "surplus" huge pages from the kernel's normal page pool, when the 2091ad1335dSMike Rapoportpersistent huge page pool is exhausted. As these surplus huge pages become 2101ad1335dSMike Rapoportunused, they are freed back to the kernel's normal page pool. 2111ad1335dSMike Rapoport 2121ad1335dSMike RapoportWhen increasing the huge page pool size via ``nr_hugepages``, any existing 2131ad1335dSMike Rapoportsurplus pages will first be promoted to persistent huge pages. Then, additional 2141ad1335dSMike Rapoporthuge pages will be allocated, if necessary and if possible, to fulfill 2151ad1335dSMike Rapoportthe new persistent huge page pool size. 2161ad1335dSMike Rapoport 2171ad1335dSMike RapoportThe administrator may shrink the pool of persistent huge pages for 2181ad1335dSMike Rapoportthe default huge page size by setting the ``nr_hugepages`` sysctl to a 2191ad1335dSMike Rapoportsmaller value. The kernel will attempt to balance the freeing of huge pages 2201ad1335dSMike Rapoportacross all nodes in the memory policy of the task modifying ``nr_hugepages``. 2211ad1335dSMike RapoportAny free huge pages on the selected nodes will be freed back to the kernel's 2221ad1335dSMike Rapoportnormal page pool. 2231ad1335dSMike Rapoport 2241ad1335dSMike RapoportCaveat: Shrinking the persistent huge page pool via ``nr_hugepages`` such that 2251ad1335dSMike Rapoportit becomes less than the number of huge pages in use will convert the balance 2261ad1335dSMike Rapoportof the in-use huge pages to surplus huge pages. This will occur even if 2271ad1335dSMike Rapoportthe number of surplus pages would exceed the overcommit value. As long as 2281ad1335dSMike Rapoportthis condition holds--that is, until ``nr_hugepages+nr_overcommit_hugepages`` is 2291ad1335dSMike Rapoportincreased sufficiently, or the surplus huge pages go out of use and are freed-- 2301ad1335dSMike Rapoportno more surplus huge pages will be allowed to be allocated. 2311ad1335dSMike Rapoport 2321ad1335dSMike RapoportWith support for multiple huge page pools at run-time available, much of 2331ad1335dSMike Rapoportthe huge page userspace interface in ``/proc/sys/vm`` has been duplicated in 2341ad1335dSMike Rapoportsysfs. 2351ad1335dSMike RapoportThe ``/proc`` interfaces discussed above have been retained for backwards 2361ad1335dSMike Rapoportcompatibility. The root huge page control directory in sysfs is:: 2371ad1335dSMike Rapoport 2381ad1335dSMike Rapoport /sys/kernel/mm/hugepages 2391ad1335dSMike Rapoport 2401ad1335dSMike RapoportFor each huge page size supported by the running kernel, a subdirectory 2411ad1335dSMike Rapoportwill exist, of the form:: 2421ad1335dSMike Rapoport 2431ad1335dSMike Rapoport hugepages-${size}kB 2441ad1335dSMike Rapoport 24579dfc695SMike KravetzInside each of these directories, the set of files contained in ``/proc`` 24679dfc695SMike Kravetzwill exist. In addition, two additional interfaces for demoting huge 24779dfc695SMike Kravetzpages may exist:: 2481ad1335dSMike Rapoport 24979dfc695SMike Kravetz demote 25079dfc695SMike Kravetz demote_size 2511ad1335dSMike Rapoport nr_hugepages 2521ad1335dSMike Rapoport nr_hugepages_mempolicy 2531ad1335dSMike Rapoport nr_overcommit_hugepages 2541ad1335dSMike Rapoport free_hugepages 2551ad1335dSMike Rapoport resv_hugepages 2561ad1335dSMike Rapoport surplus_hugepages 2571ad1335dSMike Rapoport 25879dfc695SMike KravetzThe demote interfaces provide the ability to split a huge page into 25979dfc695SMike Kravetzsmaller huge pages. For example, the x86 architecture supports both 26079dfc695SMike Kravetz1GB and 2MB huge pages sizes. A 1GB huge page can be split into 512 26179dfc695SMike Kravetz2MB huge pages. Demote interfaces are not available for the smallest 26279dfc695SMike Kravetzhuge page size. The demote interfaces are: 26379dfc695SMike Kravetz 26479dfc695SMike Kravetzdemote_size 26579dfc695SMike Kravetz is the size of demoted pages. When a page is demoted a corresponding 26679dfc695SMike Kravetz number of huge pages of demote_size will be created. By default, 26779dfc695SMike Kravetz demote_size is set to the next smaller huge page size. If there are 26879dfc695SMike Kravetz multiple smaller huge page sizes, demote_size can be set to any of 26979dfc695SMike Kravetz these smaller sizes. Only huge page sizes less than the current huge 27079dfc695SMike Kravetz pages size are allowed. 27179dfc695SMike Kravetz 27279dfc695SMike Kravetzdemote 27379dfc695SMike Kravetz is used to demote a number of huge pages. A user with root privileges 27479dfc695SMike Kravetz can write to this file. It may not be possible to demote the 27579dfc695SMike Kravetz requested number of huge pages. To determine how many pages were 27679dfc695SMike Kravetz actually demoted, compare the value of nr_hugepages before and after 27779dfc695SMike Kravetz writing to the demote interface. demote is a write only interface. 27879dfc695SMike Kravetz 27979dfc695SMike KravetzThe interfaces which are the same as in ``/proc`` (all except demote and 28079dfc695SMike Kravetzdemote_size) function as described above for the default huge page-sized case. 2811ad1335dSMike Rapoport 2821ad1335dSMike Rapoport.. _mem_policy_and_hp_alloc: 2831ad1335dSMike Rapoport 2841ad1335dSMike RapoportInteraction of Task Memory Policy with Huge Page Allocation/Freeing 2851ad1335dSMike Rapoport=================================================================== 2861ad1335dSMike Rapoport 2871ad1335dSMike RapoportWhether huge pages are allocated and freed via the ``/proc`` interface or 2881ad1335dSMike Rapoportthe ``/sysfs`` interface using the ``nr_hugepages_mempolicy`` attribute, the 2891ad1335dSMike RapoportNUMA nodes from which huge pages are allocated or freed are controlled by the 2901ad1335dSMike RapoportNUMA memory policy of the task that modifies the ``nr_hugepages_mempolicy`` 2911ad1335dSMike Rapoportsysctl or attribute. When the ``nr_hugepages`` attribute is used, mempolicy 2921ad1335dSMike Rapoportis ignored. 2931ad1335dSMike Rapoport 2941ad1335dSMike RapoportThe recommended method to allocate or free huge pages to/from the kernel 2951ad1335dSMike Rapoporthuge page pool, using the ``nr_hugepages`` example above, is:: 2961ad1335dSMike Rapoport 2971ad1335dSMike Rapoport numactl --interleave <node-list> echo 20 \ 2981ad1335dSMike Rapoport >/proc/sys/vm/nr_hugepages_mempolicy 2991ad1335dSMike Rapoport 3001ad1335dSMike Rapoportor, more succinctly:: 3011ad1335dSMike Rapoport 3021ad1335dSMike Rapoport numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy 3031ad1335dSMike Rapoport 3041ad1335dSMike RapoportThis will allocate or free ``abs(20 - nr_hugepages)`` to or from the nodes 3051ad1335dSMike Rapoportspecified in <node-list>, depending on whether number of persistent huge pages 3061ad1335dSMike Rapoportis initially less than or greater than 20, respectively. No huge pages will be 3071ad1335dSMike Rapoportallocated nor freed on any node not included in the specified <node-list>. 3081ad1335dSMike Rapoport 3091ad1335dSMike RapoportWhen adjusting the persistent hugepage count via ``nr_hugepages_mempolicy``, any 3101ad1335dSMike Rapoportmemory policy mode--bind, preferred, local or interleave--may be used. The 3111ad1335dSMike Rapoportresulting effect on persistent huge page allocation is as follows: 3121ad1335dSMike Rapoport 313e27a20f1SMike Rapoport#. Regardless of mempolicy mode [see 31400cba6b6SMike Rapoport (IBM) Documentation/admin-guide/mm/numa_memory_policy.rst], 3151ad1335dSMike Rapoport persistent huge pages will be distributed across the node or nodes 3161ad1335dSMike Rapoport specified in the mempolicy as if "interleave" had been specified. 3171ad1335dSMike Rapoport However, if a node in the policy does not contain sufficient contiguous 3181ad1335dSMike Rapoport memory for a huge page, the allocation will not "fallback" to the nearest 3191ad1335dSMike Rapoport neighbor node with sufficient contiguous memory. To do this would cause 3201ad1335dSMike Rapoport undesirable imbalance in the distribution of the huge page pool, or 3211ad1335dSMike Rapoport possibly, allocation of persistent huge pages on nodes not allowed by 3221ad1335dSMike Rapoport the task's memory policy. 3231ad1335dSMike Rapoport 3241ad1335dSMike Rapoport#. One or more nodes may be specified with the bind or interleave policy. 3251ad1335dSMike Rapoport If more than one node is specified with the preferred policy, only the 3261ad1335dSMike Rapoport lowest numeric id will be used. Local policy will select the node where 3271ad1335dSMike Rapoport the task is running at the time the nodes_allowed mask is constructed. 3281ad1335dSMike Rapoport For local policy to be deterministic, the task must be bound to a cpu or 3291ad1335dSMike Rapoport cpus in a single node. Otherwise, the task could be migrated to some 3301ad1335dSMike Rapoport other node at any time after launch and the resulting node will be 3311ad1335dSMike Rapoport indeterminate. Thus, local policy is not very useful for this purpose. 3321ad1335dSMike Rapoport Any of the other mempolicy modes may be used to specify a single node. 3331ad1335dSMike Rapoport 3341ad1335dSMike Rapoport#. The nodes allowed mask will be derived from any non-default task mempolicy, 3351ad1335dSMike Rapoport whether this policy was set explicitly by the task itself or one of its 3361ad1335dSMike Rapoport ancestors, such as numactl. This means that if the task is invoked from a 3371ad1335dSMike Rapoport shell with non-default policy, that policy will be used. One can specify a 3381ad1335dSMike Rapoport node list of "all" with numactl --interleave or --membind [-m] to achieve 3391ad1335dSMike Rapoport interleaving over all nodes in the system or cpuset. 3401ad1335dSMike Rapoport 3411ad1335dSMike Rapoport#. Any task mempolicy specified--e.g., using numactl--will be constrained by 3421ad1335dSMike Rapoport the resource limits of any cpuset in which the task runs. Thus, there will 3431ad1335dSMike Rapoport be no way for a task with non-default policy running in a cpuset with a 3441ad1335dSMike Rapoport subset of the system nodes to allocate huge pages outside the cpuset 3451ad1335dSMike Rapoport without first moving to a cpuset that contains all of the desired nodes. 3461ad1335dSMike Rapoport 3471ad1335dSMike Rapoport#. Boot-time huge page allocation attempts to distribute the requested number 3481ad1335dSMike Rapoport of huge pages over all on-lines nodes with memory. 3491ad1335dSMike Rapoport 3501ad1335dSMike RapoportPer Node Hugepages Attributes 3511ad1335dSMike Rapoport============================= 3521ad1335dSMike Rapoport 3531ad1335dSMike RapoportA subset of the contents of the root huge page control directory in sysfs, 3541ad1335dSMike Rapoportdescribed above, will be replicated under each the system device of each 3551ad1335dSMike RapoportNUMA node with memory in:: 3561ad1335dSMike Rapoport 3571ad1335dSMike Rapoport /sys/devices/system/node/node[0-9]*/hugepages/ 3581ad1335dSMike Rapoport 3591ad1335dSMike RapoportUnder this directory, the subdirectory for each supported huge page size 3601ad1335dSMike Rapoportcontains the following attribute files:: 3611ad1335dSMike Rapoport 3621ad1335dSMike Rapoport nr_hugepages 3631ad1335dSMike Rapoport free_hugepages 3641ad1335dSMike Rapoport surplus_hugepages 3651ad1335dSMike Rapoport 3661ad1335dSMike RapoportThe free\_' and surplus\_' attribute files are read-only. They return the number 3671ad1335dSMike Rapoportof free and surplus [overcommitted] huge pages, respectively, on the parent 3681ad1335dSMike Rapoportnode. 3691ad1335dSMike Rapoport 3701ad1335dSMike RapoportThe ``nr_hugepages`` attribute returns the total number of huge pages on the 3711ad1335dSMike Rapoportspecified node. When this attribute is written, the number of persistent huge 3721ad1335dSMike Rapoportpages on the parent node will be adjusted to the specified value, if sufficient 3731ad1335dSMike Rapoportresources exist, regardless of the task's mempolicy or cpuset constraints. 3741ad1335dSMike Rapoport 3751ad1335dSMike RapoportNote that the number of overcommit and reserve pages remain global quantities, 3761ad1335dSMike Rapoportas we don't know until fault time, when the faulting task's mempolicy is 3771ad1335dSMike Rapoportapplied, from which node the huge page allocation will be attempted. 3781ad1335dSMike Rapoport 3791ad1335dSMike Rapoport.. _using_huge_pages: 3801ad1335dSMike Rapoport 3811ad1335dSMike RapoportUsing Huge Pages 3821ad1335dSMike Rapoport================ 3831ad1335dSMike Rapoport 3841ad1335dSMike RapoportIf the user applications are going to request huge pages using mmap system 3851ad1335dSMike Rapoportcall, then it is required that system administrator mount a file system of 3861ad1335dSMike Rapoporttype hugetlbfs:: 3871ad1335dSMike Rapoport 3881ad1335dSMike Rapoport mount -t hugetlbfs \ 3891ad1335dSMike Rapoport -o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\ 3901ad1335dSMike Rapoport min_size=<value>,nr_inodes=<value> none /mnt/huge 3911ad1335dSMike Rapoport 3921ad1335dSMike RapoportThis command mounts a (pseudo) filesystem of type hugetlbfs on the directory 3931ad1335dSMike Rapoport``/mnt/huge``. Any file created on ``/mnt/huge`` uses huge pages. 3941ad1335dSMike Rapoport 3951ad1335dSMike RapoportThe ``uid`` and ``gid`` options sets the owner and group of the root of the 3961ad1335dSMike Rapoportfile system. By default the ``uid`` and ``gid`` of the current process 3971ad1335dSMike Rapoportare taken. 3981ad1335dSMike Rapoport 3991ad1335dSMike RapoportThe ``mode`` option sets the mode of root of file system to value & 01777. 4001ad1335dSMike RapoportThis value is given in octal. By default the value 0755 is picked. 4011ad1335dSMike Rapoport 4021ad1335dSMike RapoportIf the platform supports multiple huge page sizes, the ``pagesize`` option can 4031ad1335dSMike Rapoportbe used to specify the huge page size and associated pool. ``pagesize`` 4041ad1335dSMike Rapoportis specified in bytes. If ``pagesize`` is not specified the platform's 4051ad1335dSMike Rapoportdefault huge page size and associated pool will be used. 4061ad1335dSMike Rapoport 4071ad1335dSMike RapoportThe ``size`` option sets the maximum value of memory (huge pages) allowed 4081ad1335dSMike Rapoportfor that filesystem (``/mnt/huge``). The ``size`` option can be specified 4091ad1335dSMike Rapoportin bytes, or as a percentage of the specified huge page pool (``nr_hugepages``). 4101ad1335dSMike RapoportThe size is rounded down to HPAGE_SIZE boundary. 4111ad1335dSMike Rapoport 4121ad1335dSMike RapoportThe ``min_size`` option sets the minimum value of memory (huge pages) allowed 4131ad1335dSMike Rapoportfor the filesystem. ``min_size`` can be specified in the same way as ``size``, 4141ad1335dSMike Rapoporteither bytes or a percentage of the huge page pool. 4151ad1335dSMike RapoportAt mount time, the number of huge pages specified by ``min_size`` are reserved 4161ad1335dSMike Rapoportfor use by the filesystem. 4171ad1335dSMike RapoportIf there are not enough free huge pages available, the mount will fail. 4181ad1335dSMike RapoportAs huge pages are allocated to the filesystem and freed, the reserve count 4191ad1335dSMike Rapoportis adjusted so that the sum of allocated and reserved huge pages is always 4201ad1335dSMike Rapoportat least ``min_size``. 4211ad1335dSMike Rapoport 4221ad1335dSMike RapoportThe option ``nr_inodes`` sets the maximum number of inodes that ``/mnt/huge`` 4231ad1335dSMike Rapoportcan use. 4241ad1335dSMike Rapoport 4251ad1335dSMike RapoportIf the ``size``, ``min_size`` or ``nr_inodes`` option is not provided on 4261ad1335dSMike Rapoportcommand line then no limits are set. 4271ad1335dSMike Rapoport 4281ad1335dSMike RapoportFor ``pagesize``, ``size``, ``min_size`` and ``nr_inodes`` options, you can 4291ad1335dSMike Rapoportuse [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. 4301ad1335dSMike RapoportFor example, size=2K has the same meaning as size=2048. 4311ad1335dSMike Rapoport 4321ad1335dSMike RapoportWhile read system calls are supported on files that reside on hugetlb 4331ad1335dSMike Rapoportfile systems, write system calls are not. 4341ad1335dSMike Rapoport 4351ad1335dSMike RapoportRegular chown, chgrp, and chmod commands (with right permissions) could be 4361ad1335dSMike Rapoportused to change the file attributes on hugetlbfs. 4371ad1335dSMike Rapoport 4381ad1335dSMike RapoportAlso, it is important to note that no such mount command is required if 4391ad1335dSMike Rapoportapplications are going to use only shmat/shmget system calls or mmap with 4401ad1335dSMike RapoportMAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see 4411ad1335dSMike Rapoport:ref:`map_hugetlb <map_hugetlb>` below. 4421ad1335dSMike Rapoport 4431ad1335dSMike RapoportUsers who wish to use hugetlb memory via shared memory segment should be 4441ad1335dSMike Rapoportmembers of a supplementary group and system admin needs to configure that gid 4451ad1335dSMike Rapoportinto ``/proc/sys/vm/hugetlb_shm_group``. It is possible for same or different 4461ad1335dSMike Rapoportapplications to use any combination of mmaps and shm* calls, though the mount of 4471ad1335dSMike Rapoportfilesystem will be required for using mmap calls without MAP_HUGETLB. 4481ad1335dSMike Rapoport 4491ad1335dSMike RapoportSyscalls that operate on memory backed by hugetlb pages only have their lengths 4501ad1335dSMike Rapoportaligned to the native page size of the processor; they will normally fail with 4511ad1335dSMike Rapoporterrno set to EINVAL or exclude hugetlb pages that extend beyond the length if 4521ad1335dSMike Rapoportnot hugepage aligned. For example, munmap(2) will fail if memory is backed by 4531ad1335dSMike Rapoporta hugetlb page and the length is smaller than the hugepage size. 4541ad1335dSMike Rapoport 4551ad1335dSMike Rapoport 4561ad1335dSMike RapoportExamples 4571ad1335dSMike Rapoport======== 4581ad1335dSMike Rapoport 4591ad1335dSMike Rapoport.. _map_hugetlb: 4601ad1335dSMike Rapoport 4611ad1335dSMike Rapoport``map_hugetlb`` 462baa489faSSeongJae Park see tools/testing/selftests/mm/map_hugetlb.c 4631ad1335dSMike Rapoport 4641ad1335dSMike Rapoport``hugepage-shm`` 465baa489faSSeongJae Park see tools/testing/selftests/mm/hugepage-shm.c 4661ad1335dSMike Rapoport 4671ad1335dSMike Rapoport``hugepage-mmap`` 468baa489faSSeongJae Park see tools/testing/selftests/mm/hugepage-mmap.c 4691ad1335dSMike Rapoport 4701ad1335dSMike RapoportThe `libhugetlbfs`_ library provides a wide range of userspace tools 4711ad1335dSMike Rapoportto help with huge page usability, environment setup, and control. 4721ad1335dSMike Rapoport 4731ad1335dSMike Rapoport.. _libhugetlbfs: https://github.com/libhugetlbfs/libhugetlbfs 474