13ecf53e4SMike Rapoport================== 23ecf53e4SMike RapoportNUMA Memory Policy 33ecf53e4SMike Rapoport================== 43ecf53e4SMike Rapoport 53ecf53e4SMike RapoportWhat is NUMA Memory Policy? 63ecf53e4SMike Rapoport============================ 73ecf53e4SMike Rapoport 83ecf53e4SMike RapoportIn the Linux kernel, "memory policy" determines from which node the kernel will 93ecf53e4SMike Rapoportallocate memory in a NUMA system or in an emulated NUMA system. Linux has 103ecf53e4SMike Rapoportsupported platforms with Non-Uniform Memory Access architectures since 2.4.?. 113ecf53e4SMike RapoportThe current memory policy support was added to Linux 2.6 around May 2004. This 123ecf53e4SMike Rapoportdocument attempts to describe the concepts and APIs of the 2.6 memory policy 133ecf53e4SMike Rapoportsupport. 143ecf53e4SMike Rapoport 153ecf53e4SMike RapoportMemory policies should not be confused with cpusets 16da82c92fSMauro Carvalho Chehab(``Documentation/admin-guide/cgroup-v1/cpusets.rst``) 173ecf53e4SMike Rapoportwhich is an administrative mechanism for restricting the nodes from which 183ecf53e4SMike Rapoportmemory may be allocated by a set of processes. Memory policies are a 193ecf53e4SMike Rapoportprogramming interface that a NUMA-aware application can take advantage of. When 203ecf53e4SMike Rapoportboth cpusets and policies are applied to a task, the restrictions of the cpuset 213ecf53e4SMike Rapoporttakes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` 223ecf53e4SMike Rapoportbelow for more details. 233ecf53e4SMike Rapoport 243ecf53e4SMike RapoportMemory Policy Concepts 253ecf53e4SMike Rapoport====================== 263ecf53e4SMike Rapoport 273ecf53e4SMike RapoportScope of Memory Policies 283ecf53e4SMike Rapoport------------------------ 293ecf53e4SMike Rapoport 303ecf53e4SMike RapoportThe Linux kernel supports _scopes_ of memory policy, described here from 313ecf53e4SMike Rapoportmost general to most specific: 323ecf53e4SMike Rapoport 333ecf53e4SMike RapoportSystem Default Policy 343ecf53e4SMike Rapoport this policy is "hard coded" into the kernel. It is the policy 353ecf53e4SMike Rapoport that governs all page allocations that aren't controlled by 363ecf53e4SMike Rapoport one of the more specific policy scopes discussed below. When 373ecf53e4SMike Rapoport the system is "up and running", the system default policy will 383ecf53e4SMike Rapoport use "local allocation" described below. However, during boot 393ecf53e4SMike Rapoport up, the system default policy will be set to interleave 403ecf53e4SMike Rapoport allocations across all nodes with "sufficient" memory, so as 413ecf53e4SMike Rapoport not to overload the initial boot node with boot-time 423ecf53e4SMike Rapoport allocations. 433ecf53e4SMike Rapoport 443ecf53e4SMike RapoportTask/Process Policy 453ecf53e4SMike Rapoport this is an optional, per-task policy. When defined for a 463ecf53e4SMike Rapoport specific task, this policy controls all page allocations made 473ecf53e4SMike Rapoport by or on behalf of the task that aren't controlled by a more 483ecf53e4SMike Rapoport specific scope. If a task does not define a task policy, then 493ecf53e4SMike Rapoport all page allocations that would have been controlled by the 503ecf53e4SMike Rapoport task policy "fall back" to the System Default Policy. 513ecf53e4SMike Rapoport 523ecf53e4SMike Rapoport The task policy applies to the entire address space of a task. Thus, 533ecf53e4SMike Rapoport it is inheritable, and indeed is inherited, across both fork() 543ecf53e4SMike Rapoport [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task 553ecf53e4SMike Rapoport to establish the task policy for a child task exec()'d from an 563ecf53e4SMike Rapoport executable image that has no awareness of memory policy. See the 573ecf53e4SMike Rapoport :ref:`Memory Policy APIs <memory_policy_apis>` section, 583ecf53e4SMike Rapoport below, for an overview of the system call 593ecf53e4SMike Rapoport that a task may use to set/change its task/process policy. 603ecf53e4SMike Rapoport 613ecf53e4SMike Rapoport In a multi-threaded task, task policies apply only to the thread 623ecf53e4SMike Rapoport [Linux kernel task] that installs the policy and any threads 633ecf53e4SMike Rapoport subsequently created by that thread. Any sibling threads existing 643ecf53e4SMike Rapoport at the time a new task policy is installed retain their current 653ecf53e4SMike Rapoport policy. 663ecf53e4SMike Rapoport 673ecf53e4SMike Rapoport A task policy applies only to pages allocated after the policy is 683ecf53e4SMike Rapoport installed. Any pages already faulted in by the task when the task 693ecf53e4SMike Rapoport changes its task policy remain where they were allocated based on 703ecf53e4SMike Rapoport the policy at the time they were allocated. 713ecf53e4SMike Rapoport 723ecf53e4SMike Rapoport.. _vma_policy: 733ecf53e4SMike Rapoport 743ecf53e4SMike RapoportVMA Policy 753ecf53e4SMike Rapoport A "VMA" or "Virtual Memory Area" refers to a range of a task's 763ecf53e4SMike Rapoport virtual address space. A task may define a specific policy for a range 773ecf53e4SMike Rapoport of its virtual address space. See the 783ecf53e4SMike Rapoport :ref:`Memory Policy APIs <memory_policy_apis>` section, 793ecf53e4SMike Rapoport below, for an overview of the mbind() system call used to set a VMA 803ecf53e4SMike Rapoport policy. 813ecf53e4SMike Rapoport 823ecf53e4SMike Rapoport A VMA policy will govern the allocation of pages that back 833ecf53e4SMike Rapoport this region of the address space. Any regions of the task's 843ecf53e4SMike Rapoport address space that don't have an explicit VMA policy will fall 853ecf53e4SMike Rapoport back to the task policy, which may itself fall back to the 863ecf53e4SMike Rapoport System Default Policy. 873ecf53e4SMike Rapoport 883ecf53e4SMike Rapoport VMA policies have a few complicating details: 893ecf53e4SMike Rapoport 903ecf53e4SMike Rapoport * VMA policy applies ONLY to anonymous pages. These include 913ecf53e4SMike Rapoport pages allocated for anonymous segments, such as the task 923ecf53e4SMike Rapoport stack and heap, and any regions of the address space 933ecf53e4SMike Rapoport mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is 943ecf53e4SMike Rapoport applied to a file mapping, it will be ignored if the mapping 953ecf53e4SMike Rapoport used the MAP_SHARED flag. If the file mapping used the 963ecf53e4SMike Rapoport MAP_PRIVATE flag, the VMA policy will only be applied when 973ecf53e4SMike Rapoport an anonymous page is allocated on an attempt to write to the 983ecf53e4SMike Rapoport mapping-- i.e., at Copy-On-Write. 993ecf53e4SMike Rapoport 1003ecf53e4SMike Rapoport * VMA policies are shared between all tasks that share a 1013ecf53e4SMike Rapoport virtual address space--a.k.a. threads--independent of when 1023ecf53e4SMike Rapoport the policy is installed; and they are inherited across 1033ecf53e4SMike Rapoport fork(). However, because VMA policies refer to a specific 1043ecf53e4SMike Rapoport region of a task's address space, and because the address 1053ecf53e4SMike Rapoport space is discarded and recreated on exec*(), VMA policies 1063ecf53e4SMike Rapoport are NOT inheritable across exec(). Thus, only NUMA-aware 1073ecf53e4SMike Rapoport applications may use VMA policies. 1083ecf53e4SMike Rapoport 1093ecf53e4SMike Rapoport * A task may install a new VMA policy on a sub-range of a 1103ecf53e4SMike Rapoport previously mmap()ed region. When this happens, Linux splits 1113ecf53e4SMike Rapoport the existing virtual memory area into 2 or 3 VMAs, each with 112*97e6f135SRandy Dunlap its own policy. 1133ecf53e4SMike Rapoport 1143ecf53e4SMike Rapoport * By default, VMA policy applies only to pages allocated after 1153ecf53e4SMike Rapoport the policy is installed. Any pages already faulted into the 1163ecf53e4SMike Rapoport VMA range remain where they were allocated based on the 1173ecf53e4SMike Rapoport policy at the time they were allocated. However, since 1183ecf53e4SMike Rapoport 2.6.16, Linux supports page migration via the mbind() system 1193ecf53e4SMike Rapoport call, so that page contents can be moved to match a newly 1203ecf53e4SMike Rapoport installed policy. 1213ecf53e4SMike Rapoport 1223ecf53e4SMike RapoportShared Policy 1233ecf53e4SMike Rapoport Conceptually, shared policies apply to "memory objects" mapped 1243ecf53e4SMike Rapoport shared into one or more tasks' distinct address spaces. An 1253ecf53e4SMike Rapoport application installs shared policies the same way as VMA 1263ecf53e4SMike Rapoport policies--using the mbind() system call specifying a range of 1273ecf53e4SMike Rapoport virtual addresses that map the shared object. However, unlike 1283ecf53e4SMike Rapoport VMA policies, which can be considered to be an attribute of a 1293ecf53e4SMike Rapoport range of a task's address space, shared policies apply 1303ecf53e4SMike Rapoport directly to the shared object. Thus, all tasks that attach to 1313ecf53e4SMike Rapoport the object share the policy, and all pages allocated for the 1323ecf53e4SMike Rapoport shared object, by any task, will obey the shared policy. 1333ecf53e4SMike Rapoport 1343ecf53e4SMike Rapoport As of 2.6.22, only shared memory segments, created by shmget() or 1353ecf53e4SMike Rapoport mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared 1363ecf53e4SMike Rapoport policy support was added to Linux, the associated data structures were 1373ecf53e4SMike Rapoport added to hugetlbfs shmem segments. At the time, hugetlbfs did not 1383ecf53e4SMike Rapoport support allocation at fault time--a.k.a lazy allocation--so hugetlbfs 1393ecf53e4SMike Rapoport shmem segments were never "hooked up" to the shared policy support. 1403ecf53e4SMike Rapoport Although hugetlbfs segments now support lazy allocation, their support 1413ecf53e4SMike Rapoport for shared policy has not been completed. 1423ecf53e4SMike Rapoport 1433ecf53e4SMike Rapoport As mentioned above in :ref:`VMA policies <vma_policy>` section, 1443ecf53e4SMike Rapoport allocations of page cache pages for regular files mmap()ed 1453ecf53e4SMike Rapoport with MAP_SHARED ignore any VMA policy installed on the virtual 1463ecf53e4SMike Rapoport address range backed by the shared file mapping. Rather, 1473ecf53e4SMike Rapoport shared page cache pages, including pages backing private 1483ecf53e4SMike Rapoport mappings that have not yet been written by the task, follow 1493ecf53e4SMike Rapoport task policy, if any, else System Default Policy. 1503ecf53e4SMike Rapoport 1513ecf53e4SMike Rapoport The shared policy infrastructure supports different policies on subset 1523ecf53e4SMike Rapoport ranges of the shared object. However, Linux still splits the VMA of 1533ecf53e4SMike Rapoport the task that installs the policy for each range of distinct policy. 1543ecf53e4SMike Rapoport Thus, different tasks that attach to a shared memory segment can have 1553ecf53e4SMike Rapoport different VMA configurations mapping that one shared object. This 1563ecf53e4SMike Rapoport can be seen by examining the /proc/<pid>/numa_maps of tasks sharing 1573ecf53e4SMike Rapoport a shared memory region, when one task has installed shared policy on 1583ecf53e4SMike Rapoport one or more ranges of the region. 1593ecf53e4SMike Rapoport 1603ecf53e4SMike RapoportComponents of Memory Policies 1613ecf53e4SMike Rapoport----------------------------- 1623ecf53e4SMike Rapoport 1633ecf53e4SMike RapoportA NUMA memory policy consists of a "mode", optional mode flags, and 1643ecf53e4SMike Rapoportan optional set of nodes. The mode determines the behavior of the 1653ecf53e4SMike Rapoportpolicy, the optional mode flags determine the behavior of the mode, 1663ecf53e4SMike Rapoportand the optional set of nodes can be viewed as the arguments to the 1673ecf53e4SMike Rapoportpolicy behavior. 1683ecf53e4SMike Rapoport 1693ecf53e4SMike RapoportInternally, memory policies are implemented by a reference counted 1703ecf53e4SMike Rapoportstructure, struct mempolicy. Details of this structure will be 1713ecf53e4SMike Rapoportdiscussed in context, below, as required to explain the behavior. 1723ecf53e4SMike Rapoport 1733ecf53e4SMike RapoportNUMA memory policy supports the following 4 behavioral modes: 1743ecf53e4SMike Rapoport 1753ecf53e4SMike RapoportDefault Mode--MPOL_DEFAULT 1763ecf53e4SMike Rapoport This mode is only used in the memory policy APIs. Internally, 1773ecf53e4SMike Rapoport MPOL_DEFAULT is converted to the NULL memory policy in all 1783ecf53e4SMike Rapoport policy scopes. Any existing non-default policy will simply be 1793ecf53e4SMike Rapoport removed when MPOL_DEFAULT is specified. As a result, 1803ecf53e4SMike Rapoport MPOL_DEFAULT means "fall back to the next most specific policy 1813ecf53e4SMike Rapoport scope." 1823ecf53e4SMike Rapoport 1833ecf53e4SMike Rapoport For example, a NULL or default task policy will fall back to the 1843ecf53e4SMike Rapoport system default policy. A NULL or default vma policy will fall 1853ecf53e4SMike Rapoport back to the task policy. 1863ecf53e4SMike Rapoport 1873ecf53e4SMike Rapoport When specified in one of the memory policy APIs, the Default mode 1883ecf53e4SMike Rapoport does not use the optional set of nodes. 1893ecf53e4SMike Rapoport 1903ecf53e4SMike Rapoport It is an error for the set of nodes specified for this policy to 1913ecf53e4SMike Rapoport be non-empty. 1923ecf53e4SMike Rapoport 1933ecf53e4SMike RapoportMPOL_BIND 1943ecf53e4SMike Rapoport This mode specifies that memory must come from the set of 1953ecf53e4SMike Rapoport nodes specified by the policy. Memory will be allocated from 1963ecf53e4SMike Rapoport the node in the set with sufficient free memory that is 1973ecf53e4SMike Rapoport closest to the node where the allocation takes place. 1983ecf53e4SMike Rapoport 1993ecf53e4SMike RapoportMPOL_PREFERRED 2003ecf53e4SMike Rapoport This mode specifies that the allocation should be attempted 2013ecf53e4SMike Rapoport from the single node specified in the policy. If that 2023ecf53e4SMike Rapoport allocation fails, the kernel will search other nodes, in order 2033ecf53e4SMike Rapoport of increasing distance from the preferred node based on 2043ecf53e4SMike Rapoport information provided by the platform firmware. 2053ecf53e4SMike Rapoport 2063ecf53e4SMike Rapoport Internally, the Preferred policy uses a single node--the 2073ecf53e4SMike Rapoport preferred_node member of struct mempolicy. When the internal 2083ecf53e4SMike Rapoport mode flag MPOL_F_LOCAL is set, the preferred_node is ignored 2093ecf53e4SMike Rapoport and the policy is interpreted as local allocation. "Local" 2103ecf53e4SMike Rapoport allocation policy can be viewed as a Preferred policy that 2113ecf53e4SMike Rapoport starts at the node containing the cpu where the allocation 2123ecf53e4SMike Rapoport takes place. 2133ecf53e4SMike Rapoport 2143ecf53e4SMike Rapoport It is possible for the user to specify that local allocation 2153ecf53e4SMike Rapoport is always preferred by passing an empty nodemask with this 2163ecf53e4SMike Rapoport mode. If an empty nodemask is passed, the policy cannot use 2173ecf53e4SMike Rapoport the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags 2183ecf53e4SMike Rapoport described below. 2193ecf53e4SMike Rapoport 2203ecf53e4SMike RapoportMPOL_INTERLEAVED 2213ecf53e4SMike Rapoport This mode specifies that page allocations be interleaved, on a 2223ecf53e4SMike Rapoport page granularity, across the nodes specified in the policy. 2233ecf53e4SMike Rapoport This mode also behaves slightly differently, based on the 2243ecf53e4SMike Rapoport context where it is used: 2253ecf53e4SMike Rapoport 2263ecf53e4SMike Rapoport For allocation of anonymous pages and shared memory pages, 2273ecf53e4SMike Rapoport Interleave mode indexes the set of nodes specified by the 2283ecf53e4SMike Rapoport policy using the page offset of the faulting address into the 2293ecf53e4SMike Rapoport segment [VMA] containing the address modulo the number of 2303ecf53e4SMike Rapoport nodes specified by the policy. It then attempts to allocate a 2313ecf53e4SMike Rapoport page, starting at the selected node, as if the node had been 2323ecf53e4SMike Rapoport specified by a Preferred policy or had been selected by a 2333ecf53e4SMike Rapoport local allocation. That is, allocation will follow the per 2343ecf53e4SMike Rapoport node zonelist. 2353ecf53e4SMike Rapoport 2363ecf53e4SMike Rapoport For allocation of page cache pages, Interleave mode indexes 2373ecf53e4SMike Rapoport the set of nodes specified by the policy using a node counter 2383ecf53e4SMike Rapoport maintained per task. This counter wraps around to the lowest 2393ecf53e4SMike Rapoport specified node after it reaches the highest specified node. 2403ecf53e4SMike Rapoport This will tend to spread the pages out over the nodes 2413ecf53e4SMike Rapoport specified by the policy based on the order in which they are 2423ecf53e4SMike Rapoport allocated, rather than based on any page offset into an 2433ecf53e4SMike Rapoport address range or file. During system boot up, the temporary 2443ecf53e4SMike Rapoport interleaved system default policy works in this mode. 2453ecf53e4SMike Rapoport 246a38a59fdSBen WidawskyMPOL_PREFERRED_MANY 247dbeb56feSRandy Dunlap This mode specifies that the allocation should be preferably 248a38a59fdSBen Widawsky satisfied from the nodemask specified in the policy. If there is 249a38a59fdSBen Widawsky a memory pressure on all nodes in the nodemask, the allocation 250a38a59fdSBen Widawsky can fall back to all existing numa nodes. This is effectively 251a38a59fdSBen Widawsky MPOL_PREFERRED allowed for a mask rather than a single node. 252a38a59fdSBen Widawsky 2533ecf53e4SMike RapoportNUMA memory policy supports the following optional mode flags: 2543ecf53e4SMike Rapoport 2553ecf53e4SMike RapoportMPOL_F_STATIC_NODES 2563ecf53e4SMike Rapoport This flag specifies that the nodemask passed by 2573ecf53e4SMike Rapoport the user should not be remapped if the task or VMA's set of allowed 2583ecf53e4SMike Rapoport nodes changes after the memory policy has been defined. 2593ecf53e4SMike Rapoport 2603ecf53e4SMike Rapoport Without this flag, any time a mempolicy is rebound because of a 261a38a59fdSBen Widawsky change in the set of allowed nodes, the preferred nodemask (Preferred 262a38a59fdSBen Widawsky Many), preferred node (Preferred) or nodemask (Bind, Interleave) is 263a38a59fdSBen Widawsky remapped to the new set of allowed nodes. This may result in nodes 264a38a59fdSBen Widawsky being used that were previously undesired. 2653ecf53e4SMike Rapoport 2663ecf53e4SMike Rapoport With this flag, if the user-specified nodes overlap with the 2673ecf53e4SMike Rapoport nodes allowed by the task's cpuset, then the memory policy is 2683ecf53e4SMike Rapoport applied to their intersection. If the two sets of nodes do not 2693ecf53e4SMike Rapoport overlap, the Default policy is used. 2703ecf53e4SMike Rapoport 2713ecf53e4SMike Rapoport For example, consider a task that is attached to a cpuset with 2723ecf53e4SMike Rapoport mems 1-3 that sets an Interleave policy over the same set. If 2733ecf53e4SMike Rapoport the cpuset's mems change to 3-5, the Interleave will now occur 2743ecf53e4SMike Rapoport over nodes 3, 4, and 5. With this flag, however, since only node 2753ecf53e4SMike Rapoport 3 is allowed from the user's nodemask, the "interleave" only 2763ecf53e4SMike Rapoport occurs over that node. If no nodes from the user's nodemask are 2773ecf53e4SMike Rapoport now allowed, the Default behavior is used. 2783ecf53e4SMike Rapoport 2793ecf53e4SMike Rapoport MPOL_F_STATIC_NODES cannot be combined with the 2803ecf53e4SMike Rapoport MPOL_F_RELATIVE_NODES flag. It also cannot be used for 2813ecf53e4SMike Rapoport MPOL_PREFERRED policies that were created with an empty nodemask 2823ecf53e4SMike Rapoport (local allocation). 2833ecf53e4SMike Rapoport 2843ecf53e4SMike RapoportMPOL_F_RELATIVE_NODES 2853ecf53e4SMike Rapoport This flag specifies that the nodemask passed 2863ecf53e4SMike Rapoport by the user will be mapped relative to the set of the task or VMA's 2873ecf53e4SMike Rapoport set of allowed nodes. The kernel stores the user-passed nodemask, 2883ecf53e4SMike Rapoport and if the allowed nodes changes, then that original nodemask will 2893ecf53e4SMike Rapoport be remapped relative to the new set of allowed nodes. 2903ecf53e4SMike Rapoport 2913ecf53e4SMike Rapoport Without this flag (and without MPOL_F_STATIC_NODES), anytime a 2923ecf53e4SMike Rapoport mempolicy is rebound because of a change in the set of allowed 2933ecf53e4SMike Rapoport nodes, the node (Preferred) or nodemask (Bind, Interleave) is 2943ecf53e4SMike Rapoport remapped to the new set of allowed nodes. That remap may not 2953ecf53e4SMike Rapoport preserve the relative nature of the user's passed nodemask to its 2963ecf53e4SMike Rapoport set of allowed nodes upon successive rebinds: a nodemask of 2973ecf53e4SMike Rapoport 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of 2983ecf53e4SMike Rapoport allowed nodes is restored to its original state. 2993ecf53e4SMike Rapoport 3003ecf53e4SMike Rapoport With this flag, the remap is done so that the node numbers from 3013ecf53e4SMike Rapoport the user's passed nodemask are relative to the set of allowed 3023ecf53e4SMike Rapoport nodes. In other words, if nodes 0, 2, and 4 are set in the user's 3033ecf53e4SMike Rapoport nodemask, the policy will be effected over the first (and in the 3043ecf53e4SMike Rapoport Bind or Interleave case, the third and fifth) nodes in the set of 3053ecf53e4SMike Rapoport allowed nodes. The nodemask passed by the user represents nodes 3063ecf53e4SMike Rapoport relative to task or VMA's set of allowed nodes. 3073ecf53e4SMike Rapoport 3083ecf53e4SMike Rapoport If the user's nodemask includes nodes that are outside the range 3093ecf53e4SMike Rapoport of the new set of allowed nodes (for example, node 5 is set in 3103ecf53e4SMike Rapoport the user's nodemask when the set of allowed nodes is only 0-3), 3113ecf53e4SMike Rapoport then the remap wraps around to the beginning of the nodemask and, 3123ecf53e4SMike Rapoport if not already set, sets the node in the mempolicy nodemask. 3133ecf53e4SMike Rapoport 3143ecf53e4SMike Rapoport For example, consider a task that is attached to a cpuset with 3153ecf53e4SMike Rapoport mems 2-5 that sets an Interleave policy over the same set with 3163ecf53e4SMike Rapoport MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the 3173ecf53e4SMike Rapoport interleave now occurs over nodes 3,5-7. If the cpuset's mems 3183ecf53e4SMike Rapoport then change to 0,2-3,5, then the interleave occurs over nodes 3193ecf53e4SMike Rapoport 0,2-3,5. 3203ecf53e4SMike Rapoport 3213ecf53e4SMike Rapoport Thanks to the consistent remapping, applications preparing 3223ecf53e4SMike Rapoport nodemasks to specify memory policies using this flag should 3233ecf53e4SMike Rapoport disregard their current, actual cpuset imposed memory placement 3243ecf53e4SMike Rapoport and prepare the nodemask as if they were always located on 3253ecf53e4SMike Rapoport memory nodes 0 to N-1, where N is the number of memory nodes the 3263ecf53e4SMike Rapoport policy is intended to manage. Let the kernel then remap to the 3273ecf53e4SMike Rapoport set of memory nodes allowed by the task's cpuset, as that may 3283ecf53e4SMike Rapoport change over time. 3293ecf53e4SMike Rapoport 3303ecf53e4SMike Rapoport MPOL_F_RELATIVE_NODES cannot be combined with the 3313ecf53e4SMike Rapoport MPOL_F_STATIC_NODES flag. It also cannot be used for 3323ecf53e4SMike Rapoport MPOL_PREFERRED policies that were created with an empty nodemask 3333ecf53e4SMike Rapoport (local allocation). 3343ecf53e4SMike Rapoport 3353ecf53e4SMike RapoportMemory Policy Reference Counting 3363ecf53e4SMike Rapoport================================ 3373ecf53e4SMike Rapoport 3383ecf53e4SMike RapoportTo resolve use/free races, struct mempolicy contains an atomic reference 3393ecf53e4SMike Rapoportcount field. Internal interfaces, mpol_get()/mpol_put() increment and 3403ecf53e4SMike Rapoportdecrement this reference count, respectively. mpol_put() will only free 3413ecf53e4SMike Rapoportthe structure back to the mempolicy kmem cache when the reference count 3423ecf53e4SMike Rapoportgoes to zero. 3433ecf53e4SMike Rapoport 3443ecf53e4SMike RapoportWhen a new memory policy is allocated, its reference count is initialized 3453ecf53e4SMike Rapoportto '1', representing the reference held by the task that is installing the 3463ecf53e4SMike Rapoportnew policy. When a pointer to a memory policy structure is stored in another 3473ecf53e4SMike Rapoportstructure, another reference is added, as the task's reference will be dropped 3483ecf53e4SMike Rapoporton completion of the policy installation. 3493ecf53e4SMike Rapoport 3503ecf53e4SMike RapoportDuring run-time "usage" of the policy, we attempt to minimize atomic operations 3513ecf53e4SMike Rapoporton the reference count, as this can lead to cache lines bouncing between cpus 3523ecf53e4SMike Rapoportand NUMA nodes. "Usage" here means one of the following: 3533ecf53e4SMike Rapoport 3543ecf53e4SMike Rapoport1) querying of the policy, either by the task itself [using the get_mempolicy() 3553ecf53e4SMike Rapoport API discussed below] or by another task using the /proc/<pid>/numa_maps 3563ecf53e4SMike Rapoport interface. 3573ecf53e4SMike Rapoport 3583ecf53e4SMike Rapoport2) examination of the policy to determine the policy mode and associated node 3593ecf53e4SMike Rapoport or node lists, if any, for page allocation. This is considered a "hot 3603ecf53e4SMike Rapoport path". Note that for MPOL_BIND, the "usage" extends across the entire 361dbeb56feSRandy Dunlap allocation process, which may sleep during page reclamation, because the 3623ecf53e4SMike Rapoport BIND policy nodemask is used, by reference, to filter ineligible nodes. 3633ecf53e4SMike Rapoport 3643ecf53e4SMike RapoportWe can avoid taking an extra reference during the usages listed above as 3653ecf53e4SMike Rapoportfollows: 3663ecf53e4SMike Rapoport 3673ecf53e4SMike Rapoport1) we never need to get/free the system default policy as this is never 3683ecf53e4SMike Rapoport changed nor freed, once the system is up and running. 3693ecf53e4SMike Rapoport 3703ecf53e4SMike Rapoport2) for querying the policy, we do not need to take an extra reference on the 3713ecf53e4SMike Rapoport target task's task policy nor vma policies because we always acquire the 372c1e8d7c6SMichel Lespinasse task's mm's mmap_lock for read during the query. The set_mempolicy() and 373c1e8d7c6SMichel Lespinasse mbind() APIs [see below] always acquire the mmap_lock for write when 3743ecf53e4SMike Rapoport installing or replacing task or vma policies. Thus, there is no possibility 3753ecf53e4SMike Rapoport of a task or thread freeing a policy while another task or thread is 3763ecf53e4SMike Rapoport querying it. 3773ecf53e4SMike Rapoport 3783ecf53e4SMike Rapoport3) Page allocation usage of task or vma policy occurs in the fault path where 379c1e8d7c6SMichel Lespinasse we hold them mmap_lock for read. Again, because replacing the task or vma 380c1e8d7c6SMichel Lespinasse policy requires that the mmap_lock be held for write, the policy can't be 3813ecf53e4SMike Rapoport freed out from under us while we're using it for page allocation. 3823ecf53e4SMike Rapoport 3833ecf53e4SMike Rapoport4) Shared policies require special consideration. One task can replace a 384c1e8d7c6SMichel Lespinasse shared memory policy while another task, with a distinct mmap_lock, is 3853ecf53e4SMike Rapoport querying or allocating a page based on the policy. To resolve this 3863ecf53e4SMike Rapoport potential race, the shared policy infrastructure adds an extra reference 3873ecf53e4SMike Rapoport to the shared policy during lookup while holding a spin lock on the shared 3883ecf53e4SMike Rapoport policy management structure. This requires that we drop this extra 3893ecf53e4SMike Rapoport reference when we're finished "using" the policy. We must drop the 3903ecf53e4SMike Rapoport extra reference on shared policies in the same query/allocation paths 3913ecf53e4SMike Rapoport used for non-shared policies. For this reason, shared policies are marked 3923ecf53e4SMike Rapoport as such, and the extra reference is dropped "conditionally"--i.e., only 3933ecf53e4SMike Rapoport for shared policies. 3943ecf53e4SMike Rapoport 3953ecf53e4SMike Rapoport Because of this extra reference counting, and because we must lookup 3963ecf53e4SMike Rapoport shared policies in a tree structure under spinlock, shared policies are 3973ecf53e4SMike Rapoport more expensive to use in the page allocation path. This is especially 3983ecf53e4SMike Rapoport true for shared policies on shared memory regions shared by tasks running 3993ecf53e4SMike Rapoport on different NUMA nodes. This extra overhead can be avoided by always 4003ecf53e4SMike Rapoport falling back to task or system default policy for shared memory regions, 4013ecf53e4SMike Rapoport or by prefaulting the entire shared memory region into memory and locking 4023ecf53e4SMike Rapoport it down. However, this might not be appropriate for all applications. 4033ecf53e4SMike Rapoport 4043ecf53e4SMike Rapoport.. _memory_policy_apis: 4053ecf53e4SMike Rapoport 4063ecf53e4SMike RapoportMemory Policy APIs 4073ecf53e4SMike Rapoport================== 4083ecf53e4SMike Rapoport 409c6018b4bSAneesh Kumar K.VLinux supports 4 system calls for controlling memory policy. These APIS 4103ecf53e4SMike Rapoportalways affect only the calling task, the calling task's address space, or 4113ecf53e4SMike Rapoportsome shared object mapped into the calling task's address space. 4123ecf53e4SMike Rapoport 4133ecf53e4SMike Rapoport.. note:: 4143ecf53e4SMike Rapoport the headers that define these APIs and the parameter data types for 4153ecf53e4SMike Rapoport user space applications reside in a package that is not part of the 4163ecf53e4SMike Rapoport Linux kernel. The kernel system call interfaces, with the 'sys\_' 4173ecf53e4SMike Rapoport prefix, are defined in <linux/syscalls.h>; the mode and flag 4183ecf53e4SMike Rapoport definitions are defined in <linux/mempolicy.h>. 4193ecf53e4SMike Rapoport 4203ecf53e4SMike RapoportSet [Task] Memory Policy:: 4213ecf53e4SMike Rapoport 4223ecf53e4SMike Rapoport long set_mempolicy(int mode, const unsigned long *nmask, 4233ecf53e4SMike Rapoport unsigned long maxnode); 4243ecf53e4SMike Rapoport 4253ecf53e4SMike RapoportSet's the calling task's "task/process memory policy" to mode 4263ecf53e4SMike Rapoportspecified by the 'mode' argument and the set of nodes defined by 4273ecf53e4SMike Rapoport'nmask'. 'nmask' points to a bit mask of node ids containing at least 4283ecf53e4SMike Rapoport'maxnode' ids. Optional mode flags may be passed by combining the 4293ecf53e4SMike Rapoport'mode' argument with the flag (for example: MPOL_INTERLEAVE | 4303ecf53e4SMike RapoportMPOL_F_STATIC_NODES). 4313ecf53e4SMike Rapoport 4323ecf53e4SMike RapoportSee the set_mempolicy(2) man page for more details 4333ecf53e4SMike Rapoport 4343ecf53e4SMike Rapoport 4353ecf53e4SMike RapoportGet [Task] Memory Policy or Related Information:: 4363ecf53e4SMike Rapoport 4373ecf53e4SMike Rapoport long get_mempolicy(int *mode, 4383ecf53e4SMike Rapoport const unsigned long *nmask, unsigned long maxnode, 4393ecf53e4SMike Rapoport void *addr, int flags); 4403ecf53e4SMike Rapoport 4413ecf53e4SMike RapoportQueries the "task/process memory policy" of the calling task, or the 4423ecf53e4SMike Rapoportpolicy or location of a specified virtual address, depending on the 4433ecf53e4SMike Rapoport'flags' argument. 4443ecf53e4SMike Rapoport 4453ecf53e4SMike RapoportSee the get_mempolicy(2) man page for more details 4463ecf53e4SMike Rapoport 4473ecf53e4SMike Rapoport 4483ecf53e4SMike RapoportInstall VMA/Shared Policy for a Range of Task's Address Space:: 4493ecf53e4SMike Rapoport 4503ecf53e4SMike Rapoport long mbind(void *start, unsigned long len, int mode, 4513ecf53e4SMike Rapoport const unsigned long *nmask, unsigned long maxnode, 4523ecf53e4SMike Rapoport unsigned flags); 4533ecf53e4SMike Rapoport 4543ecf53e4SMike Rapoportmbind() installs the policy specified by (mode, nmask, maxnodes) as a 4553ecf53e4SMike RapoportVMA policy for the range of the calling task's address space specified 4563ecf53e4SMike Rapoportby the 'start' and 'len' arguments. Additional actions may be 4573ecf53e4SMike Rapoportrequested via the 'flags' argument. 4583ecf53e4SMike Rapoport 4593ecf53e4SMike RapoportSee the mbind(2) man page for more details. 4603ecf53e4SMike Rapoport 461c6018b4bSAneesh Kumar K.VSet home node for a Range of Task's Address Spacec:: 462c6018b4bSAneesh Kumar K.V 463c6018b4bSAneesh Kumar K.V long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, 464c6018b4bSAneesh Kumar K.V unsigned long home_node, 465c6018b4bSAneesh Kumar K.V unsigned long flags); 466c6018b4bSAneesh Kumar K.V 467c6018b4bSAneesh Kumar K.Vsys_set_mempolicy_home_node set the home node for a VMA policy present in the 468c6018b4bSAneesh Kumar K.Vtask's address range. The system call updates the home node only for the existing 469c6018b4bSAneesh Kumar K.Vmempolicy range. Other address ranges are ignored. A home node is the NUMA node 470c6018b4bSAneesh Kumar K.Vclosest to which page allocation will come from. Specifying the home node override 471c6018b4bSAneesh Kumar K.Vthe default allocation policy to allocate memory close to the local node for an 472c6018b4bSAneesh Kumar K.Vexecuting CPU. 473c6018b4bSAneesh Kumar K.V 474c6018b4bSAneesh Kumar K.V 4753ecf53e4SMike RapoportMemory Policy Command Line Interface 4763ecf53e4SMike Rapoport==================================== 4773ecf53e4SMike Rapoport 4783ecf53e4SMike RapoportAlthough not strictly part of the Linux implementation of memory policy, 4793ecf53e4SMike Rapoporta command line tool, numactl(8), exists that allows one to: 4803ecf53e4SMike Rapoport 4813ecf53e4SMike Rapoport+ set the task policy for a specified program via set_mempolicy(2), fork(2) and 4823ecf53e4SMike Rapoport exec(2) 4833ecf53e4SMike Rapoport 4843ecf53e4SMike Rapoport+ set the shared policy for a shared memory segment via mbind(2) 4853ecf53e4SMike Rapoport 4863ecf53e4SMike RapoportThe numactl(8) tool is packaged with the run-time version of the library 4873ecf53e4SMike Rapoportcontaining the memory policy system call wrappers. Some distributions 4883ecf53e4SMike Rapoportpackage the headers and compile-time libraries in a separate development 4893ecf53e4SMike Rapoportpackage. 4903ecf53e4SMike Rapoport 4913ecf53e4SMike Rapoport.. _mem_pol_and_cpusets: 4923ecf53e4SMike Rapoport 4933ecf53e4SMike RapoportMemory Policies and cpusets 4943ecf53e4SMike Rapoport=========================== 4953ecf53e4SMike Rapoport 4963ecf53e4SMike RapoportMemory policies work within cpusets as described above. For memory policies 4973ecf53e4SMike Rapoportthat require a node or set of nodes, the nodes are restricted to the set of 4983ecf53e4SMike Rapoportnodes whose memories are allowed by the cpuset constraints. If the nodemask 4993ecf53e4SMike Rapoportspecified for the policy contains nodes that are not allowed by the cpuset and 5003ecf53e4SMike RapoportMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes 5013ecf53e4SMike Rapoportspecified for the policy and the set of nodes with memory is used. If the 5023ecf53e4SMike Rapoportresult is the empty set, the policy is considered invalid and cannot be 5033ecf53e4SMike Rapoportinstalled. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped 5043ecf53e4SMike Rapoportonto and folded into the task's set of allowed nodes as previously described. 5053ecf53e4SMike Rapoport 5063ecf53e4SMike RapoportThe interaction of memory policies and cpusets can be problematic when tasks 5073ecf53e4SMike Rapoportin two cpusets share access to a memory region, such as shared memory segments 5083ecf53e4SMike Rapoportcreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and 5093ecf53e4SMike Rapoportany of the tasks install shared policy on the region, only nodes whose 5103ecf53e4SMike Rapoportmemories are allowed in both cpusets may be used in the policies. Obtaining 5113ecf53e4SMike Rapoportthis information requires "stepping outside" the memory policy APIs to use the 5123ecf53e4SMike Rapoportcpuset information and requires that one know in what cpusets other task might 5133ecf53e4SMike Rapoportbe attaching to the shared region. Furthermore, if the cpusets' allowed 5143ecf53e4SMike Rapoportmemory sets are disjoint, "local" allocation is the only valid policy. 515