xref: /openbmc/linux/Documentation/admin-guide/mm/numa_memory_policy.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
13ecf53e4SMike Rapoport==================
23ecf53e4SMike RapoportNUMA Memory Policy
33ecf53e4SMike Rapoport==================
43ecf53e4SMike Rapoport
53ecf53e4SMike RapoportWhat is NUMA Memory Policy?
63ecf53e4SMike Rapoport============================
73ecf53e4SMike Rapoport
83ecf53e4SMike RapoportIn the Linux kernel, "memory policy" determines from which node the kernel will
93ecf53e4SMike Rapoportallocate memory in a NUMA system or in an emulated NUMA system.  Linux has
103ecf53e4SMike Rapoportsupported platforms with Non-Uniform Memory Access architectures since 2.4.?.
113ecf53e4SMike RapoportThe current memory policy support was added to Linux 2.6 around May 2004.  This
123ecf53e4SMike Rapoportdocument attempts to describe the concepts and APIs of the 2.6 memory policy
133ecf53e4SMike Rapoportsupport.
143ecf53e4SMike Rapoport
153ecf53e4SMike RapoportMemory policies should not be confused with cpusets
16da82c92fSMauro Carvalho Chehab(``Documentation/admin-guide/cgroup-v1/cpusets.rst``)
173ecf53e4SMike Rapoportwhich is an administrative mechanism for restricting the nodes from which
183ecf53e4SMike Rapoportmemory may be allocated by a set of processes. Memory policies are a
193ecf53e4SMike Rapoportprogramming interface that a NUMA-aware application can take advantage of.  When
203ecf53e4SMike Rapoportboth cpusets and policies are applied to a task, the restrictions of the cpuset
213ecf53e4SMike Rapoporttakes priority.  See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
223ecf53e4SMike Rapoportbelow for more details.
233ecf53e4SMike Rapoport
243ecf53e4SMike RapoportMemory Policy Concepts
253ecf53e4SMike Rapoport======================
263ecf53e4SMike Rapoport
273ecf53e4SMike RapoportScope of Memory Policies
283ecf53e4SMike Rapoport------------------------
293ecf53e4SMike Rapoport
303ecf53e4SMike RapoportThe Linux kernel supports _scopes_ of memory policy, described here from
313ecf53e4SMike Rapoportmost general to most specific:
323ecf53e4SMike Rapoport
333ecf53e4SMike RapoportSystem Default Policy
343ecf53e4SMike Rapoport	this policy is "hard coded" into the kernel.  It is the policy
353ecf53e4SMike Rapoport	that governs all page allocations that aren't controlled by
363ecf53e4SMike Rapoport	one of the more specific policy scopes discussed below.  When
373ecf53e4SMike Rapoport	the system is "up and running", the system default policy will
383ecf53e4SMike Rapoport	use "local allocation" described below.  However, during boot
393ecf53e4SMike Rapoport	up, the system default policy will be set to interleave
403ecf53e4SMike Rapoport	allocations across all nodes with "sufficient" memory, so as
413ecf53e4SMike Rapoport	not to overload the initial boot node with boot-time
423ecf53e4SMike Rapoport	allocations.
433ecf53e4SMike Rapoport
443ecf53e4SMike RapoportTask/Process Policy
453ecf53e4SMike Rapoport	this is an optional, per-task policy.  When defined for a
463ecf53e4SMike Rapoport	specific task, this policy controls all page allocations made
473ecf53e4SMike Rapoport	by or on behalf of the task that aren't controlled by a more
483ecf53e4SMike Rapoport	specific scope. If a task does not define a task policy, then
493ecf53e4SMike Rapoport	all page allocations that would have been controlled by the
503ecf53e4SMike Rapoport	task policy "fall back" to the System Default Policy.
513ecf53e4SMike Rapoport
523ecf53e4SMike Rapoport	The task policy applies to the entire address space of a task. Thus,
533ecf53e4SMike Rapoport	it is inheritable, and indeed is inherited, across both fork()
543ecf53e4SMike Rapoport	[clone() w/o the CLONE_VM flag] and exec*().  This allows a parent task
553ecf53e4SMike Rapoport	to establish the task policy for a child task exec()'d from an
563ecf53e4SMike Rapoport	executable image that has no awareness of memory policy.  See the
573ecf53e4SMike Rapoport	:ref:`Memory Policy APIs <memory_policy_apis>` section,
583ecf53e4SMike Rapoport	below, for an overview of the system call
593ecf53e4SMike Rapoport	that a task may use to set/change its task/process policy.
603ecf53e4SMike Rapoport
613ecf53e4SMike Rapoport	In a multi-threaded task, task policies apply only to the thread
623ecf53e4SMike Rapoport	[Linux kernel task] that installs the policy and any threads
633ecf53e4SMike Rapoport	subsequently created by that thread.  Any sibling threads existing
643ecf53e4SMike Rapoport	at the time a new task policy is installed retain their current
653ecf53e4SMike Rapoport	policy.
663ecf53e4SMike Rapoport
673ecf53e4SMike Rapoport	A task policy applies only to pages allocated after the policy is
683ecf53e4SMike Rapoport	installed.  Any pages already faulted in by the task when the task
693ecf53e4SMike Rapoport	changes its task policy remain where they were allocated based on
703ecf53e4SMike Rapoport	the policy at the time they were allocated.
713ecf53e4SMike Rapoport
723ecf53e4SMike Rapoport.. _vma_policy:
733ecf53e4SMike Rapoport
743ecf53e4SMike RapoportVMA Policy
753ecf53e4SMike Rapoport	A "VMA" or "Virtual Memory Area" refers to a range of a task's
763ecf53e4SMike Rapoport	virtual address space.  A task may define a specific policy for a range
773ecf53e4SMike Rapoport	of its virtual address space.   See the
783ecf53e4SMike Rapoport	:ref:`Memory Policy APIs <memory_policy_apis>` section,
793ecf53e4SMike Rapoport	below, for an overview of the mbind() system call used to set a VMA
803ecf53e4SMike Rapoport	policy.
813ecf53e4SMike Rapoport
823ecf53e4SMike Rapoport	A VMA policy will govern the allocation of pages that back
833ecf53e4SMike Rapoport	this region of the address space.  Any regions of the task's
843ecf53e4SMike Rapoport	address space that don't have an explicit VMA policy will fall
853ecf53e4SMike Rapoport	back to the task policy, which may itself fall back to the
863ecf53e4SMike Rapoport	System Default Policy.
873ecf53e4SMike Rapoport
883ecf53e4SMike Rapoport	VMA policies have a few complicating details:
893ecf53e4SMike Rapoport
903ecf53e4SMike Rapoport	* VMA policy applies ONLY to anonymous pages.  These include
913ecf53e4SMike Rapoport	  pages allocated for anonymous segments, such as the task
923ecf53e4SMike Rapoport	  stack and heap, and any regions of the address space
933ecf53e4SMike Rapoport	  mmap()ed with the MAP_ANONYMOUS flag.  If a VMA policy is
943ecf53e4SMike Rapoport	  applied to a file mapping, it will be ignored if the mapping
953ecf53e4SMike Rapoport	  used the MAP_SHARED flag.  If the file mapping used the
963ecf53e4SMike Rapoport	  MAP_PRIVATE flag, the VMA policy will only be applied when
973ecf53e4SMike Rapoport	  an anonymous page is allocated on an attempt to write to the
983ecf53e4SMike Rapoport	  mapping-- i.e., at Copy-On-Write.
993ecf53e4SMike Rapoport
1003ecf53e4SMike Rapoport	* VMA policies are shared between all tasks that share a
1013ecf53e4SMike Rapoport	  virtual address space--a.k.a. threads--independent of when
1023ecf53e4SMike Rapoport	  the policy is installed; and they are inherited across
1033ecf53e4SMike Rapoport	  fork().  However, because VMA policies refer to a specific
1043ecf53e4SMike Rapoport	  region of a task's address space, and because the address
1053ecf53e4SMike Rapoport	  space is discarded and recreated on exec*(), VMA policies
1063ecf53e4SMike Rapoport	  are NOT inheritable across exec().  Thus, only NUMA-aware
1073ecf53e4SMike Rapoport	  applications may use VMA policies.
1083ecf53e4SMike Rapoport
1093ecf53e4SMike Rapoport	* A task may install a new VMA policy on a sub-range of a
1103ecf53e4SMike Rapoport	  previously mmap()ed region.  When this happens, Linux splits
1113ecf53e4SMike Rapoport	  the existing virtual memory area into 2 or 3 VMAs, each with
112*97e6f135SRandy Dunlap	  its own policy.
1133ecf53e4SMike Rapoport
1143ecf53e4SMike Rapoport	* By default, VMA policy applies only to pages allocated after
1153ecf53e4SMike Rapoport	  the policy is installed.  Any pages already faulted into the
1163ecf53e4SMike Rapoport	  VMA range remain where they were allocated based on the
1173ecf53e4SMike Rapoport	  policy at the time they were allocated.  However, since
1183ecf53e4SMike Rapoport	  2.6.16, Linux supports page migration via the mbind() system
1193ecf53e4SMike Rapoport	  call, so that page contents can be moved to match a newly
1203ecf53e4SMike Rapoport	  installed policy.
1213ecf53e4SMike Rapoport
1223ecf53e4SMike RapoportShared Policy
1233ecf53e4SMike Rapoport	Conceptually, shared policies apply to "memory objects" mapped
1243ecf53e4SMike Rapoport	shared into one or more tasks' distinct address spaces.  An
1253ecf53e4SMike Rapoport	application installs shared policies the same way as VMA
1263ecf53e4SMike Rapoport	policies--using the mbind() system call specifying a range of
1273ecf53e4SMike Rapoport	virtual addresses that map the shared object.  However, unlike
1283ecf53e4SMike Rapoport	VMA policies, which can be considered to be an attribute of a
1293ecf53e4SMike Rapoport	range of a task's address space, shared policies apply
1303ecf53e4SMike Rapoport	directly to the shared object.  Thus, all tasks that attach to
1313ecf53e4SMike Rapoport	the object share the policy, and all pages allocated for the
1323ecf53e4SMike Rapoport	shared object, by any task, will obey the shared policy.
1333ecf53e4SMike Rapoport
1343ecf53e4SMike Rapoport	As of 2.6.22, only shared memory segments, created by shmget() or
1353ecf53e4SMike Rapoport	mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy.  When shared
1363ecf53e4SMike Rapoport	policy support was added to Linux, the associated data structures were
1373ecf53e4SMike Rapoport	added to hugetlbfs shmem segments.  At the time, hugetlbfs did not
1383ecf53e4SMike Rapoport	support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
1393ecf53e4SMike Rapoport	shmem segments were never "hooked up" to the shared policy support.
1403ecf53e4SMike Rapoport	Although hugetlbfs segments now support lazy allocation, their support
1413ecf53e4SMike Rapoport	for shared policy has not been completed.
1423ecf53e4SMike Rapoport
1433ecf53e4SMike Rapoport	As mentioned above in :ref:`VMA policies <vma_policy>` section,
1443ecf53e4SMike Rapoport	allocations of page cache pages for regular files mmap()ed
1453ecf53e4SMike Rapoport	with MAP_SHARED ignore any VMA policy installed on the virtual
1463ecf53e4SMike Rapoport	address range backed by the shared file mapping.  Rather,
1473ecf53e4SMike Rapoport	shared page cache pages, including pages backing private
1483ecf53e4SMike Rapoport	mappings that have not yet been written by the task, follow
1493ecf53e4SMike Rapoport	task policy, if any, else System Default Policy.
1503ecf53e4SMike Rapoport
1513ecf53e4SMike Rapoport	The shared policy infrastructure supports different policies on subset
1523ecf53e4SMike Rapoport	ranges of the shared object.  However, Linux still splits the VMA of
1533ecf53e4SMike Rapoport	the task that installs the policy for each range of distinct policy.
1543ecf53e4SMike Rapoport	Thus, different tasks that attach to a shared memory segment can have
1553ecf53e4SMike Rapoport	different VMA configurations mapping that one shared object.  This
1563ecf53e4SMike Rapoport	can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
1573ecf53e4SMike Rapoport	a shared memory region, when one task has installed shared policy on
1583ecf53e4SMike Rapoport	one or more ranges of the region.
1593ecf53e4SMike Rapoport
1603ecf53e4SMike RapoportComponents of Memory Policies
1613ecf53e4SMike Rapoport-----------------------------
1623ecf53e4SMike Rapoport
1633ecf53e4SMike RapoportA NUMA memory policy consists of a "mode", optional mode flags, and
1643ecf53e4SMike Rapoportan optional set of nodes.  The mode determines the behavior of the
1653ecf53e4SMike Rapoportpolicy, the optional mode flags determine the behavior of the mode,
1663ecf53e4SMike Rapoportand the optional set of nodes can be viewed as the arguments to the
1673ecf53e4SMike Rapoportpolicy behavior.
1683ecf53e4SMike Rapoport
1693ecf53e4SMike RapoportInternally, memory policies are implemented by a reference counted
1703ecf53e4SMike Rapoportstructure, struct mempolicy.  Details of this structure will be
1713ecf53e4SMike Rapoportdiscussed in context, below, as required to explain the behavior.
1723ecf53e4SMike Rapoport
1733ecf53e4SMike RapoportNUMA memory policy supports the following 4 behavioral modes:
1743ecf53e4SMike Rapoport
1753ecf53e4SMike RapoportDefault Mode--MPOL_DEFAULT
1763ecf53e4SMike Rapoport	This mode is only used in the memory policy APIs.  Internally,
1773ecf53e4SMike Rapoport	MPOL_DEFAULT is converted to the NULL memory policy in all
1783ecf53e4SMike Rapoport	policy scopes.  Any existing non-default policy will simply be
1793ecf53e4SMike Rapoport	removed when MPOL_DEFAULT is specified.  As a result,
1803ecf53e4SMike Rapoport	MPOL_DEFAULT means "fall back to the next most specific policy
1813ecf53e4SMike Rapoport	scope."
1823ecf53e4SMike Rapoport
1833ecf53e4SMike Rapoport	For example, a NULL or default task policy will fall back to the
1843ecf53e4SMike Rapoport	system default policy.  A NULL or default vma policy will fall
1853ecf53e4SMike Rapoport	back to the task policy.
1863ecf53e4SMike Rapoport
1873ecf53e4SMike Rapoport	When specified in one of the memory policy APIs, the Default mode
1883ecf53e4SMike Rapoport	does not use the optional set of nodes.
1893ecf53e4SMike Rapoport
1903ecf53e4SMike Rapoport	It is an error for the set of nodes specified for this policy to
1913ecf53e4SMike Rapoport	be non-empty.
1923ecf53e4SMike Rapoport
1933ecf53e4SMike RapoportMPOL_BIND
1943ecf53e4SMike Rapoport	This mode specifies that memory must come from the set of
1953ecf53e4SMike Rapoport	nodes specified by the policy.  Memory will be allocated from
1963ecf53e4SMike Rapoport	the node in the set with sufficient free memory that is
1973ecf53e4SMike Rapoport	closest to the node where the allocation takes place.
1983ecf53e4SMike Rapoport
1993ecf53e4SMike RapoportMPOL_PREFERRED
2003ecf53e4SMike Rapoport	This mode specifies that the allocation should be attempted
2013ecf53e4SMike Rapoport	from the single node specified in the policy.  If that
2023ecf53e4SMike Rapoport	allocation fails, the kernel will search other nodes, in order
2033ecf53e4SMike Rapoport	of increasing distance from the preferred node based on
2043ecf53e4SMike Rapoport	information provided by the platform firmware.
2053ecf53e4SMike Rapoport
2063ecf53e4SMike Rapoport	Internally, the Preferred policy uses a single node--the
2073ecf53e4SMike Rapoport	preferred_node member of struct mempolicy.  When the internal
2083ecf53e4SMike Rapoport	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
2093ecf53e4SMike Rapoport	and the policy is interpreted as local allocation.  "Local"
2103ecf53e4SMike Rapoport	allocation policy can be viewed as a Preferred policy that
2113ecf53e4SMike Rapoport	starts at the node containing the cpu where the allocation
2123ecf53e4SMike Rapoport	takes place.
2133ecf53e4SMike Rapoport
2143ecf53e4SMike Rapoport	It is possible for the user to specify that local allocation
2153ecf53e4SMike Rapoport	is always preferred by passing an empty nodemask with this
2163ecf53e4SMike Rapoport	mode.  If an empty nodemask is passed, the policy cannot use
2173ecf53e4SMike Rapoport	the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
2183ecf53e4SMike Rapoport	described below.
2193ecf53e4SMike Rapoport
2203ecf53e4SMike RapoportMPOL_INTERLEAVED
2213ecf53e4SMike Rapoport	This mode specifies that page allocations be interleaved, on a
2223ecf53e4SMike Rapoport	page granularity, across the nodes specified in the policy.
2233ecf53e4SMike Rapoport	This mode also behaves slightly differently, based on the
2243ecf53e4SMike Rapoport	context where it is used:
2253ecf53e4SMike Rapoport
2263ecf53e4SMike Rapoport	For allocation of anonymous pages and shared memory pages,
2273ecf53e4SMike Rapoport	Interleave mode indexes the set of nodes specified by the
2283ecf53e4SMike Rapoport	policy using the page offset of the faulting address into the
2293ecf53e4SMike Rapoport	segment [VMA] containing the address modulo the number of
2303ecf53e4SMike Rapoport	nodes specified by the policy.  It then attempts to allocate a
2313ecf53e4SMike Rapoport	page, starting at the selected node, as if the node had been
2323ecf53e4SMike Rapoport	specified by a Preferred policy or had been selected by a
2333ecf53e4SMike Rapoport	local allocation.  That is, allocation will follow the per
2343ecf53e4SMike Rapoport	node zonelist.
2353ecf53e4SMike Rapoport
2363ecf53e4SMike Rapoport	For allocation of page cache pages, Interleave mode indexes
2373ecf53e4SMike Rapoport	the set of nodes specified by the policy using a node counter
2383ecf53e4SMike Rapoport	maintained per task.  This counter wraps around to the lowest
2393ecf53e4SMike Rapoport	specified node after it reaches the highest specified node.
2403ecf53e4SMike Rapoport	This will tend to spread the pages out over the nodes
2413ecf53e4SMike Rapoport	specified by the policy based on the order in which they are
2423ecf53e4SMike Rapoport	allocated, rather than based on any page offset into an
2433ecf53e4SMike Rapoport	address range or file.  During system boot up, the temporary
2443ecf53e4SMike Rapoport	interleaved system default policy works in this mode.
2453ecf53e4SMike Rapoport
246a38a59fdSBen WidawskyMPOL_PREFERRED_MANY
247dbeb56feSRandy Dunlap	This mode specifies that the allocation should be preferably
248a38a59fdSBen Widawsky	satisfied from the nodemask specified in the policy. If there is
249a38a59fdSBen Widawsky	a memory pressure on all nodes in the nodemask, the allocation
250a38a59fdSBen Widawsky	can fall back to all existing numa nodes. This is effectively
251a38a59fdSBen Widawsky	MPOL_PREFERRED allowed for a mask rather than a single node.
252a38a59fdSBen Widawsky
2533ecf53e4SMike RapoportNUMA memory policy supports the following optional mode flags:
2543ecf53e4SMike Rapoport
2553ecf53e4SMike RapoportMPOL_F_STATIC_NODES
2563ecf53e4SMike Rapoport	This flag specifies that the nodemask passed by
2573ecf53e4SMike Rapoport	the user should not be remapped if the task or VMA's set of allowed
2583ecf53e4SMike Rapoport	nodes changes after the memory policy has been defined.
2593ecf53e4SMike Rapoport
2603ecf53e4SMike Rapoport	Without this flag, any time a mempolicy is rebound because of a
261a38a59fdSBen Widawsky        change in the set of allowed nodes, the preferred nodemask (Preferred
262a38a59fdSBen Widawsky        Many), preferred node (Preferred) or nodemask (Bind, Interleave) is
263a38a59fdSBen Widawsky        remapped to the new set of allowed nodes.  This may result in nodes
264a38a59fdSBen Widawsky        being used that were previously undesired.
2653ecf53e4SMike Rapoport
2663ecf53e4SMike Rapoport	With this flag, if the user-specified nodes overlap with the
2673ecf53e4SMike Rapoport	nodes allowed by the task's cpuset, then the memory policy is
2683ecf53e4SMike Rapoport	applied to their intersection.  If the two sets of nodes do not
2693ecf53e4SMike Rapoport	overlap, the Default policy is used.
2703ecf53e4SMike Rapoport
2713ecf53e4SMike Rapoport	For example, consider a task that is attached to a cpuset with
2723ecf53e4SMike Rapoport	mems 1-3 that sets an Interleave policy over the same set.  If
2733ecf53e4SMike Rapoport	the cpuset's mems change to 3-5, the Interleave will now occur
2743ecf53e4SMike Rapoport	over nodes 3, 4, and 5.  With this flag, however, since only node
2753ecf53e4SMike Rapoport	3 is allowed from the user's nodemask, the "interleave" only
2763ecf53e4SMike Rapoport	occurs over that node.  If no nodes from the user's nodemask are
2773ecf53e4SMike Rapoport	now allowed, the Default behavior is used.
2783ecf53e4SMike Rapoport
2793ecf53e4SMike Rapoport	MPOL_F_STATIC_NODES cannot be combined with the
2803ecf53e4SMike Rapoport	MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
2813ecf53e4SMike Rapoport	MPOL_PREFERRED policies that were created with an empty nodemask
2823ecf53e4SMike Rapoport	(local allocation).
2833ecf53e4SMike Rapoport
2843ecf53e4SMike RapoportMPOL_F_RELATIVE_NODES
2853ecf53e4SMike Rapoport	This flag specifies that the nodemask passed
2863ecf53e4SMike Rapoport	by the user will be mapped relative to the set of the task or VMA's
2873ecf53e4SMike Rapoport	set of allowed nodes.  The kernel stores the user-passed nodemask,
2883ecf53e4SMike Rapoport	and if the allowed nodes changes, then that original nodemask will
2893ecf53e4SMike Rapoport	be remapped relative to the new set of allowed nodes.
2903ecf53e4SMike Rapoport
2913ecf53e4SMike Rapoport	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
2923ecf53e4SMike Rapoport	mempolicy is rebound because of a change in the set of allowed
2933ecf53e4SMike Rapoport	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
2943ecf53e4SMike Rapoport	remapped to the new set of allowed nodes.  That remap may not
2953ecf53e4SMike Rapoport	preserve the relative nature of the user's passed nodemask to its
2963ecf53e4SMike Rapoport	set of allowed nodes upon successive rebinds: a nodemask of
2973ecf53e4SMike Rapoport	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
2983ecf53e4SMike Rapoport	allowed nodes is restored to its original state.
2993ecf53e4SMike Rapoport
3003ecf53e4SMike Rapoport	With this flag, the remap is done so that the node numbers from
3013ecf53e4SMike Rapoport	the user's passed nodemask are relative to the set of allowed
3023ecf53e4SMike Rapoport	nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
3033ecf53e4SMike Rapoport	nodemask, the policy will be effected over the first (and in the
3043ecf53e4SMike Rapoport	Bind or Interleave case, the third and fifth) nodes in the set of
3053ecf53e4SMike Rapoport	allowed nodes.  The nodemask passed by the user represents nodes
3063ecf53e4SMike Rapoport	relative to task or VMA's set of allowed nodes.
3073ecf53e4SMike Rapoport
3083ecf53e4SMike Rapoport	If the user's nodemask includes nodes that are outside the range
3093ecf53e4SMike Rapoport	of the new set of allowed nodes (for example, node 5 is set in
3103ecf53e4SMike Rapoport	the user's nodemask when the set of allowed nodes is only 0-3),
3113ecf53e4SMike Rapoport	then the remap wraps around to the beginning of the nodemask and,
3123ecf53e4SMike Rapoport	if not already set, sets the node in the mempolicy nodemask.
3133ecf53e4SMike Rapoport
3143ecf53e4SMike Rapoport	For example, consider a task that is attached to a cpuset with
3153ecf53e4SMike Rapoport	mems 2-5 that sets an Interleave policy over the same set with
3163ecf53e4SMike Rapoport	MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
3173ecf53e4SMike Rapoport	interleave now occurs over nodes 3,5-7.  If the cpuset's mems
3183ecf53e4SMike Rapoport	then change to 0,2-3,5, then the interleave occurs over nodes
3193ecf53e4SMike Rapoport	0,2-3,5.
3203ecf53e4SMike Rapoport
3213ecf53e4SMike Rapoport	Thanks to the consistent remapping, applications preparing
3223ecf53e4SMike Rapoport	nodemasks to specify memory policies using this flag should
3233ecf53e4SMike Rapoport	disregard their current, actual cpuset imposed memory placement
3243ecf53e4SMike Rapoport	and prepare the nodemask as if they were always located on
3253ecf53e4SMike Rapoport	memory nodes 0 to N-1, where N is the number of memory nodes the
3263ecf53e4SMike Rapoport	policy is intended to manage.  Let the kernel then remap to the
3273ecf53e4SMike Rapoport	set of memory nodes allowed by the task's cpuset, as that may
3283ecf53e4SMike Rapoport	change over time.
3293ecf53e4SMike Rapoport
3303ecf53e4SMike Rapoport	MPOL_F_RELATIVE_NODES cannot be combined with the
3313ecf53e4SMike Rapoport	MPOL_F_STATIC_NODES flag.  It also cannot be used for
3323ecf53e4SMike Rapoport	MPOL_PREFERRED policies that were created with an empty nodemask
3333ecf53e4SMike Rapoport	(local allocation).
3343ecf53e4SMike Rapoport
3353ecf53e4SMike RapoportMemory Policy Reference Counting
3363ecf53e4SMike Rapoport================================
3373ecf53e4SMike Rapoport
3383ecf53e4SMike RapoportTo resolve use/free races, struct mempolicy contains an atomic reference
3393ecf53e4SMike Rapoportcount field.  Internal interfaces, mpol_get()/mpol_put() increment and
3403ecf53e4SMike Rapoportdecrement this reference count, respectively.  mpol_put() will only free
3413ecf53e4SMike Rapoportthe structure back to the mempolicy kmem cache when the reference count
3423ecf53e4SMike Rapoportgoes to zero.
3433ecf53e4SMike Rapoport
3443ecf53e4SMike RapoportWhen a new memory policy is allocated, its reference count is initialized
3453ecf53e4SMike Rapoportto '1', representing the reference held by the task that is installing the
3463ecf53e4SMike Rapoportnew policy.  When a pointer to a memory policy structure is stored in another
3473ecf53e4SMike Rapoportstructure, another reference is added, as the task's reference will be dropped
3483ecf53e4SMike Rapoporton completion of the policy installation.
3493ecf53e4SMike Rapoport
3503ecf53e4SMike RapoportDuring run-time "usage" of the policy, we attempt to minimize atomic operations
3513ecf53e4SMike Rapoporton the reference count, as this can lead to cache lines bouncing between cpus
3523ecf53e4SMike Rapoportand NUMA nodes.  "Usage" here means one of the following:
3533ecf53e4SMike Rapoport
3543ecf53e4SMike Rapoport1) querying of the policy, either by the task itself [using the get_mempolicy()
3553ecf53e4SMike Rapoport   API discussed below] or by another task using the /proc/<pid>/numa_maps
3563ecf53e4SMike Rapoport   interface.
3573ecf53e4SMike Rapoport
3583ecf53e4SMike Rapoport2) examination of the policy to determine the policy mode and associated node
3593ecf53e4SMike Rapoport   or node lists, if any, for page allocation.  This is considered a "hot
3603ecf53e4SMike Rapoport   path".  Note that for MPOL_BIND, the "usage" extends across the entire
361dbeb56feSRandy Dunlap   allocation process, which may sleep during page reclamation, because the
3623ecf53e4SMike Rapoport   BIND policy nodemask is used, by reference, to filter ineligible nodes.
3633ecf53e4SMike Rapoport
3643ecf53e4SMike RapoportWe can avoid taking an extra reference during the usages listed above as
3653ecf53e4SMike Rapoportfollows:
3663ecf53e4SMike Rapoport
3673ecf53e4SMike Rapoport1) we never need to get/free the system default policy as this is never
3683ecf53e4SMike Rapoport   changed nor freed, once the system is up and running.
3693ecf53e4SMike Rapoport
3703ecf53e4SMike Rapoport2) for querying the policy, we do not need to take an extra reference on the
3713ecf53e4SMike Rapoport   target task's task policy nor vma policies because we always acquire the
372c1e8d7c6SMichel Lespinasse   task's mm's mmap_lock for read during the query.  The set_mempolicy() and
373c1e8d7c6SMichel Lespinasse   mbind() APIs [see below] always acquire the mmap_lock for write when
3743ecf53e4SMike Rapoport   installing or replacing task or vma policies.  Thus, there is no possibility
3753ecf53e4SMike Rapoport   of a task or thread freeing a policy while another task or thread is
3763ecf53e4SMike Rapoport   querying it.
3773ecf53e4SMike Rapoport
3783ecf53e4SMike Rapoport3) Page allocation usage of task or vma policy occurs in the fault path where
379c1e8d7c6SMichel Lespinasse   we hold them mmap_lock for read.  Again, because replacing the task or vma
380c1e8d7c6SMichel Lespinasse   policy requires that the mmap_lock be held for write, the policy can't be
3813ecf53e4SMike Rapoport   freed out from under us while we're using it for page allocation.
3823ecf53e4SMike Rapoport
3833ecf53e4SMike Rapoport4) Shared policies require special consideration.  One task can replace a
384c1e8d7c6SMichel Lespinasse   shared memory policy while another task, with a distinct mmap_lock, is
3853ecf53e4SMike Rapoport   querying or allocating a page based on the policy.  To resolve this
3863ecf53e4SMike Rapoport   potential race, the shared policy infrastructure adds an extra reference
3873ecf53e4SMike Rapoport   to the shared policy during lookup while holding a spin lock on the shared
3883ecf53e4SMike Rapoport   policy management structure.  This requires that we drop this extra
3893ecf53e4SMike Rapoport   reference when we're finished "using" the policy.  We must drop the
3903ecf53e4SMike Rapoport   extra reference on shared policies in the same query/allocation paths
3913ecf53e4SMike Rapoport   used for non-shared policies.  For this reason, shared policies are marked
3923ecf53e4SMike Rapoport   as such, and the extra reference is dropped "conditionally"--i.e., only
3933ecf53e4SMike Rapoport   for shared policies.
3943ecf53e4SMike Rapoport
3953ecf53e4SMike Rapoport   Because of this extra reference counting, and because we must lookup
3963ecf53e4SMike Rapoport   shared policies in a tree structure under spinlock, shared policies are
3973ecf53e4SMike Rapoport   more expensive to use in the page allocation path.  This is especially
3983ecf53e4SMike Rapoport   true for shared policies on shared memory regions shared by tasks running
3993ecf53e4SMike Rapoport   on different NUMA nodes.  This extra overhead can be avoided by always
4003ecf53e4SMike Rapoport   falling back to task or system default policy for shared memory regions,
4013ecf53e4SMike Rapoport   or by prefaulting the entire shared memory region into memory and locking
4023ecf53e4SMike Rapoport   it down.  However, this might not be appropriate for all applications.
4033ecf53e4SMike Rapoport
4043ecf53e4SMike Rapoport.. _memory_policy_apis:
4053ecf53e4SMike Rapoport
4063ecf53e4SMike RapoportMemory Policy APIs
4073ecf53e4SMike Rapoport==================
4083ecf53e4SMike Rapoport
409c6018b4bSAneesh Kumar K.VLinux supports 4 system calls for controlling memory policy.  These APIS
4103ecf53e4SMike Rapoportalways affect only the calling task, the calling task's address space, or
4113ecf53e4SMike Rapoportsome shared object mapped into the calling task's address space.
4123ecf53e4SMike Rapoport
4133ecf53e4SMike Rapoport.. note::
4143ecf53e4SMike Rapoport   the headers that define these APIs and the parameter data types for
4153ecf53e4SMike Rapoport   user space applications reside in a package that is not part of the
4163ecf53e4SMike Rapoport   Linux kernel.  The kernel system call interfaces, with the 'sys\_'
4173ecf53e4SMike Rapoport   prefix, are defined in <linux/syscalls.h>; the mode and flag
4183ecf53e4SMike Rapoport   definitions are defined in <linux/mempolicy.h>.
4193ecf53e4SMike Rapoport
4203ecf53e4SMike RapoportSet [Task] Memory Policy::
4213ecf53e4SMike Rapoport
4223ecf53e4SMike Rapoport	long set_mempolicy(int mode, const unsigned long *nmask,
4233ecf53e4SMike Rapoport					unsigned long maxnode);
4243ecf53e4SMike Rapoport
4253ecf53e4SMike RapoportSet's the calling task's "task/process memory policy" to mode
4263ecf53e4SMike Rapoportspecified by the 'mode' argument and the set of nodes defined by
4273ecf53e4SMike Rapoport'nmask'.  'nmask' points to a bit mask of node ids containing at least
4283ecf53e4SMike Rapoport'maxnode' ids.  Optional mode flags may be passed by combining the
4293ecf53e4SMike Rapoport'mode' argument with the flag (for example: MPOL_INTERLEAVE |
4303ecf53e4SMike RapoportMPOL_F_STATIC_NODES).
4313ecf53e4SMike Rapoport
4323ecf53e4SMike RapoportSee the set_mempolicy(2) man page for more details
4333ecf53e4SMike Rapoport
4343ecf53e4SMike Rapoport
4353ecf53e4SMike RapoportGet [Task] Memory Policy or Related Information::
4363ecf53e4SMike Rapoport
4373ecf53e4SMike Rapoport	long get_mempolicy(int *mode,
4383ecf53e4SMike Rapoport			   const unsigned long *nmask, unsigned long maxnode,
4393ecf53e4SMike Rapoport			   void *addr, int flags);
4403ecf53e4SMike Rapoport
4413ecf53e4SMike RapoportQueries the "task/process memory policy" of the calling task, or the
4423ecf53e4SMike Rapoportpolicy or location of a specified virtual address, depending on the
4433ecf53e4SMike Rapoport'flags' argument.
4443ecf53e4SMike Rapoport
4453ecf53e4SMike RapoportSee the get_mempolicy(2) man page for more details
4463ecf53e4SMike Rapoport
4473ecf53e4SMike Rapoport
4483ecf53e4SMike RapoportInstall VMA/Shared Policy for a Range of Task's Address Space::
4493ecf53e4SMike Rapoport
4503ecf53e4SMike Rapoport	long mbind(void *start, unsigned long len, int mode,
4513ecf53e4SMike Rapoport		   const unsigned long *nmask, unsigned long maxnode,
4523ecf53e4SMike Rapoport		   unsigned flags);
4533ecf53e4SMike Rapoport
4543ecf53e4SMike Rapoportmbind() installs the policy specified by (mode, nmask, maxnodes) as a
4553ecf53e4SMike RapoportVMA policy for the range of the calling task's address space specified
4563ecf53e4SMike Rapoportby the 'start' and 'len' arguments.  Additional actions may be
4573ecf53e4SMike Rapoportrequested via the 'flags' argument.
4583ecf53e4SMike Rapoport
4593ecf53e4SMike RapoportSee the mbind(2) man page for more details.
4603ecf53e4SMike Rapoport
461c6018b4bSAneesh Kumar K.VSet home node for a Range of Task's Address Spacec::
462c6018b4bSAneesh Kumar K.V
463c6018b4bSAneesh Kumar K.V	long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
464c6018b4bSAneesh Kumar K.V					 unsigned long home_node,
465c6018b4bSAneesh Kumar K.V					 unsigned long flags);
466c6018b4bSAneesh Kumar K.V
467c6018b4bSAneesh Kumar K.Vsys_set_mempolicy_home_node set the home node for a VMA policy present in the
468c6018b4bSAneesh Kumar K.Vtask's address range. The system call updates the home node only for the existing
469c6018b4bSAneesh Kumar K.Vmempolicy range. Other address ranges are ignored. A home node is the NUMA node
470c6018b4bSAneesh Kumar K.Vclosest to which page allocation will come from. Specifying the home node override
471c6018b4bSAneesh Kumar K.Vthe default allocation policy to allocate memory close to the local node for an
472c6018b4bSAneesh Kumar K.Vexecuting CPU.
473c6018b4bSAneesh Kumar K.V
474c6018b4bSAneesh Kumar K.V
4753ecf53e4SMike RapoportMemory Policy Command Line Interface
4763ecf53e4SMike Rapoport====================================
4773ecf53e4SMike Rapoport
4783ecf53e4SMike RapoportAlthough not strictly part of the Linux implementation of memory policy,
4793ecf53e4SMike Rapoporta command line tool, numactl(8), exists that allows one to:
4803ecf53e4SMike Rapoport
4813ecf53e4SMike Rapoport+ set the task policy for a specified program via set_mempolicy(2), fork(2) and
4823ecf53e4SMike Rapoport  exec(2)
4833ecf53e4SMike Rapoport
4843ecf53e4SMike Rapoport+ set the shared policy for a shared memory segment via mbind(2)
4853ecf53e4SMike Rapoport
4863ecf53e4SMike RapoportThe numactl(8) tool is packaged with the run-time version of the library
4873ecf53e4SMike Rapoportcontaining the memory policy system call wrappers.  Some distributions
4883ecf53e4SMike Rapoportpackage the headers and compile-time libraries in a separate development
4893ecf53e4SMike Rapoportpackage.
4903ecf53e4SMike Rapoport
4913ecf53e4SMike Rapoport.. _mem_pol_and_cpusets:
4923ecf53e4SMike Rapoport
4933ecf53e4SMike RapoportMemory Policies and cpusets
4943ecf53e4SMike Rapoport===========================
4953ecf53e4SMike Rapoport
4963ecf53e4SMike RapoportMemory policies work within cpusets as described above.  For memory policies
4973ecf53e4SMike Rapoportthat require a node or set of nodes, the nodes are restricted to the set of
4983ecf53e4SMike Rapoportnodes whose memories are allowed by the cpuset constraints.  If the nodemask
4993ecf53e4SMike Rapoportspecified for the policy contains nodes that are not allowed by the cpuset and
5003ecf53e4SMike RapoportMPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
5013ecf53e4SMike Rapoportspecified for the policy and the set of nodes with memory is used.  If the
5023ecf53e4SMike Rapoportresult is the empty set, the policy is considered invalid and cannot be
5033ecf53e4SMike Rapoportinstalled.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
5043ecf53e4SMike Rapoportonto and folded into the task's set of allowed nodes as previously described.
5053ecf53e4SMike Rapoport
5063ecf53e4SMike RapoportThe interaction of memory policies and cpusets can be problematic when tasks
5073ecf53e4SMike Rapoportin two cpusets share access to a memory region, such as shared memory segments
5083ecf53e4SMike Rapoportcreated by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
5093ecf53e4SMike Rapoportany of the tasks install shared policy on the region, only nodes whose
5103ecf53e4SMike Rapoportmemories are allowed in both cpusets may be used in the policies.  Obtaining
5113ecf53e4SMike Rapoportthis information requires "stepping outside" the memory policy APIs to use the
5123ecf53e4SMike Rapoportcpuset information and requires that one know in what cpusets other task might
5133ecf53e4SMike Rapoportbe attaching to the shared region.  Furthermore, if the cpusets' allowed
5143ecf53e4SMike Rapoportmemory sets are disjoint, "local" allocation is the only valid policy.
515