1=======================
2NUMA Memory Performance
3=======================
4
5NUMA Locality
6=============
7
8Some platforms may have multiple types of memory attached to a compute
9node. These disparate memory ranges may share some characteristics, such
10as CPU cache coherence, but may have different performance. For example,
11different media types and buses affect bandwidth and latency.
12
13A system supports such heterogeneous memory by grouping each memory type
14under different domains, or "nodes", based on locality and performance
15characteristics.  Some memory may share the same node as a CPU, and others
16are provided as memory only nodes. While memory only nodes do not provide
17CPUs, they may still be local to one or more compute nodes relative to
18other nodes. The following diagram shows one such example of two compute
19nodes with local memory and a memory only node for each of compute node::
20
21 +------------------+     +------------------+
22 | Compute Node 0   +-----+ Compute Node 1   |
23 | Local Node0 Mem  |     | Local Node1 Mem  |
24 +--------+---------+     +--------+---------+
25          |                        |
26 +--------+---------+     +--------+---------+
27 | Slower Node2 Mem |     | Slower Node3 Mem |
28 +------------------+     +--------+---------+
29
30A "memory initiator" is a node containing one or more devices such as
31CPUs or separate memory I/O devices that can initiate memory requests.
32A "memory target" is a node containing one or more physical address
33ranges accessible from one or more memory initiators.
34
35When multiple memory initiators exist, they may not all have the same
36performance when accessing a given memory target. Each initiator-target
37pair may be organized into different ranked access classes to represent
38this relationship. The highest performing initiator to a given target
39is considered to be one of that target's local initiators, and given
40the highest access class, 0. Any given target may have one or more
41local initiators, and any given initiator may have multiple local
42memory targets.
43
44To aid applications matching memory targets with their initiators, the
45kernel provides symlinks to each other. The following example lists the
46relationship for the access class "0" memory initiators and targets::
47
48	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
49	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
50
51	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
52	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
53
54A memory initiator may have multiple memory targets in the same access
55class. The target memory's initiators in a given class indicate the
56nodes' access characteristics share the same performance relative to other
57linked initiator nodes. Each target within an initiator's access class,
58though, do not necessarily perform the same as each other.
59
60The access class "1" is used to allow differentiation between initiators
61that are CPUs and hence suitable for generic task scheduling, and
62IO initiators such as GPUs and NICs.  Unlike access class 0, only
63nodes containing CPUs are considered.
64
65NUMA Performance
66================
67
68Applications may wish to consider which node they want their memory to
69be allocated from based on the node's performance characteristics. If
70the system provides these attributes, the kernel exports them under the
71node sysfs hierarchy by appending the attributes directory under the
72memory node's access class 0 initiators as follows::
73
74	/sys/devices/system/node/nodeY/access0/initiators/
75
76These attributes apply only when accessed from nodes that have the
77are linked under the this access's initiators.
78
79The performance characteristics the kernel provides for the local initiators
80are exported are as follows::
81
82	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
83	/sys/devices/system/node/nodeY/access0/initiators/
84	|-- read_bandwidth
85	|-- read_latency
86	|-- write_bandwidth
87	`-- write_latency
88
89The bandwidth attributes are provided in MiB/second.
90
91The latency attributes are provided in nanoseconds.
92
93The values reported here correspond to the rated latency and bandwidth
94for the platform.
95
96Access class 1 takes the same form but only includes values for CPU to
97memory activity.
98
99NUMA Cache
100==========
101
102System memory may be constructed in a hierarchy of elements with various
103performance characteristics in order to provide large address space of
104slower performing memory cached by a smaller higher performing memory. The
105system physical addresses memory  initiators are aware of are provided
106by the last memory level in the hierarchy. The system meanwhile uses
107higher performing memory to transparently cache access to progressively
108slower levels.
109
110The term "far memory" is used to denote the last level memory in the
111hierarchy. Each increasing cache level provides higher performing
112initiator access, and the term "near memory" represents the fastest
113cache provided by the system.
114
115This numbering is different than CPU caches where the cache level (ex:
116L1, L2, L3) uses the CPU-side view where each increased level is lower
117performing. In contrast, the memory cache level is centric to the last
118level memory, so the higher numbered cache level corresponds to  memory
119nearer to the CPU, and further from far memory.
120
121The memory-side caches are not directly addressable by software. When
122software accesses a system address, the system will return it from the
123near memory cache if it is present. If it is not present, the system
124accesses the next level of memory until there is either a hit in that
125cache level, or it reaches far memory.
126
127An application does not need to know about caching attributes in order
128to use the system. Software may optionally query the memory cache
129attributes in order to maximize the performance out of such a setup.
130If the system provides a way for the kernel to discover this information,
131for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
132the kernel will append these attributes to the NUMA node memory target.
133
134When the kernel first registers a memory cache with a node, the kernel
135will create the following directory::
136
137	/sys/devices/system/node/nodeX/memory_side_cache/
138
139If that directory is not present, the system either does not provide
140a memory-side cache, or that information is not accessible to the kernel.
141
142The attributes for each level of cache is provided under its cache
143level index::
144
145	/sys/devices/system/node/nodeX/memory_side_cache/indexA/
146	/sys/devices/system/node/nodeX/memory_side_cache/indexB/
147	/sys/devices/system/node/nodeX/memory_side_cache/indexC/
148
149Each cache level's directory provides its attributes. For example, the
150following shows a single cache level and the attributes available for
151software to query::
152
153	# tree /sys/devices/system/node/node0/memory_side_cache/
154	/sys/devices/system/node/node0/memory_side_cache/
155	|-- index1
156	|   |-- indexing
157	|   |-- line_size
158	|   |-- size
159	|   `-- write_policy
160
161The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
162for any other indexed based, multi-way associativity.
163
164The "line_size" is the number of bytes accessed from the next cache
165level on a miss.
166
167The "size" is the number of bytes provided by this cache level.
168
169The "write_policy" will be 0 for write-back, and non-zero for
170write-through caching.
171
172See Also
173========
174
175[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
176- Section 5.2.27
177