xref: /openbmc/linux/Documentation/admin-guide/mm/numaperf.rst (revision 04eb94d526423ff082efce61f4f26b0369d0bfdd)
1.. _numaperf:
2
3=============
4NUMA Locality
5=============
6
7Some platforms may have multiple types of memory attached to a compute
8node. These disparate memory ranges may share some characteristics, such
9as CPU cache coherence, but may have different performance. For example,
10different media types and buses affect bandwidth and latency.
11
12A system supports such heterogeneous memory by grouping each memory type
13under different domains, or "nodes", based on locality and performance
14characteristics.  Some memory may share the same node as a CPU, and others
15are provided as memory only nodes. While memory only nodes do not provide
16CPUs, they may still be local to one or more compute nodes relative to
17other nodes. The following diagram shows one such example of two compute
18nodes with local memory and a memory only node for each of compute node::
19
20 +------------------+     +------------------+
21 | Compute Node 0   +-----+ Compute Node 1   |
22 | Local Node0 Mem  |     | Local Node1 Mem  |
23 +--------+---------+     +--------+---------+
24          |                        |
25 +--------+---------+     +--------+---------+
26 | Slower Node2 Mem |     | Slower Node3 Mem |
27 +------------------+     +--------+---------+
28
29A "memory initiator" is a node containing one or more devices such as
30CPUs or separate memory I/O devices that can initiate memory requests.
31A "memory target" is a node containing one or more physical address
32ranges accessible from one or more memory initiators.
33
34When multiple memory initiators exist, they may not all have the same
35performance when accessing a given memory target. Each initiator-target
36pair may be organized into different ranked access classes to represent
37this relationship. The highest performing initiator to a given target
38is considered to be one of that target's local initiators, and given
39the highest access class, 0. Any given target may have one or more
40local initiators, and any given initiator may have multiple local
41memory targets.
42
43To aid applications matching memory targets with their initiators, the
44kernel provides symlinks to each other. The following example lists the
45relationship for the access class "0" memory initiators and targets::
46
47	# symlinks -v /sys/devices/system/node/nodeX/access0/targets/
48	relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY
49
50	# symlinks -v /sys/devices/system/node/nodeY/access0/initiators/
51	relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX
52
53A memory initiator may have multiple memory targets in the same access
54class. The target memory's initiators in a given class indicate the
55nodes' access characteristics share the same performance relative to other
56linked initiator nodes. Each target within an initiator's access class,
57though, do not necessarily perform the same as each other.
58
59================
60NUMA Performance
61================
62
63Applications may wish to consider which node they want their memory to
64be allocated from based on the node's performance characteristics. If
65the system provides these attributes, the kernel exports them under the
66node sysfs hierarchy by appending the attributes directory under the
67memory node's access class 0 initiators as follows::
68
69	/sys/devices/system/node/nodeY/access0/initiators/
70
71These attributes apply only when accessed from nodes that have the
72are linked under the this access's inititiators.
73
74The performance characteristics the kernel provides for the local initiators
75are exported are as follows::
76
77	# tree -P "read*|write*" /sys/devices/system/node/nodeY/access0/initiators/
78	/sys/devices/system/node/nodeY/access0/initiators/
79	|-- read_bandwidth
80	|-- read_latency
81	|-- write_bandwidth
82	`-- write_latency
83
84The bandwidth attributes are provided in MiB/second.
85
86The latency attributes are provided in nanoseconds.
87
88The values reported here correspond to the rated latency and bandwidth
89for the platform.
90
91==========
92NUMA Cache
93==========
94
95System memory may be constructed in a hierarchy of elements with various
96performance characteristics in order to provide large address space of
97slower performing memory cached by a smaller higher performing memory. The
98system physical addresses memory  initiators are aware of are provided
99by the last memory level in the hierarchy. The system meanwhile uses
100higher performing memory to transparently cache access to progressively
101slower levels.
102
103The term "far memory" is used to denote the last level memory in the
104hierarchy. Each increasing cache level provides higher performing
105initiator access, and the term "near memory" represents the fastest
106cache provided by the system.
107
108This numbering is different than CPU caches where the cache level (ex:
109L1, L2, L3) uses the CPU-side view where each increased level is lower
110performing. In contrast, the memory cache level is centric to the last
111level memory, so the higher numbered cache level corresponds to  memory
112nearer to the CPU, and further from far memory.
113
114The memory-side caches are not directly addressable by software. When
115software accesses a system address, the system will return it from the
116near memory cache if it is present. If it is not present, the system
117accesses the next level of memory until there is either a hit in that
118cache level, or it reaches far memory.
119
120An application does not need to know about caching attributes in order
121to use the system. Software may optionally query the memory cache
122attributes in order to maximize the performance out of such a setup.
123If the system provides a way for the kernel to discover this information,
124for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
125the kernel will append these attributes to the NUMA node memory target.
126
127When the kernel first registers a memory cache with a node, the kernel
128will create the following directory::
129
130	/sys/devices/system/node/nodeX/memory_side_cache/
131
132If that directory is not present, the system either does not not provide
133a memory-side cache, or that information is not accessible to the kernel.
134
135The attributes for each level of cache is provided under its cache
136level index::
137
138	/sys/devices/system/node/nodeX/memory_side_cache/indexA/
139	/sys/devices/system/node/nodeX/memory_side_cache/indexB/
140	/sys/devices/system/node/nodeX/memory_side_cache/indexC/
141
142Each cache level's directory provides its attributes. For example, the
143following shows a single cache level and the attributes available for
144software to query::
145
146	# tree sys/devices/system/node/node0/memory_side_cache/
147	/sys/devices/system/node/node0/memory_side_cache/
148	|-- index1
149	|   |-- indexing
150	|   |-- line_size
151	|   |-- size
152	|   `-- write_policy
153
154The "indexing" will be 0 if it is a direct-mapped cache, and non-zero
155for any other indexed based, multi-way associativity.
156
157The "line_size" is the number of bytes accessed from the next cache
158level on a miss.
159
160The "size" is the number of bytes provided by this cache level.
161
162The "write_policy" will be 0 for write-back, and non-zero for
163write-through caching.
164
165========
166See Also
167========
168
169[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
170- Section 5.2.27
171