xref: /openbmc/qemu/docs/specs/ppc-spapr-numa.rst (revision 269ff671c593379378c5cf5ea3bddd7909dd3333)
161f5e1a3SDaniel Henrique Barboza
261f5e1a3SDaniel Henrique BarbozaNUMA mechanics for sPAPR (pseries machines)
361f5e1a3SDaniel Henrique Barboza============================================
461f5e1a3SDaniel Henrique Barboza
561f5e1a3SDaniel Henrique BarbozaNUMA in sPAPR works different than the System Locality Distance
661f5e1a3SDaniel Henrique BarbozaInformation Table (SLIT) in ACPI. The logic is explained in the LOPAPR
761f5e1a3SDaniel Henrique Barboza1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This
861f5e1a3SDaniel Henrique Barbozadocument aims to complement this specification, providing details
961f5e1a3SDaniel Henrique Barbozaof the elements that impacts how QEMU views NUMA in pseries.
1061f5e1a3SDaniel Henrique Barboza
1161f5e1a3SDaniel Henrique BarbozaAssociativity and ibm,associativity property
1261f5e1a3SDaniel Henrique Barboza--------------------------------------------
1361f5e1a3SDaniel Henrique Barboza
1461f5e1a3SDaniel Henrique BarbozaAssociativity is defined as a group of platform resources that has
1561f5e1a3SDaniel Henrique Barbozasimilar mean performance (or in our context here, distance) relative to
1661f5e1a3SDaniel Henrique Barbozaeveryone else outside of the group.
1761f5e1a3SDaniel Henrique Barboza
1861f5e1a3SDaniel Henrique BarbozaThe format of the ibm,associativity property varies with the value of
1961f5e1a3SDaniel Henrique Barbozabit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with
2061f5e1a3SDaniel Henrique Barbozabit 0 equal to zero is deprecated. The current format, with the bit 0
2161f5e1a3SDaniel Henrique Barbozawith the value of one, makes ibm,associativity property represent the
2261f5e1a3SDaniel Henrique Barbozaphysical hierarchy of the platform, as one or more lists that starts
2361f5e1a3SDaniel Henrique Barbozawith the highest level grouping up to the smallest. Considering the
2461f5e1a3SDaniel Henrique Barbozafollowing topology:
2561f5e1a3SDaniel Henrique Barboza
2661f5e1a3SDaniel Henrique Barboza::
2761f5e1a3SDaniel Henrique Barboza
2861f5e1a3SDaniel Henrique Barboza    Mem M1 ---- Proc P1    |
2961f5e1a3SDaniel Henrique Barboza    -----------------      | Socket S1  ---|
3061f5e1a3SDaniel Henrique Barboza          chip C1          |               |
3161f5e1a3SDaniel Henrique Barboza                                           | HW module 1 (MOD1)
3261f5e1a3SDaniel Henrique Barboza    Mem M2 ---- Proc P2    |               |
3361f5e1a3SDaniel Henrique Barboza    -----------------      | Socket S2  ---|
3461f5e1a3SDaniel Henrique Barboza          chip C2          |
3561f5e1a3SDaniel Henrique Barboza
3661f5e1a3SDaniel Henrique BarbozaThe ibm,associativity property for the processors would be:
3761f5e1a3SDaniel Henrique Barboza
3861f5e1a3SDaniel Henrique Barboza* P1: {MOD1, S1, C1, P1}
3961f5e1a3SDaniel Henrique Barboza* P2: {MOD1, S2, C2, P2}
4061f5e1a3SDaniel Henrique Barboza
4161f5e1a3SDaniel Henrique BarbozaEach allocable resource has an ibm,associativity property. The LOPAPR
4261f5e1a3SDaniel Henrique Barbozaspecification allows multiple lists to be present in this property,
4361f5e1a3SDaniel Henrique Barbozaconsidering that the same resource can have multiple connections to the
4461f5e1a3SDaniel Henrique Barbozaplatform.
4561f5e1a3SDaniel Henrique Barboza
4661f5e1a3SDaniel Henrique BarbozaRelative Performance Distance and ibm,associativity-reference-points
4761f5e1a3SDaniel Henrique Barboza--------------------------------------------------------------------
4861f5e1a3SDaniel Henrique Barboza
4961f5e1a3SDaniel Henrique BarbozaThe ibm,associativity-reference-points property is an array that is used
5061f5e1a3SDaniel Henrique Barbozato define the relevant performance/distance  related boundaries, defining
5161f5e1a3SDaniel Henrique Barbozathe NUMA levels for the platform.
5261f5e1a3SDaniel Henrique Barboza
5361f5e1a3SDaniel Henrique BarbozaThe definition of its elements also varies with the value of bit 0 of byte 5
5461f5e1a3SDaniel Henrique Barbozaof the ibm,architecture-vec-5 property. The format with bit 0 equal to zero
5561f5e1a3SDaniel Henrique Barbozais also deprecated. With the current format, each integer of the
5661f5e1a3SDaniel Henrique Barbozaibm,associativity-reference-points represents an 1 based ordinal index (i.e.
5761f5e1a3SDaniel Henrique Barbozathe first element is 1) of the ibm,associativity array. The first
5861f5e1a3SDaniel Henrique Barbozaboundary is the most significant to application performance, followed by
5961f5e1a3SDaniel Henrique Barbozaless significant boundaries. Allocated resources that belongs to the
6061f5e1a3SDaniel Henrique Barbozasame performance boundaries are expected to have relative NUMA distance
6161f5e1a3SDaniel Henrique Barbozathat matches the relevancy of the boundary itself. Resources that belongs
6261f5e1a3SDaniel Henrique Barbozato the same first boundary will have the shortest distance from each
6361f5e1a3SDaniel Henrique Barbozaother. Subsequent boundaries represents greater distances and degraded
6461f5e1a3SDaniel Henrique Barbozaperformance.
6561f5e1a3SDaniel Henrique Barboza
6661f5e1a3SDaniel Henrique BarbozaUsing the previous example, the following setting reference points defines
6761f5e1a3SDaniel Henrique Barbozathree NUMA levels:
6861f5e1a3SDaniel Henrique Barboza
6961f5e1a3SDaniel Henrique Barboza* ibm,associativity-reference-points = {0x3, 0x2, 0x1}
7061f5e1a3SDaniel Henrique Barboza
7161f5e1a3SDaniel Henrique BarbozaThe first NUMA level (0x3) is interpreted as the third element of each
7261f5e1a3SDaniel Henrique Barbozaibm,associativity array, the second level is the second element and
7361f5e1a3SDaniel Henrique Barbozathe third level is the first element. Let's also consider that elements
7461f5e1a3SDaniel Henrique Barbozabelonging to the first NUMA level have distance equal to 10 from each
7561f5e1a3SDaniel Henrique Barbozaother, and each NUMA level doubles the distance from the previous. This
7661f5e1a3SDaniel Henrique Barbozameans that the second would be 20 and the third level 40. For the P1 and
7761f5e1a3SDaniel Henrique BarbozaP2 processors, we would have the following NUMA levels:
7861f5e1a3SDaniel Henrique Barboza
7961f5e1a3SDaniel Henrique Barboza::
8061f5e1a3SDaniel Henrique Barboza
8161f5e1a3SDaniel Henrique Barboza  * ibm,associativity-reference-points = {0x3, 0x2, 0x1}
8261f5e1a3SDaniel Henrique Barboza
8361f5e1a3SDaniel Henrique Barboza  * P1: associativity{MOD1, S1, C1, P1}
8461f5e1a3SDaniel Henrique Barboza
8561f5e1a3SDaniel Henrique Barboza  First NUMA level (0x3) => associativity[2] = C1
8661f5e1a3SDaniel Henrique Barboza  Second NUMA level (0x2) => associativity[1] = S1
8761f5e1a3SDaniel Henrique Barboza  Third NUMA level (0x1) => associativity[0] = MOD1
8861f5e1a3SDaniel Henrique Barboza
8961f5e1a3SDaniel Henrique Barboza  * P2: associativity{MOD1, S2, C2, P2}
9061f5e1a3SDaniel Henrique Barboza
9161f5e1a3SDaniel Henrique Barboza  First NUMA level (0x3) => associativity[2] = C2
9261f5e1a3SDaniel Henrique Barboza  Second NUMA level (0x2) => associativity[1] = S2
9361f5e1a3SDaniel Henrique Barboza  Third NUMA level (0x1) => associativity[0] = MOD1
9461f5e1a3SDaniel Henrique Barboza
9561f5e1a3SDaniel Henrique Barboza  P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40
9661f5e1a3SDaniel Henrique Barboza
9761f5e1a3SDaniel Henrique BarbozaChanging the ibm,associativity-reference-points array changes the performance
9861f5e1a3SDaniel Henrique Barbozadistance attributes for the same associativity arrays, as the following
9961f5e1a3SDaniel Henrique Barbozaexample illustrates:
10061f5e1a3SDaniel Henrique Barboza
10161f5e1a3SDaniel Henrique Barboza::
10261f5e1a3SDaniel Henrique Barboza
10361f5e1a3SDaniel Henrique Barboza  * ibm,associativity-reference-points = {0x2}
10461f5e1a3SDaniel Henrique Barboza
10561f5e1a3SDaniel Henrique Barboza  * P1: associativity{MOD1, S1, C1, P1}
10661f5e1a3SDaniel Henrique Barboza
10761f5e1a3SDaniel Henrique Barboza  First NUMA level (0x2) => associativity[1] = S1
10861f5e1a3SDaniel Henrique Barboza
10961f5e1a3SDaniel Henrique Barboza  * P2: associativity{MOD1, S2, C2, P2}
11061f5e1a3SDaniel Henrique Barboza
11161f5e1a3SDaniel Henrique Barboza  First NUMA level (0x2) => associativity[1] = S2
11261f5e1a3SDaniel Henrique Barboza
11361f5e1a3SDaniel Henrique Barboza  P1 and P2 does not have a common performance boundary. Since this is a one level
11461f5e1a3SDaniel Henrique Barboza  NUMA configuration, distance between them is one boundary above the first
11561f5e1a3SDaniel Henrique Barboza  level, 20.
11661f5e1a3SDaniel Henrique Barboza
11761f5e1a3SDaniel Henrique Barboza
11861f5e1a3SDaniel Henrique BarbozaIn a hypothetical platform where all resources inside the same hardware module
11961f5e1a3SDaniel Henrique Barbozais considered to be on the same performance boundary:
12061f5e1a3SDaniel Henrique Barboza
12161f5e1a3SDaniel Henrique Barboza::
12261f5e1a3SDaniel Henrique Barboza
12361f5e1a3SDaniel Henrique Barboza  * ibm,associativity-reference-points = {0x1}
12461f5e1a3SDaniel Henrique Barboza
12561f5e1a3SDaniel Henrique Barboza  * P1: associativity{MOD1, S1, C1, P1}
12661f5e1a3SDaniel Henrique Barboza
12761f5e1a3SDaniel Henrique Barboza  First NUMA level (0x1) => associativity[0] = MOD0
12861f5e1a3SDaniel Henrique Barboza
12961f5e1a3SDaniel Henrique Barboza  * P2: associativity{MOD1, S2, C2, P2}
13061f5e1a3SDaniel Henrique Barboza
13161f5e1a3SDaniel Henrique Barboza  First NUMA level (0x1) => associativity[0] = MOD0
13261f5e1a3SDaniel Henrique Barboza
13361f5e1a3SDaniel Henrique Barboza  P1 and P2 belongs to the same first order boundary. The distance between then
13461f5e1a3SDaniel Henrique Barboza  is 10.
13561f5e1a3SDaniel Henrique Barboza
13661f5e1a3SDaniel Henrique Barboza
13761f5e1a3SDaniel Henrique BarbozaHow the pseries Linux guest calculates NUMA distances
13861f5e1a3SDaniel Henrique Barboza=====================================================
13961f5e1a3SDaniel Henrique Barboza
14061f5e1a3SDaniel Henrique BarbozaAnother key difference between ACPI SLIT and the LOPAPR regarding NUMA is
14161f5e1a3SDaniel Henrique Barbozahow the distances are expressed. The SLIT table provides the NUMA distance
14261f5e1a3SDaniel Henrique Barbozavalue between the relevant resources. LOPAPR does not provide a standard
14361f5e1a3SDaniel Henrique Barbozaway to calculate it. We have the ibm,associativity for each resource, which
14461f5e1a3SDaniel Henrique Barbozaprovides a common-performance hierarchy,  and the ibm,associativity-reference-points
14561f5e1a3SDaniel Henrique Barbozaarray that tells which level of associativity is considered to be relevant
14661f5e1a3SDaniel Henrique Barbozaor not.
14761f5e1a3SDaniel Henrique Barboza
14861f5e1a3SDaniel Henrique BarbozaThe result is that each OS is free to implement and to interpret the distance
14961f5e1a3SDaniel Henrique Barbozaas it sees fit. For the pseries Linux guest, each level of NUMA duplicates
15061f5e1a3SDaniel Henrique Barbozathe distance of the previous level, and the maximum amount of levels is
15161f5e1a3SDaniel Henrique Barbozalimited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the
15261f5e1a3SDaniel Henrique Barbozakernel tree). This results in the following distances:
15361f5e1a3SDaniel Henrique Barboza
15461f5e1a3SDaniel Henrique Barboza* both resources in the first NUMA level: 10
15561f5e1a3SDaniel Henrique Barboza* resources one NUMA level apart: 20
15661f5e1a3SDaniel Henrique Barboza* resources two NUMA levels apart: 40
15761f5e1a3SDaniel Henrique Barboza* resources three NUMA levels apart: 80
15861f5e1a3SDaniel Henrique Barboza* resources four NUMA levels apart: 160
15961f5e1a3SDaniel Henrique Barboza
16061f5e1a3SDaniel Henrique Barboza
161307e7a34SDaniel Henrique Barbozapseries NUMA mechanics
162307e7a34SDaniel Henrique Barboza======================
163307e7a34SDaniel Henrique Barboza
164307e7a34SDaniel Henrique BarbozaStarting in QEMU 5.2, the pseries machine considers user input when setting NUMA
165307e7a34SDaniel Henrique Barbozatopology of the guest. The overall design is:
166307e7a34SDaniel Henrique Barboza
167307e7a34SDaniel Henrique Barboza* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing
168307e7a34SDaniel Henrique Barboza  for 4 distinct NUMA distance values based on the NUMA levels
169307e7a34SDaniel Henrique Barboza
170307e7a34SDaniel Henrique Barboza* ibm,max-associativity-domains supports multiple associativity domains in all
171307e7a34SDaniel Henrique Barboza  NUMA levels, granting user flexibility
172307e7a34SDaniel Henrique Barboza
173307e7a34SDaniel Henrique Barboza* ibm,associativity for all resources varies with user input
174307e7a34SDaniel Henrique Barboza
175307e7a34SDaniel Henrique BarbozaThese changes are only effective for pseries-5.2 and newer machines that are
176307e7a34SDaniel Henrique Barbozacreated with more than one NUMA node (disconsidering NUMA nodes created by
177307e7a34SDaniel Henrique Barbozathe machine itself, e.g. NVLink 2 GPUs). The now legacy support has been
178307e7a34SDaniel Henrique Barbozaaround for such a long time, with users seeing NUMA distances 10 and 40
179307e7a34SDaniel Henrique Barboza(and 80 if using NVLink2 GPUs), and there is no need to disrupt the
180307e7a34SDaniel Henrique Barbozaexisting experience of those guests.
181307e7a34SDaniel Henrique Barboza
182307e7a34SDaniel Henrique BarbozaTo bring the user experience x86 users have when tuning up NUMA, we had
183307e7a34SDaniel Henrique Barbozato operate under the current pseries Linux kernel logic described in
184307e7a34SDaniel Henrique Barboza`How the pseries Linux guest calculates NUMA distances`_. The result
185307e7a34SDaniel Henrique Barbozais that we needed to translate NUMA distance user input to pseries
186307e7a34SDaniel Henrique BarbozaLinux kernel input.
187307e7a34SDaniel Henrique Barboza
188307e7a34SDaniel Henrique BarbozaTranslating user distance to kernel distance
189307e7a34SDaniel Henrique Barboza--------------------------------------------
190307e7a34SDaniel Henrique Barboza
191307e7a34SDaniel Henrique BarbozaUser input for NUMA distance can vary from 10 to 254. We need to translate
192307e7a34SDaniel Henrique Barbozathat to the values that the Linux kernel operates on (10, 20, 40, 80, 160).
193307e7a34SDaniel Henrique BarbozaThis is how it is being done:
194307e7a34SDaniel Henrique Barboza
195307e7a34SDaniel Henrique Barboza* user distance 11 to 30 will be interpreted as 20
196307e7a34SDaniel Henrique Barboza* user distance 31 to 60 will be interpreted as 40
197307e7a34SDaniel Henrique Barboza* user distance 61 to 120 will be interpreted as 80
198307e7a34SDaniel Henrique Barboza* user distance 121 and beyond will be interpreted as 160
199307e7a34SDaniel Henrique Barboza* user distance 10 stays 10
200307e7a34SDaniel Henrique Barboza
201*ac9574bcSStefan WeilThe reasoning behind this approximation is to avoid any round up to the local
202307e7a34SDaniel Henrique Barbozadistance (10), keeping it exclusive to the 4th NUMA level (which is still
203307e7a34SDaniel Henrique Barbozaexclusive to the node_id). All other ranges were chosen under the developer
204307e7a34SDaniel Henrique Barbozadiscretion of what would be (somewhat) sensible considering the user input.
205307e7a34SDaniel Henrique BarbozaAny other strategy can be used here, but in the end the reality is that we'll
206307e7a34SDaniel Henrique Barbozahave to accept that a large array of values will be translated to the same
207307e7a34SDaniel Henrique BarbozaNUMA topology in the guest, e.g. this user input:
208307e7a34SDaniel Henrique Barboza
209307e7a34SDaniel Henrique Barboza::
210307e7a34SDaniel Henrique Barboza
211307e7a34SDaniel Henrique Barboza      0   1   2
212307e7a34SDaniel Henrique Barboza  0  10  31 120
213307e7a34SDaniel Henrique Barboza  1  31  10  30
214307e7a34SDaniel Henrique Barboza  2 120  30  10
215307e7a34SDaniel Henrique Barboza
216307e7a34SDaniel Henrique BarbozaAnd this other user input:
217307e7a34SDaniel Henrique Barboza
218307e7a34SDaniel Henrique Barboza::
219307e7a34SDaniel Henrique Barboza
220307e7a34SDaniel Henrique Barboza      0   1   2
221307e7a34SDaniel Henrique Barboza  0  10  60  61
222307e7a34SDaniel Henrique Barboza  1  60  10  11
223307e7a34SDaniel Henrique Barboza  2  61  11  10
224307e7a34SDaniel Henrique Barboza
225307e7a34SDaniel Henrique BarbozaWill both be translated to the same values internally:
226307e7a34SDaniel Henrique Barboza
227307e7a34SDaniel Henrique Barboza::
228307e7a34SDaniel Henrique Barboza
229307e7a34SDaniel Henrique Barboza      0   1   2
230307e7a34SDaniel Henrique Barboza  0  10  40  80
231307e7a34SDaniel Henrique Barboza  1  40  10  20
232307e7a34SDaniel Henrique Barboza  2  80  20  10
233307e7a34SDaniel Henrique Barboza
234307e7a34SDaniel Henrique BarbozaUsers are encouraged to use only the kernel values in the NUMA definition to
235307e7a34SDaniel Henrique Barbozaavoid being taken by surprise with that the guest is actually seeing in the
236307e7a34SDaniel Henrique Barbozatopology. There are enough potential surprises that are inherent to the
237307e7a34SDaniel Henrique Barbozaassociativity domain assignment process, discussed below.
238307e7a34SDaniel Henrique Barboza
239307e7a34SDaniel Henrique Barboza
240307e7a34SDaniel Henrique BarbozaHow associativity domains are assigned
241307e7a34SDaniel Henrique Barboza--------------------------------------
242307e7a34SDaniel Henrique Barboza
243307e7a34SDaniel Henrique BarbozaLOPAPR allows more than one associativity array (or 'string') per allocated
244307e7a34SDaniel Henrique Barbozaresource. This would be used to represent that the resource has multiple
245307e7a34SDaniel Henrique Barbozaconnections with the board, and then the operational system, when deciding
246307e7a34SDaniel Henrique BarbozaNUMA distancing, should consider the associativity information that provides
247307e7a34SDaniel Henrique Barbozathe shortest distance.
248307e7a34SDaniel Henrique Barboza
249307e7a34SDaniel Henrique BarbozaThe spapr implementation does not support multiple associativity arrays per
250307e7a34SDaniel Henrique Barbozaresource, neither does the pseries Linux kernel. We'll have to represent the
251307e7a34SDaniel Henrique BarbozaNUMA topology using one associativity per resource, which means that choices
252307e7a34SDaniel Henrique Barbozaand compromises are going to be made.
253307e7a34SDaniel Henrique Barboza
254307e7a34SDaniel Henrique BarbozaConsider the following NUMA topology entered by user input:
255307e7a34SDaniel Henrique Barboza
256307e7a34SDaniel Henrique Barboza::
257307e7a34SDaniel Henrique Barboza
258307e7a34SDaniel Henrique Barboza      0   1   2   3
259307e7a34SDaniel Henrique Barboza  0  10  40  20  40
260307e7a34SDaniel Henrique Barboza  1  40  10  80  40
261307e7a34SDaniel Henrique Barboza  2  20  80  10  20
262307e7a34SDaniel Henrique Barboza  3  40  40  20  10
263307e7a34SDaniel Henrique Barboza
264307e7a34SDaniel Henrique BarbozaAll the associativity arrays are initialized with NUMA id in all associativity
265307e7a34SDaniel Henrique Barbozadomains:
266307e7a34SDaniel Henrique Barboza
267307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0
268307e7a34SDaniel Henrique Barboza* node 1: 1 1 1 1
269307e7a34SDaniel Henrique Barboza* node 2: 2 2 2 2
270307e7a34SDaniel Henrique Barboza* node 3: 3 3 3 3
271307e7a34SDaniel Henrique Barboza
272307e7a34SDaniel Henrique Barboza
273307e7a34SDaniel Henrique BarbozaHonoring just the relative distances of node 0 to every other node, we find the
274307e7a34SDaniel Henrique BarbozaNUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for
275307e7a34SDaniel Henrique Barbozaeach distance:
276307e7a34SDaniel Henrique Barboza
277307e7a34SDaniel Henrique Barboza* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match
278307e7a34SDaniel Henrique Barboza  at 0x2)
279307e7a34SDaniel Henrique Barboza* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3)
280307e7a34SDaniel Henrique Barboza* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match
281307e7a34SDaniel Henrique Barboza  at 0x2)
282307e7a34SDaniel Henrique Barboza
283307e7a34SDaniel Henrique BarbozaWe'll copy the associativity domains of node 0 to all other nodes, based on
284307e7a34SDaniel Henrique Barbozathe NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy
285307e7a34SDaniel Henrique Barbozathe domains 0x2 and 0x1 from 0 to 1 as well. This will give us:
286307e7a34SDaniel Henrique Barboza
287307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0
288307e7a34SDaniel Henrique Barboza* node 1: 0 0 1 1
289307e7a34SDaniel Henrique Barboza
290307e7a34SDaniel Henrique BarbozaDoing the same to node 2 and node 3, these are the associativity arrays
291307e7a34SDaniel Henrique Barbozaafter considering all matches with node 0:
292307e7a34SDaniel Henrique Barboza
293307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0
294307e7a34SDaniel Henrique Barboza* node 1: 0 0 1 1
295307e7a34SDaniel Henrique Barboza* node 2: 0 0 0 2
296307e7a34SDaniel Henrique Barboza* node 3: 0 0 3 3
297307e7a34SDaniel Henrique Barboza
298307e7a34SDaniel Henrique BarbozaThe distances related to node 0 are accounted for. For node 1, and keeping
299307e7a34SDaniel Henrique Barbozain mind that we don't need to revisit node 0 again, the distance from
300307e7a34SDaniel Henrique Barbozanode 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40,
301307e7a34SDaniel Henrique Barbozamatch in 0x2. Repeating the same logic of copying all domains up to
302307e7a34SDaniel Henrique Barbozathe NUMA level match:
303307e7a34SDaniel Henrique Barboza
304307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0
305307e7a34SDaniel Henrique Barboza* node 1: 1 0 1 1
306307e7a34SDaniel Henrique Barboza* node 2: 1 0 0 2
307307e7a34SDaniel Henrique Barboza* node 3: 1 0 3 3
308307e7a34SDaniel Henrique Barboza
309307e7a34SDaniel Henrique BarbozaIn the last step we will analyze just nodes 2 and 3. The desired distance
310307e7a34SDaniel Henrique Barbozabetween 2 and 3 is 20, i.e. a match in 0x3:
311307e7a34SDaniel Henrique Barboza
312307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0
313307e7a34SDaniel Henrique Barboza* node 1: 1 0 1 1
314307e7a34SDaniel Henrique Barboza* node 2: 1 0 0 2
315307e7a34SDaniel Henrique Barboza* node 3: 1 0 0 3
316307e7a34SDaniel Henrique Barboza
317307e7a34SDaniel Henrique Barboza
318307e7a34SDaniel Henrique BarbozaThe kernel will read these arrays and will calculate the following NUMA topology for
319307e7a34SDaniel Henrique Barbozathe guest:
320307e7a34SDaniel Henrique Barboza
321307e7a34SDaniel Henrique Barboza::
322307e7a34SDaniel Henrique Barboza
323307e7a34SDaniel Henrique Barboza      0   1   2   3
324307e7a34SDaniel Henrique Barboza  0  10  40  20  20
325307e7a34SDaniel Henrique Barboza  1  40  10  40  40
326307e7a34SDaniel Henrique Barboza  2  20  40  10  20
327307e7a34SDaniel Henrique Barboza  3  20  40  20  10
328307e7a34SDaniel Henrique Barboza
329307e7a34SDaniel Henrique BarbozaNote that this is not what the user wanted - the desired distance between
330307e7a34SDaniel Henrique Barboza0 and 3 is 40, we calculated it as 20. This is what the current logic and
331307e7a34SDaniel Henrique Barbozaimplementation constraints of the kernel and QEMU will provide inside the
332307e7a34SDaniel Henrique BarbozaLOPAPR specification.
333307e7a34SDaniel Henrique Barboza
334307e7a34SDaniel Henrique BarbozaUsers are welcome to use this knowledge and experiment with the input to get
335307e7a34SDaniel Henrique Barbozathe NUMA topology they want, or as closer as they want. The important thing
336307e7a34SDaniel Henrique Barbozais to keep expectations up to par with what we are capable of provide at this
337307e7a34SDaniel Henrique Barbozamoment: an approximation.
338307e7a34SDaniel Henrique Barboza
339307e7a34SDaniel Henrique BarbozaLimitations of the implementation
34061f5e1a3SDaniel Henrique Barboza---------------------------------
34161f5e1a3SDaniel Henrique Barboza
342307e7a34SDaniel Henrique BarbozaAs mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate
343307e7a34SDaniel Henrique Barbozauser choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways
344307e7a34SDaniel Henrique Barbozato fully map user input to actual NUMA distance the guest will use. These limitations
345307e7a34SDaniel Henrique Barbozacreates two notable limitations in our support:
346307e7a34SDaniel Henrique Barboza
347307e7a34SDaniel Henrique Barboza* Asymmetrical topologies aren't supported. We only support NUMA topologies where
348307e7a34SDaniel Henrique Barboza  the distance from node A to B is always the same as B to A. We do not support
349307e7a34SDaniel Henrique Barboza  any A-B pair where the distance back and forth is asymmetric. For example, the
350307e7a34SDaniel Henrique Barboza  following topology isn't supported and the pSeries guest will not boot with this
351307e7a34SDaniel Henrique Barboza  user input:
352307e7a34SDaniel Henrique Barboza
353307e7a34SDaniel Henrique Barboza::
354307e7a34SDaniel Henrique Barboza
355307e7a34SDaniel Henrique Barboza      0   1
356307e7a34SDaniel Henrique Barboza  0  10  40
357307e7a34SDaniel Henrique Barboza  1  20  10
358307e7a34SDaniel Henrique Barboza
359307e7a34SDaniel Henrique Barboza
360307e7a34SDaniel Henrique Barboza* 'non-transitive' topologies will be poorly translated to the guest. This is the
361307e7a34SDaniel Henrique Barboza  kind of topology where the distance from a node A to B is X, B to C is X, but
362307e7a34SDaniel Henrique Barboza  the distance A to C is not X. E.g.:
363307e7a34SDaniel Henrique Barboza
364307e7a34SDaniel Henrique Barboza::
365307e7a34SDaniel Henrique Barboza
366307e7a34SDaniel Henrique Barboza      0   1   2   3
367307e7a34SDaniel Henrique Barboza  0  10  20  20  40
368307e7a34SDaniel Henrique Barboza  1  20  10  80  40
369307e7a34SDaniel Henrique Barboza  2  20  80  10  20
370307e7a34SDaniel Henrique Barboza  3  40  40  20  10
371307e7a34SDaniel Henrique Barboza
372307e7a34SDaniel Henrique Barboza  In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40.
373307e7a34SDaniel Henrique Barboza  The kernel will always match with the shortest associativity domain possible,
374307e7a34SDaniel Henrique Barboza  and we're attempting to retain the previous established relations between the
375307e7a34SDaniel Henrique Barboza  nodes. This means that a distance equal to 20 between nodes 0 and 2 and the
376307e7a34SDaniel Henrique Barboza  same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3
377307e7a34SDaniel Henrique Barboza  to also be 20.
378307e7a34SDaniel Henrique Barboza
379307e7a34SDaniel Henrique Barboza
380307e7a34SDaniel Henrique BarbozaLegacy (5.1 and older) pseries NUMA mechanics
381307e7a34SDaniel Henrique Barboza=============================================
382307e7a34SDaniel Henrique Barboza
383307e7a34SDaniel Henrique BarbozaIn short, we can summarize the NUMA distances seem in pseries Linux guests, using
384307e7a34SDaniel Henrique BarbozaQEMU up to 5.1, as follows:
385307e7a34SDaniel Henrique Barboza
386307e7a34SDaniel Henrique Barboza* local distance, i.e. the distance of the resource to its own NUMA node: 10
387307e7a34SDaniel Henrique Barboza* if it's a NVLink GPU device, distance: 80
388307e7a34SDaniel Henrique Barboza* every other resource, distance: 40
389307e7a34SDaniel Henrique Barboza
39061f5e1a3SDaniel Henrique BarbozaThe way the pseries Linux guest calculates NUMA distances has a direct effect
39161f5e1a3SDaniel Henrique Barbozaon what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is
39261f5e1a3SDaniel Henrique Barbozathe default ibm,associativity-reference-points being used in the pseries
39361f5e1a3SDaniel Henrique Barbozamachine:
39461f5e1a3SDaniel Henrique Barboza
39561f5e1a3SDaniel Henrique Barbozaibm,associativity-reference-points = {0x4, 0x4, 0x2}
39661f5e1a3SDaniel Henrique Barboza
39761f5e1a3SDaniel Henrique BarbozaThe first and second level are equal, 0x4, and a third one was added in
39861f5e1a3SDaniel Henrique Barbozacommit a6030d7e0b35 exclusively for NVLink GPUs support. This means that
39961f5e1a3SDaniel Henrique Barbozaregardless of how the ibm,associativity properties are being created in
40061f5e1a3SDaniel Henrique Barbozathe device tree, the pseries Linux guest will only recognize three scenarios
40161f5e1a3SDaniel Henrique Barbozaas far as NUMA distance goes:
40261f5e1a3SDaniel Henrique Barboza
40361f5e1a3SDaniel Henrique Barboza* if the resources belongs to the same first NUMA level = 10
40461f5e1a3SDaniel Henrique Barboza* second level is skipped since it's equal to the first
40561f5e1a3SDaniel Henrique Barboza* all resources that aren't a NVLink GPU, it is guaranteed that they will belong
40661f5e1a3SDaniel Henrique Barboza  to the same third NUMA level, having distance = 40
40761f5e1a3SDaniel Henrique Barboza* for NVLink GPUs, distance = 80 from everything else
40861f5e1a3SDaniel Henrique Barboza
40961f5e1a3SDaniel Henrique BarbozaThis also means that user input in QEMU command line does not change the
41061f5e1a3SDaniel Henrique BarbozaNUMA distancing inside the guest for the pseries machine.
411