161f5e1a3SDaniel Henrique Barboza 261f5e1a3SDaniel Henrique BarbozaNUMA mechanics for sPAPR (pseries machines) 361f5e1a3SDaniel Henrique Barboza============================================ 461f5e1a3SDaniel Henrique Barboza 561f5e1a3SDaniel Henrique BarbozaNUMA in sPAPR works different than the System Locality Distance 661f5e1a3SDaniel Henrique BarbozaInformation Table (SLIT) in ACPI. The logic is explained in the LOPAPR 761f5e1a3SDaniel Henrique Barboza1.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This 861f5e1a3SDaniel Henrique Barbozadocument aims to complement this specification, providing details 961f5e1a3SDaniel Henrique Barbozaof the elements that impacts how QEMU views NUMA in pseries. 1061f5e1a3SDaniel Henrique Barboza 1161f5e1a3SDaniel Henrique BarbozaAssociativity and ibm,associativity property 1261f5e1a3SDaniel Henrique Barboza-------------------------------------------- 1361f5e1a3SDaniel Henrique Barboza 1461f5e1a3SDaniel Henrique BarbozaAssociativity is defined as a group of platform resources that has 1561f5e1a3SDaniel Henrique Barbozasimilar mean performance (or in our context here, distance) relative to 1661f5e1a3SDaniel Henrique Barbozaeveryone else outside of the group. 1761f5e1a3SDaniel Henrique Barboza 1861f5e1a3SDaniel Henrique BarbozaThe format of the ibm,associativity property varies with the value of 1961f5e1a3SDaniel Henrique Barbozabit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with 2061f5e1a3SDaniel Henrique Barbozabit 0 equal to zero is deprecated. The current format, with the bit 0 2161f5e1a3SDaniel Henrique Barbozawith the value of one, makes ibm,associativity property represent the 2261f5e1a3SDaniel Henrique Barbozaphysical hierarchy of the platform, as one or more lists that starts 2361f5e1a3SDaniel Henrique Barbozawith the highest level grouping up to the smallest. Considering the 2461f5e1a3SDaniel Henrique Barbozafollowing topology: 2561f5e1a3SDaniel Henrique Barboza 2661f5e1a3SDaniel Henrique Barboza:: 2761f5e1a3SDaniel Henrique Barboza 2861f5e1a3SDaniel Henrique Barboza Mem M1 ---- Proc P1 | 2961f5e1a3SDaniel Henrique Barboza ----------------- | Socket S1 ---| 3061f5e1a3SDaniel Henrique Barboza chip C1 | | 3161f5e1a3SDaniel Henrique Barboza | HW module 1 (MOD1) 3261f5e1a3SDaniel Henrique Barboza Mem M2 ---- Proc P2 | | 3361f5e1a3SDaniel Henrique Barboza ----------------- | Socket S2 ---| 3461f5e1a3SDaniel Henrique Barboza chip C2 | 3561f5e1a3SDaniel Henrique Barboza 3661f5e1a3SDaniel Henrique BarbozaThe ibm,associativity property for the processors would be: 3761f5e1a3SDaniel Henrique Barboza 3861f5e1a3SDaniel Henrique Barboza* P1: {MOD1, S1, C1, P1} 3961f5e1a3SDaniel Henrique Barboza* P2: {MOD1, S2, C2, P2} 4061f5e1a3SDaniel Henrique Barboza 4161f5e1a3SDaniel Henrique BarbozaEach allocable resource has an ibm,associativity property. The LOPAPR 4261f5e1a3SDaniel Henrique Barbozaspecification allows multiple lists to be present in this property, 4361f5e1a3SDaniel Henrique Barbozaconsidering that the same resource can have multiple connections to the 4461f5e1a3SDaniel Henrique Barbozaplatform. 4561f5e1a3SDaniel Henrique Barboza 4661f5e1a3SDaniel Henrique BarbozaRelative Performance Distance and ibm,associativity-reference-points 4761f5e1a3SDaniel Henrique Barboza-------------------------------------------------------------------- 4861f5e1a3SDaniel Henrique Barboza 4961f5e1a3SDaniel Henrique BarbozaThe ibm,associativity-reference-points property is an array that is used 5061f5e1a3SDaniel Henrique Barbozato define the relevant performance/distance related boundaries, defining 5161f5e1a3SDaniel Henrique Barbozathe NUMA levels for the platform. 5261f5e1a3SDaniel Henrique Barboza 5361f5e1a3SDaniel Henrique BarbozaThe definition of its elements also varies with the value of bit 0 of byte 5 5461f5e1a3SDaniel Henrique Barbozaof the ibm,architecture-vec-5 property. The format with bit 0 equal to zero 5561f5e1a3SDaniel Henrique Barbozais also deprecated. With the current format, each integer of the 5661f5e1a3SDaniel Henrique Barbozaibm,associativity-reference-points represents an 1 based ordinal index (i.e. 5761f5e1a3SDaniel Henrique Barbozathe first element is 1) of the ibm,associativity array. The first 5861f5e1a3SDaniel Henrique Barbozaboundary is the most significant to application performance, followed by 5961f5e1a3SDaniel Henrique Barbozaless significant boundaries. Allocated resources that belongs to the 6061f5e1a3SDaniel Henrique Barbozasame performance boundaries are expected to have relative NUMA distance 6161f5e1a3SDaniel Henrique Barbozathat matches the relevancy of the boundary itself. Resources that belongs 6261f5e1a3SDaniel Henrique Barbozato the same first boundary will have the shortest distance from each 6361f5e1a3SDaniel Henrique Barbozaother. Subsequent boundaries represents greater distances and degraded 6461f5e1a3SDaniel Henrique Barbozaperformance. 6561f5e1a3SDaniel Henrique Barboza 6661f5e1a3SDaniel Henrique BarbozaUsing the previous example, the following setting reference points defines 6761f5e1a3SDaniel Henrique Barbozathree NUMA levels: 6861f5e1a3SDaniel Henrique Barboza 6961f5e1a3SDaniel Henrique Barboza* ibm,associativity-reference-points = {0x3, 0x2, 0x1} 7061f5e1a3SDaniel Henrique Barboza 7161f5e1a3SDaniel Henrique BarbozaThe first NUMA level (0x3) is interpreted as the third element of each 7261f5e1a3SDaniel Henrique Barbozaibm,associativity array, the second level is the second element and 7361f5e1a3SDaniel Henrique Barbozathe third level is the first element. Let's also consider that elements 7461f5e1a3SDaniel Henrique Barbozabelonging to the first NUMA level have distance equal to 10 from each 7561f5e1a3SDaniel Henrique Barbozaother, and each NUMA level doubles the distance from the previous. This 7661f5e1a3SDaniel Henrique Barbozameans that the second would be 20 and the third level 40. For the P1 and 7761f5e1a3SDaniel Henrique BarbozaP2 processors, we would have the following NUMA levels: 7861f5e1a3SDaniel Henrique Barboza 7961f5e1a3SDaniel Henrique Barboza:: 8061f5e1a3SDaniel Henrique Barboza 8161f5e1a3SDaniel Henrique Barboza * ibm,associativity-reference-points = {0x3, 0x2, 0x1} 8261f5e1a3SDaniel Henrique Barboza 8361f5e1a3SDaniel Henrique Barboza * P1: associativity{MOD1, S1, C1, P1} 8461f5e1a3SDaniel Henrique Barboza 8561f5e1a3SDaniel Henrique Barboza First NUMA level (0x3) => associativity[2] = C1 8661f5e1a3SDaniel Henrique Barboza Second NUMA level (0x2) => associativity[1] = S1 8761f5e1a3SDaniel Henrique Barboza Third NUMA level (0x1) => associativity[0] = MOD1 8861f5e1a3SDaniel Henrique Barboza 8961f5e1a3SDaniel Henrique Barboza * P2: associativity{MOD1, S2, C2, P2} 9061f5e1a3SDaniel Henrique Barboza 9161f5e1a3SDaniel Henrique Barboza First NUMA level (0x3) => associativity[2] = C2 9261f5e1a3SDaniel Henrique Barboza Second NUMA level (0x2) => associativity[1] = S2 9361f5e1a3SDaniel Henrique Barboza Third NUMA level (0x1) => associativity[0] = MOD1 9461f5e1a3SDaniel Henrique Barboza 9561f5e1a3SDaniel Henrique Barboza P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 9661f5e1a3SDaniel Henrique Barboza 9761f5e1a3SDaniel Henrique BarbozaChanging the ibm,associativity-reference-points array changes the performance 9861f5e1a3SDaniel Henrique Barbozadistance attributes for the same associativity arrays, as the following 9961f5e1a3SDaniel Henrique Barbozaexample illustrates: 10061f5e1a3SDaniel Henrique Barboza 10161f5e1a3SDaniel Henrique Barboza:: 10261f5e1a3SDaniel Henrique Barboza 10361f5e1a3SDaniel Henrique Barboza * ibm,associativity-reference-points = {0x2} 10461f5e1a3SDaniel Henrique Barboza 10561f5e1a3SDaniel Henrique Barboza * P1: associativity{MOD1, S1, C1, P1} 10661f5e1a3SDaniel Henrique Barboza 10761f5e1a3SDaniel Henrique Barboza First NUMA level (0x2) => associativity[1] = S1 10861f5e1a3SDaniel Henrique Barboza 10961f5e1a3SDaniel Henrique Barboza * P2: associativity{MOD1, S2, C2, P2} 11061f5e1a3SDaniel Henrique Barboza 11161f5e1a3SDaniel Henrique Barboza First NUMA level (0x2) => associativity[1] = S2 11261f5e1a3SDaniel Henrique Barboza 11361f5e1a3SDaniel Henrique Barboza P1 and P2 does not have a common performance boundary. Since this is a one level 11461f5e1a3SDaniel Henrique Barboza NUMA configuration, distance between them is one boundary above the first 11561f5e1a3SDaniel Henrique Barboza level, 20. 11661f5e1a3SDaniel Henrique Barboza 11761f5e1a3SDaniel Henrique Barboza 11861f5e1a3SDaniel Henrique BarbozaIn a hypothetical platform where all resources inside the same hardware module 11961f5e1a3SDaniel Henrique Barbozais considered to be on the same performance boundary: 12061f5e1a3SDaniel Henrique Barboza 12161f5e1a3SDaniel Henrique Barboza:: 12261f5e1a3SDaniel Henrique Barboza 12361f5e1a3SDaniel Henrique Barboza * ibm,associativity-reference-points = {0x1} 12461f5e1a3SDaniel Henrique Barboza 12561f5e1a3SDaniel Henrique Barboza * P1: associativity{MOD1, S1, C1, P1} 12661f5e1a3SDaniel Henrique Barboza 12761f5e1a3SDaniel Henrique Barboza First NUMA level (0x1) => associativity[0] = MOD0 12861f5e1a3SDaniel Henrique Barboza 12961f5e1a3SDaniel Henrique Barboza * P2: associativity{MOD1, S2, C2, P2} 13061f5e1a3SDaniel Henrique Barboza 13161f5e1a3SDaniel Henrique Barboza First NUMA level (0x1) => associativity[0] = MOD0 13261f5e1a3SDaniel Henrique Barboza 13361f5e1a3SDaniel Henrique Barboza P1 and P2 belongs to the same first order boundary. The distance between then 13461f5e1a3SDaniel Henrique Barboza is 10. 13561f5e1a3SDaniel Henrique Barboza 13661f5e1a3SDaniel Henrique Barboza 13761f5e1a3SDaniel Henrique BarbozaHow the pseries Linux guest calculates NUMA distances 13861f5e1a3SDaniel Henrique Barboza===================================================== 13961f5e1a3SDaniel Henrique Barboza 14061f5e1a3SDaniel Henrique BarbozaAnother key difference between ACPI SLIT and the LOPAPR regarding NUMA is 14161f5e1a3SDaniel Henrique Barbozahow the distances are expressed. The SLIT table provides the NUMA distance 14261f5e1a3SDaniel Henrique Barbozavalue between the relevant resources. LOPAPR does not provide a standard 14361f5e1a3SDaniel Henrique Barbozaway to calculate it. We have the ibm,associativity for each resource, which 14461f5e1a3SDaniel Henrique Barbozaprovides a common-performance hierarchy, and the ibm,associativity-reference-points 14561f5e1a3SDaniel Henrique Barbozaarray that tells which level of associativity is considered to be relevant 14661f5e1a3SDaniel Henrique Barbozaor not. 14761f5e1a3SDaniel Henrique Barboza 14861f5e1a3SDaniel Henrique BarbozaThe result is that each OS is free to implement and to interpret the distance 14961f5e1a3SDaniel Henrique Barbozaas it sees fit. For the pseries Linux guest, each level of NUMA duplicates 15061f5e1a3SDaniel Henrique Barbozathe distance of the previous level, and the maximum amount of levels is 15161f5e1a3SDaniel Henrique Barbozalimited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the 15261f5e1a3SDaniel Henrique Barbozakernel tree). This results in the following distances: 15361f5e1a3SDaniel Henrique Barboza 15461f5e1a3SDaniel Henrique Barboza* both resources in the first NUMA level: 10 15561f5e1a3SDaniel Henrique Barboza* resources one NUMA level apart: 20 15661f5e1a3SDaniel Henrique Barboza* resources two NUMA levels apart: 40 15761f5e1a3SDaniel Henrique Barboza* resources three NUMA levels apart: 80 15861f5e1a3SDaniel Henrique Barboza* resources four NUMA levels apart: 160 15961f5e1a3SDaniel Henrique Barboza 16061f5e1a3SDaniel Henrique Barboza 161307e7a34SDaniel Henrique Barbozapseries NUMA mechanics 162307e7a34SDaniel Henrique Barboza====================== 163307e7a34SDaniel Henrique Barboza 164307e7a34SDaniel Henrique BarbozaStarting in QEMU 5.2, the pseries machine considers user input when setting NUMA 165307e7a34SDaniel Henrique Barbozatopology of the guest. The overall design is: 166307e7a34SDaniel Henrique Barboza 167307e7a34SDaniel Henrique Barboza* ibm,associativity-reference-points is set to {0x4, 0x3, 0x2, 0x1}, allowing 168307e7a34SDaniel Henrique Barboza for 4 distinct NUMA distance values based on the NUMA levels 169307e7a34SDaniel Henrique Barboza 170307e7a34SDaniel Henrique Barboza* ibm,max-associativity-domains supports multiple associativity domains in all 171307e7a34SDaniel Henrique Barboza NUMA levels, granting user flexibility 172307e7a34SDaniel Henrique Barboza 173307e7a34SDaniel Henrique Barboza* ibm,associativity for all resources varies with user input 174307e7a34SDaniel Henrique Barboza 175307e7a34SDaniel Henrique BarbozaThese changes are only effective for pseries-5.2 and newer machines that are 176307e7a34SDaniel Henrique Barbozacreated with more than one NUMA node (disconsidering NUMA nodes created by 177307e7a34SDaniel Henrique Barbozathe machine itself, e.g. NVLink 2 GPUs). The now legacy support has been 178307e7a34SDaniel Henrique Barbozaaround for such a long time, with users seeing NUMA distances 10 and 40 179307e7a34SDaniel Henrique Barboza(and 80 if using NVLink2 GPUs), and there is no need to disrupt the 180307e7a34SDaniel Henrique Barbozaexisting experience of those guests. 181307e7a34SDaniel Henrique Barboza 182307e7a34SDaniel Henrique BarbozaTo bring the user experience x86 users have when tuning up NUMA, we had 183307e7a34SDaniel Henrique Barbozato operate under the current pseries Linux kernel logic described in 184307e7a34SDaniel Henrique Barboza`How the pseries Linux guest calculates NUMA distances`_. The result 185307e7a34SDaniel Henrique Barbozais that we needed to translate NUMA distance user input to pseries 186307e7a34SDaniel Henrique BarbozaLinux kernel input. 187307e7a34SDaniel Henrique Barboza 188307e7a34SDaniel Henrique BarbozaTranslating user distance to kernel distance 189307e7a34SDaniel Henrique Barboza-------------------------------------------- 190307e7a34SDaniel Henrique Barboza 191307e7a34SDaniel Henrique BarbozaUser input for NUMA distance can vary from 10 to 254. We need to translate 192307e7a34SDaniel Henrique Barbozathat to the values that the Linux kernel operates on (10, 20, 40, 80, 160). 193307e7a34SDaniel Henrique BarbozaThis is how it is being done: 194307e7a34SDaniel Henrique Barboza 195307e7a34SDaniel Henrique Barboza* user distance 11 to 30 will be interpreted as 20 196307e7a34SDaniel Henrique Barboza* user distance 31 to 60 will be interpreted as 40 197307e7a34SDaniel Henrique Barboza* user distance 61 to 120 will be interpreted as 80 198307e7a34SDaniel Henrique Barboza* user distance 121 and beyond will be interpreted as 160 199307e7a34SDaniel Henrique Barboza* user distance 10 stays 10 200307e7a34SDaniel Henrique Barboza 201*ac9574bcSStefan WeilThe reasoning behind this approximation is to avoid any round up to the local 202307e7a34SDaniel Henrique Barbozadistance (10), keeping it exclusive to the 4th NUMA level (which is still 203307e7a34SDaniel Henrique Barbozaexclusive to the node_id). All other ranges were chosen under the developer 204307e7a34SDaniel Henrique Barbozadiscretion of what would be (somewhat) sensible considering the user input. 205307e7a34SDaniel Henrique BarbozaAny other strategy can be used here, but in the end the reality is that we'll 206307e7a34SDaniel Henrique Barbozahave to accept that a large array of values will be translated to the same 207307e7a34SDaniel Henrique BarbozaNUMA topology in the guest, e.g. this user input: 208307e7a34SDaniel Henrique Barboza 209307e7a34SDaniel Henrique Barboza:: 210307e7a34SDaniel Henrique Barboza 211307e7a34SDaniel Henrique Barboza 0 1 2 212307e7a34SDaniel Henrique Barboza 0 10 31 120 213307e7a34SDaniel Henrique Barboza 1 31 10 30 214307e7a34SDaniel Henrique Barboza 2 120 30 10 215307e7a34SDaniel Henrique Barboza 216307e7a34SDaniel Henrique BarbozaAnd this other user input: 217307e7a34SDaniel Henrique Barboza 218307e7a34SDaniel Henrique Barboza:: 219307e7a34SDaniel Henrique Barboza 220307e7a34SDaniel Henrique Barboza 0 1 2 221307e7a34SDaniel Henrique Barboza 0 10 60 61 222307e7a34SDaniel Henrique Barboza 1 60 10 11 223307e7a34SDaniel Henrique Barboza 2 61 11 10 224307e7a34SDaniel Henrique Barboza 225307e7a34SDaniel Henrique BarbozaWill both be translated to the same values internally: 226307e7a34SDaniel Henrique Barboza 227307e7a34SDaniel Henrique Barboza:: 228307e7a34SDaniel Henrique Barboza 229307e7a34SDaniel Henrique Barboza 0 1 2 230307e7a34SDaniel Henrique Barboza 0 10 40 80 231307e7a34SDaniel Henrique Barboza 1 40 10 20 232307e7a34SDaniel Henrique Barboza 2 80 20 10 233307e7a34SDaniel Henrique Barboza 234307e7a34SDaniel Henrique BarbozaUsers are encouraged to use only the kernel values in the NUMA definition to 235307e7a34SDaniel Henrique Barbozaavoid being taken by surprise with that the guest is actually seeing in the 236307e7a34SDaniel Henrique Barbozatopology. There are enough potential surprises that are inherent to the 237307e7a34SDaniel Henrique Barbozaassociativity domain assignment process, discussed below. 238307e7a34SDaniel Henrique Barboza 239307e7a34SDaniel Henrique Barboza 240307e7a34SDaniel Henrique BarbozaHow associativity domains are assigned 241307e7a34SDaniel Henrique Barboza-------------------------------------- 242307e7a34SDaniel Henrique Barboza 243307e7a34SDaniel Henrique BarbozaLOPAPR allows more than one associativity array (or 'string') per allocated 244307e7a34SDaniel Henrique Barbozaresource. This would be used to represent that the resource has multiple 245307e7a34SDaniel Henrique Barbozaconnections with the board, and then the operational system, when deciding 246307e7a34SDaniel Henrique BarbozaNUMA distancing, should consider the associativity information that provides 247307e7a34SDaniel Henrique Barbozathe shortest distance. 248307e7a34SDaniel Henrique Barboza 249307e7a34SDaniel Henrique BarbozaThe spapr implementation does not support multiple associativity arrays per 250307e7a34SDaniel Henrique Barbozaresource, neither does the pseries Linux kernel. We'll have to represent the 251307e7a34SDaniel Henrique BarbozaNUMA topology using one associativity per resource, which means that choices 252307e7a34SDaniel Henrique Barbozaand compromises are going to be made. 253307e7a34SDaniel Henrique Barboza 254307e7a34SDaniel Henrique BarbozaConsider the following NUMA topology entered by user input: 255307e7a34SDaniel Henrique Barboza 256307e7a34SDaniel Henrique Barboza:: 257307e7a34SDaniel Henrique Barboza 258307e7a34SDaniel Henrique Barboza 0 1 2 3 259307e7a34SDaniel Henrique Barboza 0 10 40 20 40 260307e7a34SDaniel Henrique Barboza 1 40 10 80 40 261307e7a34SDaniel Henrique Barboza 2 20 80 10 20 262307e7a34SDaniel Henrique Barboza 3 40 40 20 10 263307e7a34SDaniel Henrique Barboza 264307e7a34SDaniel Henrique BarbozaAll the associativity arrays are initialized with NUMA id in all associativity 265307e7a34SDaniel Henrique Barbozadomains: 266307e7a34SDaniel Henrique Barboza 267307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0 268307e7a34SDaniel Henrique Barboza* node 1: 1 1 1 1 269307e7a34SDaniel Henrique Barboza* node 2: 2 2 2 2 270307e7a34SDaniel Henrique Barboza* node 3: 3 3 3 3 271307e7a34SDaniel Henrique Barboza 272307e7a34SDaniel Henrique Barboza 273307e7a34SDaniel Henrique BarbozaHonoring just the relative distances of node 0 to every other node, we find the 274307e7a34SDaniel Henrique BarbozaNUMA level matches (considering the reference points {0x4, 0x3, 0x2, 0x1}) for 275307e7a34SDaniel Henrique Barbozaeach distance: 276307e7a34SDaniel Henrique Barboza 277307e7a34SDaniel Henrique Barboza* distance from 0 to 1 is 40 (no match at 0x4 and 0x3, will match 278307e7a34SDaniel Henrique Barboza at 0x2) 279307e7a34SDaniel Henrique Barboza* distance from 0 to 2 is 20 (no match at 0x4, will match at 0x3) 280307e7a34SDaniel Henrique Barboza* distance from 0 to 3 is 40 (no match at 0x4 and 0x3, will match 281307e7a34SDaniel Henrique Barboza at 0x2) 282307e7a34SDaniel Henrique Barboza 283307e7a34SDaniel Henrique BarbozaWe'll copy the associativity domains of node 0 to all other nodes, based on 284307e7a34SDaniel Henrique Barbozathe NUMA level matches. Between 0 and 1, a match in 0x2, we'll also copy 285307e7a34SDaniel Henrique Barbozathe domains 0x2 and 0x1 from 0 to 1 as well. This will give us: 286307e7a34SDaniel Henrique Barboza 287307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0 288307e7a34SDaniel Henrique Barboza* node 1: 0 0 1 1 289307e7a34SDaniel Henrique Barboza 290307e7a34SDaniel Henrique BarbozaDoing the same to node 2 and node 3, these are the associativity arrays 291307e7a34SDaniel Henrique Barbozaafter considering all matches with node 0: 292307e7a34SDaniel Henrique Barboza 293307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0 294307e7a34SDaniel Henrique Barboza* node 1: 0 0 1 1 295307e7a34SDaniel Henrique Barboza* node 2: 0 0 0 2 296307e7a34SDaniel Henrique Barboza* node 3: 0 0 3 3 297307e7a34SDaniel Henrique Barboza 298307e7a34SDaniel Henrique BarbozaThe distances related to node 0 are accounted for. For node 1, and keeping 299307e7a34SDaniel Henrique Barbozain mind that we don't need to revisit node 0 again, the distance from 300307e7a34SDaniel Henrique Barbozanode 1 to 2 is 80, matching at 0x1, and distance from 1 to 3 is 40, 301307e7a34SDaniel Henrique Barbozamatch in 0x2. Repeating the same logic of copying all domains up to 302307e7a34SDaniel Henrique Barbozathe NUMA level match: 303307e7a34SDaniel Henrique Barboza 304307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0 305307e7a34SDaniel Henrique Barboza* node 1: 1 0 1 1 306307e7a34SDaniel Henrique Barboza* node 2: 1 0 0 2 307307e7a34SDaniel Henrique Barboza* node 3: 1 0 3 3 308307e7a34SDaniel Henrique Barboza 309307e7a34SDaniel Henrique BarbozaIn the last step we will analyze just nodes 2 and 3. The desired distance 310307e7a34SDaniel Henrique Barbozabetween 2 and 3 is 20, i.e. a match in 0x3: 311307e7a34SDaniel Henrique Barboza 312307e7a34SDaniel Henrique Barboza* node 0: 0 0 0 0 313307e7a34SDaniel Henrique Barboza* node 1: 1 0 1 1 314307e7a34SDaniel Henrique Barboza* node 2: 1 0 0 2 315307e7a34SDaniel Henrique Barboza* node 3: 1 0 0 3 316307e7a34SDaniel Henrique Barboza 317307e7a34SDaniel Henrique Barboza 318307e7a34SDaniel Henrique BarbozaThe kernel will read these arrays and will calculate the following NUMA topology for 319307e7a34SDaniel Henrique Barbozathe guest: 320307e7a34SDaniel Henrique Barboza 321307e7a34SDaniel Henrique Barboza:: 322307e7a34SDaniel Henrique Barboza 323307e7a34SDaniel Henrique Barboza 0 1 2 3 324307e7a34SDaniel Henrique Barboza 0 10 40 20 20 325307e7a34SDaniel Henrique Barboza 1 40 10 40 40 326307e7a34SDaniel Henrique Barboza 2 20 40 10 20 327307e7a34SDaniel Henrique Barboza 3 20 40 20 10 328307e7a34SDaniel Henrique Barboza 329307e7a34SDaniel Henrique BarbozaNote that this is not what the user wanted - the desired distance between 330307e7a34SDaniel Henrique Barboza0 and 3 is 40, we calculated it as 20. This is what the current logic and 331307e7a34SDaniel Henrique Barbozaimplementation constraints of the kernel and QEMU will provide inside the 332307e7a34SDaniel Henrique BarbozaLOPAPR specification. 333307e7a34SDaniel Henrique Barboza 334307e7a34SDaniel Henrique BarbozaUsers are welcome to use this knowledge and experiment with the input to get 335307e7a34SDaniel Henrique Barbozathe NUMA topology they want, or as closer as they want. The important thing 336307e7a34SDaniel Henrique Barbozais to keep expectations up to par with what we are capable of provide at this 337307e7a34SDaniel Henrique Barbozamoment: an approximation. 338307e7a34SDaniel Henrique Barboza 339307e7a34SDaniel Henrique BarbozaLimitations of the implementation 34061f5e1a3SDaniel Henrique Barboza--------------------------------- 34161f5e1a3SDaniel Henrique Barboza 342307e7a34SDaniel Henrique BarbozaAs mentioned above, the pSeries NUMA distance logic is, in fact, a way to approximate 343307e7a34SDaniel Henrique Barbozauser choice. The Linux kernel, and PAPR itself, does not provide QEMU with the ways 344307e7a34SDaniel Henrique Barbozato fully map user input to actual NUMA distance the guest will use. These limitations 345307e7a34SDaniel Henrique Barbozacreates two notable limitations in our support: 346307e7a34SDaniel Henrique Barboza 347307e7a34SDaniel Henrique Barboza* Asymmetrical topologies aren't supported. We only support NUMA topologies where 348307e7a34SDaniel Henrique Barboza the distance from node A to B is always the same as B to A. We do not support 349307e7a34SDaniel Henrique Barboza any A-B pair where the distance back and forth is asymmetric. For example, the 350307e7a34SDaniel Henrique Barboza following topology isn't supported and the pSeries guest will not boot with this 351307e7a34SDaniel Henrique Barboza user input: 352307e7a34SDaniel Henrique Barboza 353307e7a34SDaniel Henrique Barboza:: 354307e7a34SDaniel Henrique Barboza 355307e7a34SDaniel Henrique Barboza 0 1 356307e7a34SDaniel Henrique Barboza 0 10 40 357307e7a34SDaniel Henrique Barboza 1 20 10 358307e7a34SDaniel Henrique Barboza 359307e7a34SDaniel Henrique Barboza 360307e7a34SDaniel Henrique Barboza* 'non-transitive' topologies will be poorly translated to the guest. This is the 361307e7a34SDaniel Henrique Barboza kind of topology where the distance from a node A to B is X, B to C is X, but 362307e7a34SDaniel Henrique Barboza the distance A to C is not X. E.g.: 363307e7a34SDaniel Henrique Barboza 364307e7a34SDaniel Henrique Barboza:: 365307e7a34SDaniel Henrique Barboza 366307e7a34SDaniel Henrique Barboza 0 1 2 3 367307e7a34SDaniel Henrique Barboza 0 10 20 20 40 368307e7a34SDaniel Henrique Barboza 1 20 10 80 40 369307e7a34SDaniel Henrique Barboza 2 20 80 10 20 370307e7a34SDaniel Henrique Barboza 3 40 40 20 10 371307e7a34SDaniel Henrique Barboza 372307e7a34SDaniel Henrique Barboza In the example above, distance 0 to 2 is 20, 2 to 3 is 20, but 0 to 3 is 40. 373307e7a34SDaniel Henrique Barboza The kernel will always match with the shortest associativity domain possible, 374307e7a34SDaniel Henrique Barboza and we're attempting to retain the previous established relations between the 375307e7a34SDaniel Henrique Barboza nodes. This means that a distance equal to 20 between nodes 0 and 2 and the 376307e7a34SDaniel Henrique Barboza same distance 20 between nodes 2 and 3 will cause the distance between 0 and 3 377307e7a34SDaniel Henrique Barboza to also be 20. 378307e7a34SDaniel Henrique Barboza 379307e7a34SDaniel Henrique Barboza 380307e7a34SDaniel Henrique BarbozaLegacy (5.1 and older) pseries NUMA mechanics 381307e7a34SDaniel Henrique Barboza============================================= 382307e7a34SDaniel Henrique Barboza 383307e7a34SDaniel Henrique BarbozaIn short, we can summarize the NUMA distances seem in pseries Linux guests, using 384307e7a34SDaniel Henrique BarbozaQEMU up to 5.1, as follows: 385307e7a34SDaniel Henrique Barboza 386307e7a34SDaniel Henrique Barboza* local distance, i.e. the distance of the resource to its own NUMA node: 10 387307e7a34SDaniel Henrique Barboza* if it's a NVLink GPU device, distance: 80 388307e7a34SDaniel Henrique Barboza* every other resource, distance: 40 389307e7a34SDaniel Henrique Barboza 39061f5e1a3SDaniel Henrique BarbozaThe way the pseries Linux guest calculates NUMA distances has a direct effect 39161f5e1a3SDaniel Henrique Barbozaon what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is 39261f5e1a3SDaniel Henrique Barbozathe default ibm,associativity-reference-points being used in the pseries 39361f5e1a3SDaniel Henrique Barbozamachine: 39461f5e1a3SDaniel Henrique Barboza 39561f5e1a3SDaniel Henrique Barbozaibm,associativity-reference-points = {0x4, 0x4, 0x2} 39661f5e1a3SDaniel Henrique Barboza 39761f5e1a3SDaniel Henrique BarbozaThe first and second level are equal, 0x4, and a third one was added in 39861f5e1a3SDaniel Henrique Barbozacommit a6030d7e0b35 exclusively for NVLink GPUs support. This means that 39961f5e1a3SDaniel Henrique Barbozaregardless of how the ibm,associativity properties are being created in 40061f5e1a3SDaniel Henrique Barbozathe device tree, the pseries Linux guest will only recognize three scenarios 40161f5e1a3SDaniel Henrique Barbozaas far as NUMA distance goes: 40261f5e1a3SDaniel Henrique Barboza 40361f5e1a3SDaniel Henrique Barboza* if the resources belongs to the same first NUMA level = 10 40461f5e1a3SDaniel Henrique Barboza* second level is skipped since it's equal to the first 40561f5e1a3SDaniel Henrique Barboza* all resources that aren't a NVLink GPU, it is guaranteed that they will belong 40661f5e1a3SDaniel Henrique Barboza to the same third NUMA level, having distance = 40 40761f5e1a3SDaniel Henrique Barboza* for NVLink GPUs, distance = 80 from everything else 40861f5e1a3SDaniel Henrique Barboza 40961f5e1a3SDaniel Henrique BarbozaThis also means that user input in QEMU command line does not change the 41061f5e1a3SDaniel Henrique BarbozaNUMA distancing inside the guest for the pseries machine. 411