1 2NUMA mechanics for sPAPR (pseries machines) 3============================================ 4 5NUMA in sPAPR works different than the System Locality Distance 6Information Table (SLIT) in ACPI. The logic is explained in the LOPAPR 71.1 chapter 15, "Non Uniform Memory Access (NUMA) Option". This 8document aims to complement this specification, providing details 9of the elements that impacts how QEMU views NUMA in pseries. 10 11Associativity and ibm,associativity property 12-------------------------------------------- 13 14Associativity is defined as a group of platform resources that has 15similar mean performance (or in our context here, distance) relative to 16everyone else outside of the group. 17 18The format of the ibm,associativity property varies with the value of 19bit 0 of byte 5 of the ibm,architecture-vec-5 property. The format with 20bit 0 equal to zero is deprecated. The current format, with the bit 0 21with the value of one, makes ibm,associativity property represent the 22physical hierarchy of the platform, as one or more lists that starts 23with the highest level grouping up to the smallest. Considering the 24following topology: 25 26:: 27 28 Mem M1 ---- Proc P1 | 29 ----------------- | Socket S1 ---| 30 chip C1 | | 31 | HW module 1 (MOD1) 32 Mem M2 ---- Proc P2 | | 33 ----------------- | Socket S2 ---| 34 chip C2 | 35 36The ibm,associativity property for the processors would be: 37 38* P1: {MOD1, S1, C1, P1} 39* P2: {MOD1, S2, C2, P2} 40 41Each allocable resource has an ibm,associativity property. The LOPAPR 42specification allows multiple lists to be present in this property, 43considering that the same resource can have multiple connections to the 44platform. 45 46Relative Performance Distance and ibm,associativity-reference-points 47-------------------------------------------------------------------- 48 49The ibm,associativity-reference-points property is an array that is used 50to define the relevant performance/distance related boundaries, defining 51the NUMA levels for the platform. 52 53The definition of its elements also varies with the value of bit 0 of byte 5 54of the ibm,architecture-vec-5 property. The format with bit 0 equal to zero 55is also deprecated. With the current format, each integer of the 56ibm,associativity-reference-points represents an 1 based ordinal index (i.e. 57the first element is 1) of the ibm,associativity array. The first 58boundary is the most significant to application performance, followed by 59less significant boundaries. Allocated resources that belongs to the 60same performance boundaries are expected to have relative NUMA distance 61that matches the relevancy of the boundary itself. Resources that belongs 62to the same first boundary will have the shortest distance from each 63other. Subsequent boundaries represents greater distances and degraded 64performance. 65 66Using the previous example, the following setting reference points defines 67three NUMA levels: 68 69* ibm,associativity-reference-points = {0x3, 0x2, 0x1} 70 71The first NUMA level (0x3) is interpreted as the third element of each 72ibm,associativity array, the second level is the second element and 73the third level is the first element. Let's also consider that elements 74belonging to the first NUMA level have distance equal to 10 from each 75other, and each NUMA level doubles the distance from the previous. This 76means that the second would be 20 and the third level 40. For the P1 and 77P2 processors, we would have the following NUMA levels: 78 79:: 80 81 * ibm,associativity-reference-points = {0x3, 0x2, 0x1} 82 83 * P1: associativity{MOD1, S1, C1, P1} 84 85 First NUMA level (0x3) => associativity[2] = C1 86 Second NUMA level (0x2) => associativity[1] = S1 87 Third NUMA level (0x1) => associativity[0] = MOD1 88 89 * P2: associativity{MOD1, S2, C2, P2} 90 91 First NUMA level (0x3) => associativity[2] = C2 92 Second NUMA level (0x2) => associativity[1] = S2 93 Third NUMA level (0x1) => associativity[0] = MOD1 94 95 P1 and P2 have the same third NUMA level, MOD1: Distance between them = 40 96 97Changing the ibm,associativity-reference-points array changes the performance 98distance attributes for the same associativity arrays, as the following 99example illustrates: 100 101:: 102 103 * ibm,associativity-reference-points = {0x2} 104 105 * P1: associativity{MOD1, S1, C1, P1} 106 107 First NUMA level (0x2) => associativity[1] = S1 108 109 * P2: associativity{MOD1, S2, C2, P2} 110 111 First NUMA level (0x2) => associativity[1] = S2 112 113 P1 and P2 does not have a common performance boundary. Since this is a one level 114 NUMA configuration, distance between them is one boundary above the first 115 level, 20. 116 117 118In a hypothetical platform where all resources inside the same hardware module 119is considered to be on the same performance boundary: 120 121:: 122 123 * ibm,associativity-reference-points = {0x1} 124 125 * P1: associativity{MOD1, S1, C1, P1} 126 127 First NUMA level (0x1) => associativity[0] = MOD0 128 129 * P2: associativity{MOD1, S2, C2, P2} 130 131 First NUMA level (0x1) => associativity[0] = MOD0 132 133 P1 and P2 belongs to the same first order boundary. The distance between then 134 is 10. 135 136 137How the pseries Linux guest calculates NUMA distances 138===================================================== 139 140Another key difference between ACPI SLIT and the LOPAPR regarding NUMA is 141how the distances are expressed. The SLIT table provides the NUMA distance 142value between the relevant resources. LOPAPR does not provide a standard 143way to calculate it. We have the ibm,associativity for each resource, which 144provides a common-performance hierarchy, and the ibm,associativity-reference-points 145array that tells which level of associativity is considered to be relevant 146or not. 147 148The result is that each OS is free to implement and to interpret the distance 149as it sees fit. For the pseries Linux guest, each level of NUMA duplicates 150the distance of the previous level, and the maximum amount of levels is 151limited to MAX_DISTANCE_REF_POINTS = 4 (from arch/powerpc/mm/numa.c in the 152kernel tree). This results in the following distances: 153 154* both resources in the first NUMA level: 10 155* resources one NUMA level apart: 20 156* resources two NUMA levels apart: 40 157* resources three NUMA levels apart: 80 158* resources four NUMA levels apart: 160 159 160 161Consequences for QEMU NUMA tuning 162--------------------------------- 163 164The way the pseries Linux guest calculates NUMA distances has a direct effect 165on what QEMU users can expect when doing NUMA tuning. As of QEMU 5.1, this is 166the default ibm,associativity-reference-points being used in the pseries 167machine: 168 169ibm,associativity-reference-points = {0x4, 0x4, 0x2} 170 171The first and second level are equal, 0x4, and a third one was added in 172commit a6030d7e0b35 exclusively for NVLink GPUs support. This means that 173regardless of how the ibm,associativity properties are being created in 174the device tree, the pseries Linux guest will only recognize three scenarios 175as far as NUMA distance goes: 176 177* if the resources belongs to the same first NUMA level = 10 178* second level is skipped since it's equal to the first 179* all resources that aren't a NVLink GPU, it is guaranteed that they will belong 180 to the same third NUMA level, having distance = 40 181* for NVLink GPUs, distance = 80 from everything else 182 183In short, we can summarize the NUMA distances seem in pseries Linux guests, using 184QEMU up to 5.1, as follows: 185 186* local distance, i.e. the distance of the resource to its own NUMA node: 10 187* if it's a NVLink GPU device, distance: 80 188* every other resource, distance: 40 189 190This also means that user input in QEMU command line does not change the 191NUMA distancing inside the guest for the pseries machine. 192