1============================= 2sPAPR Dynamic Reconfiguration 3============================= 4 5sPAPR or pSeries guests make use of a facility called dynamic reconfiguration 6to handle hot plugging of dynamic "physical" resources like PCI cards, or 7"logical"/para-virtual resources like memory, CPUs, and "physical" 8host-bridges, which are generally managed by the host/hypervisor and provided 9to guests as virtualized resources. The specifics of dynamic reconfiguration 10are documented extensively in section 13 of the Linux on Power Architecture 11Reference document ([LoPAR]_). This document provides a summary of that 12information as it applies to the implementation within QEMU. 13 14Dynamic-reconfiguration Connectors 15================================== 16 17To manage hot plug/unplug of these resources, a firmware abstraction known as 18a Dynamic Resource Connector (DRC) is used to assign a particular dynamic 19resource to the guest, and provide an interface for the guest to manage 20configuration/removal of the resource associated with it. 21 22Device tree description of DRCs 23=============================== 24 25A set of four Open Firmware device tree array properties are used to describe 26the name/index/power-domain/type of each DRC allocated to a guest at 27boot time. There may be multiple sets of these arrays, rooted at different 28paths in the device tree depending on the type of resource the DRCs manage. 29 30In some cases, the DRCs themselves may be provided by a dynamic resource, 31such as the DRCs managing PCI slots on a hot plugged PHB. In this case the 32arrays would be fetched as part of the device tree retrieval interfaces 33for hot plugged resources described under :ref:`guest-host-interface`. 34 35The array properties are described below. Each entry/element in an array 36describes the DRC identified by the element in the corresponding position 37of ``ibm,drc-indexes``: 38 39``ibm,drc-names`` 40----------------- 41 42 First 4-bytes: big-endian (BE) encoded integer denoting the number of entries. 43 44 Each entry: a NULL-terminated ``<name>`` string encoded as a byte array. 45 46 ``<name>`` values for logical/virtual resources are defined in the Linux on 47 Power Architecture Reference ([LoPAR]_) section 13.5.2.4, and basically 48 consist of the type of the resource followed by a space and a numerical 49 value that's unique across resources of that type. 50 51 ``<name>`` values for "physical" resources such as PCI or VIO devices are 52 defined as being "location codes", which are the "location labels" of each 53 encapsulating device, starting from the chassis down to the individual slot 54 for the device, concatenated by a hyphen. This provides a mapping of 55 resources to a physical location in a chassis for debugging purposes. For 56 QEMU, this mapping is less important, so we assign a location code that 57 conforms to naming specifications, but is simply a location label for the 58 slot by itself to simplify the implementation. The naming convention for 59 location labels is documented in detail in the [LoPAR]_ section 12.3.1.5, 60 and in our case amounts to using ``C<n>`` for PCI/VIO device slots, where 61 ``<n>`` is unique across all PCI/VIO device slots. 62 63``ibm,drc-indexes`` 64------------------- 65 66 First 4-bytes: BE-encoded integer denoting the number of entries. 67 68 Each 4-byte entry: BE-encoded ``<index>`` integer that is unique across all 69 DRCs in the machine. 70 71 ``<index>`` is arbitrary, but in the case of QEMU we try to maintain the 72 convention used to assign them to pSeries guests on pHyp (the hypervisor 73 portion of PowerVM): 74 75 ``bit[31:28]``: integer encoding of ``<type>``, where ``<type>`` is: 76 77 ``1`` for CPU resource. 78 79 ``2`` for PHB resource. 80 81 ``3`` for VIO resource. 82 83 ``4`` for PCI resource. 84 85 ``8`` for memory resource. 86 87 ``bit[27:0]``: integer encoding of ``<id>``, where ``<id>`` is unique 88 across all resources of specified type. 89 90``ibm,drc-power-domains`` 91------------------------- 92 93 First 4-bytes: BE-encoded integer denoting the number of entries. 94 95 Each 4-byte entry: 32-bit, BE-encoded ``<index>`` integer that specifies the 96 power domain the resource will be assigned to. In the case of QEMU we 97 associated all resources with a "live insertion" domain, where the power is 98 assumed to be managed automatically. The integer value for this domain is a 99 special value of ``-1``. 100 101 102``ibm,drc-types`` 103----------------- 104 105 First 4-bytes: BE-encoded integer denoting the number of entries. 106 107 Each entry: a NULL-terminated ``<type>`` string encoded as a byte array. 108 ``<type>`` is assigned as follows: 109 110 "CPU" for a CPU. 111 112 "PHB" for a physical host-bridge. 113 114 "SLOT" for a VIO slot. 115 116 "28" for a PCI slot. 117 118 "MEM" for memory resource. 119 120.. _guest-host-interface: 121 122Guest->Host interface to manage dynamic resources 123================================================= 124 125Each DRC is given a globally unique DRC index, and resources associated with a 126particular DRC are configured/managed by the guest via a number of RTAS calls 127which reference individual DRCs based on the DRC index. This can be considered 128the guest->host interface. 129 130``rtas-set-power-level`` 131------------------------ 132 133Set the power level for a specified power domain. 134 135 ``arg[0]``: integer identifying power domain. 136 137 ``arg[1]``: new power level for the domain, ``0-100``. 138 139 ``output[0]``: status, ``0`` on success. 140 141 ``output[1]``: power level after command. 142 143``rtas-get-power-level`` 144------------------------ 145 146Get the power level for a specified power domain. 147 148 ``arg[0]``: integer identifying power domain. 149 150 ``output[0]``: status, ``0`` on success. 151 152 ``output[1]``: current power level. 153 154``rtas-set-indicator`` 155---------------------- 156 157Set the state of an indicator or sensor. 158 159 ``arg[0]``: integer identifying sensor/indicator type. 160 161 ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC 162 index. 163 164 ``arg[2]``: desired sensor value. 165 166 ``output[0]``: status, ``0`` on success. 167 168For the purpose of this document we focus on the indicator/sensor types 169associated with a DRC. The types are: 170 171* ``9001``: ``isolation-state``, controls/indicates whether a device has been 172 made accessible to a guest. Supported sensor values: 173 174 ``0``: ``isolate``, device is made inaccessible by guest OS. 175 176 ``1``: ``unisolate``, device is made available to guest OS. 177 178* ``9002``: ``dr-indicator``, controls "visual" indicator associated with 179 device. Supported sensor values: 180 181 ``0``: ``inactive``, resource may be safely removed. 182 183 ``1``: ``active``, resource is in use and cannot be safely removed. 184 185 ``2``: ``identify``, used to visually identify slot for interactive hot plug. 186 187 ``3``: ``action``, in most cases, used in the same manner as identify. 188 189* ``9003``: ``allocation-state``, generally only used for "logical" DR resources 190 to request the allocation/deallocation of a resource prior to acquiring it via 191 ``isolation-state->unisolate``, or after releasing it via 192 ``isolation-state->isolate``, respectively. For "physical" DR (like PCI 193 hot plug/unplug) the pre-allocation of the resource is implied and this sensor 194 is unused. Supported sensor values: 195 196 ``0``: ``unusable``, tell firmware/system the resource can be 197 unallocated/reclaimed and added back to the system resource pool. 198 199 ``1``: ``usable``, request the resource be allocated/reserved for use by 200 guest OS. 201 202 ``2``: ``exchange``, used to allocate a spare resource to use for fail-over 203 in certain situations. Unused in QEMU. 204 205 ``3``: ``recover``, used to reclaim a previously allocated resource that's 206 not currently allocated to the guest OS. Unused in QEMU. 207 208``rtas-get-sensor-state:`` 209-------------------------- 210 211Used to read an indicator or sensor value. 212 213 ``arg[0]``: integer identifying sensor/indicator type. 214 215 ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC 216 index 217 218 ``output[0]``: status, 0 on success 219 220For DR-related operations, the only noteworthy sensor is ``dr-entity-sense``, 221which has a type value of ``9003``, as ``allocation-state`` does in the case of 222``rtas-set-indicator``. The semantics/encodings of the sensor values are 223distinct however. 224 225Supported sensor values for ``dr-entity-sense`` (``9003``) sensor: 226 227 ``0``: empty. 228 229 For physical resources: DRC/slot is empty. 230 231 For logical resources: unused. 232 233 ``1``: present. 234 235 For physical resources: DRC/slot is populated with a device/resource. 236 237 For logical resources: resource has been allocated to the DRC. 238 239 ``2``: unusable. 240 241 For physical resources: unused. 242 243 For logical resources: DRC has no resource allocated to it. 244 245 ``3``: exchange. 246 247 For physical resources: unused. 248 249 For logical resources: resource available for exchange (see 250 ``allocation-state`` sensor semantics above). 251 252 ``4``: recovery. 253 254 For physical resources: unused. 255 256 For logical resources: resource available for recovery (see 257 ``allocation-state`` sensor semantics above). 258 259``rtas-ibm-configure-connector`` 260-------------------------------- 261 262Used to fetch an OpenFirmware device tree description of the resource associated 263with a particular DRC. 264 265 ``arg[0]``: guest physical address of 4096-byte work area buffer. 266 267 ``arg[1]``: 0, or address of additional 4096-byte work area buffer; only 268 non-zero if a prior RTAS response indicated a need for additional memory. 269 270 ``output[0]``: status: 271 272 ``0``: completed transmittal of device tree node. 273 274 ``1``: instruct guest to prepare for next device tree sibling node. 275 276 ``2``: instruct guest to prepare for next device tree child node. 277 278 ``3``: instruct guest to prepare for next device tree property. 279 280 ``4``: instruct guest to ascend to parent device tree node. 281 282 ``5``: instruct guest to provide additional work-area buffer via ``arg[1]``. 283 284 ``990x``: instruct guest that operation took too long and to try again 285 later. 286 287The DRC index is encoded in the first 4-bytes of the first work area buffer. 288Work area (``wa``) layout, using 4-byte offsets: 289 290 ``wa[0]``: DRC index of the DRC to fetch device tree nodes from. 291 292 ``wa[1]``: ``0`` (hard-coded). 293 294 ``wa[2]``: 295 296 For next-sibling/next-child response: 297 298 ``wa`` offset of null-terminated string denoting the new node's name. 299 300 For next-property response: 301 302 ``wa`` offset of null-terminated string denoting new property's name. 303 304 ``wa[3]``: for next-property response (unused otherwise): 305 306 Byte-length of new property's value. 307 308 ``wa[4]``: for next-property response (unused otherwise): 309 310 New property's value, encoded as an OFDT-compatible byte array. 311 312Hot plug/unplug events 313====================== 314 315For most DR operations, the hypervisor will issue host->guest add/remove events 316using the EPOW/check-exception notification framework, where the host issues a 317check-exception interrupt, then provides an RTAS event log via an 318rtas-check-exception call issued by the guest in response. This framework is 319documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown 320requests via EPOW events. 321 322For DR, this framework has been extended to include hotplug events, which were 323previously unneeded due to direct manipulation of DR-related guest userspace 324tools by host-level management such as an HMC. This level of management is not 325applicable to KVM on Power, hence the reason for extending the notification 326framework to support hotplug events. 327 328The format for these EPOW-signalled events is described below under 329:ref:`hot-plug-unplug-event-structure`. Note that these events are not formally 330part of the PAPR+ specification, and have been superseded by a newer format, 331also described below under :ref:`hot-plug-unplug-event-structure`, and so are 332now deemed a "legacy" format. The formats are similar, but the "modern" format 333contains additional fields/flags, which are denoted for the purposes of this 334documentation with ``#ifdef GUEST_SUPPORTS_MODERN`` guards. 335 336QEMU should assume support only for "legacy" fields/flags unless the guest 337advertises support for the "modern" format via 338``ibm,client-architecture-support`` hcall by setting byte 5, bit 6 of it's 339``ibm,architecture-vec-5`` option vector structure (as described by [LoPAR]_, 340section B.5.2.3). As with "legacy" format events, "modern" format events are 341surfaced to the guest via check-exception RTAS calls, but use a dedicated event 342source to signal the guest. This event source is advertised to the guest by the 343addition of a ``hot-plug-events`` node under ``/event-sources`` node of the 344guest's device tree using the standard format described in [LoPAR]_, 345section B.5.12.2. 346 347.. _hot-plug-unplug-event-structure: 348 349Hot plug/unplug event structure 350=============================== 351 352The hot plug specific payload in QEMU is implemented as follows (with all values 353encoded in big-endian format): 354 355.. code-block:: c 356 357 struct rtas_event_log_v6_hp { 358 #define SECTION_ID_HOTPLUG 0x4850 /* HP */ 359 struct section_header { 360 uint16_t section_id; /* set to SECTION_ID_HOTPLUG */ 361 uint16_t section_length; /* sizeof(rtas_event_log_v6_hp), 362 * plus the length of the DRC name 363 * if a DRC name identifier is 364 * specified for hotplug_identifier 365 */ 366 uint8_t section_version; /* version 1 */ 367 uint8_t section_subtype; /* unused */ 368 uint16_t creator_component_id; /* unused */ 369 } hdr; 370 #define RTAS_LOG_V6_HP_TYPE_CPU 1 371 #define RTAS_LOG_V6_HP_TYPE_MEMORY 2 372 #define RTAS_LOG_V6_HP_TYPE_SLOT 3 373 #define RTAS_LOG_V6_HP_TYPE_PHB 4 374 #define RTAS_LOG_V6_HP_TYPE_PCI 5 375 uint8_t hotplug_type; /* type of resource/device */ 376 #define RTAS_LOG_V6_HP_ACTION_ADD 1 377 #define RTAS_LOG_V6_HP_ACTION_REMOVE 2 378 uint8_t hotplug_action; /* action (add/remove) */ 379 #define RTAS_LOG_V6_HP_ID_DRC_NAME 1 380 #define RTAS_LOG_V6_HP_ID_DRC_INDEX 2 381 #define RTAS_LOG_V6_HP_ID_DRC_COUNT 3 382 #ifdef GUEST_SUPPORTS_MODERN 383 #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4 384 #endif 385 uint8_t hotplug_identifier; /* type of the resource identifier, 386 * which serves as the discriminator 387 * for the 'drc' union field below 388 */ 389 #ifdef GUEST_SUPPORTS_MODERN 390 uint8_t capabilities; /* capability flags, currently unused 391 * by QEMU 392 */ 393 #else 394 uint8_t reserved; 395 #endif 396 union { 397 uint32_t index; /* DRC index of resource to take action 398 * on 399 */ 400 uint32_t count; /* number of DR resources to take 401 * action on (guest chooses which) 402 */ 403 #ifdef GUEST_SUPPORTS_MODERN 404 struct { 405 uint32_t count; /* number of DR resources to take 406 * action on 407 */ 408 uint32_t index; /* DRC index of first resource to take 409 * action on. guest will take action 410 * on DRC index <index> through 411 * DRC index <index + count - 1> in 412 * sequential order 413 */ 414 } count_indexed; 415 #endif 416 char name[1]; /* string representing the name of the 417 * DRC to take action on 418 */ 419 } drc; 420 } QEMU_PACKED; 421 422``ibm,lrdr-capacity`` 423===================== 424 425``ibm,lrdr-capacity`` is a property in the /rtas device tree node that 426identifies the dynamic reconfiguration capabilities of the guest. It consists 427of a triple consisting of ``<phys>``, ``<size>`` and ``<maxcpus>``. 428 429 ``<phys>``, encoded in BE format represents the maximum address in bytes and 430 hence the maximum memory that can be allocated to the guest. 431 432 ``<size>``, encoded in BE format represents the size increments in which 433 memory can be hot-plugged to the guest. 434 435 ``<maxcpus>``, a BE-encoded integer, represents the maximum number of 436 processors that the guest can have. 437 438``pseries`` guests use this property to note the maximum allowed CPUs for the 439guest. 440 441``ibm,dynamic-reconfiguration-memory`` 442====================================== 443 444``ibm,dynamic-reconfiguration-memory`` is a device tree node that represents 445dynamically reconfigurable logical memory blocks (LMB). This node is generated 446only when the guest advertises the support for it via 447``ibm,client-architecture-support`` call. Memory that is not dynamically 448reconfigurable is represented by ``/memory`` nodes. The properties of this node 449that are of interest to the sPAPR memory hotplug implementation in QEMU are 450described here. 451 452``ibm,lmb-size`` 453---------------- 454 455This 64-bit integer defines the size of each dynamically reconfigurable LMB. 456 457``ibm,associativity-lookup-arrays`` 458----------------------------------- 459 460This property defines a lookup array in which the NUMA associativity 461information for each LMB can be found. It is a property encoded array 462that begins with an integer M, the number of associativity lists followed 463by an integer N, the number of entries per associativity list and terminated 464by M associativity lists each of length N integers. 465 466This property provides the same information as given by ``ibm,associativity`` 467property in a ``/memory`` node. Each assigned LMB has an index value between 4680 and M-1 which is used as an index into this table to select which 469associativity list to use for the LMB. This index value for each LMB is defined 470in ``ibm,dynamic-memory`` property. 471 472``ibm,dynamic-memory`` 473---------------------- 474 475This property describes the dynamically reconfigurable memory. It is a 476property encoded array that has an integer N, the number of LMBs followed 477by N LMB list entries. 478 479Each LMB list entry consists of the following elements: 480 481- Logical address of the start of the LMB encoded as a 64-bit integer. This 482 corresponds to ``reg`` property in ``/memory`` node. 483- DRC index of the LMB that corresponds to ``ibm,my-drc-index`` property 484 in a ``/memory`` node. 485- Four bytes reserved for expansion. 486- Associativity list index for the LMB that is used as an index into 487 ``ibm,associativity-lookup-arrays`` property described earlier. This is used 488 to retrieve the right associativity list to be used for this LMB. 489- A 32-bit flags word. The bit at bit position ``0x00000008`` defines whether 490 the LMB is assigned to the partition as of boot time. 491 492``ibm,dynamic-memory-v2`` 493------------------------- 494 495This property describes the dynamically reconfigurable memory. This is 496an alternate and newer way to describe dynamically reconfigurable memory. 497It is a property encoded array that has an integer N (the number of 498LMB set entries) followed by N LMB set entries. There is an LMB set entry 499for each sequential group of LMBs that share common attributes. 500 501Each LMB set entry consists of the following elements: 502 503- Number of sequential LMBs in the entry represented by a 32-bit integer. 504- Logical address of the first LMB in the set encoded as a 64-bit integer. 505- DRC index of the first LMB in the set. 506- Associativity list index that is used as an index into 507 ``ibm,associativity-lookup-arrays`` property described earlier. This 508 is used to retrieve the right associativity list to be used for all 509 the LMBs in this set. 510- A 32-bit flags word that applies to all the LMBs in the set. 511