xref: /openbmc/qemu/docs/specs/ppc-spapr-hotplug.rst (revision 9d1401b79463e74adbfac69d836789d4e103fb61)
1=============================
2sPAPR Dynamic Reconfiguration
3=============================
4
5sPAPR or pSeries guests make use of a facility called dynamic reconfiguration
6to handle hot plugging of dynamic "physical" resources like PCI cards, or
7"logical"/para-virtual resources like memory, CPUs, and "physical"
8host-bridges, which are generally managed by the host/hypervisor and provided
9to guests as virtualized resources. The specifics of dynamic reconfiguration
10are documented extensively in section 13 of the Linux on Power Architecture
11Reference document ([LoPAR]_). This document provides a summary of that
12information as it applies to the implementation within QEMU.
13
14Dynamic-reconfiguration Connectors
15==================================
16
17To manage hot plug/unplug of these resources, a firmware abstraction known as
18a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
19resource to the guest, and provide an interface for the guest to manage
20configuration/removal of the resource associated with it.
21
22Device tree description of DRCs
23===============================
24
25A set of four Open Firmware device tree array properties are used to describe
26the name/index/power-domain/type of each DRC allocated to a guest at
27boot time. There may be multiple sets of these arrays, rooted at different
28paths in the device tree depending on the type of resource the DRCs manage.
29
30In some cases, the DRCs themselves may be provided by a dynamic resource,
31such as the DRCs managing PCI slots on a hot plugged PHB. In this case the
32arrays would be fetched as part of the device tree retrieval interfaces
33for hot plugged resources described under :ref:`guest-host-interface`.
34
35The array properties are described below. Each entry/element in an array
36describes the DRC identified by the element in the corresponding position
37of ``ibm,drc-indexes``:
38
39``ibm,drc-names``
40-----------------
41
42  First 4-bytes: big-endian (BE) encoded integer denoting the number of entries.
43
44  Each entry: a NULL-terminated ``<name>`` string encoded as a byte array.
45
46    ``<name>`` values for logical/virtual resources are defined in the Linux on
47    Power Architecture Reference ([LoPAR]_) section 13.5.2.4, and basically
48    consist of the type of the resource followed by a space and a numerical
49    value that's unique across resources of that type.
50
51    ``<name>`` values for "physical" resources such as PCI or VIO devices are
52    defined as being "location codes", which are the "location labels" of each
53    encapsulating device, starting from the chassis down to the individual slot
54    for the device, concatenated by a hyphen. This provides a mapping of
55    resources to a physical location in a chassis for debugging purposes. For
56    QEMU, this mapping is less important, so we assign a location code that
57    conforms to naming specifications, but is simply a location label for the
58    slot by itself to simplify the implementation. The naming convention for
59    location labels is documented in detail in the [LoPAR]_ section 12.3.1.5,
60    and in our case amounts to using ``C<n>`` for PCI/VIO device slots, where
61    ``<n>`` is unique across all PCI/VIO device slots.
62
63``ibm,drc-indexes``
64-------------------
65
66  First 4-bytes: BE-encoded integer denoting the number of entries.
67
68  Each 4-byte entry: BE-encoded ``<index>`` integer that is unique across all
69  DRCs in the machine.
70
71    ``<index>`` is arbitrary, but in the case of QEMU we try to maintain the
72    convention used to assign them to pSeries guests on pHyp (the hypervisor
73    portion of PowerVM):
74
75      ``bit[31:28]``: integer encoding of ``<type>``, where ``<type>`` is:
76
77        ``1`` for CPU resource.
78
79        ``2`` for PHB resource.
80
81        ``3`` for VIO resource.
82
83        ``4`` for PCI resource.
84
85        ``8`` for memory resource.
86
87      ``bit[27:0]``: integer encoding of ``<id>``, where ``<id>`` is unique
88      across all resources of specified type.
89
90``ibm,drc-power-domains``
91-------------------------
92
93  First 4-bytes: BE-encoded integer denoting the number of entries.
94
95  Each 4-byte entry: 32-bit, BE-encoded ``<index>`` integer that specifies the
96  power domain the resource will be assigned to. In the case of QEMU we
97  associated all resources with a "live insertion" domain, where the power is
98  assumed to be managed automatically. The integer value for this domain is a
99  special value of ``-1``.
100
101
102``ibm,drc-types``
103-----------------
104
105  First 4-bytes: BE-encoded integer denoting the number of entries.
106
107  Each entry: a NULL-terminated ``<type>`` string encoded as a byte array.
108  ``<type>`` is assigned as follows:
109
110    "CPU" for a CPU.
111
112    "PHB" for a physical host-bridge.
113
114    "SLOT" for a VIO slot.
115
116    "28" for a PCI slot.
117
118    "MEM" for memory resource.
119
120.. _guest-host-interface:
121
122Guest->Host interface to manage dynamic resources
123=================================================
124
125Each DRC is given a globally unique DRC index, and resources associated with a
126particular DRC are configured/managed by the guest via a number of RTAS calls
127which reference individual DRCs based on the DRC index. This can be considered
128the guest->host interface.
129
130``rtas-set-power-level``
131------------------------
132
133Set the power level for a specified power domain.
134
135  ``arg[0]``: integer identifying power domain.
136
137  ``arg[1]``: new power level for the domain, ``0-100``.
138
139  ``output[0]``: status, ``0`` on success.
140
141  ``output[1]``: power level after command.
142
143``rtas-get-power-level``
144------------------------
145
146Get the power level for a specified power domain.
147
148  ``arg[0]``: integer identifying power domain.
149
150  ``output[0]``: status, ``0`` on success.
151
152  ``output[1]``: current power level.
153
154``rtas-set-indicator``
155----------------------
156
157Set the state of an indicator or sensor.
158
159  ``arg[0]``: integer identifying sensor/indicator type.
160
161  ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
162  index.
163
164  ``arg[2]``: desired sensor value.
165
166  ``output[0]``: status, ``0`` on success.
167
168For the purpose of this document we focus on the indicator/sensor types
169associated with a DRC. The types are:
170
171* ``9001``: ``isolation-state``, controls/indicates whether a device has been
172  made accessible to a guest. Supported sensor values:
173
174    ``0``: ``isolate``, device is made inaccessible by guest OS.
175
176    ``1``: ``unisolate``, device is made available to guest OS.
177
178* ``9002``: ``dr-indicator``, controls "visual" indicator associated with
179  device. Supported sensor values:
180
181    ``0``: ``inactive``, resource may be safely removed.
182
183    ``1``: ``active``, resource is in use and cannot be safely removed.
184
185    ``2``: ``identify``, used to visually identify slot for interactive hot plug.
186
187    ``3``: ``action``, in most cases, used in the same manner as identify.
188
189* ``9003``: ``allocation-state``, generally only used for "logical" DR resources
190  to request the allocation/deallocation of a resource prior to acquiring it via
191  ``isolation-state->unisolate``, or after releasing it via
192  ``isolation-state->isolate``, respectively. For "physical" DR (like PCI
193  hot plug/unplug) the pre-allocation of the resource is implied and this sensor
194  is unused. Supported sensor values:
195
196    ``0``: ``unusable``, tell firmware/system the resource can be
197    unallocated/reclaimed and added back to the system resource pool.
198
199    ``1``: ``usable``, request the resource be allocated/reserved for use by
200    guest OS.
201
202    ``2``: ``exchange``, used to allocate a spare resource to use for fail-over
203    in certain situations. Unused in QEMU.
204
205    ``3``: ``recover``, used to reclaim a previously allocated resource that's
206    not currently allocated to the guest OS. Unused in QEMU.
207
208``rtas-get-sensor-state:``
209--------------------------
210
211Used to read an indicator or sensor value.
212
213  ``arg[0]``: integer identifying sensor/indicator type.
214
215  ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
216  index
217
218  ``output[0]``: status, 0 on success
219
220For DR-related operations, the only noteworthy sensor is ``dr-entity-sense``,
221which has a type value of ``9003``, as ``allocation-state`` does in the case of
222``rtas-set-indicator``. The semantics/encodings of the sensor values are
223distinct however.
224
225Supported sensor values for ``dr-entity-sense`` (``9003``) sensor:
226
227  ``0``: empty.
228
229    For physical resources: DRC/slot is empty.
230
231    For logical resources: unused.
232
233  ``1``: present.
234
235    For physical resources: DRC/slot is populated with a device/resource.
236
237    For logical resources: resource has been allocated to the DRC.
238
239  ``2``: unusable.
240
241    For physical resources: unused.
242
243    For logical resources: DRC has no resource allocated to it.
244
245  ``3``: exchange.
246
247    For physical resources: unused.
248
249    For logical resources: resource available for exchange (see
250    ``allocation-state`` sensor semantics above).
251
252  ``4``: recovery.
253
254    For physical resources: unused.
255
256    For logical resources: resource available for recovery (see
257    ``allocation-state`` sensor semantics above).
258
259``rtas-ibm-configure-connector``
260--------------------------------
261
262Used to fetch an OpenFirmware device tree description of the resource associated
263with a particular DRC.
264
265  ``arg[0]``: guest physical address of 4096-byte work area buffer.
266
267  ``arg[1]``: 0, or address of additional 4096-byte work area buffer; only
268  non-zero if a prior RTAS response indicated a need for additional memory.
269
270  ``output[0]``: status:
271
272    ``0``: completed transmittal of device tree node.
273
274    ``1``: instruct guest to prepare for next device tree sibling node.
275
276    ``2``: instruct guest to prepare for next device tree child node.
277
278    ``3``: instruct guest to prepare for next device tree property.
279
280    ``4``: instruct guest to ascend to parent device tree node.
281
282    ``5``: instruct guest to provide additional work-area buffer via ``arg[1]``.
283
284    ``990x``: instruct guest that operation took too long and to try again
285    later.
286
287The DRC index is encoded in the first 4-bytes of the first work area buffer.
288Work area (``wa``) layout, using 4-byte offsets:
289
290  ``wa[0]``: DRC index of the DRC to fetch device tree nodes from.
291
292  ``wa[1]``: ``0`` (hard-coded).
293
294  ``wa[2]``:
295
296    For next-sibling/next-child response:
297
298      ``wa`` offset of null-terminated string denoting the new node's name.
299
300    For next-property response:
301
302      ``wa`` offset of null-terminated string denoting new property's name.
303
304  ``wa[3]``: for next-property response (unused otherwise):
305
306      Byte-length of new property's value.
307
308  ``wa[4]``: for next-property response (unused otherwise):
309
310      New property's value, encoded as an OFDT-compatible byte array.
311
312Hot plug/unplug events
313======================
314
315For most DR operations, the hypervisor will issue host->guest add/remove events
316using the EPOW/check-exception notification framework, where the host issues a
317check-exception interrupt, then provides an RTAS event log via an
318rtas-check-exception call issued by the guest in response. This framework is
319documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
320requests via EPOW events.
321
322For DR, this framework has been extended to include hotplug events, which were
323previously unneeded due to direct manipulation of DR-related guest userspace
324tools by host-level management such as an HMC. This level of management is not
325applicable to KVM on Power, hence the reason for extending the notification
326framework to support hotplug events.
327
328The format for these EPOW-signalled events is described below under
329:ref:`hot-plug-unplug-event-structure`. Note that these events are not formally
330part of the PAPR+ specification, and have been superseded by a newer format,
331also described below under :ref:`hot-plug-unplug-event-structure`, and so are
332now deemed a "legacy" format. The formats are similar, but the "modern" format
333contains additional fields/flags, which are denoted for the purposes of this
334documentation with ``#ifdef GUEST_SUPPORTS_MODERN`` guards.
335
336QEMU should assume support only for "legacy" fields/flags unless the guest
337advertises support for the "modern" format via
338``ibm,client-architecture-support`` hcall by setting byte 5, bit 6 of it's
339``ibm,architecture-vec-5`` option vector structure (as described by [LoPAR]_,
340section B.5.2.3). As with "legacy" format events, "modern" format events are
341surfaced to the guest via check-exception RTAS calls, but use a dedicated event
342source to signal the guest. This event source is advertised to the guest by the
343addition of a ``hot-plug-events`` node under ``/event-sources`` node of the
344guest's device tree using the standard format described in [LoPAR]_,
345section B.5.12.2.
346
347.. _hot-plug-unplug-event-structure:
348
349Hot plug/unplug event structure
350===============================
351
352The hot plug specific payload in QEMU is implemented as follows (with all values
353encoded in big-endian format):
354
355.. code-block:: c
356
357   struct rtas_event_log_v6_hp {
358   #define SECTION_ID_HOTPLUG              0x4850 /* HP */
359       struct section_header {
360           uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
361           uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
362                                            * plus the length of the DRC name
363                                            * if a DRC name identifier is
364                                            * specified for hotplug_identifier
365                                            */
366           uint8_t section_version;        /* version 1 */
367           uint8_t section_subtype;        /* unused */
368           uint16_t creator_component_id;  /* unused */
369       } hdr;
370   #define RTAS_LOG_V6_HP_TYPE_CPU         1
371   #define RTAS_LOG_V6_HP_TYPE_MEMORY      2
372   #define RTAS_LOG_V6_HP_TYPE_SLOT        3
373   #define RTAS_LOG_V6_HP_TYPE_PHB         4
374   #define RTAS_LOG_V6_HP_TYPE_PCI         5
375       uint8_t hotplug_type;               /* type of resource/device */
376   #define RTAS_LOG_V6_HP_ACTION_ADD       1
377   #define RTAS_LOG_V6_HP_ACTION_REMOVE    2
378       uint8_t hotplug_action;             /* action (add/remove) */
379   #define RTAS_LOG_V6_HP_ID_DRC_NAME          1
380   #define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
381   #define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
382   #ifdef GUEST_SUPPORTS_MODERN
383   #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
384   #endif
385       uint8_t hotplug_identifier;         /* type of the resource identifier,
386                                            * which serves as the discriminator
387                                            * for the 'drc' union field below
388                                            */
389   #ifdef GUEST_SUPPORTS_MODERN
390       uint8_t capabilities;               /* capability flags, currently unused
391                                            * by QEMU
392                                            */
393   #else
394       uint8_t reserved;
395   #endif
396       union {
397           uint32_t index;                 /* DRC index of resource to take action
398                                            * on
399                                            */
400           uint32_t count;                 /* number of DR resources to take
401                                            * action on (guest chooses which)
402                                            */
403   #ifdef GUEST_SUPPORTS_MODERN
404           struct {
405               uint32_t count;             /* number of DR resources to take
406                                            * action on
407                                            */
408               uint32_t index;             /* DRC index of first resource to take
409                                            * action on. guest will take action
410                                            * on DRC index <index> through
411                                            * DRC index <index + count - 1> in
412                                            * sequential order
413                                            */
414           } count_indexed;
415   #endif
416           char name[1];                   /* string representing the name of the
417                                            * DRC to take action on
418                                            */
419       } drc;
420   } QEMU_PACKED;
421
422``ibm,lrdr-capacity``
423=====================
424
425``ibm,lrdr-capacity`` is a property in the /rtas device tree node that
426identifies the dynamic reconfiguration capabilities of the guest. It consists
427of a triple consisting of ``<phys>``, ``<size>`` and ``<maxcpus>``.
428
429  ``<phys>``, encoded in BE format represents the maximum address in bytes and
430  hence the maximum memory that can be allocated to the guest.
431
432  ``<size>``, encoded in BE format represents the size increments in which
433  memory can be hot-plugged to the guest.
434
435  ``<maxcpus>``, a BE-encoded integer, represents the maximum number of
436  processors that the guest can have.
437
438``pseries`` guests use this property to note the maximum allowed CPUs for the
439guest.
440
441``ibm,dynamic-reconfiguration-memory``
442======================================
443
444``ibm,dynamic-reconfiguration-memory`` is a device tree node that represents
445dynamically reconfigurable logical memory blocks (LMB). This node is generated
446only when the guest advertises the support for it via
447``ibm,client-architecture-support`` call. Memory that is not dynamically
448reconfigurable is represented by ``/memory`` nodes. The properties of this node
449that are of interest to the sPAPR memory hotplug implementation in QEMU are
450described here.
451
452``ibm,lmb-size``
453----------------
454
455This 64-bit integer defines the size of each dynamically reconfigurable LMB.
456
457``ibm,associativity-lookup-arrays``
458-----------------------------------
459
460This property defines a lookup array in which the NUMA associativity
461information for each LMB can be found. It is a property encoded array
462that begins with an integer M, the number of associativity lists followed
463by an integer N, the number of entries per associativity list and terminated
464by M associativity lists each of length N integers.
465
466This property provides the same information as given by ``ibm,associativity``
467property in a ``/memory`` node. Each assigned LMB has an index value between
4680 and M-1 which is used as an index into this table to select which
469associativity list to use for the LMB. This index value for each LMB is defined
470in ``ibm,dynamic-memory`` property.
471
472``ibm,dynamic-memory``
473----------------------
474
475This property describes the dynamically reconfigurable memory. It is a
476property encoded array that has an integer N, the number of LMBs followed
477by N LMB list entries.
478
479Each LMB list entry consists of the following elements:
480
481- Logical address of the start of the LMB encoded as a 64-bit integer. This
482  corresponds to ``reg`` property in ``/memory`` node.
483- DRC index of the LMB that corresponds to ``ibm,my-drc-index`` property
484  in a ``/memory`` node.
485- Four bytes reserved for expansion.
486- Associativity list index for the LMB that is used as an index into
487  ``ibm,associativity-lookup-arrays`` property described earlier. This is used
488  to retrieve the right associativity list to be used for this LMB.
489- A 32-bit flags word. The bit at bit position ``0x00000008`` defines whether
490  the LMB is assigned to the partition as of boot time.
491
492``ibm,dynamic-memory-v2``
493-------------------------
494
495This property describes the dynamically reconfigurable memory. This is
496an alternate and newer way to describe dynamically reconfigurable memory.
497It is a property encoded array that has an integer N (the number of
498LMB set entries) followed by N LMB set entries. There is an LMB set entry
499for each sequential group of LMBs that share common attributes.
500
501Each LMB set entry consists of the following elements:
502
503- Number of sequential LMBs in the entry represented by a 32-bit integer.
504- Logical address of the first LMB in the set encoded as a 64-bit integer.
505- DRC index of the first LMB in the set.
506- Associativity list index that is used as an index into
507  ``ibm,associativity-lookup-arrays`` property described earlier. This
508  is used to retrieve the right associativity list to be used for all
509  the LMBs in this set.
510- A 32-bit flags word that applies to all the LMBs in the set.
511