qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

ppc-spapr-hotplug.rst (20127B)


      1 =============================
      2 sPAPR Dynamic Reconfiguration
      3 =============================
      4 
      5 sPAPR or pSeries guests make use of a facility called dynamic reconfiguration
      6 to handle hot plugging of dynamic "physical" resources like PCI cards, or
      7 "logical"/para-virtual resources like memory, CPUs, and "physical"
      8 host-bridges, which are generally managed by the host/hypervisor and provided
      9 to guests as virtualized resources. The specifics of dynamic reconfiguration
     10 are documented extensively in section 13 of the Linux on Power Architecture
     11 Reference document ([LoPAR]_). This document provides a summary of that
     12 information as it applies to the implementation within QEMU.
     13 
     14 Dynamic-reconfiguration Connectors
     15 ==================================
     16 
     17 To manage hot plug/unplug of these resources, a firmware abstraction known as
     18 a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
     19 resource to the guest, and provide an interface for the guest to manage
     20 configuration/removal of the resource associated with it.
     21 
     22 Device tree description of DRCs
     23 ===============================
     24 
     25 A set of four Open Firmware device tree array properties are used to describe
     26 the name/index/power-domain/type of each DRC allocated to a guest at
     27 boot time. There may be multiple sets of these arrays, rooted at different
     28 paths in the device tree depending on the type of resource the DRCs manage.
     29 
     30 In some cases, the DRCs themselves may be provided by a dynamic resource,
     31 such as the DRCs managing PCI slots on a hot plugged PHB. In this case the
     32 arrays would be fetched as part of the device tree retrieval interfaces
     33 for hot plugged resources described under :ref:`guest-host-interface`.
     34 
     35 The array properties are described below. Each entry/element in an array
     36 describes the DRC identified by the element in the corresponding position
     37 of ``ibm,drc-indexes``:
     38 
     39 ``ibm,drc-names``
     40 -----------------
     41 
     42   First 4-bytes: big-endian (BE) encoded integer denoting the number of entries.
     43 
     44   Each entry: a NULL-terminated ``<name>`` string encoded as a byte array.
     45 
     46     ``<name>`` values for logical/virtual resources are defined in the Linux on
     47     Power Architecture Reference ([LoPAR]_) section 13.5.2.4, and basically
     48     consist of the type of the resource followed by a space and a numerical
     49     value that's unique across resources of that type.
     50 
     51     ``<name>`` values for "physical" resources such as PCI or VIO devices are
     52     defined as being "location codes", which are the "location labels" of each
     53     encapsulating device, starting from the chassis down to the individual slot
     54     for the device, concatenated by a hyphen. This provides a mapping of
     55     resources to a physical location in a chassis for debugging purposes. For
     56     QEMU, this mapping is less important, so we assign a location code that
     57     conforms to naming specifications, but is simply a location label for the
     58     slot by itself to simplify the implementation. The naming convention for
     59     location labels is documented in detail in the [LoPAR]_ section 12.3.1.5,
     60     and in our case amounts to using ``C<n>`` for PCI/VIO device slots, where
     61     ``<n>`` is unique across all PCI/VIO device slots.
     62 
     63 ``ibm,drc-indexes``
     64 -------------------
     65 
     66   First 4-bytes: BE-encoded integer denoting the number of entries.
     67 
     68   Each 4-byte entry: BE-encoded ``<index>`` integer that is unique across all
     69   DRCs in the machine.
     70 
     71     ``<index>`` is arbitrary, but in the case of QEMU we try to maintain the
     72     convention used to assign them to pSeries guests on pHyp (the hypervisor
     73     portion of PowerVM):
     74 
     75       ``bit[31:28]``: integer encoding of ``<type>``, where ``<type>`` is:
     76 
     77         ``1`` for CPU resource.
     78 
     79         ``2`` for PHB resource.
     80 
     81         ``3`` for VIO resource.
     82 
     83         ``4`` for PCI resource.
     84 
     85         ``8`` for memory resource.
     86 
     87       ``bit[27:0]``: integer encoding of ``<id>``, where ``<id>`` is unique
     88       across all resources of specified type.
     89 
     90 ``ibm,drc-power-domains``
     91 -------------------------
     92 
     93   First 4-bytes: BE-encoded integer denoting the number of entries.
     94 
     95   Each 4-byte entry: 32-bit, BE-encoded ``<index>`` integer that specifies the
     96   power domain the resource will be assigned to. In the case of QEMU we
     97   associated all resources with a "live insertion" domain, where the power is
     98   assumed to be managed automatically. The integer value for this domain is a
     99   special value of ``-1``.
    100 
    101 
    102 ``ibm,drc-types``
    103 -----------------
    104 
    105   First 4-bytes: BE-encoded integer denoting the number of entries.
    106 
    107   Each entry: a NULL-terminated ``<type>`` string encoded as a byte array.
    108   ``<type>`` is assigned as follows:
    109 
    110     "CPU" for a CPU.
    111 
    112     "PHB" for a physical host-bridge.
    113 
    114     "SLOT" for a VIO slot.
    115 
    116     "28" for a PCI slot.
    117 
    118     "MEM" for memory resource.
    119 
    120 .. _guest-host-interface:
    121 
    122 Guest->Host interface to manage dynamic resources
    123 =================================================
    124 
    125 Each DRC is given a globally unique DRC index, and resources associated with a
    126 particular DRC are configured/managed by the guest via a number of RTAS calls
    127 which reference individual DRCs based on the DRC index. This can be considered
    128 the guest->host interface.
    129 
    130 ``rtas-set-power-level``
    131 ------------------------
    132 
    133 Set the power level for a specified power domain.
    134 
    135   ``arg[0]``: integer identifying power domain.
    136 
    137   ``arg[1]``: new power level for the domain, ``0-100``.
    138 
    139   ``output[0]``: status, ``0`` on success.
    140 
    141   ``output[1]``: power level after command.
    142 
    143 ``rtas-get-power-level``
    144 ------------------------
    145 
    146 Get the power level for a specified power domain.
    147 
    148   ``arg[0]``: integer identifying power domain.
    149 
    150   ``output[0]``: status, ``0`` on success.
    151 
    152   ``output[1]``: current power level.
    153 
    154 ``rtas-set-indicator``
    155 ----------------------
    156 
    157 Set the state of an indicator or sensor.
    158 
    159   ``arg[0]``: integer identifying sensor/indicator type.
    160 
    161   ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
    162   index.
    163 
    164   ``arg[2]``: desired sensor value.
    165 
    166   ``output[0]``: status, ``0`` on success.
    167 
    168 For the purpose of this document we focus on the indicator/sensor types
    169 associated with a DRC. The types are:
    170 
    171 * ``9001``: ``isolation-state``, controls/indicates whether a device has been
    172   made accessible to a guest. Supported sensor values:
    173 
    174     ``0``: ``isolate``, device is made inaccessible by guest OS.
    175 
    176     ``1``: ``unisolate``, device is made available to guest OS.
    177 
    178 * ``9002``: ``dr-indicator``, controls "visual" indicator associated with
    179   device. Supported sensor values:
    180 
    181     ``0``: ``inactive``, resource may be safely removed.
    182 
    183     ``1``: ``active``, resource is in use and cannot be safely removed.
    184 
    185     ``2``: ``identify``, used to visually identify slot for interactive hot plug.
    186 
    187     ``3``: ``action``, in most cases, used in the same manner as identify.
    188 
    189 * ``9003``: ``allocation-state``, generally only used for "logical" DR resources
    190   to request the allocation/deallocation of a resource prior to acquiring it via
    191   ``isolation-state->unisolate``, or after releasing it via
    192   ``isolation-state->isolate``, respectively. For "physical" DR (like PCI
    193   hot plug/unplug) the pre-allocation of the resource is implied and this sensor
    194   is unused. Supported sensor values:
    195 
    196     ``0``: ``unusable``, tell firmware/system the resource can be
    197     unallocated/reclaimed and added back to the system resource pool.
    198 
    199     ``1``: ``usable``, request the resource be allocated/reserved for use by
    200     guest OS.
    201 
    202     ``2``: ``exchange``, used to allocate a spare resource to use for fail-over
    203     in certain situations. Unused in QEMU.
    204 
    205     ``3``: ``recover``, used to reclaim a previously allocated resource that's
    206     not currently allocated to the guest OS. Unused in QEMU.
    207 
    208 ``rtas-get-sensor-state:``
    209 --------------------------
    210 
    211 Used to read an indicator or sensor value.
    212 
    213   ``arg[0]``: integer identifying sensor/indicator type.
    214 
    215   ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
    216   index
    217 
    218   ``output[0]``: status, 0 on success
    219 
    220 For DR-related operations, the only noteworthy sensor is ``dr-entity-sense``,
    221 which has a type value of ``9003``, as ``allocation-state`` does in the case of
    222 ``rtas-set-indicator``. The semantics/encodings of the sensor values are
    223 distinct however.
    224 
    225 Supported sensor values for ``dr-entity-sense`` (``9003``) sensor:
    226 
    227   ``0``: empty.
    228 
    229     For physical resources: DRC/slot is empty.
    230 
    231     For logical resources: unused.
    232 
    233   ``1``: present.
    234 
    235     For physical resources: DRC/slot is populated with a device/resource.
    236 
    237     For logical resources: resource has been allocated to the DRC.
    238 
    239   ``2``: unusable.
    240 
    241     For physical resources: unused.
    242 
    243     For logical resources: DRC has no resource allocated to it.
    244 
    245   ``3``: exchange.
    246 
    247     For physical resources: unused.
    248 
    249     For logical resources: resource available for exchange (see
    250     ``allocation-state`` sensor semantics above).
    251 
    252   ``4``: recovery.
    253 
    254     For physical resources: unused.
    255 
    256     For logical resources: resource available for recovery (see
    257     ``allocation-state`` sensor semantics above).
    258 
    259 ``rtas-ibm-configure-connector``
    260 --------------------------------
    261 
    262 Used to fetch an OpenFirmware device tree description of the resource associated
    263 with a particular DRC.
    264 
    265   ``arg[0]``: guest physical address of 4096-byte work area buffer.
    266 
    267   ``arg[1]``: 0, or address of additional 4096-byte work area buffer; only
    268   non-zero if a prior RTAS response indicated a need for additional memory.
    269 
    270   ``output[0]``: status:
    271 
    272     ``0``: completed transmittal of device tree node.
    273 
    274     ``1``: instruct guest to prepare for next device tree sibling node.
    275 
    276     ``2``: instruct guest to prepare for next device tree child node.
    277 
    278     ``3``: instruct guest to prepare for next device tree property.
    279 
    280     ``4``: instruct guest to ascend to parent device tree node.
    281 
    282     ``5``: instruct guest to provide additional work-area buffer via ``arg[1]``.
    283 
    284     ``990x``: instruct guest that operation took too long and to try again
    285     later.
    286 
    287 The DRC index is encoded in the first 4-bytes of the first work area buffer.
    288 Work area (``wa``) layout, using 4-byte offsets:
    289 
    290   ``wa[0]``: DRC index of the DRC to fetch device tree nodes from.
    291 
    292   ``wa[1]``: ``0`` (hard-coded).
    293 
    294   ``wa[2]``:
    295 
    296     For next-sibling/next-child response:
    297 
    298       ``wa`` offset of null-terminated string denoting the new node's name.
    299 
    300     For next-property response:
    301 
    302       ``wa`` offset of null-terminated string denoting new property's name.
    303 
    304   ``wa[3]``: for next-property response (unused otherwise):
    305 
    306       Byte-length of new property's value.
    307 
    308   ``wa[4]``: for next-property response (unused otherwise):
    309 
    310       New property's value, encoded as an OFDT-compatible byte array.
    311 
    312 Hot plug/unplug events
    313 ======================
    314 
    315 For most DR operations, the hypervisor will issue host->guest add/remove events
    316 using the EPOW/check-exception notification framework, where the host issues a
    317 check-exception interrupt, then provides an RTAS event log via an
    318 rtas-check-exception call issued by the guest in response. This framework is
    319 documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
    320 requests via EPOW events.
    321 
    322 For DR, this framework has been extended to include hotplug events, which were
    323 previously unneeded due to direct manipulation of DR-related guest userspace
    324 tools by host-level management such as an HMC. This level of management is not
    325 applicable to KVM on Power, hence the reason for extending the notification
    326 framework to support hotplug events.
    327 
    328 The format for these EPOW-signalled events is described below under
    329 :ref:`hot-plug-unplug-event-structure`. Note that these events are not formally
    330 part of the PAPR+ specification, and have been superseded by a newer format,
    331 also described below under :ref:`hot-plug-unplug-event-structure`, and so are
    332 now deemed a "legacy" format. The formats are similar, but the "modern" format
    333 contains additional fields/flags, which are denoted for the purposes of this
    334 documentation with ``#ifdef GUEST_SUPPORTS_MODERN`` guards.
    335 
    336 QEMU should assume support only for "legacy" fields/flags unless the guest
    337 advertises support for the "modern" format via
    338 ``ibm,client-architecture-support`` hcall by setting byte 5, bit 6 of it's
    339 ``ibm,architecture-vec-5`` option vector structure (as described by [LoPAR]_,
    340 section B.5.2.3). As with "legacy" format events, "modern" format events are
    341 surfaced to the guest via check-exception RTAS calls, but use a dedicated event
    342 source to signal the guest. This event source is advertised to the guest by the
    343 addition of a ``hot-plug-events`` node under ``/event-sources`` node of the
    344 guest's device tree using the standard format described in [LoPAR]_,
    345 section B.5.12.2.
    346 
    347 .. _hot-plug-unplug-event-structure:
    348 
    349 Hot plug/unplug event structure
    350 ===============================
    351 
    352 The hot plug specific payload in QEMU is implemented as follows (with all values
    353 encoded in big-endian format):
    354 
    355 .. code-block:: c
    356 
    357    struct rtas_event_log_v6_hp {
    358    #define SECTION_ID_HOTPLUG              0x4850 /* HP */
    359        struct section_header {
    360            uint16_t section_id;            /* set to SECTION_ID_HOTPLUG */
    361            uint16_t section_length;        /* sizeof(rtas_event_log_v6_hp),
    362                                             * plus the length of the DRC name
    363                                             * if a DRC name identifier is
    364                                             * specified for hotplug_identifier
    365                                             */
    366            uint8_t section_version;        /* version 1 */
    367            uint8_t section_subtype;        /* unused */
    368            uint16_t creator_component_id;  /* unused */
    369        } hdr;
    370    #define RTAS_LOG_V6_HP_TYPE_CPU         1
    371    #define RTAS_LOG_V6_HP_TYPE_MEMORY      2
    372    #define RTAS_LOG_V6_HP_TYPE_SLOT        3
    373    #define RTAS_LOG_V6_HP_TYPE_PHB         4
    374    #define RTAS_LOG_V6_HP_TYPE_PCI         5
    375        uint8_t hotplug_type;               /* type of resource/device */
    376    #define RTAS_LOG_V6_HP_ACTION_ADD       1
    377    #define RTAS_LOG_V6_HP_ACTION_REMOVE    2
    378        uint8_t hotplug_action;             /* action (add/remove) */
    379    #define RTAS_LOG_V6_HP_ID_DRC_NAME          1
    380    #define RTAS_LOG_V6_HP_ID_DRC_INDEX         2
    381    #define RTAS_LOG_V6_HP_ID_DRC_COUNT         3
    382    #ifdef GUEST_SUPPORTS_MODERN
    383    #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
    384    #endif
    385        uint8_t hotplug_identifier;         /* type of the resource identifier,
    386                                             * which serves as the discriminator
    387                                             * for the 'drc' union field below
    388                                             */
    389    #ifdef GUEST_SUPPORTS_MODERN
    390        uint8_t capabilities;               /* capability flags, currently unused
    391                                             * by QEMU
    392                                             */
    393    #else
    394        uint8_t reserved;
    395    #endif
    396        union {
    397            uint32_t index;                 /* DRC index of resource to take action
    398                                             * on
    399                                             */
    400            uint32_t count;                 /* number of DR resources to take
    401                                             * action on (guest chooses which)
    402                                             */
    403    #ifdef GUEST_SUPPORTS_MODERN
    404            struct {
    405                uint32_t count;             /* number of DR resources to take
    406                                             * action on
    407                                             */
    408                uint32_t index;             /* DRC index of first resource to take
    409                                             * action on. guest will take action
    410                                             * on DRC index <index> through
    411                                             * DRC index <index + count - 1> in
    412                                             * sequential order
    413                                             */
    414            } count_indexed;
    415    #endif
    416            char name[1];                   /* string representing the name of the
    417                                             * DRC to take action on
    418                                             */
    419        } drc;
    420    } QEMU_PACKED;
    421 
    422 ``ibm,lrdr-capacity``
    423 =====================
    424 
    425 ``ibm,lrdr-capacity`` is a property in the /rtas device tree node that
    426 identifies the dynamic reconfiguration capabilities of the guest. It consists
    427 of a triple consisting of ``<phys>``, ``<size>`` and ``<maxcpus>``.
    428 
    429   ``<phys>``, encoded in BE format represents the maximum address in bytes and
    430   hence the maximum memory that can be allocated to the guest.
    431 
    432   ``<size>``, encoded in BE format represents the size increments in which
    433   memory can be hot-plugged to the guest.
    434 
    435   ``<maxcpus>``, a BE-encoded integer, represents the maximum number of
    436   processors that the guest can have.
    437 
    438 ``pseries`` guests use this property to note the maximum allowed CPUs for the
    439 guest.
    440 
    441 ``ibm,dynamic-reconfiguration-memory``
    442 ======================================
    443 
    444 ``ibm,dynamic-reconfiguration-memory`` is a device tree node that represents
    445 dynamically reconfigurable logical memory blocks (LMB). This node is generated
    446 only when the guest advertises the support for it via
    447 ``ibm,client-architecture-support`` call. Memory that is not dynamically
    448 reconfigurable is represented by ``/memory`` nodes. The properties of this node
    449 that are of interest to the sPAPR memory hotplug implementation in QEMU are
    450 described here.
    451 
    452 ``ibm,lmb-size``
    453 ----------------
    454 
    455 This 64-bit integer defines the size of each dynamically reconfigurable LMB.
    456 
    457 ``ibm,associativity-lookup-arrays``
    458 -----------------------------------
    459 
    460 This property defines a lookup array in which the NUMA associativity
    461 information for each LMB can be found. It is a property encoded array
    462 that begins with an integer M, the number of associativity lists followed
    463 by an integer N, the number of entries per associativity list and terminated
    464 by M associativity lists each of length N integers.
    465 
    466 This property provides the same information as given by ``ibm,associativity``
    467 property in a ``/memory`` node. Each assigned LMB has an index value between
    468 0 and M-1 which is used as an index into this table to select which
    469 associativity list to use for the LMB. This index value for each LMB is defined
    470 in ``ibm,dynamic-memory`` property.
    471 
    472 ``ibm,dynamic-memory``
    473 ----------------------
    474 
    475 This property describes the dynamically reconfigurable memory. It is a
    476 property encoded array that has an integer N, the number of LMBs followed
    477 by N LMB list entries.
    478 
    479 Each LMB list entry consists of the following elements:
    480 
    481 - Logical address of the start of the LMB encoded as a 64-bit integer. This
    482   corresponds to ``reg`` property in ``/memory`` node.
    483 - DRC index of the LMB that corresponds to ``ibm,my-drc-index`` property
    484   in a ``/memory`` node.
    485 - Four bytes reserved for expansion.
    486 - Associativity list index for the LMB that is used as an index into
    487   ``ibm,associativity-lookup-arrays`` property described earlier. This is used
    488   to retrieve the right associativity list to be used for this LMB.
    489 - A 32-bit flags word. The bit at bit position ``0x00000008`` defines whether
    490   the LMB is assigned to the partition as of boot time.
    491 
    492 ``ibm,dynamic-memory-v2``
    493 -------------------------
    494 
    495 This property describes the dynamically reconfigurable memory. This is
    496 an alternate and newer way to describe dynamically reconfigurable memory.
    497 It is a property encoded array that has an integer N (the number of
    498 LMB set entries) followed by N LMB set entries. There is an LMB set entry
    499 for each sequential group of LMBs that share common attributes.
    500 
    501 Each LMB set entry consists of the following elements:
    502 
    503 - Number of sequential LMBs in the entry represented by a 32-bit integer.
    504 - Logical address of the first LMB in the set encoded as a 64-bit integer.
    505 - DRC index of the first LMB in the set.
    506 - Associativity list index that is used as an index into
    507   ``ibm,associativity-lookup-arrays`` property described earlier. This
    508   is used to retrieve the right associativity list to be used for all
    509   the LMBs in this set.
    510 - A 32-bit flags word that applies to all the LMBs in the set.