qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

ppc-spapr-xive.rst (10761B)


      1 XIVE for sPAPR (pseries machines)
      2 =================================
      3 
      4 The POWER9 processor comes with a new interrupt controller
      5 architecture, called XIVE as "eXternal Interrupt Virtualization
      6 Engine". It supports a larger number of interrupt sources and offers
      7 virtualization features which enables the HW to deliver interrupts
      8 directly to virtual processors without hypervisor assistance.
      9 
     10 A QEMU ``pseries`` machine (which is PAPR compliant) using POWER9
     11 processors can run under two interrupt modes:
     12 
     13 - *Legacy Compatibility Mode*
     14 
     15   the hypervisor provides identical interfaces and similar
     16   functionality to PAPR+ Version 2.7.  This is the default mode
     17 
     18   It is also referred as *XICS* in QEMU.
     19 
     20 - *XIVE native exploitation mode*
     21 
     22   the hypervisor provides new interfaces to manage the XIVE control
     23   structures, and provides direct control for interrupt management
     24   through MMIO pages.
     25 
     26 Which interrupt modes can be used by the machine is negotiated with
     27 the guest O/S during the Client Architecture Support negotiation
     28 sequence. The two modes are mutually exclusive.
     29 
     30 Both interrupt mode share the same IRQ number space. See below for the
     31 layout.
     32 
     33 CAS Negotiation
     34 ---------------
     35 
     36 QEMU advertises the supported interrupt modes in the device tree
     37 property ``ibm,arch-vec-5-platform-support`` in byte 23 and the OS
     38 Selection for XIVE is indicated in the ``ibm,architecture-vec-5``
     39 property byte 23.
     40 
     41 The interrupt modes supported by the machine depend on the CPU type
     42 (POWER9 is required for XIVE) but also on the machine property
     43 ``ic-mode`` which can be set on the command line. It can take the
     44 following values: ``xics``, ``xive``, and ``dual`` which is the
     45 default mode. ``dual`` means that both modes XICS **and** XIVE are
     46 supported and if the guest OS supports XIVE, this mode will be
     47 selected.
     48 
     49 The chosen interrupt mode is activated after a reconfiguration done
     50 in a machine reset.
     51 
     52 KVM negotiation
     53 ---------------
     54 
     55 When the guest starts under KVM, the capabilities of the host kernel
     56 and QEMU are also negotiated. Depending on the version of the host
     57 kernel, KVM will advertise the XIVE capability to QEMU or not.
     58 
     59 Nevertheless, the available interrupt modes in the machine should not
     60 depend on the XIVE KVM capability of the host. On older kernels
     61 without XIVE KVM support, QEMU will use the emulated XIVE device as a
     62 fallback and on newer kernels (>=5.2), the KVM XIVE device.
     63 
     64 XIVE native exploitation mode is not supported for KVM nested guests,
     65 VMs running under a L1 hypervisor (KVM on pSeries). In that case, the
     66 hypervisor will not advertise the KVM capability and QEMU will use the
     67 emulated XIVE device, same as for older versions of KVM.
     68 
     69 As a final refinement, the user can also switch the use of the KVM
     70 device with the machine option ``kernel_irqchip``.
     71 
     72 
     73 XIVE support in KVM
     74 ~~~~~~~~~~~~~~~~~~~
     75 
     76 For guest OSes supporting XIVE, the resulting interrupt modes on host
     77 kernels with XIVE KVM support are the following:
     78 
     79 ==============  =============  =============  ================
     80 ic-mode                            kernel_irqchip
     81 --------------  ----------------------------------------------
     82 /               allowed        off            on
     83                 (default)
     84 ==============  =============  =============  ================
     85 dual (default)  XIVE KVM       XIVE emul.     XIVE KVM
     86 xive            XIVE KVM       XIVE emul.     XIVE KVM
     87 xics            XICS KVM       XICS emul.     XICS KVM
     88 ==============  =============  =============  ================
     89 
     90 For legacy guest OSes without XIVE support, the resulting interrupt
     91 modes are the following:
     92 
     93 ==============  =============  =============  ================
     94 ic-mode                            kernel_irqchip
     95 --------------  ----------------------------------------------
     96 /               allowed        off            on
     97                 (default)
     98 ==============  =============  =============  ================
     99 dual (default)  XICS KVM       XICS emul.     XICS KVM
    100 xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
    101 xics            XICS KVM       XICS emul.     XICS KVM
    102 ==============  =============  =============  ================
    103 
    104 (3) QEMU fails at CAS with ``Guest requested unavailable interrupt
    105     mode (XICS), either don't set the ic-mode machine property or try
    106     ic-mode=xics or ic-mode=dual``
    107 
    108 
    109 No XIVE support in KVM
    110 ~~~~~~~~~~~~~~~~~~~~~~
    111 
    112 For guest OSes supporting XIVE, the resulting interrupt modes on host
    113 kernels without XIVE KVM support are the following:
    114 
    115 ==============  =============  =============  ================
    116 ic-mode                            kernel_irqchip
    117 --------------  ----------------------------------------------
    118 /               allowed        off            on
    119                 (default)
    120 ==============  =============  =============  ================
    121 dual (default)  XIVE emul.(1)  XIVE emul.     QEMU error (2)
    122 xive            XIVE emul.(1)  XIVE emul.     QEMU error (2)
    123 xics            XICS KVM       XICS emul.     XICS KVM
    124 ==============  =============  =============  ================
    125 
    126 
    127 (1) QEMU warns with ``warning: kernel_irqchip requested but unavailable:
    128     IRQ_XIVE capability must be present for KVM``
    129     In some cases (old host kernels or KVM nested guests), one may hit a
    130     QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
    131     with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
    132 (2) QEMU fails with ``kernel_irqchip requested but unavailable:
    133     IRQ_XIVE capability must be present for KVM``
    134 
    135 
    136 For legacy guest OSes without XIVE support, the resulting interrupt
    137 modes are the following:
    138 
    139 ==============  =============  =============  ================
    140 ic-mode                            kernel_irqchip
    141 --------------  ----------------------------------------------
    142 /               allowed        off            on
    143                 (default)
    144 ==============  =============  =============  ================
    145 dual (default)  QEMU error(4)  XICS emul.     QEMU error(4)
    146 xive            QEMU error(3)  QEMU error(3)  QEMU error(3)
    147 xics            XICS KVM       XICS emul.     XICS KVM
    148 ==============  =============  =============  ================
    149 
    150 (3) QEMU fails at CAS with ``Guest requested unavailable interrupt
    151     mode (XICS), either don't set the ic-mode machine property or try
    152     ic-mode=xics or ic-mode=dual``
    153 (4) QEMU/KVM incompatibility due to device destruction in reset. QEMU fails
    154     with ``KVM is incompatible with ic-mode=dual,kernel-irqchip=on``
    155 
    156 
    157 XIVE Device tree properties
    158 ---------------------------
    159 
    160 The properties for the PAPR interrupt controller node when the *XIVE
    161 native exploitation mode* is selected should contain:
    162 
    163 - ``device_type``
    164 
    165   value should be "power-ivpe".
    166 
    167 - ``compatible``
    168 
    169   value should be "ibm,power-ivpe".
    170 
    171 - ``reg``
    172 
    173   contains the base address and size of the thread interrupt
    174   managnement areas (TIMA), for the User level and for the Guest OS
    175   level. Only the Guest OS level is taken into account today.
    176 
    177 - ``ibm,xive-eq-sizes``
    178 
    179   the size of the event queues. One cell per size supported, contains
    180   log2 of size, in ascending order.
    181 
    182 - ``ibm,xive-lisn-ranges``
    183 
    184   the IRQ interrupt number ranges assigned to the guest for the IPIs.
    185 
    186 The root node also exports :
    187 
    188 - ``ibm,plat-res-int-priorities``
    189 
    190   contains a list of priorities that the hypervisor has reserved for
    191   its own use.
    192 
    193 IRQ number space
    194 ----------------
    195 
    196 IRQ Number space of the ``pseries`` machine is 8K wide and is the same
    197 for both interrupt mode. The different ranges are defined as follow :
    198 
    199 - ``0x0000 .. 0x0FFF`` 4K CPU IPIs (only used under XIVE)
    200 - ``0x1000 .. 0x1000`` 1 EPOW
    201 - ``0x1001 .. 0x1001`` 1 HOTPLUG
    202 - ``0x1002 .. 0x10FF`` unused
    203 - ``0x1100 .. 0x11FF`` 256 VIO devices
    204 - ``0x1200 .. 0x127F`` 32x4 LSIs for PHB devices
    205 - ``0x1280 .. 0x12FF`` unused
    206 - ``0x1300 .. 0x1FFF`` PHB MSIs (dynamically allocated)
    207 
    208 Monitoring XIVE
    209 ---------------
    210 
    211 The state of the XIVE interrupt controller can be queried through the
    212 monitor commands ``info pic``. The output comes in two parts.
    213 
    214 First, the state of the thread interrupt context registers is dumped
    215 for each CPU :
    216 
    217 ::
    218 
    219    (qemu) info pic
    220    CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
    221    CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
    222    CPU[0000]:   OS    00   ff  00    00   ff  00  ff   ff  80000400
    223    CPU[0000]: POOL    00   00  00    00   00  00  00   00  00000000
    224    CPU[0000]: PHYS    00   00  00    00   00  00  00   ff  00000000
    225    ...
    226 
    227 In the case of a ``pseries`` machine, QEMU acts as the hypervisor and only
    228 the O/S and USER register rings make sense. ``W2`` contains the vCPU CAM
    229 line which is set to the VP identifier.
    230 
    231 Then comes the routing information which aggregates the EAS and the
    232 END configuration:
    233 
    234 ::
    235 
    236    ...
    237    LISN         PQ    EISN     CPU/PRIO EQ
    238    00000000 MSI --    00000010   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
    239    00000001 MSI --    00000010   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
    240    00000002 MSI --    00000010   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
    241    00000003 MSI --    00000010   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
    242    00000004 MSI -Q  M 00000000
    243    00000005 MSI -Q  M 00000000
    244    00000006 MSI -Q  M 00000000
    245    00000007 MSI -Q  M 00000000
    246    00001000 MSI --    00000012   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
    247    00001001 MSI --    00000013   0/6    380/16384 @1fe3e0000 ^1 [ 80000010 ... ]
    248    00001100 MSI --    00000100   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
    249    00001101 MSI -Q  M 00000000
    250    00001200 LSI -Q  M 00000000
    251    00001201 LSI -Q  M 00000000
    252    00001202 LSI -Q  M 00000000
    253    00001203 LSI -Q  M 00000000
    254    00001300 MSI --    00000102   1/6    305/16384 @1fc230000 ^1 [ 80000010 ... ]
    255    00001301 MSI --    00000103   2/6    220/16384 @1fc2f0000 ^1 [ 80000010 ... ]
    256    00001302 MSI --    00000104   3/6    201/16384 @1fc390000 ^1 [ 80000010 ... ]
    257 
    258 The source information and configuration:
    259 
    260 - The ``LISN`` column outputs the interrupt number of the source in
    261   range ``[ 0x0 ... 0x1FFF ]`` and its type : ``MSI`` or ``LSI``
    262 - The ``PQ`` column reflects the state of the PQ bits of the source :
    263 
    264   - ``--`` source is ready to take events
    265   - ``P-`` an event was sent and an EOI is PENDING
    266   - ``PQ`` an event was QUEUED
    267   - ``-Q`` source is OFF
    268 
    269   a ``M`` indicates that source is *MASKED* at the EAS level,
    270 
    271 The targeting configuration :
    272 
    273 - The ``EISN`` column is the event data that will be queued in the event
    274   queue of the O/S.
    275 - The ``CPU/PRIO`` column is the tuple defining the CPU number and
    276   priority queue serving the source.
    277 - The ``EQ`` column outputs :
    278 
    279   - the current index of the event queue/ the max number of entries
    280   - the O/S event queue address
    281   - the toggle bit
    282   - the last entries that were pushed in the event queue.