qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

ivshmem-spec.txt (10242B)


      1 = Device Specification for Inter-VM shared memory device =
      2 
      3 The Inter-VM shared memory device (ivshmem) is designed to share a
      4 memory region between multiple QEMU processes running different guests
      5 and the host.  In order for all guests to be able to pick up the
      6 shared memory area, it is modeled by QEMU as a PCI device exposing
      7 said memory to the guest as a PCI BAR.
      8 
      9 The device can use a shared memory object on the host directly, or it
     10 can obtain one from an ivshmem server.
     11 
     12 In the latter case, the device can additionally interrupt its peers, and
     13 get interrupted by its peers.
     14 
     15 
     16 == Configuring the ivshmem PCI device ==
     17 
     18 There are two basic configurations:
     19 
     20 - Just shared memory:
     21 
     22       -device ivshmem-plain,memdev=HMB,...
     23 
     24   This uses host memory backend HMB.  It should have option "share"
     25   set.
     26 
     27 - Shared memory plus interrupts:
     28 
     29       -device ivshmem-doorbell,chardev=CHR,vectors=N,...
     30 
     31   An ivshmem server must already be running on the host.  The device
     32   connects to the server's UNIX domain socket via character device
     33   CHR.
     34 
     35   Each peer gets assigned a unique ID by the server.  IDs must be
     36   between 0 and 65535.
     37 
     38   Interrupts are message-signaled (MSI-X).  vectors=N configures the
     39   number of vectors to use.
     40 
     41 For more details on ivshmem device properties, see the QEMU Emulator
     42 user documentation.
     43 
     44 
     45 == The ivshmem PCI device's guest interface ==
     46 
     47 The device has vendor ID 1af4, device ID 1110, revision 1.  Before
     48 QEMU 2.6.0, it had revision 0.
     49 
     50 === PCI BARs ===
     51 
     52 The ivshmem PCI device has two or three BARs:
     53 
     54 - BAR0 holds device registers (256 Byte MMIO)
     55 - BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
     56 - BAR2 maps the shared memory object
     57 
     58 There are two ways to use this device:
     59 
     60 - If you only need the shared memory part, BAR2 suffices.  This way,
     61   you have access to the shared memory in the guest and can use it as
     62   you see fit.  Memnic, for example, uses ivshmem this way from guest
     63   user space (see http://dpdk.org/browse/memnic).
     64 
     65 - If you additionally need the capability for peers to interrupt each
     66   other, you need BAR0 and BAR1.  You will most likely want to write a
     67   kernel driver to handle interrupts.  Requires the device to be
     68   configured for interrupts, obviously.
     69 
     70 Before QEMU 2.6.0, BAR2 can initially be invalid if the device is
     71 configured for interrupts.  It becomes safely accessible only after
     72 the ivshmem server provided the shared memory.  These devices have PCI
     73 revision 0 rather than 1.  Guest software should wait for the
     74 IVPosition register (described below) to become non-negative before
     75 accessing BAR2.
     76 
     77 Revision 0 of the device is not capable to tell guest software whether
     78 it is configured for interrupts.
     79 
     80 === PCI device registers ===
     81 
     82 BAR 0 contains the following registers:
     83 
     84     Offset  Size  Access      On reset  Function
     85         0     4   read/write        0   Interrupt Mask
     86                                         bit 0: peer interrupt (rev 0)
     87                                                reserved       (rev 1)
     88                                         bit 1..31: reserved
     89         4     4   read/write        0   Interrupt Status
     90                                         bit 0: peer interrupt (rev 0)
     91                                                reserved       (rev 1)
     92                                         bit 1..31: reserved
     93         8     4   read-only   0 or ID   IVPosition
     94        12     4   write-only      N/A   Doorbell
     95                                         bit 0..15: vector
     96                                         bit 16..31: peer ID
     97        16   240   none            N/A   reserved
     98 
     99 Software should only access the registers as specified in column
    100 "Access".  Reserved bits should be ignored on read, and preserved on
    101 write.
    102 
    103 In revision 0 of the device, Interrupt Status and Mask Register
    104 together control the legacy INTx interrupt when the device has no
    105 MSI-X capability: INTx is asserted when the bit-wise AND of Status and
    106 Mask is non-zero and the device has no MSI-X capability.  Interrupt
    107 Status Register bit 0 becomes 1 when an interrupt request from a peer
    108 is received.  Reading the register clears it.
    109 
    110 IVPosition Register: if the device is not configured for interrupts,
    111 this is zero.  Else, it is the device's ID (between 0 and 65535).
    112 
    113 Before QEMU 2.6.0, the register may read -1 for a short while after
    114 reset.  These devices have PCI revision 0 rather than 1.
    115 
    116 There is no good way for software to find out whether the device is
    117 configured for interrupts.  A positive IVPosition means interrupts,
    118 but zero could be either.
    119 
    120 Doorbell Register: writing this register requests to interrupt a peer.
    121 The written value's high 16 bits are the ID of the peer to interrupt,
    122 and its low 16 bits select an interrupt vector.
    123 
    124 If the device is not configured for interrupts, the write is ignored.
    125 
    126 If the interrupt hasn't completed setup, the write is ignored.  The
    127 device is not capable to tell guest software whether setup is
    128 complete.  Interrupts can regress to this state on migration.
    129 
    130 If the peer with the requested ID isn't connected, or it has fewer
    131 interrupt vectors connected, the write is ignored.  The device is not
    132 capable to tell guest software what peers are connected, or how many
    133 interrupt vectors are connected.
    134 
    135 The peer's interrupt for this vector then becomes pending.  There is
    136 no way for software to clear the pending bit, and a polling mode of
    137 operation is therefore impossible.
    138 
    139 If the peer is a revision 0 device without MSI-X capability, its
    140 Interrupt Status register is set to 1.  This asserts INTx unless
    141 masked by the Interrupt Mask register.  The device is not capable to
    142 communicate the interrupt vector to guest software then.
    143 
    144 With multiple MSI-X vectors, different vectors can be used to indicate
    145 different events have occurred.  The semantics of interrupt vectors
    146 are left to the application.
    147 
    148 
    149 == Interrupt infrastructure ==
    150 
    151 When configured for interrupts, the peers share eventfd objects in
    152 addition to shared memory.  The shared resources are managed by an
    153 ivshmem server.
    154 
    155 === The ivshmem server ===
    156 
    157 The server listens on a UNIX domain socket.
    158 
    159 For each new client that connects to the server, the server
    160 - picks an ID,
    161 - creates eventfd file descriptors for the interrupt vectors,
    162 - sends the ID and the file descriptor for the shared memory to the
    163   new client,
    164 - sends connect notifications for the new client to the other clients
    165   (these contain file descriptors for sending interrupts),
    166 - sends connect notifications for the other clients to the new client,
    167   and
    168 - sends interrupt setup messages to the new client (these contain file
    169   descriptors for receiving interrupts).
    170 
    171 The first client to connect to the server receives ID zero.
    172 
    173 When a client disconnects from the server, the server sends disconnect
    174 notifications to the other clients.
    175 
    176 The next section describes the protocol in detail.
    177 
    178 If the server terminates without sending disconnect notifications for
    179 its connected clients, the clients can elect to continue.  They can
    180 communicate with each other normally, but won't receive disconnect
    181 notification on disconnect, and no new clients can connect.  There is
    182 no way for the clients to connect to a restarted server.  The device
    183 is not capable to tell guest software whether the server is still up.
    184 
    185 Example server code is in contrib/ivshmem-server/.  Not to be used in
    186 production.  It assumes all clients use the same number of interrupt
    187 vectors.
    188 
    189 A standalone client is in contrib/ivshmem-client/.  It can be useful
    190 for debugging.
    191 
    192 === The ivshmem Client-Server Protocol ===
    193 
    194 An ivshmem device configured for interrupts connects to an ivshmem
    195 server.  This section details the protocol between the two.
    196 
    197 The connection is one-way: the server sends messages to the client.
    198 Each message consists of a single 8 byte little-endian signed number,
    199 and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
    200 client and server close the connection on error.
    201 
    202 Note: QEMU currently doesn't close the connection right on error, but
    203 only when the character device is destroyed.
    204 
    205 On connect, the server sends the following messages in order:
    206 
    207 1. The protocol version number, currently zero.  The client should
    208    close the connection on receipt of versions it can't handle.
    209 
    210 2. The client's ID.  This is unique among all clients of this server.
    211    IDs must be between 0 and 65535, because the Doorbell register
    212    provides only 16 bits for them.
    213 
    214 3. The number -1, accompanied by the file descriptor for the shared
    215    memory.
    216 
    217 4. Connect notifications for existing other clients, if any.  This is
    218    a peer ID (number between 0 and 65535 other than the client's ID),
    219    repeated N times.  Each repetition is accompanied by one file
    220    descriptor.  These are for interrupting the peer with that ID using
    221    vector 0,..,N-1, in order.  If the client is configured for fewer
    222    vectors, it closes the extra file descriptors.  If it is configured
    223    for more, the extra vectors remain unconnected.
    224 
    225 5. Interrupt setup.  This is the client's own ID, repeated N times.
    226    Each repetition is accompanied by one file descriptor.  These are
    227    for receiving interrupts from peers using vector 0,..,N-1, in
    228    order.  If the client is configured for fewer vectors, it closes
    229    the extra file descriptors.  If it is configured for more, the
    230    extra vectors remain unconnected.
    231 
    232 From then on, the server sends these kinds of messages:
    233 
    234 6. Connection / disconnection notification.  This is a peer ID.
    235 
    236   - If the number comes with a file descriptor, it's a connection
    237     notification, exactly like in step 4.
    238 
    239   - Else, it's a disconnection notification for the peer with that ID.
    240 
    241 Known bugs:
    242 
    243 * The protocol changed incompatibly in QEMU 2.5.  Before, messages
    244   were native endian long, and there was no version number.
    245 
    246 * The protocol is poorly designed.
    247 
    248 === The ivshmem Client-Client Protocol ===
    249 
    250 An ivshmem device configured for interrupts receives eventfd file
    251 descriptors for interrupting peers and getting interrupted by peers
    252 from the server, as explained in the previous section.
    253 
    254 To interrupt a peer, the device writes the 8-byte integer 1 in native
    255 byte order to the respective file descriptor.
    256 
    257 To receive an interrupt, the device reads and discards as many 8-byte
    258 integers as it can.