qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

pvrdma.txt (12768B)


      1 Paravirtualized RDMA Device (PVRDMA)
      2 ====================================
      3 
      4 
      5 1. Description
      6 ===============
      7 PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
      8 It works with its Linux Kernel driver AS IS, no need for any special guest
      9 modifications.
     10 
     11 While it complies with the VMware device, it can also communicate with bare
     12 metal RDMA-enabled machines as peers.
     13 
     14 It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
     15 
     16 It does not require the whole guest RAM to be pinned allowing memory
     17 over-commit and, even if not implemented yet, migration support will be
     18 possible with some HW assistance.
     19 
     20 A project presentation accompany this document:
     21 - https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
     22 
     23 
     24 
     25 2. Setup
     26 ========
     27 
     28 
     29 2.1 Guest setup
     30 ===============
     31 Fedora 27+ kernels work out of the box, older distributions
     32 require updating the kernel to 4.14 to include the pvrdma driver.
     33 
     34 However the libpvrdma library needed by User Level Software is still
     35 not available as part of the distributions, so the rdma-core library
     36 needs to be compiled and optionally installed.
     37 
     38 Please follow the instructions at:
     39   https://github.com/linux-rdma/rdma-core.git
     40 
     41 
     42 2.2 Host Setup
     43 ==============
     44 The pvrdma backend is an ibdevice interface that can be exposed
     45 either by a Soft-RoCE(rxe) device on machines with no RDMA device,
     46 or an HCA SRIOV function(VF/PF).
     47 Note that ibdevice interfaces can't be shared between pvrdma devices,
     48 each one requiring a separate instance (rxe or SRIOV VF).
     49 
     50 
     51 2.2.1 Soft-RoCE backend(rxe)
     52 ===========================
     53 A stable version of rxe is required, Fedora 27+ or a Linux
     54 Kernel 4.14+ is preferred.
     55 
     56 The rdma_rxe module is part of the Linux Kernel but not loaded by default.
     57 Install the User Level library (librxe) following the instructions from:
     58 https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
     59 
     60 Associate an ETH interface with rxe by running:
     61    rxe_cfg add eth0
     62 An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
     63 
     64 
     65 2.2.2 RDMA device Virtual Function backend
     66 ==========================================
     67 Nothing special is required, the pvrdma device can work not only with
     68 Ethernet Links, but also Infinibands Links.
     69 All is needed is an ibdevice with an active port, for Mellanox cards
     70 will be something like mlx5_6 which can be the backend.
     71 
     72 
     73 2.2.3 QEMU setup
     74 ================
     75 Configure QEMU with --enable-rdma flag, installing
     76 the required RDMA libraries.
     77 
     78 
     79 
     80 3. Usage
     81 ========
     82 
     83 
     84 3.1 VM Memory settings
     85 ======================
     86 Currently the device is working only with memory backed RAM
     87 and it must be mark as "shared":
     88    -m 1G \
     89    -object memory-backend-ram,id=mb1,size=1G,share \
     90    -numa node,memdev=mb1 \
     91 
     92 
     93 3.2 MAD Multiplexer
     94 ===================
     95 MAD Multiplexer is a service that exposes MAD-like interface for VMs in
     96 order to overcome the limitation where only single entity can register with
     97 MAD layer to send and receive RDMA-CM MAD packets.
     98 
     99 To build rdmacm-mux run
    100 # make rdmacm-mux
    101 
    102 Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
    103 modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
    104 
    105 The application accepts 3 command line arguments and exposes a UNIX socket
    106 to pass control and data to it.
    107 -d rdma-device-name  Name of RDMA device to register with
    108 -s unix-socket-path  Path to unix socket to listen (default /var/run/rdmacm-mux)
    109 -p rdma-device-port  Port number of RDMA device to register with (default 1)
    110 The final UNIX socket file name is a concatenation of the 3 arguments so
    111 for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
    112 will be created.
    113 
    114 pvrdma requires this service.
    115 
    116 Please refer to contrib/rdmacm-mux for more details.
    117 
    118 
    119 3.3 Service exposed by libvirt daemon
    120 =====================================
    121 The control over the RDMA device's GID table is done by updating the
    122 device's Ethernet function addresses.
    123 Usually the first GID entry is determined by the MAC address, the second by
    124 the first IPv6 address and the third by the IPv4 address. Other entries can
    125 be added by adding more IP addresses. The opposite is the same, i.e.
    126 whenever an address is removed, the corresponding GID entry is removed.
    127 The process is done by the network and RDMA stacks. Whenever an address is
    128 added the ib_core driver is notified and calls the device driver add_gid
    129 function which in turn update the device.
    130 To support this in pvrdma device the device hooks into the create_bind and
    131 destroy_bind HW commands triggered by pvrdma driver in guest.
    132 
    133 Whenever changed is made to the pvrdma port's GID table a special QMP
    134 messages is sent to be processed by libvirt to update the address of the
    135 backend Ethernet device.
    136 
    137 pvrdma requires that libvirt service will be up.
    138 
    139 
    140 3.4 PCI devices settings
    141 ========================
    142 RoCE device exposes two functions - an Ethernet and RDMA.
    143 To support it, pvrdma device is composed of two PCI functions, an Ethernet
    144 device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
    145 Ethernet function can be used for other Ethernet purposes such as IP.
    146 
    147 
    148 3.5 Device parameters
    149 =====================
    150 - netdev: Specifies the Ethernet device function name on the host for
    151   example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
    152   device used to create it.
    153 - ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
    154 - mad-chardev: The name of the MAD multiplexer char device.
    155 - ibport: In case of multi-port device (such as Mellanox's HCA) this
    156   specify the port to use. If not set 1 will be used.
    157 - dev-caps-max-mr-size: The maximum size of MR.
    158 - dev-caps-max-qp:      Maximum number of QPs.
    159 - dev-caps-max-cq:      Maximum number of CQs.
    160 - dev-caps-max-mr:      Maximum number of MRs.
    161 - dev-caps-max-pd:      Maximum number of PDs.
    162 - dev-caps-max-ah:      Maximum number of AHs.
    163 
    164 Notes:
    165 - The first 3 parameters are mandatory settings, the rest have their
    166   defaults.
    167 - The last 8 parameters (the ones that prefixed by dev-caps) defines the top
    168   limits but the final values is adjusted by the backend device limitations.
    169 - netdev can be extracted from ibdev's sysfs
    170   (/sys/class/infiniband/<ibdev>/device/net/)
    171 
    172 
    173 3.6 Example
    174 ===========
    175 Define bridge device with vmxnet3 network backend:
    176 <interface type='bridge'>
    177   <mac address='56:b4:44:e9:62:dc'/>
    178   <source bridge='bridge1'/>
    179   <model type='vmxnet3'/>
    180   <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
    181 </interface>
    182 
    183 Define pvrdma device:
    184 <qemu:commandline>
    185   <qemu:arg value='-object'/>
    186   <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
    187   <qemu:arg value='-numa'/>
    188   <qemu:arg value='node,memdev=mb1'/>
    189   <qemu:arg value='-chardev'/>
    190   <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
    191   <qemu:arg value='-device'/>
    192   <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
    193 </qemu:commandline>
    194 
    195 
    196 
    197 4. Implementation details
    198 =========================
    199 
    200 
    201 4.1 Overview
    202 ============
    203 The device acts like a proxy between the Guest Driver and the host
    204 ibdevice interface.
    205 On configuration path:
    206  - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
    207    a resource from the backend interface, maintaining a 1-1 mapping
    208    between the guest and host.
    209 On data path:
    210  - Every post_send/receive received from the guest will be converted into
    211    a post_send/receive for the backend. The buffers data will not be touched
    212    or copied resulting in near bare-metal performance for large enough buffers.
    213  - Completions from the backend interface will result in completions for
    214    the pvrdma device.
    215 
    216 
    217 4.2 PCI BARs
    218 ============
    219 PCI Bars:
    220 	BAR 0 - MSI-X
    221         MSI-X vectors:
    222 		(0) Command - used when execution of a command is completed.
    223 		(1) Async - not in use.
    224 		(2) Completion - used when a completion event is placed in
    225 		  device's CQ ring.
    226 	BAR 1 - Registers
    227         --------------------------------------------------------
    228         | VERSION |  DSR | CTL | REQ | ERR |  ICR | IMR  | MAC |
    229         --------------------------------------------------------
    230 		DSR - Address of driver/device shared memory used
    231               for the command channel, used for passing:
    232 			    - General info such as driver version
    233 			    - Address of 'command' and 'response'
    234 			    - Address of async ring
    235 			    - Address of device's CQ ring
    236 			    - Device capabilities
    237 		CTL - Device control operations (activate, reset etc)
    238 		IMG - Set interrupt mask
    239 		REQ - Command execution register
    240 		ERR - Operation status
    241 
    242 	BAR 2 - UAR
    243         ---------------------------------------------------------
    244         | QP_NUM  | SEND/RECV Flag ||  CQ_NUM |   ARM/POLL Flag |
    245         ---------------------------------------------------------
    246 		- Offset 0 used for QP operations (send and recv)
    247 		- Offset 4 used for CQ operations (arm and poll)
    248 
    249 
    250 4.3 Major flows
    251 ===============
    252 
    253 4.3.1 Create CQ
    254 ===============
    255     - Guest driver
    256         - Allocates pages for CQ ring
    257         - Creates page directory (pdir) to hold CQ ring's pages
    258         - Initializes CQ ring
    259         - Initializes 'Create CQ' command object (cqe, pdir etc)
    260         - Copies the command to 'command' address
    261         - Writes 0 into REQ register
    262     - Device
    263         - Reads the request object from the 'command' address
    264         - Allocates CQ object and initialize CQ ring based on pdir
    265         - Creates the backend CQ
    266         - Writes operation status to ERR register
    267         - Posts command-interrupt to guest
    268     - Guest driver
    269         - Reads the HW response code from ERR register
    270 
    271 4.3.2 Create QP
    272 ===============
    273     - Guest driver
    274         - Allocates pages for send and receive rings
    275         - Creates page directory(pdir) to hold the ring's pages
    276         - Initializes 'Create QP' command object (max_send_wr,
    277           send_cq_handle, recv_cq_handle, pdir etc)
    278         - Copies the object to 'command' address
    279         - Write 0 into REQ register
    280     - Device
    281         - Reads the request object from 'command' address
    282         - Allocates the QP object and initialize
    283             - Send and recv rings based on pdir
    284             - Send and recv ring state
    285         - Creates the backend QP
    286         - Writes the operation status to ERR register
    287         - Posts command-interrupt to guest
    288     - Guest driver
    289         - Reads the HW response code from ERR register
    290 
    291 4.3.3 Post receive
    292 ==================
    293     - Guest driver
    294         - Initializes a wqe and place it on recv ring
    295         - Write to qpn|qp_recv_bit (31) to QP offset in UAR
    296     - Device
    297         - Extracts qpn from UAR
    298         - Walks through the ring and does the following for each wqe
    299             - Prepares the backend CQE context to be used when
    300               receiving completion from backend (wr_id, op_code, emu_cq_num)
    301             - For each sge prepares backend sge
    302             - Calls backend's post_recv
    303 
    304 4.3.4 Process backend events
    305 ============================
    306     - Done by a dedicated thread used to process backend events;
    307       at initialization is attached to the device and creates
    308       the communication channel.
    309     - Thread main loop:
    310         - Polls for completions
    311         - Extracts QEMU _cq_num, wr_id and op_code from context
    312         - Writes CQE to CQ ring
    313         - Writes CQ number to device CQ
    314         - Sends completion-interrupt to guest
    315         - Deallocates context
    316         - Acks the event to backend
    317 
    318 
    319 
    320 5. Limitations
    321 ==============
    322 - The device obviously is limited by the Guest Linux Driver features implementation
    323   of the VMware device API.
    324 - Memory registration mechanism requires mremap for every page in the buffer in order
    325   to map it to a contiguous virtual address range. Since this is not the data path
    326   it should not matter much. If the default max mr size is increased, be aware that
    327   memory registration can take up to 0.5 seconds for 1GB of memory.
    328 - The device requires target page size to be the same as the host page size,
    329   otherwise it will fail to init.
    330 - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
    331   so it can't work with huge pages. The limitation will be addressed in the future,
    332   however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
    333   pages available, QEMU will use them. QEMU will fail to init if the requirements
    334   are not met.
    335 
    336 
    337 
    338 6. Performance
    339 ==============
    340 By design the pvrdma device exits on each post-send/receive, so for small buffers
    341 the performance is affected; however for medium buffers it will became close to
    342 bare metal and from 1MB buffers and  up it reaches bare metal performance.
    343 (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
    344 
    345 All the above assumes no memory registration is done on data path.