pvrdma.txt (12768B)
1 Paravirtualized RDMA Device (PVRDMA) 2 ==================================== 3 4 5 1. Description 6 =============== 7 PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. 8 It works with its Linux Kernel driver AS IS, no need for any special guest 9 modifications. 10 11 While it complies with the VMware device, it can also communicate with bare 12 metal RDMA-enabled machines as peers. 13 14 It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). 15 16 It does not require the whole guest RAM to be pinned allowing memory 17 over-commit and, even if not implemented yet, migration support will be 18 possible with some HW assistance. 19 20 A project presentation accompany this document: 21 - https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf 22 23 24 25 2. Setup 26 ======== 27 28 29 2.1 Guest setup 30 =============== 31 Fedora 27+ kernels work out of the box, older distributions 32 require updating the kernel to 4.14 to include the pvrdma driver. 33 34 However the libpvrdma library needed by User Level Software is still 35 not available as part of the distributions, so the rdma-core library 36 needs to be compiled and optionally installed. 37 38 Please follow the instructions at: 39 https://github.com/linux-rdma/rdma-core.git 40 41 42 2.2 Host Setup 43 ============== 44 The pvrdma backend is an ibdevice interface that can be exposed 45 either by a Soft-RoCE(rxe) device on machines with no RDMA device, 46 or an HCA SRIOV function(VF/PF). 47 Note that ibdevice interfaces can't be shared between pvrdma devices, 48 each one requiring a separate instance (rxe or SRIOV VF). 49 50 51 2.2.1 Soft-RoCE backend(rxe) 52 =========================== 53 A stable version of rxe is required, Fedora 27+ or a Linux 54 Kernel 4.14+ is preferred. 55 56 The rdma_rxe module is part of the Linux Kernel but not loaded by default. 57 Install the User Level library (librxe) following the instructions from: 58 https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home 59 60 Associate an ETH interface with rxe by running: 61 rxe_cfg add eth0 62 An rxe0 ibdevice interface will be created and can be used as pvrdma backend. 63 64 65 2.2.2 RDMA device Virtual Function backend 66 ========================================== 67 Nothing special is required, the pvrdma device can work not only with 68 Ethernet Links, but also Infinibands Links. 69 All is needed is an ibdevice with an active port, for Mellanox cards 70 will be something like mlx5_6 which can be the backend. 71 72 73 2.2.3 QEMU setup 74 ================ 75 Configure QEMU with --enable-rdma flag, installing 76 the required RDMA libraries. 77 78 79 80 3. Usage 81 ======== 82 83 84 3.1 VM Memory settings 85 ====================== 86 Currently the device is working only with memory backed RAM 87 and it must be mark as "shared": 88 -m 1G \ 89 -object memory-backend-ram,id=mb1,size=1G,share \ 90 -numa node,memdev=mb1 \ 91 92 93 3.2 MAD Multiplexer 94 =================== 95 MAD Multiplexer is a service that exposes MAD-like interface for VMs in 96 order to overcome the limitation where only single entity can register with 97 MAD layer to send and receive RDMA-CM MAD packets. 98 99 To build rdmacm-mux run 100 # make rdmacm-mux 101 102 Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel 103 modules aren't loaded, otherwise the rdmacm-mux service will fail to start. 104 105 The application accepts 3 command line arguments and exposes a UNIX socket 106 to pass control and data to it. 107 -d rdma-device-name Name of RDMA device to register with 108 -s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux) 109 -p rdma-device-port Port number of RDMA device to register with (default 1) 110 The final UNIX socket file name is a concatenation of the 3 arguments so 111 for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2 112 will be created. 113 114 pvrdma requires this service. 115 116 Please refer to contrib/rdmacm-mux for more details. 117 118 119 3.3 Service exposed by libvirt daemon 120 ===================================== 121 The control over the RDMA device's GID table is done by updating the 122 device's Ethernet function addresses. 123 Usually the first GID entry is determined by the MAC address, the second by 124 the first IPv6 address and the third by the IPv4 address. Other entries can 125 be added by adding more IP addresses. The opposite is the same, i.e. 126 whenever an address is removed, the corresponding GID entry is removed. 127 The process is done by the network and RDMA stacks. Whenever an address is 128 added the ib_core driver is notified and calls the device driver add_gid 129 function which in turn update the device. 130 To support this in pvrdma device the device hooks into the create_bind and 131 destroy_bind HW commands triggered by pvrdma driver in guest. 132 133 Whenever changed is made to the pvrdma port's GID table a special QMP 134 messages is sent to be processed by libvirt to update the address of the 135 backend Ethernet device. 136 137 pvrdma requires that libvirt service will be up. 138 139 140 3.4 PCI devices settings 141 ======================== 142 RoCE device exposes two functions - an Ethernet and RDMA. 143 To support it, pvrdma device is composed of two PCI functions, an Ethernet 144 device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The 145 Ethernet function can be used for other Ethernet purposes such as IP. 146 147 148 3.5 Device parameters 149 ===================== 150 - netdev: Specifies the Ethernet device function name on the host for 151 example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet 152 device used to create it. 153 - ibdev: The IB device name on host for example rxe0, mlx5_0 etc. 154 - mad-chardev: The name of the MAD multiplexer char device. 155 - ibport: In case of multi-port device (such as Mellanox's HCA) this 156 specify the port to use. If not set 1 will be used. 157 - dev-caps-max-mr-size: The maximum size of MR. 158 - dev-caps-max-qp: Maximum number of QPs. 159 - dev-caps-max-cq: Maximum number of CQs. 160 - dev-caps-max-mr: Maximum number of MRs. 161 - dev-caps-max-pd: Maximum number of PDs. 162 - dev-caps-max-ah: Maximum number of AHs. 163 164 Notes: 165 - The first 3 parameters are mandatory settings, the rest have their 166 defaults. 167 - The last 8 parameters (the ones that prefixed by dev-caps) defines the top 168 limits but the final values is adjusted by the backend device limitations. 169 - netdev can be extracted from ibdev's sysfs 170 (/sys/class/infiniband/<ibdev>/device/net/) 171 172 173 3.6 Example 174 =========== 175 Define bridge device with vmxnet3 network backend: 176 <interface type='bridge'> 177 <mac address='56:b4:44:e9:62:dc'/> 178 <source bridge='bridge1'/> 179 <model type='vmxnet3'/> 180 <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/> 181 </interface> 182 183 Define pvrdma device: 184 <qemu:commandline> 185 <qemu:arg value='-object'/> 186 <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/> 187 <qemu:arg value='-numa'/> 188 <qemu:arg value='node,memdev=mb1'/> 189 <qemu:arg value='-chardev'/> 190 <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/> 191 <qemu:arg value='-device'/> 192 <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/> 193 </qemu:commandline> 194 195 196 197 4. Implementation details 198 ========================= 199 200 201 4.1 Overview 202 ============ 203 The device acts like a proxy between the Guest Driver and the host 204 ibdevice interface. 205 On configuration path: 206 - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request 207 a resource from the backend interface, maintaining a 1-1 mapping 208 between the guest and host. 209 On data path: 210 - Every post_send/receive received from the guest will be converted into 211 a post_send/receive for the backend. The buffers data will not be touched 212 or copied resulting in near bare-metal performance for large enough buffers. 213 - Completions from the backend interface will result in completions for 214 the pvrdma device. 215 216 217 4.2 PCI BARs 218 ============ 219 PCI Bars: 220 BAR 0 - MSI-X 221 MSI-X vectors: 222 (0) Command - used when execution of a command is completed. 223 (1) Async - not in use. 224 (2) Completion - used when a completion event is placed in 225 device's CQ ring. 226 BAR 1 - Registers 227 -------------------------------------------------------- 228 | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC | 229 -------------------------------------------------------- 230 DSR - Address of driver/device shared memory used 231 for the command channel, used for passing: 232 - General info such as driver version 233 - Address of 'command' and 'response' 234 - Address of async ring 235 - Address of device's CQ ring 236 - Device capabilities 237 CTL - Device control operations (activate, reset etc) 238 IMG - Set interrupt mask 239 REQ - Command execution register 240 ERR - Operation status 241 242 BAR 2 - UAR 243 --------------------------------------------------------- 244 | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag | 245 --------------------------------------------------------- 246 - Offset 0 used for QP operations (send and recv) 247 - Offset 4 used for CQ operations (arm and poll) 248 249 250 4.3 Major flows 251 =============== 252 253 4.3.1 Create CQ 254 =============== 255 - Guest driver 256 - Allocates pages for CQ ring 257 - Creates page directory (pdir) to hold CQ ring's pages 258 - Initializes CQ ring 259 - Initializes 'Create CQ' command object (cqe, pdir etc) 260 - Copies the command to 'command' address 261 - Writes 0 into REQ register 262 - Device 263 - Reads the request object from the 'command' address 264 - Allocates CQ object and initialize CQ ring based on pdir 265 - Creates the backend CQ 266 - Writes operation status to ERR register 267 - Posts command-interrupt to guest 268 - Guest driver 269 - Reads the HW response code from ERR register 270 271 4.3.2 Create QP 272 =============== 273 - Guest driver 274 - Allocates pages for send and receive rings 275 - Creates page directory(pdir) to hold the ring's pages 276 - Initializes 'Create QP' command object (max_send_wr, 277 send_cq_handle, recv_cq_handle, pdir etc) 278 - Copies the object to 'command' address 279 - Write 0 into REQ register 280 - Device 281 - Reads the request object from 'command' address 282 - Allocates the QP object and initialize 283 - Send and recv rings based on pdir 284 - Send and recv ring state 285 - Creates the backend QP 286 - Writes the operation status to ERR register 287 - Posts command-interrupt to guest 288 - Guest driver 289 - Reads the HW response code from ERR register 290 291 4.3.3 Post receive 292 ================== 293 - Guest driver 294 - Initializes a wqe and place it on recv ring 295 - Write to qpn|qp_recv_bit (31) to QP offset in UAR 296 - Device 297 - Extracts qpn from UAR 298 - Walks through the ring and does the following for each wqe 299 - Prepares the backend CQE context to be used when 300 receiving completion from backend (wr_id, op_code, emu_cq_num) 301 - For each sge prepares backend sge 302 - Calls backend's post_recv 303 304 4.3.4 Process backend events 305 ============================ 306 - Done by a dedicated thread used to process backend events; 307 at initialization is attached to the device and creates 308 the communication channel. 309 - Thread main loop: 310 - Polls for completions 311 - Extracts QEMU _cq_num, wr_id and op_code from context 312 - Writes CQE to CQ ring 313 - Writes CQ number to device CQ 314 - Sends completion-interrupt to guest 315 - Deallocates context 316 - Acks the event to backend 317 318 319 320 5. Limitations 321 ============== 322 - The device obviously is limited by the Guest Linux Driver features implementation 323 of the VMware device API. 324 - Memory registration mechanism requires mremap for every page in the buffer in order 325 to map it to a contiguous virtual address range. Since this is not the data path 326 it should not matter much. If the default max mr size is increased, be aware that 327 memory registration can take up to 0.5 seconds for 1GB of memory. 328 - The device requires target page size to be the same as the host page size, 329 otherwise it will fail to init. 330 - QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached, 331 so it can't work with huge pages. The limitation will be addressed in the future, 332 however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge 333 pages available, QEMU will use them. QEMU will fail to init if the requirements 334 are not met. 335 336 337 338 6. Performance 339 ============== 340 By design the pvrdma device exits on each post-send/receive, so for small buffers 341 the performance is affected; however for medium buffers it will became close to 342 bare metal and from 1MB buffers and up it reaches bare metal performance. 343 (tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device) 344 345 All the above assumes no memory registration is done on data path.