multi-process.rst (40751B)
1 Multi-process QEMU 2 =================== 3 4 .. note:: 5 6 This is the design document for multi-process QEMU. It does not 7 necessarily reflect the status of the current implementation, which 8 may lack features or be considerably different from what is described 9 in this document. This document is still useful as a description of 10 the goals and general direction of this feature. 11 12 Please refer to the following wiki for latest details: 13 https://wiki.qemu.org/Features/MultiProcessQEMU 14 15 QEMU is often used as the hypervisor for virtual machines running in the 16 Oracle cloud. Since one of the advantages of cloud computing is the 17 ability to run many VMs from different tenants in the same cloud 18 infrastructure, a guest that compromised its hypervisor could 19 potentially use the hypervisor's access privileges to access data it is 20 not authorized for. 21 22 QEMU can be susceptible to security attacks because it is a large, 23 monolithic program that provides many features to the VMs it services. 24 Many of these features can be configured out of QEMU, but even a reduced 25 configuration QEMU has a large amount of code a guest can potentially 26 attack. Separating QEMU reduces the attack surface by aiding to 27 limit each component in the system to only access the resources that 28 it needs to perform its job. 29 30 QEMU services 31 ------------- 32 33 QEMU can be broadly described as providing three main services. One is a 34 VM control point, where VMs can be created, migrated, re-configured, and 35 destroyed. A second is to emulate the CPU instructions within the VM, 36 often accelerated by HW virtualization features such as Intel's VT 37 extensions. Finally, it provides IO services to the VM by emulating HW 38 IO devices, such as disk and network devices. 39 40 A multi-process QEMU 41 ~~~~~~~~~~~~~~~~~~~~ 42 43 A multi-process QEMU involves separating QEMU services into separate 44 host processes. Each of these processes can be given only the privileges 45 it needs to provide its service, e.g., a disk service could be given 46 access only to the disk images it provides, and not be allowed to 47 access other files, or any network devices. An attacker who compromised 48 this service would not be able to use this exploit to access files or 49 devices beyond what the disk service was given access to. 50 51 A QEMU control process would remain, but in multi-process mode, will 52 have no direct interfaces to the VM. During VM execution, it would still 53 provide the user interface to hot-plug devices or live migrate the VM. 54 55 A first step in creating a multi-process QEMU is to separate IO services 56 from the main QEMU program, which would continue to provide CPU 57 emulation. i.e., the control process would also be the CPU emulation 58 process. In a later phase, CPU emulation could be separated from the 59 control process. 60 61 Separating IO services 62 ---------------------- 63 64 Separating IO services into individual host processes is a good place to 65 begin for a couple of reasons. One is the sheer number of IO devices QEMU 66 can emulate provides a large surface of interfaces which could potentially 67 be exploited, and, indeed, have been a source of exploits in the past. 68 Another is the modular nature of QEMU device emulation code provides 69 interface points where the QEMU functions that perform device emulation 70 can be separated from the QEMU functions that manage the emulation of 71 guest CPU instructions. The devices emulated in the separate process are 72 referred to as remote devices. 73 74 QEMU device emulation 75 ~~~~~~~~~~~~~~~~~~~~~ 76 77 QEMU uses an object oriented SW architecture for device emulation code. 78 Configured objects are all compiled into the QEMU binary, then objects 79 are instantiated by name when used by the guest VM. For example, the 80 code to emulate a device named "foo" is always present in QEMU, but its 81 instantiation code is only run when the device is included in the target 82 VM. (e.g., via the QEMU command line as *-device foo*) 83 84 The object model is hierarchical, so device emulation code names its 85 parent object (such as "pci-device" for a PCI device) and QEMU will 86 instantiate a parent object before calling the device's instantiation 87 code. 88 89 Current separation models 90 ~~~~~~~~~~~~~~~~~~~~~~~~~ 91 92 In order to separate the device emulation code from the CPU emulation 93 code, the device object code must run in a different process. There are 94 a couple of existing QEMU features that can run emulation code 95 separately from the main QEMU process. These are examined below. 96 97 vhost user model 98 ^^^^^^^^^^^^^^^^ 99 100 Virtio guest device drivers can be connected to vhost user applications 101 in order to perform their IO operations. This model uses special virtio 102 device drivers in the guest and vhost user device objects in QEMU, but 103 once the QEMU vhost user code has configured the vhost user application, 104 mission-mode IO is performed by the application. The vhost user 105 application is a daemon process that can be contacted via a known UNIX 106 domain socket. 107 108 vhost socket 109 '''''''''''' 110 111 As mentioned above, one of the tasks of the vhost device object within 112 QEMU is to contact the vhost application and send it configuration 113 information about this device instance. As part of the configuration 114 process, the application can also be sent other file descriptors over 115 the socket, which then can be used by the vhost user application in 116 various ways, some of which are described below. 117 118 vhost MMIO store acceleration 119 ''''''''''''''''''''''''''''' 120 121 VMs are often run using HW virtualization features via the KVM kernel 122 driver. This driver allows QEMU to accelerate the emulation of guest CPU 123 instructions by running the guest in a virtual HW mode. When the guest 124 executes instructions that cannot be executed by virtual HW mode, 125 execution returns to the KVM driver so it can inform QEMU to emulate the 126 instructions in SW. 127 128 One of the events that can cause a return to QEMU is when a guest device 129 driver accesses an IO location. QEMU then dispatches the memory 130 operation to the corresponding QEMU device object. In the case of a 131 vhost user device, the memory operation would need to be sent over a 132 socket to the vhost application. This path is accelerated by the QEMU 133 virtio code by setting up an eventfd file descriptor that the vhost 134 application can directly receive MMIO store notifications from the KVM 135 driver, instead of needing them to be sent to the QEMU process first. 136 137 vhost interrupt acceleration 138 '''''''''''''''''''''''''''' 139 140 Another optimization used by the vhost application is the ability to 141 directly inject interrupts into the VM via the KVM driver, again, 142 bypassing the need to send the interrupt back to the QEMU process first. 143 The QEMU virtio setup code configures the KVM driver with an eventfd 144 that triggers the device interrupt in the guest when the eventfd is 145 written. This irqfd file descriptor is then passed to the vhost user 146 application program. 147 148 vhost access to guest memory 149 '''''''''''''''''''''''''''' 150 151 The vhost application is also allowed to directly access guest memory, 152 instead of needing to send the data as messages to QEMU. This is also 153 done with file descriptors sent to the vhost user application by QEMU. 154 These descriptors can be passed to ``mmap()`` by the vhost application 155 to map the guest address space into the vhost application. 156 157 IOMMUs introduce another level of complexity, since the address given to 158 the guest virtio device to DMA to or from is not a guest physical 159 address. This case is handled by having vhost code within QEMU register 160 as a listener for IOMMU mapping changes. The vhost application maintains 161 a cache of IOMMMU translations: sending translation requests back to 162 QEMU on cache misses, and in turn receiving flush requests from QEMU 163 when mappings are purged. 164 165 applicability to device separation 166 '''''''''''''''''''''''''''''''''' 167 168 Much of the vhost model can be re-used by separated device emulation. In 169 particular, the ideas of using a socket between QEMU and the device 170 emulation application, using a file descriptor to inject interrupts into 171 the VM via KVM, and allowing the application to ``mmap()`` the guest 172 should be re used. 173 174 There are, however, some notable differences between how a vhost 175 application works and the needs of separated device emulation. The most 176 basic is that vhost uses custom virtio device drivers which always 177 trigger IO with MMIO stores. A separated device emulation model must 178 work with existing IO device models and guest device drivers. MMIO loads 179 break vhost store acceleration since they are synchronous - guest 180 progress cannot continue until the load has been emulated. By contrast, 181 stores are asynchronous, the guest can continue after the store event 182 has been sent to the vhost application. 183 184 Another difference is that in the vhost user model, a single daemon can 185 support multiple QEMU instances. This is contrary to the security regime 186 desired, in which the emulation application should only be allowed to 187 access the files or devices the VM it's running on behalf of can access. 188 #### qemu-io model 189 190 ``qemu-io`` is a test harness used to test changes to the QEMU block backend 191 object code (e.g., the code that implements disk images for disk driver 192 emulation). ``qemu-io`` is not a device emulation application per se, but it 193 does compile the QEMU block objects into a separate binary from the main 194 QEMU one. This could be useful for disk device emulation, since its 195 emulation applications will need to include the QEMU block objects. 196 197 New separation model based on proxy objects 198 ------------------------------------------- 199 200 A different model based on proxy objects in the QEMU program 201 communicating with remote emulation programs could provide separation 202 while minimizing the changes needed to the device emulation code. The 203 rest of this section is a discussion of how a proxy object model would 204 work. 205 206 Remote emulation processes 207 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 208 209 The remote emulation process will run the QEMU object hierarchy without 210 modification. The device emulation objects will be also be based on the 211 QEMU code, because for anything but the simplest device, it would not be 212 a tractable to re-implement both the object model and the many device 213 backends that QEMU has. 214 215 The processes will communicate with the QEMU process over UNIX domain 216 sockets. The processes can be executed either as standalone processes, 217 or be executed by QEMU. In both cases, the host backends the emulation 218 processes will provide are specified on its command line, as they would 219 be for QEMU. For example: 220 221 :: 222 223 disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ 224 -blockdev driver=qcow2,node-name=drive0,file=file0 225 226 would indicate process *disk-proc* uses a qcow2 emulated disk named 227 *file0* as its backend. 228 229 Emulation processes may emulate more than one guest controller. A common 230 configuration might be to put all controllers of the same device class 231 (e.g., disk, network, etc.) in a single process, so that all backends of 232 the same type can be managed by a single QMP monitor. 233 234 communication with QEMU 235 ^^^^^^^^^^^^^^^^^^^^^^^ 236 237 The first argument to the remote emulation process will be a Unix domain 238 socket that connects with the Proxy object. This is a required argument. 239 240 :: 241 242 disk-proc <socket number> <backend list> 243 244 remote process QMP monitor 245 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 246 247 Remote emulation processes can be monitored via QMP, similar to QEMU 248 itself. The QMP monitor socket is specified the same as for a QEMU 249 process: 250 251 :: 252 253 disk-proc -qmp unix:/tmp/disk-mon,server 254 255 can be monitored over the UNIX socket path */tmp/disk-mon*. 256 257 QEMU command line 258 ~~~~~~~~~~~~~~~~~ 259 260 Each remote device emulated in a remote process on the host is 261 represented as a *-device* of type *pci-proxy-dev*. A socket 262 sub-option to this option specifies the Unix socket that connects 263 to the remote process. An *id* sub-option is required, and it should 264 be the same id as used in the remote process. 265 266 :: 267 268 qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 269 270 can be used to add a device emulated in a remote process 271 272 273 QEMU management of remote processes 274 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 275 276 QEMU is not aware of the type of type of the remote PCI device. It is 277 a pass through device as far as QEMU is concerned. 278 279 communication with emulation process 280 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 281 282 primary channel 283 ''''''''''''''' 284 285 The primary channel (referred to as com in the code) is used to bootstrap 286 the remote process. It is also used to pass on device-agnostic commands 287 like reset. 288 289 per-device channels 290 ''''''''''''''''''' 291 292 Each remote device communicates with QEMU using a dedicated communication 293 channel. The proxy object sets up this channel using the primary 294 channel during its initialization. 295 296 QEMU device proxy objects 297 ~~~~~~~~~~~~~~~~~~~~~~~~~ 298 299 QEMU has an object model based on sub-classes inherited from the 300 "object" super-class. The sub-classes that are of interest here are the 301 "device" and "bus" sub-classes whose child sub-classes make up the 302 device tree of a QEMU emulated system. 303 304 The proxy object model will use device proxy objects to replace the 305 device emulation code within the QEMU process. These objects will live 306 in the same place in the object and bus hierarchies as the objects they 307 replace. i.e., the proxy object for an LSI SCSI controller will be a 308 sub-class of the "pci-device" class, and will have the same PCI bus 309 parent and the same SCSI bus child objects as the LSI controller object 310 it replaces. 311 312 It is worth noting that the same proxy object is used to mediate with 313 all types of remote PCI devices. 314 315 object initialization 316 ^^^^^^^^^^^^^^^^^^^^^ 317 318 The Proxy device objects are initialized in the exact same manner in 319 which any other QEMU device would be initialized. 320 321 In addition, the Proxy objects perform the following two tasks: 322 - Parses the "socket" sub option and connects to the remote process 323 using this channel 324 - Uses the "id" sub-option to connect to the emulated device on the 325 separate process 326 327 class\_init 328 ''''''''''' 329 330 The ``class_init()`` method of a proxy object will, in general behave 331 similarly to the object it replaces, including setting any static 332 properties and methods needed by the proxy. 333 334 instance\_init / realize 335 '''''''''''''''''''''''' 336 337 The ``instance_init()`` and ``realize()`` functions would only need to 338 perform tasks related to being a proxy, such are registering its own 339 MMIO handlers, or creating a child bus that other proxy devices can be 340 attached to later. 341 342 Other tasks will be device-specific. For example, PCI device objects 343 will initialize the PCI config space in order to make a valid PCI device 344 tree within the QEMU process. 345 346 address space registration 347 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 348 349 Most devices are driven by guest device driver accesses to IO addresses 350 or ports. The QEMU device emulation code uses QEMU's memory region 351 function calls (such as ``memory_region_init_io()``) to add callback 352 functions that QEMU will invoke when the guest accesses the device's 353 areas of the IO address space. When a guest driver does access the 354 device, the VM will exit HW virtualization mode and return to QEMU, 355 which will then lookup and execute the corresponding callback function. 356 357 A proxy object would need to mirror the memory region calls the actual 358 device emulator would perform in its initialization code, but with its 359 own callbacks. When invoked by QEMU as a result of a guest IO operation, 360 they will forward the operation to the device emulation process. 361 362 PCI config space 363 ^^^^^^^^^^^^^^^^ 364 365 PCI devices also have a configuration space that can be accessed by the 366 guest driver. Guest accesses to this space is not handled by the device 367 emulation object, but by its PCI parent object. Much of this space is 368 read-only, but certain registers (especially BAR and MSI-related ones) 369 need to be propagated to the emulation process. 370 371 PCI parent proxy 372 '''''''''''''''' 373 374 One way to propagate guest PCI config accesses is to create a 375 "pci-device-proxy" class that can serve as the parent of a PCI device 376 proxy object. This class's parent would be "pci-device" and it would 377 override the PCI parent's ``config_read()`` and ``config_write()`` 378 methods with ones that forward these operations to the emulation 379 program. 380 381 interrupt receipt 382 ^^^^^^^^^^^^^^^^^ 383 384 A proxy for a device that generates interrupts will need to create a 385 socket to receive interrupt indications from the emulation process. An 386 incoming interrupt indication would then be sent up to its bus parent to 387 be injected into the guest. For example, a PCI device object may use 388 ``pci_set_irq()``. 389 390 live migration 391 ^^^^^^^^^^^^^^ 392 393 The proxy will register to save and restore any *vmstate* it needs over 394 a live migration event. The device proxy does not need to manage the 395 remote device's *vmstate*; that will be handled by the remote process 396 proxy (see below). 397 398 QEMU remote device operation 399 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 400 401 Generic device operations, such as DMA, will be performed by the remote 402 process proxy by sending messages to the remote process. 403 404 DMA operations 405 ^^^^^^^^^^^^^^ 406 407 DMA operations would be handled much like vhost applications do. One of 408 the initial messages sent to the emulation process is a guest memory 409 table. Each entry in this table consists of a file descriptor and size 410 that the emulation process can ``mmap()`` to directly access guest 411 memory, similar to ``vhost_user_set_mem_table()``. Note guest memory 412 must be backed by file descriptors, such as when QEMU is given the 413 *-mem-path* command line option. 414 415 IOMMU operations 416 ^^^^^^^^^^^^^^^^ 417 418 When the emulated system includes an IOMMU, the remote process proxy in 419 QEMU will need to create a socket for IOMMU requests from the emulation 420 process. It will handle those requests with an 421 ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU 422 unmaps, the remote process proxy will also register as a listener on the 423 device's DMA address space. When an IOMMU memory region is created 424 within the DMA address space, an IOMMU notifier for unmaps will be added 425 to the memory region that will forward unmaps to the emulation process 426 over the IOMMU socket. 427 428 device hot-plug via QMP 429 ^^^^^^^^^^^^^^^^^^^^^^^ 430 431 An QMP "device\_add" command can add a device emulated by a remote 432 process. It will also have "rid" option to the command, just as the 433 *-device* command line option does. The remote process may either be one 434 started at QEMU startup, or be one added by the "add-process" QMP 435 command described above. In either case, the remote process proxy will 436 forward the new device's JSON description to the corresponding emulation 437 process. 438 439 live migration 440 ^^^^^^^^^^^^^^ 441 442 The remote process proxy will also register for live migration 443 notifications with ``vmstate_register()``. When called to save state, 444 the proxy will send the remote process a secondary socket file 445 descriptor to save the remote process's device *vmstate* over. The 446 incoming byte stream length and data will be saved as the proxy's 447 *vmstate*. When the proxy is resumed on its new host, this *vmstate* 448 will be extracted, and a secondary socket file descriptor will be sent 449 to the new remote process through which it receives the *vmstate* in 450 order to restore the devices there. 451 452 device emulation in remote process 453 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 454 455 The parts of QEMU that the emulation program will need include the 456 object model; the memory emulation objects; the device emulation objects 457 of the targeted device, and any dependent devices; and, the device's 458 backends. It will also need code to setup the machine environment, 459 handle requests from the QEMU process, and route machine-level requests 460 (such as interrupts or IOMMU mappings) back to the QEMU process. 461 462 initialization 463 ^^^^^^^^^^^^^^ 464 465 The process initialization sequence will follow the same sequence 466 followed by QEMU. It will first initialize the backend objects, then 467 device emulation objects. The JSON descriptions sent by the QEMU process 468 will drive which objects need to be created. 469 470 - address spaces 471 472 Before the device objects are created, the initial address spaces and 473 memory regions must be configured with ``memory_map_init()``. This 474 creates a RAM memory region object (*system\_memory*) and an IO memory 475 region object (*system\_io*). 476 477 - RAM 478 479 RAM memory region creation will follow how ``pc_memory_init()`` creates 480 them, but must use ``memory_region_init_ram_from_fd()`` instead of 481 ``memory_region_allocate_system_memory()``. The file descriptors needed 482 will be supplied by the guest memory table from above. Those RAM regions 483 would then be added to the *system\_memory* memory region with 484 ``memory_region_add_subregion()``. 485 486 - PCI 487 488 IO initialization will be driven by the JSON descriptions sent from the 489 QEMU process. For a PCI device, a PCI bus will need to be created with 490 ``pci_root_bus_new()``, and a PCI memory region will need to be created 491 and added to the *system\_memory* memory region with 492 ``memory_region_add_subregion_overlap()``. The overlap version is 493 required for architectures where PCI memory overlaps with RAM memory. 494 495 MMIO handling 496 ^^^^^^^^^^^^^ 497 498 The device emulation objects will use ``memory_region_init_io()`` to 499 install their MMIO handlers, and ``pci_register_bar()`` to associate 500 those handlers with a PCI BAR, as they do within QEMU currently. 501 502 In order to use ``address_space_rw()`` in the emulation process to 503 handle MMIO requests from QEMU, the PCI physical addresses must be the 504 same in the QEMU process and the device emulation process. In order to 505 accomplish that, guest BAR programming must also be forwarded from QEMU 506 to the emulation process. 507 508 interrupt injection 509 ^^^^^^^^^^^^^^^^^^^ 510 511 When device emulation wants to inject an interrupt into the VM, the 512 request climbs the device's bus object hierarchy until the point where a 513 bus object knows how to signal the interrupt to the guest. The details 514 depend on the type of interrupt being raised. 515 516 - PCI pin interrupts 517 518 On x86 systems, there is an emulated IOAPIC object attached to the root 519 PCI bus object, and the root PCI object forwards interrupt requests to 520 it. The IOAPIC object, in turn, calls the KVM driver to inject the 521 corresponding interrupt into the VM. The simplest way to handle this in 522 an emulation process would be to setup the root PCI bus driver (via 523 ``pci_bus_irqs()``) to send a interrupt request back to the QEMU 524 process, and have the device proxy object reflect it up the PCI tree 525 there. 526 527 - PCI MSI/X interrupts 528 529 PCI MSI/X interrupts are implemented in HW as DMA writes to a 530 CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives 531 these DMA writes, then calls into the KVM driver to inject the interrupt 532 into the VM. A simple emulation process implementation would be to send 533 the MSI DMA address from QEMU as a message at initialization, then 534 install an address space handler at that address which forwards the MSI 535 message back to QEMU. 536 537 DMA operations 538 ^^^^^^^^^^^^^^ 539 540 When a emulation object wants to DMA into or out of guest memory, it 541 first must use dma\_memory\_map() to convert the DMA address to a local 542 virtual address. The emulation process memory region objects setup above 543 will be used to translate the DMA address to a local virtual address the 544 device emulation code can access. 545 546 IOMMU 547 ^^^^^ 548 549 When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory 550 regions to translate the DMA address to a guest physical address before 551 that physical address can be translated to a local virtual address. The 552 emulation process will need similar functionality. 553 554 - IOTLB cache 555 556 The emulation process will maintain a cache of recent IOMMU translations 557 (the IOTLB). When the translate() callback of an IOMMU memory region is 558 invoked, the IOTLB cache will be searched for an entry that will map the 559 DMA address to a guest PA. On a cache miss, a message will be sent back 560 to QEMU requesting the corresponding translation entry, which be both be 561 used to return a guest address and be added to the cache. 562 563 - IOTLB purge 564 565 The IOMMU emulation will also need to act on unmap requests from QEMU. 566 These happen when the guest IOMMU driver purges an entry from the 567 guest's translation table. 568 569 live migration 570 ^^^^^^^^^^^^^^ 571 572 When a remote process receives a live migration indication from QEMU, it 573 will set up a channel using the received file descriptor with 574 ``qio_channel_socket_new_fd()``. This channel will be used to create a 575 *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send 576 the process's device state back to QEMU. This method will be reversed on 577 restore - the channel will be passed to ``qemu_loadvm_state()`` to 578 restore the device state. 579 580 Accelerating device emulation 581 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 582 583 The messages that are required to be sent between QEMU and the emulation 584 process can add considerable latency to IO operations. The optimizations 585 described below attempt to ameliorate this effect by allowing the 586 emulation process to communicate directly with the kernel KVM driver. 587 The KVM file descriptors created would be passed to the emulation process 588 via initialization messages, much like the guest memory table is done. 589 #### MMIO acceleration 590 591 Vhost user applications can receive guest virtio driver stores directly 592 from KVM. The issue with the eventfd mechanism used by vhost user is 593 that it does not pass any data with the event indication, so it cannot 594 handle guest loads or guest stores that carry store data. This concept 595 could, however, be expanded to cover more cases. 596 597 The expanded idea would require a new type of KVM device: 598 *KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master 599 descriptor that QEMU can use for configuration, and a slave descriptor 600 that the emulation process can use to receive MMIO notifications. QEMU 601 would create both descriptors using the KVM driver, and pass the slave 602 descriptor to the emulation process via an initialization message. 603 604 data structures 605 ^^^^^^^^^^^^^^^ 606 607 - guest physical range 608 609 The guest physical range structure describes the address range that a 610 device will respond to. It includes the base and length of the range, as 611 well as which bus the range resides on (e.g., on an x86machine, it can 612 specify whether the range refers to memory or IO addresses). 613 614 A device can have multiple physical address ranges it responds to (e.g., 615 a PCI device can have multiple BARs), so the structure will also include 616 an enumerated identifier to specify which of the device's ranges is 617 being referred to. 618 619 +--------+----------------------------+ 620 | Name | Description | 621 +========+============================+ 622 | addr | range base address | 623 +--------+----------------------------+ 624 | len | range length | 625 +--------+----------------------------+ 626 | bus | addr type (memory or IO) | 627 +--------+----------------------------+ 628 | id | range ID (e.g., PCI BAR) | 629 +--------+----------------------------+ 630 631 - MMIO request structure 632 633 This structure describes an MMIO operation. It includes which guest 634 physical range the MMIO was within, the offset within that range, the 635 MMIO type (e.g., load or store), and its length and data. It also 636 includes a sequence number that can be used to reply to the MMIO, and 637 the CPU that issued the MMIO. 638 639 +----------+------------------------+ 640 | Name | Description | 641 +==========+========================+ 642 | rid | range MMIO is within | 643 +----------+------------------------+ 644 | offset | offset within *rid* | 645 +----------+------------------------+ 646 | type | e.g., load or store | 647 +----------+------------------------+ 648 | len | MMIO length | 649 +----------+------------------------+ 650 | data | store data | 651 +----------+------------------------+ 652 | seq | sequence ID | 653 +----------+------------------------+ 654 655 - MMIO request queues 656 657 MMIO request queues are FIFO arrays of MMIO request structures. There 658 are two queues: pending queue is for MMIOs that haven't been read by the 659 emulation program, and the sent queue is for MMIOs that haven't been 660 acknowledged. The main use of the second queue is to validate MMIO 661 replies from the emulation program. 662 663 - scoreboard 664 665 Each CPU in the VM is emulated in QEMU by a separate thread, so multiple 666 MMIOs may be waiting to be consumed by an emulation program and multiple 667 threads may be waiting for MMIO replies. The scoreboard would contain a 668 wait queue and sequence number for the per-CPU threads, allowing them to 669 be individually woken when the MMIO reply is received from the emulation 670 program. It also tracks the number of posted MMIO stores to the device 671 that haven't been replied to, in order to satisfy the PCI constraint 672 that a load to a device will not complete until all previous stores to 673 that device have been completed. 674 675 - device shadow memory 676 677 Some MMIO loads do not have device side-effects. These MMIOs can be 678 completed without sending a MMIO request to the emulation program if the 679 emulation program shares a shadow image of the device's memory image 680 with the KVM driver. 681 682 The emulation program will ask the KVM driver to allocate memory for the 683 shadow image, and will then use ``mmap()`` to directly access it. The 684 emulation program can control KVM access to the shadow image by sending 685 KVM an access map telling it which areas of the image have no 686 side-effects (and can be completed immediately), and which require a 687 MMIO request to the emulation program. The access map can also inform 688 the KVM drive which size accesses are allowed to the image. 689 690 master descriptor 691 ^^^^^^^^^^^^^^^^^ 692 693 The master descriptor is used by QEMU to configure the new KVM device. 694 The descriptor would be returned by the KVM driver when QEMU issues a 695 *KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. 696 697 KVM\_DEV\_TYPE\_USER device ops 698 699 700 The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a 701 ``kvm_register_device_ops()`` call when the KVM system in initialized by 702 ``kvm_init()``. These device ops are called by the KVM driver when QEMU 703 executes certain ``ioctl()`` operations on its KVM file descriptor. They 704 include: 705 706 - create 707 708 This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* 709 ``ioctl()`` on its per-VM file descriptor. It will allocate and 710 initialize a KVM user device specific data structure, and assign the 711 *kvm\_device* private field to it. 712 713 - ioctl 714 715 This routine is invoked when QEMU issues an ``ioctl()`` on the master 716 descriptor. The ``ioctl()`` commands supported are defined by the KVM 717 device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: 718 719 *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will 720 be passed to the device emulation program. Only one slave can be created 721 by each master descriptor. The file operations performed by this 722 descriptor are described below. 723 724 The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical 725 address range that the slave descriptor will receive MMIO notifications 726 for. The range is specified by a guest physical range structure 727 argument. For buses that assign addresses to devices dynamically, this 728 command can be executed while the guest is running, such as the case 729 when a guest changes a device's PCI BAR registers. 730 731 *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to 732 register *kvm\_io\_device\_ops* callbacks to be invoked when the guest 733 performs a MMIO operation within the range. When a range is changed, 734 ``kvm_io_bus_unregister_dev()`` is used to remove the previous 735 instantiation. 736 737 *KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies 738 how long KVM will wait for the emulation process to respond to a MMIO 739 indication. 740 741 - destroy 742 743 This routine is called when the VM instance is destroyed. It will need 744 to destroy the slave descriptor; and free any memory allocated by the 745 driver, as well as the *kvm\_device* structure itself. 746 747 slave descriptor 748 ^^^^^^^^^^^^^^^^ 749 750 The slave descriptor will have its own file operations vector, which 751 responds to system calls on the descriptor performed by the device 752 emulation program. 753 754 - read 755 756 A read returns any pending MMIO requests from the KVM driver as MMIO 757 request structures. Multiple structures can be returned if there are 758 multiple MMIO operations pending. The MMIO requests are moved from the 759 pending queue to the sent queue, and if there are threads waiting for 760 space in the pending to add new MMIO operations, they will be woken 761 here. 762 763 - write 764 765 A write also consists of a set of MMIO requests. They are compared to 766 the MMIO requests in the sent queue. Matches are removed from the sent 767 queue, and any threads waiting for the reply are woken. If a store is 768 removed, then the number of posted stores in the per-CPU scoreboard is 769 decremented. When the number is zero, and a non side-effect load was 770 waiting for posted stores to complete, the load is continued. 771 772 - ioctl 773 774 There are several ioctl()s that can be performed on the slave 775 descriptor. 776 777 A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to 778 allocate memory for the shadow image. This memory can later be 779 ``mmap()``\ ed by the emulation process to share the emulation's view of 780 device memory with the KVM driver. 781 782 A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the 783 shadow image. It will send the KVM driver a shadow control map, which 784 specifies which areas of the image can complete guest loads without 785 sending the load request to the emulation program. It will also specify 786 the size of load operations that are allowed. 787 788 - poll 789 790 An emulation program will use the ``poll()`` call with a *POLLIN* flag 791 to determine if there are MMIO requests waiting to be read. It will 792 return if the pending MMIO request queue is not empty. 793 794 - mmap 795 796 This call allows the emulation program to directly access the shadow 797 image allocated by the KVM driver. As device emulation updates device 798 memory, changes with no side-effects will be reflected in the shadow, 799 and the KVM driver can satisfy guest loads from the shadow image without 800 needing to wait for the emulation program. 801 802 kvm\_io\_device ops 803 ^^^^^^^^^^^^^^^^^^^ 804 805 Each KVM per-CPU thread can handle MMIO operation on behalf of the guest 806 VM. KVM will use the MMIO's guest physical address to search for a 807 matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM 808 driver instead of exiting back to QEMU. If a match is found, the 809 corresponding callback will be invoked. 810 811 - read 812 813 This callback is invoked when the guest performs a load to the device. 814 Loads with side-effects must be handled synchronously, with the KVM 815 driver putting the QEMU thread to sleep waiting for the emulation 816 process reply before re-starting the guest. Loads that do not have 817 side-effects may be optimized by satisfying them from the shadow image, 818 if there are no outstanding stores to the device by this CPU. PCI memory 819 ordering demands that a load cannot complete before all older stores to 820 the same device have been completed. 821 822 - write 823 824 Stores can be handled asynchronously unless the pending MMIO request 825 queue is full. In this case, the QEMU thread must sleep waiting for 826 space in the queue. Stores will increment the number of posted stores in 827 the per-CPU scoreboard, in order to implement the PCI ordering 828 constraint above. 829 830 interrupt acceleration 831 ^^^^^^^^^^^^^^^^^^^^^^ 832 833 This performance optimization would work much like a vhost user 834 application does, where the QEMU process sets up *eventfds* that cause 835 the device's corresponding interrupt to be triggered by the KVM driver. 836 These irq file descriptors are sent to the emulation process at 837 initialization, and are used when the emulation code raises a device 838 interrupt. 839 840 intx acceleration 841 ''''''''''''''''' 842 843 Traditional PCI pin interrupts are level based, so, in addition to an 844 irq file descriptor, a re-sampling file descriptor needs to be sent to 845 the emulation program. This second file descriptor allows multiple 846 devices sharing an irq to be notified when the interrupt has been 847 acknowledged by the guest, so they can re-trigger the interrupt if their 848 device has not de-asserted its interrupt. 849 850 intx irq descriptor 851 852 853 The irq descriptors are created by the proxy object 854 ``using event_notifier_init()`` to create the irq and re-sampling 855 *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. 856 The interrupt route can be found with 857 ``pci_device_route_intx_to_irq()``. 858 859 intx routing changes 860 861 862 Intx routing can be changed when the guest programs the APIC the device 863 pin is connected to. The proxy object in QEMU will use 864 ``pci_device_set_intx_routing_notifier()`` to be informed of any guest 865 changes to the route. This handler will broadly follow the VFIO 866 interrupt logic to change the route: de-assigning the existing irq 867 descriptor from its route, then assigning it the new route. (see 868 ``vfio_intx_update()``) 869 870 MSI/X acceleration 871 '''''''''''''''''' 872 873 MSI/X interrupts are sent as DMA transactions to the host. The interrupt 874 data contains a vector that is programmed by the guest, A device may have 875 multiple MSI interrupts associated with it, so multiple irq descriptors 876 may need to be sent to the emulation program. 877 878 MSI/X irq descriptor 879 880 881 This case will also follow the VFIO example. For each MSI/X interrupt, 882 an *eventfd* is created, a virtual interrupt is allocated by 883 ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to 884 the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. 885 886 MSI/X config space changes 887 888 889 The guest may dynamically update several MSI-related tables in the 890 device's PCI config space. These include per-MSI interrupt enables and 891 vector data. Additionally, MSIX tables exist in device memory space, not 892 config space. Much like the BAR case above, the proxy object must look 893 at guest config space programming to keep the MSI interrupt state 894 consistent between QEMU and the emulation program. 895 896 -------------- 897 898 Disaggregated CPU emulation 899 --------------------------- 900 901 After IO services have been disaggregated, a second phase would be to 902 separate a process to handle CPU instruction emulation from the main 903 QEMU control function. There are no object separation points for this 904 code, so the first task would be to create one. 905 906 Host access controls 907 -------------------- 908 909 Separating QEMU relies on the host OS's access restriction mechanisms to 910 enforce that the differing processes can only access the objects they 911 are entitled to. There are a couple types of mechanisms usually provided 912 by general purpose OSs. 913 914 Discretionary access control 915 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 916 917 Discretionary access control allows each user to control who can access 918 their files. In Linux, this type of control is usually too coarse for 919 QEMU separation, since it only provides three separate access controls: 920 one for the same user ID, the second for users IDs with the same group 921 ID, and the third for all other user IDs. Each device instance would 922 need a separate user ID to provide access control, which is likely to be 923 unwieldy for dynamically created VMs. 924 925 Mandatory access control 926 ~~~~~~~~~~~~~~~~~~~~~~~~ 927 928 Mandatory access control allows the OS to add an additional set of 929 controls on top of discretionary access for the OS to control. It also 930 adds other attributes to processes and files such as types, roles, and 931 categories, and can establish rules for how processes and files can 932 interact. 933 934 Type enforcement 935 ^^^^^^^^^^^^^^^^ 936 937 Type enforcement assigns a *type* attribute to processes and files, and 938 allows rules to be written on what operations a process with a given 939 type can perform on a file with a given type. QEMU separation could take 940 advantage of type enforcement by running the emulation processes with 941 different types, both from the main QEMU process, and from the emulation 942 processes of different classes of devices. 943 944 For example, guest disk images and disk emulation processes could have 945 types separate from the main QEMU process and non-disk emulation 946 processes, and the type rules could prevent processes other than disk 947 emulation ones from accessing guest disk images. Similarly, network 948 emulation processes can have a type separate from the main QEMU process 949 and non-network emulation process, and only that type can access the 950 host tun/tap device used to provide guest networking. 951 952 Category enforcement 953 ^^^^^^^^^^^^^^^^^^^^ 954 955 Category enforcement assigns a set of numbers within a given range to 956 the process or file. The process is granted access to the file if the 957 process's set is a superset of the file's set. This enforcement can be 958 used to separate multiple instances of devices in the same class. 959 960 For example, if there are multiple disk devices provides to a guest, 961 each device emulation process could be provisioned with a separate 962 category. The different device emulation processes would not be able to 963 access each other's backing disk images. 964 965 Alternatively, categories could be used in lieu of the type enforcement 966 scheme described above. In this scenario, different categories would be 967 used to prevent device emulation processes in different classes from 968 accessing resources assigned to other classes.