qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

block-replication.txt (10875B)


      1 Block replication
      2 ----------------------------------------
      3 Copyright Fujitsu, Corp. 2016
      4 Copyright (c) 2016 Intel Corporation
      5 Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD.
      6 
      7 This work is licensed under the terms of the GNU GPL, version 2 or later.
      8 See the COPYING file in the top-level directory.
      9 
     10 Block replication is used for continuous checkpoints. It is designed
     11 for COLO (COarse-grain LOck-stepping) where the Secondary VM is running.
     12 It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario,
     13 where the Secondary VM is not running.
     14 
     15 This document gives an overview of block replication's design.
     16 
     17 == Background ==
     18 High availability solutions such as micro checkpoint and COLO will do
     19 consecutive checkpoints. The VM state of the Primary and Secondary VM is
     20 identical right after a VM checkpoint, but becomes different as the VM
     21 executes till the next checkpoint. To support disk contents checkpoint,
     22 the modified disk contents in the Secondary VM must be buffered, and are
     23 only dropped at next checkpoint time. To reduce the network transportation
     24 effort during a vmstate checkpoint, the disk modification operations of
     25 the Primary disk are asynchronously forwarded to the Secondary node.
     26 
     27 == Workflow ==
     28 The following is the image of block replication workflow:
     29 
     30         +----------------------+            +------------------------+
     31         |Primary Write Requests|            |Secondary Write Requests|
     32         +----------------------+            +------------------------+
     33                   |                                       |
     34                   |                                      (4)
     35                   |                                       V
     36                   |                              /-------------\
     37                   |      Copy and Forward        |             |
     38                   |---------(1)----------+       | Disk Buffer |
     39                   |                      |       |             |
     40                   |                     (3)      \-------------/
     41                   |                 speculative      ^
     42                   |                write through    (2)
     43                   |                      |           |
     44                   V                      V           |
     45            +--------------+           +----------------+
     46            | Primary Disk |           | Secondary Disk |
     47            +--------------+           +----------------+
     48 
     49     1) Primary write requests will be copied and forwarded to Secondary
     50        QEMU.
     51     2) Before Primary write requests are written to Secondary disk, the
     52        original sector content will be read from Secondary disk and
     53        buffered in the Disk buffer, but it will not overwrite the existing
     54        sector content (it could be from either "Secondary Write Requests" or
     55        previous COW of "Primary Write Requests") in the Disk buffer.
     56     3) Primary write requests will be written to Secondary disk.
     57     4) Secondary write requests will be buffered in the Disk buffer and it
     58        will overwrite the existing sector content in the buffer.
     59 
     60 == Architecture ==
     61 We are going to implement block replication from many basic
     62 blocks that are already in QEMU.
     63 
     64          virtio-blk       ||
     65              ^            ||                            .----------
     66              |            ||                            | Secondary
     67         1 Quorum          ||                            '----------
     68          /      \         ||                                                           virtio-blk
     69         /        \        ||                                                               ^
     70    Primary    2 filter                                                                     |
     71      disk         ^                                                                   7 Quorum
     72                   |                                                                    /
     73                 3 NBD  ------->  3 NBD                                                /
     74                 client    ||     server                                          2 filter
     75                           ||        ^                                                ^
     76 --------.                 ||        |                                                |
     77 Primary |                 ||  Secondary disk <--------- hidden-disk 5 <--------- active-disk 4
     78 --------'                 ||        |          backing        ^       backing
     79                           ||        |                         |
     80                           ||        |                         |
     81                           ||        '-------------------------'
     82                           ||         blockdev-backup sync=none 6
     83 
     84 1) The disk on the primary is represented by a block device with two
     85 children, providing replication between a primary disk and the host that
     86 runs the secondary VM. The read pattern (fifo) for quorum can be extended
     87 to make the primary always read from the local disk instead of going through
     88 NBD.
     89 
     90 2) The new block filter (the name is replication) will control the block
     91 replication.
     92 
     93 3) The secondary disk receives writes from the primary VM through QEMU's
     94 embedded NBD server (speculative write-through).
     95 
     96 4) The disk on the secondary is represented by a custom block device
     97 (called active-disk). It should start as an empty disk, and the format
     98 should support bdrv_make_empty() and backing file.
     99 
    100 5) The hidden-disk is created automatically. It buffers the original content
    101 that is modified by the primary VM. It should also start as an empty disk,
    102 and the driver supports bdrv_make_empty() and backing file.
    103 
    104 6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
    105 any state that would otherwise be lost by the speculative write-through
    106 of the NBD server into the secondary disk. So before block replication,
    107 the primary disk and secondary disk should contain the same data.
    108 
    109 7) The secondary also has a quorum node, so after secondary failover it
    110 can become the new primary and continue replication.
    111 
    112 
    113 == Failure Handling ==
    114 There are 7 internal errors when block replication is running:
    115 1. I/O error on primary disk
    116 2. Forwarding primary write requests failed
    117 3. Backup failed
    118 4. I/O error on secondary disk
    119 5. I/O error on active disk
    120 6. Making active disk or hidden disk empty failed
    121 7. Doing failover failed
    122 In case 1 and 5, we just report the error to the disk layer. In case 2, 3,
    123 4 and 6, we just report block replication's error to FT/HA manager (which
    124 decides when to do a new checkpoint, when to do failover).
    125 In case 7, if active commit failed, we use replication failover failed state
    126 in Secondary's write operation (what decides which target to write).
    127 
    128 == New block driver interface ==
    129 We add four block driver interfaces to control block replication:
    130 a. replication_start_all()
    131    Start block replication, called in migration/checkpoint thread.
    132    We must call block_replication_start_all() in secondary QEMU before
    133    calling block_replication_start_all() in primary QEMU. The caller
    134    must hold the I/O mutex lock if it is in migration/checkpoint
    135    thread.
    136 b. replication_do_checkpoint_all()
    137    This interface is called after all VM state is transferred to
    138    Secondary QEMU. The Disk buffer will be dropped in this interface.
    139    The caller must hold the I/O mutex lock if it is in migration/checkpoint
    140    thread.
    141 c. replication_get_error_all()
    142    This interface is called to check if error happened in replication.
    143    The caller must hold the I/O mutex lock if it is in migration/checkpoint
    144    thread.
    145 d. replication_stop_all()
    146    It is called on failover. We will flush the Disk buffer into
    147    Secondary Disk and stop block replication. The vm should be stopped
    148    before calling it if you use this API to shutdown the guest, or other
    149    things except failover. The caller must hold the I/O mutex lock if it is
    150    in migration/checkpoint thread.
    151 
    152 == Usage ==
    153 Primary:
    154   -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\
    155          children.0.file.filename=1.raw,\
    156          children.0.driver=raw
    157 
    158   Run qmp command in primary qemu:
    159     { "execute": "human-monitor-command",
    160       "arguments": {
    161           "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
    162       }
    163     }
    164     { "execute": "x-blockdev-change",
    165       "arguments": {
    166           "parent": "colo1",
    167           "node": "nbd_client1"
    168       }
    169     }
    170   Note:
    171   1. There should be only one NBD Client for each primary disk.
    172   2. host is the secondary physical machine's hostname or IP
    173   3. Each disk must have its own export name.
    174   4. It is all a single argument to -drive and you should ignore the
    175      leading whitespace.
    176   5. The qmp command line must be run after running qmp command line in
    177      secondary qemu.
    178   6. After primary failover we need remove children.1 (replication driver).
    179 
    180 Secondary:
    181   -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
    182   -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
    183          file.file.filename=active_disk.qcow2,\
    184          file.driver=qcow2,\
    185          file.backing.file.filename=hidden_disk.qcow2,\
    186          file.backing.driver=qcow2,\
    187          file.backing.backing=colo1
    188   -drive if=xxx,driver=quorum,read-pattern=fifo,id=top-disk1,\
    189          vote-threshold=1,children.0=childs1
    190 
    191   Then run qmp command in secondary qemu:
    192     { "execute": "nbd-server-start",
    193       "arguments": {
    194           "addr": {
    195               "type": "inet",
    196               "data": {
    197                   "host": "xxx",
    198                   "port": "xxx"
    199               }
    200           }
    201       }
    202     }
    203     { "execute": "nbd-server-add",
    204       "arguments": {
    205           "device": "colo1",
    206           "writable": true
    207       }
    208     }
    209 
    210   Note:
    211   1. The export name in secondary QEMU command line is the secondary
    212      disk's id.
    213   2. The export name for the same disk must be the same
    214   3. The qmp command nbd-server-start and nbd-server-add must be run
    215      before running the qmp command migrate on primary QEMU
    216   4. Active disk, hidden disk and nbd target's length should be the
    217      same.
    218   5. It is better to put active disk and hidden disk in ramdisk.
    219   6. It is all a single argument to -drive, and you should ignore
    220      the leading whitespace.
    221 
    222 After Failover:
    223 Primary:
    224   The secondary host is down, so we should run the following qmp command
    225   to remove the nbd child from the quorum:
    226   { "execute": "x-blockdev-change",
    227     "arguments": {
    228         "parent": "colo1",
    229         "child": "children.1"
    230     }
    231   }
    232   { "execute": "human-monitor-command",
    233     "arguments": {
    234         "command-line": "drive_del xxxx"
    235     }
    236   }
    237   Note: there is no qmp command to remove the blockdev now
    238 
    239 Secondary:
    240   The primary host is down, so we should do the following thing:
    241   { "execute": "nbd-server-stop" }
    242 
    243 Promote Secondary to Primary:
    244   see COLO-FT.txt
    245 
    246 TODO:
    247 1. Shared disk