1Block replication 2---------------------------------------- 3Copyright Fujitsu, Corp. 2016 4Copyright (c) 2016 Intel Corporation 5Copyright (c) 2016 HUAWEI TECHNOLOGIES CO., LTD. 6 7This work is licensed under the terms of the GNU GPL, version 2 or later. 8See the COPYING file in the top-level directory. 9 10Block replication is used for continuous checkpoints. It is designed 11for COLO (COarse-grain LOck-stepping) where the Secondary VM is running. 12It can also be applied for FT/HA (Fault-tolerance/High Assurance) scenario, 13where the Secondary VM is not running. 14 15This document gives an overview of block replication's design. 16 17== Background == 18High availability solutions such as micro checkpoint and COLO will do 19consecutive checkpoints. The VM state of the Primary and Secondary VM is 20identical right after a VM checkpoint, but becomes different as the VM 21executes till the next checkpoint. To support disk contents checkpoint, 22the modified disk contents in the Secondary VM must be buffered, and are 23only dropped at next checkpoint time. To reduce the network transportation 24effort during a vmstate checkpoint, the disk modification operations of 25the Primary disk are asynchronously forwarded to the Secondary node. 26 27== Workflow == 28The following is the image of block replication workflow: 29 30 +----------------------+ +------------------------+ 31 |Primary Write Requests| |Secondary Write Requests| 32 +----------------------+ +------------------------+ 33 | | 34 | (4) 35 | V 36 | /-------------\ 37 | Copy and Forward | | 38 |---------(1)----------+ | Disk Buffer | 39 | | | | 40 | (3) \-------------/ 41 | speculative ^ 42 | write through (2) 43 | | | 44 V V | 45 +--------------+ +----------------+ 46 | Primary Disk | | Secondary Disk | 47 +--------------+ +----------------+ 48 49 1) Primary write requests will be copied and forwarded to Secondary 50 QEMU. 51 2) Before Primary write requests are written to Secondary disk, the 52 original sector content will be read from Secondary disk and 53 buffered in the Disk buffer, but it will not overwrite the existing 54 sector content (it could be from either "Secondary Write Requests" or 55 previous COW of "Primary Write Requests") in the Disk buffer. 56 3) Primary write requests will be written to Secondary disk. 57 4) Secondary write requests will be buffered in the Disk buffer and it 58 will overwrite the existing sector content in the buffer. 59 60== Architecture == 61We are going to implement block replication from many basic 62blocks that are already in QEMU. 63 64 virtio-blk || 65 ^ || .---------- 66 | || | Secondary 67 1 Quorum || '---------- 68 / \ || 69 / \ || 70 Primary 2 filter 71 disk ^ virtio-blk 72 | ^ 73 3 NBD -------> 3 NBD | 74 client || server 2 filter 75 || ^ ^ 76--------. || | | 77Primary | || Secondary disk <--------- hidden-disk 5 <--------- active-disk 4 78--------' || | backing ^ backing 79 || | | 80 || | | 81 || '-------------------------' 82 || drive-backup sync=none 6 83 841) The disk on the primary is represented by a block device with two 85children, providing replication between a primary disk and the host that 86runs the secondary VM. The read pattern (fifo) for quorum can be extended 87to make the primary always read from the local disk instead of going through 88NBD. 89 902) The new block filter (the name is replication) will control the block 91replication. 92 933) The secondary disk receives writes from the primary VM through QEMU's 94embedded NBD server (speculative write-through). 95 964) The disk on the secondary is represented by a custom block device 97(called active-disk). It should start as an empty disk, and the format 98should support bdrv_make_empty() and backing file. 99 1005) The hidden-disk is created automatically. It buffers the original content 101that is modified by the primary VM. It should also start as an empty disk, 102and the driver supports bdrv_make_empty() and backing file. 103 1046) The drive-backup job (sync=none) is run to allow hidden-disk to buffer 105any state that would otherwise be lost by the speculative write-through 106of the NBD server into the secondary disk. So before block replication, 107the primary disk and secondary disk should contain the same data. 108 109== Failure Handling == 110There are 7 internal errors when block replication is running: 1111. I/O error on primary disk 1122. Forwarding primary write requests failed 1133. Backup failed 1144. I/O error on secondary disk 1155. I/O error on active disk 1166. Making active disk or hidden disk empty failed 1177. Doing failover failed 118In case 1 and 5, we just report the error to the disk layer. In case 2, 3, 1194 and 6, we just report block replication's error to FT/HA manager (which 120decides when to do a new checkpoint, when to do failover). 121In case 7, if active commit failed, we use replication failover failed state 122in Secondary's write operation (what decides which target to write). 123 124== New block driver interface == 125We add four block driver interfaces to control block replication: 126a. replication_start_all() 127 Start block replication, called in migration/checkpoint thread. 128 We must call block_replication_start_all() in secondary QEMU before 129 calling block_replication_start_all() in primary QEMU. The caller 130 must hold the I/O mutex lock if it is in migration/checkpoint 131 thread. 132b. replication_do_checkpoint_all() 133 This interface is called after all VM state is transferred to 134 Secondary QEMU. The Disk buffer will be dropped in this interface. 135 The caller must hold the I/O mutex lock if it is in migration/checkpoint 136 thread. 137c. replication_get_error_all() 138 This interface is called to check if error happened in replication. 139 The caller must hold the I/O mutex lock if it is in migration/checkpoint 140 thread. 141d. replication_stop_all() 142 It is called on failover. We will flush the Disk buffer into 143 Secondary Disk and stop block replication. The vm should be stopped 144 before calling it if you use this API to shutdown the guest, or other 145 things except failover. The caller must hold the I/O mutex lock if it is 146 in migration/checkpoint thread. 147 148== Usage == 149Primary: 150 -drive if=xxx,driver=quorum,read-pattern=fifo,id=colo1,vote-threshold=1,\ 151 children.0.file.filename=1.raw,\ 152 children.0.driver=raw 153 154 Run qmp command in primary qemu: 155 { 'execute': 'human-monitor-command', 156 'arguments': { 157 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1' 158 } 159 } 160 { 'execute': 'x-blockdev-change', 161 'arguments': { 162 'parent': 'colo1', 163 'node': 'nbd_client1' 164 } 165 } 166 Note: 167 1. There should be only one NBD Client for each primary disk. 168 2. host is the secondary physical machine's hostname or IP 169 3. Each disk must have its own export name. 170 4. It is all a single argument to -drive and you should ignore the 171 leading whitespace. 172 5. The qmp command line must be run after running qmp command line in 173 secondary qemu. 174 6. After failover we need remove children.1 (replication driver). 175 176Secondary: 177 -drive if=none,driver=raw,file.filename=1.raw,id=colo1 \ 178 -drive if=xxx,id=topxxx,driver=replication,mode=secondary,top-id=topxxx\ 179 file.file.filename=active_disk.qcow2,\ 180 file.driver=qcow2,\ 181 file.backing.file.filename=hidden_disk.qcow2,\ 182 file.backing.driver=qcow2,\ 183 file.backing.backing=colo1 184 185 Then run qmp command in secondary qemu: 186 { 'execute': 'nbd-server-start', 187 'arguments': { 188 'addr': { 189 'type': 'inet', 190 'data': { 191 'host': 'xxx', 192 'port': 'xxx' 193 } 194 } 195 } 196 } 197 { 'execute': 'nbd-server-add', 198 'arguments': { 199 'device': 'colo1', 200 'writable': true 201 } 202 } 203 204 Note: 205 1. The export name in secondary QEMU command line is the secondary 206 disk's id. 207 2. The export name for the same disk must be the same 208 3. The qmp command nbd-server-start and nbd-server-add must be run 209 before running the qmp command migrate on primary QEMU 210 4. Active disk, hidden disk and nbd target's length should be the 211 same. 212 5. It is better to put active disk and hidden disk in ramdisk. 213 6. It is all a single argument to -drive, and you should ignore 214 the leading whitespace. 215 216After Failover: 217Primary: 218 The secondary host is down, so we should run the following qmp command 219 to remove the nbd child from the quorum: 220 { 'execute': 'x-blockdev-change', 221 'arguments': { 222 'parent': 'colo1', 223 'child': 'children.1' 224 } 225 } 226 { 'execute': 'human-monitor-command', 227 'arguments': { 228 'command-line': 'drive_del xxxx' 229 } 230 } 231 Note: there is no qmp command to remove the blockdev now 232 233Secondary: 234 The primary host is down, so we should do the following thing: 235 { 'execute': 'nbd-server-stop' } 236 237TODO: 2381. Continuous block replication 2392. Shared disk 240