1======== 2Postcopy 3======== 4 5.. contents:: 6 7'Postcopy' migration is a way to deal with migrations that refuse to converge 8(or take too long to converge) its plus side is that there is an upper bound on 9the amount of migration traffic and time it takes, the down side is that during 10the postcopy phase, a failure of *either* side causes the guest to be lost. 11 12In postcopy the destination CPUs are started before all the memory has been 13transferred, and accesses to pages that are yet to be transferred cause 14a fault that's translated by QEMU into a request to the source QEMU. 15 16Postcopy can be combined with precopy (i.e. normal migration) so that if precopy 17doesn't finish in a given time the switch is made to postcopy. 18 19Enabling postcopy 20================= 21 22To enable postcopy, issue this command on the monitor (both source and 23destination) prior to the start of migration: 24 25``migrate_set_capability postcopy-ram on`` 26 27The normal commands are then used to start a migration, which is still 28started in precopy mode. Issuing: 29 30``migrate_start_postcopy`` 31 32will now cause the transition from precopy to postcopy. 33It can be issued immediately after migration is started or any 34time later on. Issuing it after the end of a migration is harmless. 35 36Blocktime is a postcopy live migration metric, intended to show how 37long the vCPU was in state of interruptible sleep due to pagefault. 38That metric is calculated both for all vCPUs as overlapped value, and 39separately for each vCPU. These values are calculated on destination 40side. To enable postcopy blocktime calculation, enter following 41command on destination monitor: 42 43``migrate_set_capability postcopy-blocktime on`` 44 45Postcopy blocktime can be retrieved by query-migrate qmp command. 46postcopy-blocktime value of qmp command will show overlapped blocking 47time for all vCPU, postcopy-vcpu-blocktime will show list of blocking 48time per vCPU. 49 50.. note:: 51 During the postcopy phase, the bandwidth limits set using 52 ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 53 the destination is waiting for). 54 55Postcopy internals 56================== 57 58State machine 59------------- 60 61Postcopy moves through a series of states (see postcopy_state) from 62ADVISE->DISCARD->LISTEN->RUNNING->END 63 64 - Advise 65 66 Set at the start of migration if postcopy is enabled, even 67 if it hasn't had the start command; here the destination 68 checks that its OS has the support needed for postcopy, and performs 69 setup to ensure the RAM mappings are suitable for later postcopy. 70 The destination will fail early in migration at this point if the 71 required OS support is not present. 72 (Triggered by reception of POSTCOPY_ADVISE command) 73 74 - Discard 75 76 Entered on receipt of the first 'discard' command; prior to 77 the first Discard being performed, hugepages are switched off 78 (using madvise) to ensure that no new huge pages are created 79 during the postcopy phase, and to cause any huge pages that 80 have discards on them to be broken. 81 82 - Listen 83 84 The first command in the package, POSTCOPY_LISTEN, switches 85 the destination state to Listen, and starts a new thread 86 (the 'listen thread') which takes over the job of receiving 87 pages off the migration stream, while the main thread carries 88 on processing the blob. With this thread able to process page 89 reception, the destination now 'sensitises' the RAM to detect 90 any access to missing pages (on Linux using the 'userfault' 91 system). 92 93 - Running 94 95 POSTCOPY_RUN causes the destination to synchronise all 96 state and start the CPUs and IO devices running. The main 97 thread now finishes processing the migration package and 98 now carries on as it would for normal precopy migration 99 (although it can't do the cleanup it would do as it 100 finishes a normal migration). 101 102 - End 103 104 The listen thread can now quit, and perform the cleanup of migration 105 state, the migration is now complete. 106 107Device transfer 108--------------- 109 110Loading of device data may cause the device emulation to access guest RAM 111that may trigger faults that have to be resolved by the source, as such 112the migration stream has to be able to respond with page data *during* the 113device load, and hence the device data has to be read from the stream completely 114before the device load begins to free the stream up. This is achieved by 115'packaging' the device data into a blob that's read in one go. 116 117Source behaviour 118---------------- 119 120Until postcopy is entered the migration stream is identical to normal 121precopy, except for the addition of a 'postcopy advise' command at 122the beginning, to tell the destination that postcopy might happen. 123When postcopy starts the source sends the page discard data and then 124forms the 'package' containing: 125 126 - Command: 'postcopy listen' 127 - The device state 128 129 A series of sections, identical to the precopy streams device state stream 130 containing everything except postcopiable devices (i.e. RAM) 131 - Command: 'postcopy run' 132 133The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 134contents are formatted in the same way as the main migration stream. 135 136During postcopy the source scans the list of dirty pages and sends them 137to the destination without being requested (in much the same way as precopy), 138however when a page request is received from the destination, the dirty page 139scanning restarts from the requested location. This causes requested pages 140to be sent quickly, and also causes pages directly after the requested page 141to be sent quickly in the hope that those pages are likely to be used 142by the destination soon. 143 144Destination behaviour 145--------------------- 146 147Initially the destination looks the same as precopy, with a single thread 148reading the migration stream; the 'postcopy advise' and 'discard' commands 149are processed to change the way RAM is managed, but don't affect the stream 150processing. 151 152:: 153 154 ------------------------------------------------------------------------------ 155 1 2 3 4 5 6 7 156 main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 157 thread | | 158 | (page request) 159 | \___ 160 v \ 161 listen thread: --- page -- page -- page -- page -- page -- 162 163 a b c 164 ------------------------------------------------------------------------------ 165 166- On receipt of ``CMD_PACKAGED`` (1) 167 168 All the data associated with the package - the ( ... ) section in the diagram - 169 is read into memory, and the main thread recurses into qemu_loadvm_state_main 170 to process the contents of the package (2) which contains commands (3,6) and 171 devices (4...) 172 173- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 174 175 a new thread (a) is started that takes over servicing the migration stream, 176 while the main thread carries on loading the package. It loads normal 177 background page data (b) but if during a device load a fault happens (5) 178 the returned page (c) is loaded by the listen thread allowing the main 179 threads device load to carry on. 180 181- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 182 183 letting the destination CPUs start running. At the end of the 184 ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 185 is no longer used by migration, while the listen thread carries on servicing 186 page data until the end of migration. 187 188Source side page bitmap 189----------------------- 190 191The 'migration bitmap' in postcopy is basically the same as in the precopy, 192where each of the bit to indicate that page is 'dirty' - i.e. needs 193sending. During the precopy phase this is updated as the CPU dirties 194pages, however during postcopy the CPUs are stopped and nothing should 195dirty anything any more. Instead, dirty bits are cleared when the relevant 196pages are sent during postcopy. 197 198Postcopy features 199================= 200 201Postcopy recovery 202----------------- 203 204Comparing to precopy, postcopy is special on error handlings. When any 205error happens (in this case, mostly network errors), QEMU cannot easily 206fail a migration because VM data resides in both source and destination 207QEMU instances. On the other hand, when issue happens QEMU on both sides 208will go into a paused state. It'll need a recovery phase to continue a 209paused postcopy migration. 210 211The recovery phase normally contains a few steps: 212 213 - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED** 214 migration state. 215 216 - When the network is recovered (or a new network is provided), the admin 217 can setup the new channel for migration using QMP command 218 'migrate-recover' on destination node, preparing for a resume. 219 220 - On source host, the admin can continue the interrupted postcopy 221 migration using QMP command 'migrate' with resume=true flag set. 222 Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to 223 re-establish the channels. 224 225 - When both sides of QEMU successfully reconnect using a new or fixed up 226 channel, they will go into **POSTCOPY_RECOVER** state, some handshake 227 procedure will be needed to properly synchronize the VM states between 228 the two QEMUs to continue the postcopy migration. For example, there 229 can be pages sent right during the window when the network is 230 interrupted, then the handshake will guarantee pages lost in-flight 231 will be resent again. 232 233 - After a proper handshake synchronization, QEMU will continue the 234 postcopy migration on both sides and go back to **POSTCOPY_ACTIVE** 235 state. Postcopy migration will continue. 236 237During a paused postcopy migration, the VM can logically still continue 238running, and it will not be impacted from any page access to pages that 239were already migrated to destination VM before the interruption happens. 240However, if any of the missing pages got accessed on destination VM, the VM 241thread will be halted waiting for the page to be migrated, it means it can 242be halted until the recovery is complete. 243 244The impact of accessing missing pages can be relevant to different 245configurations of the guest. For example, when with async page fault 246enabled, logically the guest can proactively schedule out the threads 247accessing missing pages. 248 249Postcopy with hugepages 250----------------------- 251 252Postcopy now works with hugetlbfs backed memory: 253 254 a) The linux kernel on the destination must support userfault on hugepages. 255 b) The huge-page configuration on the source and destination VMs must be 256 identical; i.e. RAMBlocks on both sides must use the same page size. 257 c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 258 RAM if it doesn't have enough hugepages, triggering (b) to fail. 259 Using ``-mem-prealloc`` enforces the allocation using hugepages. 260 d) Care should be taken with the size of hugepage used; postcopy with 2MB 261 hugepages works well, however 1GB hugepages are likely to be problematic 262 since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 263 and until the full page is transferred the destination thread is blocked. 264 265Postcopy with shared memory 266--------------------------- 267 268Postcopy migration with shared memory needs explicit support from the other 269processes that share memory and from QEMU. There are restrictions on the type of 270memory that userfault can support shared. 271 272The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 273(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 274for hugetlbfs which may be a problem in some configurations). 275 276The vhost-user code in QEMU supports clients that have Postcopy support, 277and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 278to support postcopy. 279 280The client needs to open a userfaultfd and register the areas 281of memory that it maps with userfault. The client must then pass the 282userfaultfd back to QEMU together with a mapping table that allows 283fault addresses in the clients address space to be converted back to 284RAMBlock/offsets. The client's userfaultfd is added to the postcopy 285fault-thread and page requests are made on behalf of the client by QEMU. 286QEMU performs 'wake' operations on the client's userfaultfd to allow it 287to continue after a page has arrived. 288 289.. note:: 290 There are two future improvements that would be nice: 291 a) Some way to make QEMU ignorant of the addresses in the clients 292 address space 293 b) Avoiding the need for QEMU to perform ufd-wake calls after the 294 pages have arrived 295 296Retro-fitting postcopy to existing clients is possible: 297 a) A mechanism is needed for the registration with userfault as above, 298 and the registration needs to be coordinated with the phases of 299 postcopy. In vhost-user extra messages are added to the existing 300 control channel. 301 b) Any thread that can block due to guest memory accesses must be 302 identified and the implication understood; for example if the 303 guest memory access is made while holding a lock then all other 304 threads waiting for that lock will also be blocked. 305 306Postcopy preemption mode 307------------------------ 308 309Postcopy preempt is a new capability introduced in 8.0 QEMU release, it 310allows urgent pages (those got page fault requested from destination QEMU 311explicitly) to be sent in a separate preempt channel, rather than queued in 312the background migration channel. Anyone who cares about latencies of page 313faults during a postcopy migration should enable this feature. By default, 314it's not enabled. 315