1*21b17cd0SPeter Xu======== 2bfb4c7cdSPeter XuPostcopy 3bfb4c7cdSPeter Xu======== 4bfb4c7cdSPeter Xu 5*21b17cd0SPeter Xu.. contents:: 6*21b17cd0SPeter Xu 7bfb4c7cdSPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge 8bfb4c7cdSPeter Xu(or take too long to converge) its plus side is that there is an upper bound on 9bfb4c7cdSPeter Xuthe amount of migration traffic and time it takes, the down side is that during 10bfb4c7cdSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost. 11bfb4c7cdSPeter Xu 12bfb4c7cdSPeter XuIn postcopy the destination CPUs are started before all the memory has been 13bfb4c7cdSPeter Xutransferred, and accesses to pages that are yet to be transferred cause 14bfb4c7cdSPeter Xua fault that's translated by QEMU into a request to the source QEMU. 15bfb4c7cdSPeter Xu 16bfb4c7cdSPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 17bfb4c7cdSPeter Xudoesn't finish in a given time the switch is made to postcopy. 18bfb4c7cdSPeter Xu 19bfb4c7cdSPeter XuEnabling postcopy 20*21b17cd0SPeter Xu================= 21bfb4c7cdSPeter Xu 22bfb4c7cdSPeter XuTo enable postcopy, issue this command on the monitor (both source and 23bfb4c7cdSPeter Xudestination) prior to the start of migration: 24bfb4c7cdSPeter Xu 25bfb4c7cdSPeter Xu``migrate_set_capability postcopy-ram on`` 26bfb4c7cdSPeter Xu 27bfb4c7cdSPeter XuThe normal commands are then used to start a migration, which is still 28bfb4c7cdSPeter Xustarted in precopy mode. Issuing: 29bfb4c7cdSPeter Xu 30bfb4c7cdSPeter Xu``migrate_start_postcopy`` 31bfb4c7cdSPeter Xu 32bfb4c7cdSPeter Xuwill now cause the transition from precopy to postcopy. 33bfb4c7cdSPeter XuIt can be issued immediately after migration is started or any 34bfb4c7cdSPeter Xutime later on. Issuing it after the end of a migration is harmless. 35bfb4c7cdSPeter Xu 36bfb4c7cdSPeter XuBlocktime is a postcopy live migration metric, intended to show how 37bfb4c7cdSPeter Xulong the vCPU was in state of interruptible sleep due to pagefault. 38bfb4c7cdSPeter XuThat metric is calculated both for all vCPUs as overlapped value, and 39bfb4c7cdSPeter Xuseparately for each vCPU. These values are calculated on destination 40bfb4c7cdSPeter Xuside. To enable postcopy blocktime calculation, enter following 41bfb4c7cdSPeter Xucommand on destination monitor: 42bfb4c7cdSPeter Xu 43bfb4c7cdSPeter Xu``migrate_set_capability postcopy-blocktime on`` 44bfb4c7cdSPeter Xu 45bfb4c7cdSPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command. 46bfb4c7cdSPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking 47bfb4c7cdSPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 48bfb4c7cdSPeter Xutime per vCPU. 49bfb4c7cdSPeter Xu 50bfb4c7cdSPeter Xu.. note:: 51bfb4c7cdSPeter Xu During the postcopy phase, the bandwidth limits set using 52bfb4c7cdSPeter Xu ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 53bfb4c7cdSPeter Xu the destination is waiting for). 54bfb4c7cdSPeter Xu 55*21b17cd0SPeter XuPostcopy internals 56*21b17cd0SPeter Xu================== 57*21b17cd0SPeter Xu 58*21b17cd0SPeter XuState machine 59*21b17cd0SPeter Xu------------- 60*21b17cd0SPeter Xu 61*21b17cd0SPeter XuPostcopy moves through a series of states (see postcopy_state) from 62*21b17cd0SPeter XuADVISE->DISCARD->LISTEN->RUNNING->END 63*21b17cd0SPeter Xu 64*21b17cd0SPeter Xu - Advise 65*21b17cd0SPeter Xu 66*21b17cd0SPeter Xu Set at the start of migration if postcopy is enabled, even 67*21b17cd0SPeter Xu if it hasn't had the start command; here the destination 68*21b17cd0SPeter Xu checks that its OS has the support needed for postcopy, and performs 69*21b17cd0SPeter Xu setup to ensure the RAM mappings are suitable for later postcopy. 70*21b17cd0SPeter Xu The destination will fail early in migration at this point if the 71*21b17cd0SPeter Xu required OS support is not present. 72*21b17cd0SPeter Xu (Triggered by reception of POSTCOPY_ADVISE command) 73*21b17cd0SPeter Xu 74*21b17cd0SPeter Xu - Discard 75*21b17cd0SPeter Xu 76*21b17cd0SPeter Xu Entered on receipt of the first 'discard' command; prior to 77*21b17cd0SPeter Xu the first Discard being performed, hugepages are switched off 78*21b17cd0SPeter Xu (using madvise) to ensure that no new huge pages are created 79*21b17cd0SPeter Xu during the postcopy phase, and to cause any huge pages that 80*21b17cd0SPeter Xu have discards on them to be broken. 81*21b17cd0SPeter Xu 82*21b17cd0SPeter Xu - Listen 83*21b17cd0SPeter Xu 84*21b17cd0SPeter Xu The first command in the package, POSTCOPY_LISTEN, switches 85*21b17cd0SPeter Xu the destination state to Listen, and starts a new thread 86*21b17cd0SPeter Xu (the 'listen thread') which takes over the job of receiving 87*21b17cd0SPeter Xu pages off the migration stream, while the main thread carries 88*21b17cd0SPeter Xu on processing the blob. With this thread able to process page 89*21b17cd0SPeter Xu reception, the destination now 'sensitises' the RAM to detect 90*21b17cd0SPeter Xu any access to missing pages (on Linux using the 'userfault' 91*21b17cd0SPeter Xu system). 92*21b17cd0SPeter Xu 93*21b17cd0SPeter Xu - Running 94*21b17cd0SPeter Xu 95*21b17cd0SPeter Xu POSTCOPY_RUN causes the destination to synchronise all 96*21b17cd0SPeter Xu state and start the CPUs and IO devices running. The main 97*21b17cd0SPeter Xu thread now finishes processing the migration package and 98*21b17cd0SPeter Xu now carries on as it would for normal precopy migration 99*21b17cd0SPeter Xu (although it can't do the cleanup it would do as it 100*21b17cd0SPeter Xu finishes a normal migration). 101*21b17cd0SPeter Xu 102*21b17cd0SPeter Xu - Paused 103*21b17cd0SPeter Xu 104*21b17cd0SPeter Xu Postcopy can run into a paused state (normally on both sides when 105*21b17cd0SPeter Xu happens), where all threads will be temporarily halted mostly due to 106*21b17cd0SPeter Xu network errors. When reaching paused state, migration will make sure 107*21b17cd0SPeter Xu the qemu binary on both sides maintain the data without corrupting 108*21b17cd0SPeter Xu the VM. To continue the migration, the admin needs to fix the 109*21b17cd0SPeter Xu migration channel using the QMP command 'migrate-recover' on the 110*21b17cd0SPeter Xu destination node, then resume the migration using QMP command 'migrate' 111*21b17cd0SPeter Xu again on source node, with resume=true flag set. 112*21b17cd0SPeter Xu 113*21b17cd0SPeter Xu - End 114*21b17cd0SPeter Xu 115*21b17cd0SPeter Xu The listen thread can now quit, and perform the cleanup of migration 116*21b17cd0SPeter Xu state, the migration is now complete. 117*21b17cd0SPeter Xu 118*21b17cd0SPeter XuDevice transfer 119*21b17cd0SPeter Xu--------------- 120bfb4c7cdSPeter Xu 121bfb4c7cdSPeter XuLoading of device data may cause the device emulation to access guest RAM 122bfb4c7cdSPeter Xuthat may trigger faults that have to be resolved by the source, as such 123bfb4c7cdSPeter Xuthe migration stream has to be able to respond with page data *during* the 124bfb4c7cdSPeter Xudevice load, and hence the device data has to be read from the stream completely 125bfb4c7cdSPeter Xubefore the device load begins to free the stream up. This is achieved by 126bfb4c7cdSPeter Xu'packaging' the device data into a blob that's read in one go. 127bfb4c7cdSPeter Xu 128bfb4c7cdSPeter XuSource behaviour 129bfb4c7cdSPeter Xu---------------- 130bfb4c7cdSPeter Xu 131bfb4c7cdSPeter XuUntil postcopy is entered the migration stream is identical to normal 132bfb4c7cdSPeter Xuprecopy, except for the addition of a 'postcopy advise' command at 133bfb4c7cdSPeter Xuthe beginning, to tell the destination that postcopy might happen. 134bfb4c7cdSPeter XuWhen postcopy starts the source sends the page discard data and then 135bfb4c7cdSPeter Xuforms the 'package' containing: 136bfb4c7cdSPeter Xu 137bfb4c7cdSPeter Xu - Command: 'postcopy listen' 138bfb4c7cdSPeter Xu - The device state 139bfb4c7cdSPeter Xu 140bfb4c7cdSPeter Xu A series of sections, identical to the precopy streams device state stream 141bfb4c7cdSPeter Xu containing everything except postcopiable devices (i.e. RAM) 142bfb4c7cdSPeter Xu - Command: 'postcopy run' 143bfb4c7cdSPeter Xu 144bfb4c7cdSPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 145bfb4c7cdSPeter Xucontents are formatted in the same way as the main migration stream. 146bfb4c7cdSPeter Xu 147bfb4c7cdSPeter XuDuring postcopy the source scans the list of dirty pages and sends them 148bfb4c7cdSPeter Xuto the destination without being requested (in much the same way as precopy), 149bfb4c7cdSPeter Xuhowever when a page request is received from the destination, the dirty page 150bfb4c7cdSPeter Xuscanning restarts from the requested location. This causes requested pages 151bfb4c7cdSPeter Xuto be sent quickly, and also causes pages directly after the requested page 152bfb4c7cdSPeter Xuto be sent quickly in the hope that those pages are likely to be used 153bfb4c7cdSPeter Xuby the destination soon. 154bfb4c7cdSPeter Xu 155bfb4c7cdSPeter XuDestination behaviour 156bfb4c7cdSPeter Xu--------------------- 157bfb4c7cdSPeter Xu 158bfb4c7cdSPeter XuInitially the destination looks the same as precopy, with a single thread 159bfb4c7cdSPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands 160bfb4c7cdSPeter Xuare processed to change the way RAM is managed, but don't affect the stream 161bfb4c7cdSPeter Xuprocessing. 162bfb4c7cdSPeter Xu 163bfb4c7cdSPeter Xu:: 164bfb4c7cdSPeter Xu 165bfb4c7cdSPeter Xu ------------------------------------------------------------------------------ 166bfb4c7cdSPeter Xu 1 2 3 4 5 6 7 167bfb4c7cdSPeter Xu main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 168bfb4c7cdSPeter Xu thread | | 169bfb4c7cdSPeter Xu | (page request) 170bfb4c7cdSPeter Xu | \___ 171bfb4c7cdSPeter Xu v \ 172bfb4c7cdSPeter Xu listen thread: --- page -- page -- page -- page -- page -- 173bfb4c7cdSPeter Xu 174bfb4c7cdSPeter Xu a b c 175bfb4c7cdSPeter Xu ------------------------------------------------------------------------------ 176bfb4c7cdSPeter Xu 177bfb4c7cdSPeter Xu- On receipt of ``CMD_PACKAGED`` (1) 178bfb4c7cdSPeter Xu 179bfb4c7cdSPeter Xu All the data associated with the package - the ( ... ) section in the diagram - 180bfb4c7cdSPeter Xu is read into memory, and the main thread recurses into qemu_loadvm_state_main 181bfb4c7cdSPeter Xu to process the contents of the package (2) which contains commands (3,6) and 182bfb4c7cdSPeter Xu devices (4...) 183bfb4c7cdSPeter Xu 184bfb4c7cdSPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 185bfb4c7cdSPeter Xu 186bfb4c7cdSPeter Xu a new thread (a) is started that takes over servicing the migration stream, 187bfb4c7cdSPeter Xu while the main thread carries on loading the package. It loads normal 188bfb4c7cdSPeter Xu background page data (b) but if during a device load a fault happens (5) 189bfb4c7cdSPeter Xu the returned page (c) is loaded by the listen thread allowing the main 190bfb4c7cdSPeter Xu threads device load to carry on. 191bfb4c7cdSPeter Xu 192bfb4c7cdSPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 193bfb4c7cdSPeter Xu 194bfb4c7cdSPeter Xu letting the destination CPUs start running. At the end of the 195bfb4c7cdSPeter Xu ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 196bfb4c7cdSPeter Xu is no longer used by migration, while the listen thread carries on servicing 197bfb4c7cdSPeter Xu page data until the end of migration. 198bfb4c7cdSPeter Xu 199*21b17cd0SPeter XuSource side page bitmap 200*21b17cd0SPeter Xu----------------------- 201*21b17cd0SPeter Xu 202*21b17cd0SPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy, 203*21b17cd0SPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs 204*21b17cd0SPeter Xusending. During the precopy phase this is updated as the CPU dirties 205*21b17cd0SPeter Xupages, however during postcopy the CPUs are stopped and nothing should 206*21b17cd0SPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant 207*21b17cd0SPeter Xupages are sent during postcopy. 208*21b17cd0SPeter Xu 209*21b17cd0SPeter XuPostcopy features 210*21b17cd0SPeter Xu================= 211*21b17cd0SPeter Xu 212*21b17cd0SPeter XuPostcopy recovery 213bfb4c7cdSPeter Xu----------------- 214bfb4c7cdSPeter Xu 215bfb4c7cdSPeter XuComparing to precopy, postcopy is special on error handlings. When any 216bfb4c7cdSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily 217bfb4c7cdSPeter Xufail a migration because VM data resides in both source and destination 218bfb4c7cdSPeter XuQEMU instances. On the other hand, when issue happens QEMU on both sides 219bfb4c7cdSPeter Xuwill go into a paused state. It'll need a recovery phase to continue a 220bfb4c7cdSPeter Xupaused postcopy migration. 221bfb4c7cdSPeter Xu 222bfb4c7cdSPeter XuThe recovery phase normally contains a few steps: 223bfb4c7cdSPeter Xu 224bfb4c7cdSPeter Xu - When network issue occurs, both QEMU will go into PAUSED state 225bfb4c7cdSPeter Xu 226bfb4c7cdSPeter Xu - When the network is recovered (or a new network is provided), the admin 227bfb4c7cdSPeter Xu can setup the new channel for migration using QMP command 228bfb4c7cdSPeter Xu 'migrate-recover' on destination node, preparing for a resume. 229bfb4c7cdSPeter Xu 230bfb4c7cdSPeter Xu - On source host, the admin can continue the interrupted postcopy 231bfb4c7cdSPeter Xu migration using QMP command 'migrate' with resume=true flag set. 232bfb4c7cdSPeter Xu 233bfb4c7cdSPeter Xu - After the connection is re-established, QEMU will continue the postcopy 234bfb4c7cdSPeter Xu migration on both sides. 235bfb4c7cdSPeter Xu 236bfb4c7cdSPeter XuDuring a paused postcopy migration, the VM can logically still continue 237bfb4c7cdSPeter Xurunning, and it will not be impacted from any page access to pages that 238bfb4c7cdSPeter Xuwere already migrated to destination VM before the interruption happens. 239bfb4c7cdSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM 240bfb4c7cdSPeter Xuthread will be halted waiting for the page to be migrated, it means it can 241bfb4c7cdSPeter Xube halted until the recovery is complete. 242bfb4c7cdSPeter Xu 243bfb4c7cdSPeter XuThe impact of accessing missing pages can be relevant to different 244bfb4c7cdSPeter Xuconfigurations of the guest. For example, when with async page fault 245bfb4c7cdSPeter Xuenabled, logically the guest can proactively schedule out the threads 246bfb4c7cdSPeter Xuaccessing missing pages. 247bfb4c7cdSPeter Xu 248bfb4c7cdSPeter XuPostcopy with hugepages 249bfb4c7cdSPeter Xu----------------------- 250bfb4c7cdSPeter Xu 251bfb4c7cdSPeter XuPostcopy now works with hugetlbfs backed memory: 252bfb4c7cdSPeter Xu 253bfb4c7cdSPeter Xu a) The linux kernel on the destination must support userfault on hugepages. 254bfb4c7cdSPeter Xu b) The huge-page configuration on the source and destination VMs must be 255bfb4c7cdSPeter Xu identical; i.e. RAMBlocks on both sides must use the same page size. 256bfb4c7cdSPeter Xu c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 257bfb4c7cdSPeter Xu RAM if it doesn't have enough hugepages, triggering (b) to fail. 258bfb4c7cdSPeter Xu Using ``-mem-prealloc`` enforces the allocation using hugepages. 259bfb4c7cdSPeter Xu d) Care should be taken with the size of hugepage used; postcopy with 2MB 260bfb4c7cdSPeter Xu hugepages works well, however 1GB hugepages are likely to be problematic 261bfb4c7cdSPeter Xu since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 262bfb4c7cdSPeter Xu and until the full page is transferred the destination thread is blocked. 263bfb4c7cdSPeter Xu 264bfb4c7cdSPeter XuPostcopy with shared memory 265bfb4c7cdSPeter Xu--------------------------- 266bfb4c7cdSPeter Xu 267bfb4c7cdSPeter XuPostcopy migration with shared memory needs explicit support from the other 268bfb4c7cdSPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of 269bfb4c7cdSPeter Xumemory that userfault can support shared. 270bfb4c7cdSPeter Xu 271bfb4c7cdSPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 272bfb4c7cdSPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 273bfb4c7cdSPeter Xufor hugetlbfs which may be a problem in some configurations). 274bfb4c7cdSPeter Xu 275bfb4c7cdSPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support, 276bfb4c7cdSPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 277bfb4c7cdSPeter Xuto support postcopy. 278bfb4c7cdSPeter Xu 279bfb4c7cdSPeter XuThe client needs to open a userfaultfd and register the areas 280bfb4c7cdSPeter Xuof memory that it maps with userfault. The client must then pass the 281bfb4c7cdSPeter Xuuserfaultfd back to QEMU together with a mapping table that allows 282bfb4c7cdSPeter Xufault addresses in the clients address space to be converted back to 283bfb4c7cdSPeter XuRAMBlock/offsets. The client's userfaultfd is added to the postcopy 284bfb4c7cdSPeter Xufault-thread and page requests are made on behalf of the client by QEMU. 285bfb4c7cdSPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it 286bfb4c7cdSPeter Xuto continue after a page has arrived. 287bfb4c7cdSPeter Xu 288bfb4c7cdSPeter Xu.. note:: 289bfb4c7cdSPeter Xu There are two future improvements that would be nice: 290bfb4c7cdSPeter Xu a) Some way to make QEMU ignorant of the addresses in the clients 291bfb4c7cdSPeter Xu address space 292bfb4c7cdSPeter Xu b) Avoiding the need for QEMU to perform ufd-wake calls after the 293bfb4c7cdSPeter Xu pages have arrived 294bfb4c7cdSPeter Xu 295bfb4c7cdSPeter XuRetro-fitting postcopy to existing clients is possible: 296bfb4c7cdSPeter Xu a) A mechanism is needed for the registration with userfault as above, 297bfb4c7cdSPeter Xu and the registration needs to be coordinated with the phases of 298bfb4c7cdSPeter Xu postcopy. In vhost-user extra messages are added to the existing 299bfb4c7cdSPeter Xu control channel. 300bfb4c7cdSPeter Xu b) Any thread that can block due to guest memory accesses must be 301bfb4c7cdSPeter Xu identified and the implication understood; for example if the 302bfb4c7cdSPeter Xu guest memory access is made while holding a lock then all other 303bfb4c7cdSPeter Xu threads waiting for that lock will also be blocked. 304bfb4c7cdSPeter Xu 305*21b17cd0SPeter XuPostcopy preemption mode 306bfb4c7cdSPeter Xu------------------------ 307bfb4c7cdSPeter Xu 308bfb4c7cdSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it 309bfb4c7cdSPeter Xuallows urgent pages (those got page fault requested from destination QEMU 310bfb4c7cdSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in 311bfb4c7cdSPeter Xuthe background migration channel. Anyone who cares about latencies of page 312bfb4c7cdSPeter Xufaults during a postcopy migration should enable this feature. By default, 313bfb4c7cdSPeter Xuit's not enabled. 314