xref: /openbmc/qemu/docs/devel/migration/postcopy.rst (revision 21e89f7a)
121b17cd0SPeter Xu========
2bfb4c7cdSPeter XuPostcopy
3bfb4c7cdSPeter Xu========
4bfb4c7cdSPeter Xu
521b17cd0SPeter Xu.. contents::
621b17cd0SPeter Xu
7bfb4c7cdSPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge
8bfb4c7cdSPeter Xu(or take too long to converge) its plus side is that there is an upper bound on
9bfb4c7cdSPeter Xuthe amount of migration traffic and time it takes, the down side is that during
10bfb4c7cdSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost.
11bfb4c7cdSPeter Xu
12bfb4c7cdSPeter XuIn postcopy the destination CPUs are started before all the memory has been
13bfb4c7cdSPeter Xutransferred, and accesses to pages that are yet to be transferred cause
14bfb4c7cdSPeter Xua fault that's translated by QEMU into a request to the source QEMU.
15bfb4c7cdSPeter Xu
16bfb4c7cdSPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
17bfb4c7cdSPeter Xudoesn't finish in a given time the switch is made to postcopy.
18bfb4c7cdSPeter Xu
19bfb4c7cdSPeter XuEnabling postcopy
2021b17cd0SPeter Xu=================
21bfb4c7cdSPeter Xu
22bfb4c7cdSPeter XuTo enable postcopy, issue this command on the monitor (both source and
23bfb4c7cdSPeter Xudestination) prior to the start of migration:
24bfb4c7cdSPeter Xu
25bfb4c7cdSPeter Xu``migrate_set_capability postcopy-ram on``
26bfb4c7cdSPeter Xu
27bfb4c7cdSPeter XuThe normal commands are then used to start a migration, which is still
28bfb4c7cdSPeter Xustarted in precopy mode.  Issuing:
29bfb4c7cdSPeter Xu
30bfb4c7cdSPeter Xu``migrate_start_postcopy``
31bfb4c7cdSPeter Xu
32bfb4c7cdSPeter Xuwill now cause the transition from precopy to postcopy.
33bfb4c7cdSPeter XuIt can be issued immediately after migration is started or any
34bfb4c7cdSPeter Xutime later on.  Issuing it after the end of a migration is harmless.
35bfb4c7cdSPeter Xu
36bfb4c7cdSPeter XuBlocktime is a postcopy live migration metric, intended to show how
37bfb4c7cdSPeter Xulong the vCPU was in state of interruptible sleep due to pagefault.
38bfb4c7cdSPeter XuThat metric is calculated both for all vCPUs as overlapped value, and
39bfb4c7cdSPeter Xuseparately for each vCPU. These values are calculated on destination
40bfb4c7cdSPeter Xuside.  To enable postcopy blocktime calculation, enter following
41bfb4c7cdSPeter Xucommand on destination monitor:
42bfb4c7cdSPeter Xu
43bfb4c7cdSPeter Xu``migrate_set_capability postcopy-blocktime on``
44bfb4c7cdSPeter Xu
45bfb4c7cdSPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command.
46bfb4c7cdSPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking
47bfb4c7cdSPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
48bfb4c7cdSPeter Xutime per vCPU.
49bfb4c7cdSPeter Xu
50bfb4c7cdSPeter Xu.. note::
51bfb4c7cdSPeter Xu  During the postcopy phase, the bandwidth limits set using
52bfb4c7cdSPeter Xu  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
53bfb4c7cdSPeter Xu  the destination is waiting for).
54bfb4c7cdSPeter Xu
5521b17cd0SPeter XuPostcopy internals
5621b17cd0SPeter Xu==================
5721b17cd0SPeter Xu
5821b17cd0SPeter XuState machine
5921b17cd0SPeter Xu-------------
6021b17cd0SPeter Xu
6121b17cd0SPeter XuPostcopy moves through a series of states (see postcopy_state) from
6221b17cd0SPeter XuADVISE->DISCARD->LISTEN->RUNNING->END
6321b17cd0SPeter Xu
6421b17cd0SPeter Xu - Advise
6521b17cd0SPeter Xu
6621b17cd0SPeter Xu    Set at the start of migration if postcopy is enabled, even
6721b17cd0SPeter Xu    if it hasn't had the start command; here the destination
6821b17cd0SPeter Xu    checks that its OS has the support needed for postcopy, and performs
6921b17cd0SPeter Xu    setup to ensure the RAM mappings are suitable for later postcopy.
7021b17cd0SPeter Xu    The destination will fail early in migration at this point if the
7121b17cd0SPeter Xu    required OS support is not present.
7221b17cd0SPeter Xu    (Triggered by reception of POSTCOPY_ADVISE command)
7321b17cd0SPeter Xu
7421b17cd0SPeter Xu - Discard
7521b17cd0SPeter Xu
7621b17cd0SPeter Xu    Entered on receipt of the first 'discard' command; prior to
7721b17cd0SPeter Xu    the first Discard being performed, hugepages are switched off
7821b17cd0SPeter Xu    (using madvise) to ensure that no new huge pages are created
7921b17cd0SPeter Xu    during the postcopy phase, and to cause any huge pages that
8021b17cd0SPeter Xu    have discards on them to be broken.
8121b17cd0SPeter Xu
8221b17cd0SPeter Xu - Listen
8321b17cd0SPeter Xu
8421b17cd0SPeter Xu    The first command in the package, POSTCOPY_LISTEN, switches
8521b17cd0SPeter Xu    the destination state to Listen, and starts a new thread
8621b17cd0SPeter Xu    (the 'listen thread') which takes over the job of receiving
8721b17cd0SPeter Xu    pages off the migration stream, while the main thread carries
8821b17cd0SPeter Xu    on processing the blob.  With this thread able to process page
8921b17cd0SPeter Xu    reception, the destination now 'sensitises' the RAM to detect
9021b17cd0SPeter Xu    any access to missing pages (on Linux using the 'userfault'
9121b17cd0SPeter Xu    system).
9221b17cd0SPeter Xu
9321b17cd0SPeter Xu - Running
9421b17cd0SPeter Xu
9521b17cd0SPeter Xu    POSTCOPY_RUN causes the destination to synchronise all
9621b17cd0SPeter Xu    state and start the CPUs and IO devices running.  The main
9721b17cd0SPeter Xu    thread now finishes processing the migration package and
9821b17cd0SPeter Xu    now carries on as it would for normal precopy migration
9921b17cd0SPeter Xu    (although it can't do the cleanup it would do as it
10021b17cd0SPeter Xu    finishes a normal migration).
10121b17cd0SPeter Xu
10221b17cd0SPeter Xu - End
10321b17cd0SPeter Xu
10421b17cd0SPeter Xu    The listen thread can now quit, and perform the cleanup of migration
10521b17cd0SPeter Xu    state, the migration is now complete.
10621b17cd0SPeter Xu
10721b17cd0SPeter XuDevice transfer
10821b17cd0SPeter Xu---------------
109bfb4c7cdSPeter Xu
110bfb4c7cdSPeter XuLoading of device data may cause the device emulation to access guest RAM
111bfb4c7cdSPeter Xuthat may trigger faults that have to be resolved by the source, as such
112bfb4c7cdSPeter Xuthe migration stream has to be able to respond with page data *during* the
113bfb4c7cdSPeter Xudevice load, and hence the device data has to be read from the stream completely
114bfb4c7cdSPeter Xubefore the device load begins to free the stream up.  This is achieved by
115bfb4c7cdSPeter Xu'packaging' the device data into a blob that's read in one go.
116bfb4c7cdSPeter Xu
117bfb4c7cdSPeter XuSource behaviour
118bfb4c7cdSPeter Xu----------------
119bfb4c7cdSPeter Xu
120bfb4c7cdSPeter XuUntil postcopy is entered the migration stream is identical to normal
121bfb4c7cdSPeter Xuprecopy, except for the addition of a 'postcopy advise' command at
122bfb4c7cdSPeter Xuthe beginning, to tell the destination that postcopy might happen.
123bfb4c7cdSPeter XuWhen postcopy starts the source sends the page discard data and then
124bfb4c7cdSPeter Xuforms the 'package' containing:
125bfb4c7cdSPeter Xu
126bfb4c7cdSPeter Xu   - Command: 'postcopy listen'
127bfb4c7cdSPeter Xu   - The device state
128bfb4c7cdSPeter Xu
129bfb4c7cdSPeter Xu     A series of sections, identical to the precopy streams device state stream
130bfb4c7cdSPeter Xu     containing everything except postcopiable devices (i.e. RAM)
131bfb4c7cdSPeter Xu   - Command: 'postcopy run'
132bfb4c7cdSPeter Xu
133bfb4c7cdSPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
134bfb4c7cdSPeter Xucontents are formatted in the same way as the main migration stream.
135bfb4c7cdSPeter Xu
136bfb4c7cdSPeter XuDuring postcopy the source scans the list of dirty pages and sends them
137bfb4c7cdSPeter Xuto the destination without being requested (in much the same way as precopy),
138bfb4c7cdSPeter Xuhowever when a page request is received from the destination, the dirty page
139bfb4c7cdSPeter Xuscanning restarts from the requested location.  This causes requested pages
140bfb4c7cdSPeter Xuto be sent quickly, and also causes pages directly after the requested page
141bfb4c7cdSPeter Xuto be sent quickly in the hope that those pages are likely to be used
142bfb4c7cdSPeter Xuby the destination soon.
143bfb4c7cdSPeter Xu
144bfb4c7cdSPeter XuDestination behaviour
145bfb4c7cdSPeter Xu---------------------
146bfb4c7cdSPeter Xu
147bfb4c7cdSPeter XuInitially the destination looks the same as precopy, with a single thread
148bfb4c7cdSPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands
149bfb4c7cdSPeter Xuare processed to change the way RAM is managed, but don't affect the stream
150bfb4c7cdSPeter Xuprocessing.
151bfb4c7cdSPeter Xu
152bfb4c7cdSPeter Xu::
153bfb4c7cdSPeter Xu
154bfb4c7cdSPeter Xu  ------------------------------------------------------------------------------
155bfb4c7cdSPeter Xu                          1      2   3     4 5                      6   7
156bfb4c7cdSPeter Xu  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
157bfb4c7cdSPeter Xu  thread                             |       |
158bfb4c7cdSPeter Xu                                     |     (page request)
159bfb4c7cdSPeter Xu                                     |        \___
160bfb4c7cdSPeter Xu                                     v            \
161bfb4c7cdSPeter Xu  listen thread:                     --- page -- page -- page -- page -- page --
162bfb4c7cdSPeter Xu
163bfb4c7cdSPeter Xu                                     a   b        c
164bfb4c7cdSPeter Xu  ------------------------------------------------------------------------------
165bfb4c7cdSPeter Xu
166bfb4c7cdSPeter Xu- On receipt of ``CMD_PACKAGED`` (1)
167bfb4c7cdSPeter Xu
168bfb4c7cdSPeter Xu   All the data associated with the package - the ( ... ) section in the diagram -
169bfb4c7cdSPeter Xu   is read into memory, and the main thread recurses into qemu_loadvm_state_main
170bfb4c7cdSPeter Xu   to process the contents of the package (2) which contains commands (3,6) and
171bfb4c7cdSPeter Xu   devices (4...)
172bfb4c7cdSPeter Xu
173bfb4c7cdSPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
174bfb4c7cdSPeter Xu
175bfb4c7cdSPeter Xu   a new thread (a) is started that takes over servicing the migration stream,
176bfb4c7cdSPeter Xu   while the main thread carries on loading the package.   It loads normal
177bfb4c7cdSPeter Xu   background page data (b) but if during a device load a fault happens (5)
178bfb4c7cdSPeter Xu   the returned page (c) is loaded by the listen thread allowing the main
179bfb4c7cdSPeter Xu   threads device load to carry on.
180bfb4c7cdSPeter Xu
181bfb4c7cdSPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
182bfb4c7cdSPeter Xu
183bfb4c7cdSPeter Xu   letting the destination CPUs start running.  At the end of the
184bfb4c7cdSPeter Xu   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
185bfb4c7cdSPeter Xu   is no longer used by migration, while the listen thread carries on servicing
186bfb4c7cdSPeter Xu   page data until the end of migration.
187bfb4c7cdSPeter Xu
18821b17cd0SPeter XuSource side page bitmap
18921b17cd0SPeter Xu-----------------------
19021b17cd0SPeter Xu
19121b17cd0SPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy,
19221b17cd0SPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs
19321b17cd0SPeter Xusending.  During the precopy phase this is updated as the CPU dirties
19421b17cd0SPeter Xupages, however during postcopy the CPUs are stopped and nothing should
19521b17cd0SPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant
19621b17cd0SPeter Xupages are sent during postcopy.
19721b17cd0SPeter Xu
19821b17cd0SPeter XuPostcopy features
19921b17cd0SPeter Xu=================
20021b17cd0SPeter Xu
20121b17cd0SPeter XuPostcopy recovery
202bfb4c7cdSPeter Xu-----------------
203bfb4c7cdSPeter Xu
204bfb4c7cdSPeter XuComparing to precopy, postcopy is special on error handlings.  When any
205bfb4c7cdSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily
206bfb4c7cdSPeter Xufail a migration because VM data resides in both source and destination
207bfb4c7cdSPeter XuQEMU instances.  On the other hand, when issue happens QEMU on both sides
208bfb4c7cdSPeter Xuwill go into a paused state.  It'll need a recovery phase to continue a
209bfb4c7cdSPeter Xupaused postcopy migration.
210bfb4c7cdSPeter Xu
211bfb4c7cdSPeter XuThe recovery phase normally contains a few steps:
212bfb4c7cdSPeter Xu
213*21e89f7aSPeter Xu  - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
214*21e89f7aSPeter Xu    migration state.
215bfb4c7cdSPeter Xu
216bfb4c7cdSPeter Xu  - When the network is recovered (or a new network is provided), the admin
217bfb4c7cdSPeter Xu    can setup the new channel for migration using QMP command
218bfb4c7cdSPeter Xu    'migrate-recover' on destination node, preparing for a resume.
219bfb4c7cdSPeter Xu
220bfb4c7cdSPeter Xu  - On source host, the admin can continue the interrupted postcopy
221bfb4c7cdSPeter Xu    migration using QMP command 'migrate' with resume=true flag set.
222*21e89f7aSPeter Xu    Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
223*21e89f7aSPeter Xu    re-establish the channels.
224bfb4c7cdSPeter Xu
225*21e89f7aSPeter Xu  - When both sides of QEMU successfully reconnect using a new or fixed up
226*21e89f7aSPeter Xu    channel, they will go into **POSTCOPY_RECOVER** state, some handshake
227*21e89f7aSPeter Xu    procedure will be needed to properly synchronize the VM states between
228*21e89f7aSPeter Xu    the two QEMUs to continue the postcopy migration.  For example, there
229*21e89f7aSPeter Xu    can be pages sent right during the window when the network is
230*21e89f7aSPeter Xu    interrupted, then the handshake will guarantee pages lost in-flight
231*21e89f7aSPeter Xu    will be resent again.
232*21e89f7aSPeter Xu
233*21e89f7aSPeter Xu  - After a proper handshake synchronization, QEMU will continue the
234*21e89f7aSPeter Xu    postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
235*21e89f7aSPeter Xu    state.  Postcopy migration will continue.
236bfb4c7cdSPeter Xu
237bfb4c7cdSPeter XuDuring a paused postcopy migration, the VM can logically still continue
238bfb4c7cdSPeter Xurunning, and it will not be impacted from any page access to pages that
239bfb4c7cdSPeter Xuwere already migrated to destination VM before the interruption happens.
240bfb4c7cdSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM
241bfb4c7cdSPeter Xuthread will be halted waiting for the page to be migrated, it means it can
242bfb4c7cdSPeter Xube halted until the recovery is complete.
243bfb4c7cdSPeter Xu
244bfb4c7cdSPeter XuThe impact of accessing missing pages can be relevant to different
245bfb4c7cdSPeter Xuconfigurations of the guest.  For example, when with async page fault
246bfb4c7cdSPeter Xuenabled, logically the guest can proactively schedule out the threads
247bfb4c7cdSPeter Xuaccessing missing pages.
248bfb4c7cdSPeter Xu
249bfb4c7cdSPeter XuPostcopy with hugepages
250bfb4c7cdSPeter Xu-----------------------
251bfb4c7cdSPeter Xu
252bfb4c7cdSPeter XuPostcopy now works with hugetlbfs backed memory:
253bfb4c7cdSPeter Xu
254bfb4c7cdSPeter Xu  a) The linux kernel on the destination must support userfault on hugepages.
255bfb4c7cdSPeter Xu  b) The huge-page configuration on the source and destination VMs must be
256bfb4c7cdSPeter Xu     identical; i.e. RAMBlocks on both sides must use the same page size.
257bfb4c7cdSPeter Xu  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
258bfb4c7cdSPeter Xu     RAM if it doesn't have enough hugepages, triggering (b) to fail.
259bfb4c7cdSPeter Xu     Using ``-mem-prealloc`` enforces the allocation using hugepages.
260bfb4c7cdSPeter Xu  d) Care should be taken with the size of hugepage used; postcopy with 2MB
261bfb4c7cdSPeter Xu     hugepages works well, however 1GB hugepages are likely to be problematic
262bfb4c7cdSPeter Xu     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
263bfb4c7cdSPeter Xu     and until the full page is transferred the destination thread is blocked.
264bfb4c7cdSPeter Xu
265bfb4c7cdSPeter XuPostcopy with shared memory
266bfb4c7cdSPeter Xu---------------------------
267bfb4c7cdSPeter Xu
268bfb4c7cdSPeter XuPostcopy migration with shared memory needs explicit support from the other
269bfb4c7cdSPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of
270bfb4c7cdSPeter Xumemory that userfault can support shared.
271bfb4c7cdSPeter Xu
272bfb4c7cdSPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
273bfb4c7cdSPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
274bfb4c7cdSPeter Xufor hugetlbfs which may be a problem in some configurations).
275bfb4c7cdSPeter Xu
276bfb4c7cdSPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support,
277bfb4c7cdSPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
278bfb4c7cdSPeter Xuto support postcopy.
279bfb4c7cdSPeter Xu
280bfb4c7cdSPeter XuThe client needs to open a userfaultfd and register the areas
281bfb4c7cdSPeter Xuof memory that it maps with userfault.  The client must then pass the
282bfb4c7cdSPeter Xuuserfaultfd back to QEMU together with a mapping table that allows
283bfb4c7cdSPeter Xufault addresses in the clients address space to be converted back to
284bfb4c7cdSPeter XuRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
285bfb4c7cdSPeter Xufault-thread and page requests are made on behalf of the client by QEMU.
286bfb4c7cdSPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it
287bfb4c7cdSPeter Xuto continue after a page has arrived.
288bfb4c7cdSPeter Xu
289bfb4c7cdSPeter Xu.. note::
290bfb4c7cdSPeter Xu  There are two future improvements that would be nice:
291bfb4c7cdSPeter Xu    a) Some way to make QEMU ignorant of the addresses in the clients
292bfb4c7cdSPeter Xu       address space
293bfb4c7cdSPeter Xu    b) Avoiding the need for QEMU to perform ufd-wake calls after the
294bfb4c7cdSPeter Xu       pages have arrived
295bfb4c7cdSPeter Xu
296bfb4c7cdSPeter XuRetro-fitting postcopy to existing clients is possible:
297bfb4c7cdSPeter Xu  a) A mechanism is needed for the registration with userfault as above,
298bfb4c7cdSPeter Xu     and the registration needs to be coordinated with the phases of
299bfb4c7cdSPeter Xu     postcopy.  In vhost-user extra messages are added to the existing
300bfb4c7cdSPeter Xu     control channel.
301bfb4c7cdSPeter Xu  b) Any thread that can block due to guest memory accesses must be
302bfb4c7cdSPeter Xu     identified and the implication understood; for example if the
303bfb4c7cdSPeter Xu     guest memory access is made while holding a lock then all other
304bfb4c7cdSPeter Xu     threads waiting for that lock will also be blocked.
305bfb4c7cdSPeter Xu
30621b17cd0SPeter XuPostcopy preemption mode
307bfb4c7cdSPeter Xu------------------------
308bfb4c7cdSPeter Xu
309bfb4c7cdSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it
310bfb4c7cdSPeter Xuallows urgent pages (those got page fault requested from destination QEMU
311bfb4c7cdSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in
312bfb4c7cdSPeter Xuthe background migration channel.  Anyone who cares about latencies of page
313bfb4c7cdSPeter Xufaults during a postcopy migration should enable this feature.  By default,
314bfb4c7cdSPeter Xuit's not enabled.
315