xref: /openbmc/qemu/docs/devel/migration/postcopy.rst (revision 28004fb7)
1========
2Postcopy
3========
4
5.. contents::
6
7'Postcopy' migration is a way to deal with migrations that refuse to converge
8(or take too long to converge) its plus side is that there is an upper bound on
9the amount of migration traffic and time it takes, the down side is that during
10the postcopy phase, a failure of *either* side causes the guest to be lost.
11
12In postcopy the destination CPUs are started before all the memory has been
13transferred, and accesses to pages that are yet to be transferred cause
14a fault that's translated by QEMU into a request to the source QEMU.
15
16Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
17doesn't finish in a given time the switch is made to postcopy.
18
19Enabling postcopy
20=================
21
22To enable postcopy, issue this command on the monitor (both source and
23destination) prior to the start of migration:
24
25``migrate_set_capability postcopy-ram on``
26
27The normal commands are then used to start a migration, which is still
28started in precopy mode.  Issuing:
29
30``migrate_start_postcopy``
31
32will now cause the transition from precopy to postcopy.
33It can be issued immediately after migration is started or any
34time later on.  Issuing it after the end of a migration is harmless.
35
36Blocktime is a postcopy live migration metric, intended to show how
37long the vCPU was in state of interruptible sleep due to pagefault.
38That metric is calculated both for all vCPUs as overlapped value, and
39separately for each vCPU. These values are calculated on destination
40side.  To enable postcopy blocktime calculation, enter following
41command on destination monitor:
42
43``migrate_set_capability postcopy-blocktime on``
44
45Postcopy blocktime can be retrieved by query-migrate qmp command.
46postcopy-blocktime value of qmp command will show overlapped blocking
47time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
48time per vCPU.
49
50.. note::
51  During the postcopy phase, the bandwidth limits set using
52  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
53  the destination is waiting for).
54
55Postcopy internals
56==================
57
58State machine
59-------------
60
61Postcopy moves through a series of states (see postcopy_state) from
62ADVISE->DISCARD->LISTEN->RUNNING->END
63
64 - Advise
65
66    Set at the start of migration if postcopy is enabled, even
67    if it hasn't had the start command; here the destination
68    checks that its OS has the support needed for postcopy, and performs
69    setup to ensure the RAM mappings are suitable for later postcopy.
70    The destination will fail early in migration at this point if the
71    required OS support is not present.
72    (Triggered by reception of POSTCOPY_ADVISE command)
73
74 - Discard
75
76    Entered on receipt of the first 'discard' command; prior to
77    the first Discard being performed, hugepages are switched off
78    (using madvise) to ensure that no new huge pages are created
79    during the postcopy phase, and to cause any huge pages that
80    have discards on them to be broken.
81
82 - Listen
83
84    The first command in the package, POSTCOPY_LISTEN, switches
85    the destination state to Listen, and starts a new thread
86    (the 'listen thread') which takes over the job of receiving
87    pages off the migration stream, while the main thread carries
88    on processing the blob.  With this thread able to process page
89    reception, the destination now 'sensitises' the RAM to detect
90    any access to missing pages (on Linux using the 'userfault'
91    system).
92
93 - Running
94
95    POSTCOPY_RUN causes the destination to synchronise all
96    state and start the CPUs and IO devices running.  The main
97    thread now finishes processing the migration package and
98    now carries on as it would for normal precopy migration
99    (although it can't do the cleanup it would do as it
100    finishes a normal migration).
101
102 - Paused
103
104    Postcopy can run into a paused state (normally on both sides when
105    happens), where all threads will be temporarily halted mostly due to
106    network errors.  When reaching paused state, migration will make sure
107    the qemu binary on both sides maintain the data without corrupting
108    the VM.  To continue the migration, the admin needs to fix the
109    migration channel using the QMP command 'migrate-recover' on the
110    destination node, then resume the migration using QMP command 'migrate'
111    again on source node, with resume=true flag set.
112
113 - End
114
115    The listen thread can now quit, and perform the cleanup of migration
116    state, the migration is now complete.
117
118Device transfer
119---------------
120
121Loading of device data may cause the device emulation to access guest RAM
122that may trigger faults that have to be resolved by the source, as such
123the migration stream has to be able to respond with page data *during* the
124device load, and hence the device data has to be read from the stream completely
125before the device load begins to free the stream up.  This is achieved by
126'packaging' the device data into a blob that's read in one go.
127
128Source behaviour
129----------------
130
131Until postcopy is entered the migration stream is identical to normal
132precopy, except for the addition of a 'postcopy advise' command at
133the beginning, to tell the destination that postcopy might happen.
134When postcopy starts the source sends the page discard data and then
135forms the 'package' containing:
136
137   - Command: 'postcopy listen'
138   - The device state
139
140     A series of sections, identical to the precopy streams device state stream
141     containing everything except postcopiable devices (i.e. RAM)
142   - Command: 'postcopy run'
143
144The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
145contents are formatted in the same way as the main migration stream.
146
147During postcopy the source scans the list of dirty pages and sends them
148to the destination without being requested (in much the same way as precopy),
149however when a page request is received from the destination, the dirty page
150scanning restarts from the requested location.  This causes requested pages
151to be sent quickly, and also causes pages directly after the requested page
152to be sent quickly in the hope that those pages are likely to be used
153by the destination soon.
154
155Destination behaviour
156---------------------
157
158Initially the destination looks the same as precopy, with a single thread
159reading the migration stream; the 'postcopy advise' and 'discard' commands
160are processed to change the way RAM is managed, but don't affect the stream
161processing.
162
163::
164
165  ------------------------------------------------------------------------------
166                          1      2   3     4 5                      6   7
167  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
168  thread                             |       |
169                                     |     (page request)
170                                     |        \___
171                                     v            \
172  listen thread:                     --- page -- page -- page -- page -- page --
173
174                                     a   b        c
175  ------------------------------------------------------------------------------
176
177- On receipt of ``CMD_PACKAGED`` (1)
178
179   All the data associated with the package - the ( ... ) section in the diagram -
180   is read into memory, and the main thread recurses into qemu_loadvm_state_main
181   to process the contents of the package (2) which contains commands (3,6) and
182   devices (4...)
183
184- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
185
186   a new thread (a) is started that takes over servicing the migration stream,
187   while the main thread carries on loading the package.   It loads normal
188   background page data (b) but if during a device load a fault happens (5)
189   the returned page (c) is loaded by the listen thread allowing the main
190   threads device load to carry on.
191
192- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
193
194   letting the destination CPUs start running.  At the end of the
195   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
196   is no longer used by migration, while the listen thread carries on servicing
197   page data until the end of migration.
198
199Source side page bitmap
200-----------------------
201
202The 'migration bitmap' in postcopy is basically the same as in the precopy,
203where each of the bit to indicate that page is 'dirty' - i.e. needs
204sending.  During the precopy phase this is updated as the CPU dirties
205pages, however during postcopy the CPUs are stopped and nothing should
206dirty anything any more. Instead, dirty bits are cleared when the relevant
207pages are sent during postcopy.
208
209Postcopy features
210=================
211
212Postcopy recovery
213-----------------
214
215Comparing to precopy, postcopy is special on error handlings.  When any
216error happens (in this case, mostly network errors), QEMU cannot easily
217fail a migration because VM data resides in both source and destination
218QEMU instances.  On the other hand, when issue happens QEMU on both sides
219will go into a paused state.  It'll need a recovery phase to continue a
220paused postcopy migration.
221
222The recovery phase normally contains a few steps:
223
224  - When network issue occurs, both QEMU will go into PAUSED state
225
226  - When the network is recovered (or a new network is provided), the admin
227    can setup the new channel for migration using QMP command
228    'migrate-recover' on destination node, preparing for a resume.
229
230  - On source host, the admin can continue the interrupted postcopy
231    migration using QMP command 'migrate' with resume=true flag set.
232
233  - After the connection is re-established, QEMU will continue the postcopy
234    migration on both sides.
235
236During a paused postcopy migration, the VM can logically still continue
237running, and it will not be impacted from any page access to pages that
238were already migrated to destination VM before the interruption happens.
239However, if any of the missing pages got accessed on destination VM, the VM
240thread will be halted waiting for the page to be migrated, it means it can
241be halted until the recovery is complete.
242
243The impact of accessing missing pages can be relevant to different
244configurations of the guest.  For example, when with async page fault
245enabled, logically the guest can proactively schedule out the threads
246accessing missing pages.
247
248Postcopy with hugepages
249-----------------------
250
251Postcopy now works with hugetlbfs backed memory:
252
253  a) The linux kernel on the destination must support userfault on hugepages.
254  b) The huge-page configuration on the source and destination VMs must be
255     identical; i.e. RAMBlocks on both sides must use the same page size.
256  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
257     RAM if it doesn't have enough hugepages, triggering (b) to fail.
258     Using ``-mem-prealloc`` enforces the allocation using hugepages.
259  d) Care should be taken with the size of hugepage used; postcopy with 2MB
260     hugepages works well, however 1GB hugepages are likely to be problematic
261     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
262     and until the full page is transferred the destination thread is blocked.
263
264Postcopy with shared memory
265---------------------------
266
267Postcopy migration with shared memory needs explicit support from the other
268processes that share memory and from QEMU. There are restrictions on the type of
269memory that userfault can support shared.
270
271The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
272(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
273for hugetlbfs which may be a problem in some configurations).
274
275The vhost-user code in QEMU supports clients that have Postcopy support,
276and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
277to support postcopy.
278
279The client needs to open a userfaultfd and register the areas
280of memory that it maps with userfault.  The client must then pass the
281userfaultfd back to QEMU together with a mapping table that allows
282fault addresses in the clients address space to be converted back to
283RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
284fault-thread and page requests are made on behalf of the client by QEMU.
285QEMU performs 'wake' operations on the client's userfaultfd to allow it
286to continue after a page has arrived.
287
288.. note::
289  There are two future improvements that would be nice:
290    a) Some way to make QEMU ignorant of the addresses in the clients
291       address space
292    b) Avoiding the need for QEMU to perform ufd-wake calls after the
293       pages have arrived
294
295Retro-fitting postcopy to existing clients is possible:
296  a) A mechanism is needed for the registration with userfault as above,
297     and the registration needs to be coordinated with the phases of
298     postcopy.  In vhost-user extra messages are added to the existing
299     control channel.
300  b) Any thread that can block due to guest memory accesses must be
301     identified and the implication understood; for example if the
302     guest memory access is made while holding a lock then all other
303     threads waiting for that lock will also be blocked.
304
305Postcopy preemption mode
306------------------------
307
308Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
309allows urgent pages (those got page fault requested from destination QEMU
310explicitly) to be sent in a separate preempt channel, rather than queued in
311the background migration channel.  Anyone who cares about latencies of page
312faults during a postcopy migration should enable this feature.  By default,
313it's not enabled.
314