xref: /openbmc/qemu/docs/devel/migration/postcopy.rst (revision 1bbbe7cf2df11a1bc334489a3b87ee23e13c3c29)
1========
2Postcopy
3========
4
5.. contents::
6
7'Postcopy' migration is a way to deal with migrations that refuse to converge
8(or take too long to converge) its plus side is that there is an upper bound on
9the amount of migration traffic and time it takes, the down side is that during
10the postcopy phase, a failure of *either* side causes the guest to be lost.
11
12In postcopy the destination CPUs are started before all the memory has been
13transferred, and accesses to pages that are yet to be transferred cause
14a fault that's translated by QEMU into a request to the source QEMU.
15
16Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
17doesn't finish in a given time the switch is made to postcopy.
18
19Enabling postcopy
20=================
21
22To enable postcopy, issue this command on the monitor (both source and
23destination) prior to the start of migration:
24
25``migrate_set_capability postcopy-ram on``
26
27The normal commands are then used to start a migration, which is still
28started in precopy mode.  Issuing:
29
30``migrate_start_postcopy``
31
32will now cause the transition from precopy to postcopy.
33It can be issued immediately after migration is started or any
34time later on.  Issuing it after the end of a migration is harmless.
35
36Postcopy internals
37==================
38
39State machine
40-------------
41
42Postcopy moves through a series of states (see postcopy_state) from
43ADVISE->DISCARD->LISTEN->RUNNING->END
44
45 - Advise
46
47    Set at the start of migration if postcopy is enabled, even
48    if it hasn't had the start command; here the destination
49    checks that its OS has the support needed for postcopy, and performs
50    setup to ensure the RAM mappings are suitable for later postcopy.
51    The destination will fail early in migration at this point if the
52    required OS support is not present.
53    (Triggered by reception of POSTCOPY_ADVISE command)
54
55 - Discard
56
57    Entered on receipt of the first 'discard' command; prior to
58    the first Discard being performed, hugepages are switched off
59    (using madvise) to ensure that no new huge pages are created
60    during the postcopy phase, and to cause any huge pages that
61    have discards on them to be broken.
62
63 - Listen
64
65    The first command in the package, POSTCOPY_LISTEN, switches
66    the destination state to Listen, and starts a new thread
67    (the 'listen thread') which takes over the job of receiving
68    pages off the migration stream, while the main thread carries
69    on processing the blob.  With this thread able to process page
70    reception, the destination now 'sensitises' the RAM to detect
71    any access to missing pages (on Linux using the 'userfault'
72    system).
73
74 - Running
75
76    POSTCOPY_RUN causes the destination to synchronise all
77    state and start the CPUs and IO devices running.  The main
78    thread now finishes processing the migration package and
79    now carries on as it would for normal precopy migration
80    (although it can't do the cleanup it would do as it
81    finishes a normal migration).
82
83 - End
84
85    The listen thread can now quit, and perform the cleanup of migration
86    state, the migration is now complete.
87
88Device transfer
89---------------
90
91Loading of device data may cause the device emulation to access guest RAM
92that may trigger faults that have to be resolved by the source, as such
93the migration stream has to be able to respond with page data *during* the
94device load, and hence the device data has to be read from the stream completely
95before the device load begins to free the stream up.  This is achieved by
96'packaging' the device data into a blob that's read in one go.
97
98Source behaviour
99----------------
100
101Until postcopy is entered the migration stream is identical to normal
102precopy, except for the addition of a 'postcopy advise' command at
103the beginning, to tell the destination that postcopy might happen.
104When postcopy starts the source sends the page discard data and then
105forms the 'package' containing:
106
107   - Command: 'postcopy listen'
108   - The device state
109
110     A series of sections, identical to the precopy streams device state stream
111     containing everything except postcopiable devices (i.e. RAM)
112   - Command: 'postcopy run'
113
114The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
115contents are formatted in the same way as the main migration stream.
116
117During postcopy the source scans the list of dirty pages and sends them
118to the destination without being requested (in much the same way as precopy),
119however when a page request is received from the destination, the dirty page
120scanning restarts from the requested location.  This causes requested pages
121to be sent quickly, and also causes pages directly after the requested page
122to be sent quickly in the hope that those pages are likely to be used
123by the destination soon.
124
125Destination behaviour
126---------------------
127
128Initially the destination looks the same as precopy, with a single thread
129reading the migration stream; the 'postcopy advise' and 'discard' commands
130are processed to change the way RAM is managed, but don't affect the stream
131processing.
132
133::
134
135  ------------------------------------------------------------------------------
136                          1      2   3     4 5                      6   7
137  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
138  thread                             |       |
139                                     |     (page request)
140                                     |        \___
141                                     v            \
142  listen thread:                     --- page -- page -- page -- page -- page --
143
144                                     a   b        c
145  ------------------------------------------------------------------------------
146
147- On receipt of ``CMD_PACKAGED`` (1)
148
149   All the data associated with the package - the ( ... ) section in the diagram -
150   is read into memory, and the main thread recurses into qemu_loadvm_state_main
151   to process the contents of the package (2) which contains commands (3,6) and
152   devices (4...)
153
154- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
155
156   a new thread (a) is started that takes over servicing the migration stream,
157   while the main thread carries on loading the package.   It loads normal
158   background page data (b) but if during a device load a fault happens (5)
159   the returned page (c) is loaded by the listen thread allowing the main
160   threads device load to carry on.
161
162- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
163
164   letting the destination CPUs start running.  At the end of the
165   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
166   is no longer used by migration, while the listen thread carries on servicing
167   page data until the end of migration.
168
169Source side page bitmap
170-----------------------
171
172The 'migration bitmap' in postcopy is basically the same as in the precopy,
173where each of the bit to indicate that page is 'dirty' - i.e. needs
174sending.  During the precopy phase this is updated as the CPU dirties
175pages, however during postcopy the CPUs are stopped and nothing should
176dirty anything any more. Instead, dirty bits are cleared when the relevant
177pages are sent during postcopy.
178
179Postcopy features
180=================
181
182Postcopy recovery
183-----------------
184
185Comparing to precopy, postcopy is special on error handlings.  When any
186error happens (in this case, mostly network errors), QEMU cannot easily
187fail a migration because VM data resides in both source and destination
188QEMU instances.  On the other hand, when issue happens QEMU on both sides
189will go into a paused state.  It'll need a recovery phase to continue a
190paused postcopy migration.
191
192The recovery phase normally contains a few steps:
193
194  - When network issue occurs, both QEMU will go into **POSTCOPY_PAUSED**
195    migration state.
196
197  - When the network is recovered (or a new network is provided), the admin
198    can setup the new channel for migration using QMP command
199    'migrate-recover' on destination node, preparing for a resume.
200
201  - On source host, the admin can continue the interrupted postcopy
202    migration using QMP command 'migrate' with resume=true flag set.
203    Source QEMU will go into **POSTCOPY_RECOVER_SETUP** state trying to
204    re-establish the channels.
205
206  - When both sides of QEMU successfully reconnect using a new or fixed up
207    channel, they will go into **POSTCOPY_RECOVER** state, some handshake
208    procedure will be needed to properly synchronize the VM states between
209    the two QEMUs to continue the postcopy migration.  For example, there
210    can be pages sent right during the window when the network is
211    interrupted, then the handshake will guarantee pages lost in-flight
212    will be resent again.
213
214  - After a proper handshake synchronization, QEMU will continue the
215    postcopy migration on both sides and go back to **POSTCOPY_ACTIVE**
216    state.  Postcopy migration will continue.
217
218During a paused postcopy migration, the VM can logically still continue
219running, and it will not be impacted from any page access to pages that
220were already migrated to destination VM before the interruption happens.
221However, if any of the missing pages got accessed on destination VM, the VM
222thread will be halted waiting for the page to be migrated, it means it can
223be halted until the recovery is complete.
224
225The impact of accessing missing pages can be relevant to different
226configurations of the guest.  For example, when with async page fault
227enabled, logically the guest can proactively schedule out the threads
228accessing missing pages.
229
230Postcopy with hugepages
231-----------------------
232
233Postcopy now works with hugetlbfs backed memory:
234
235  a) The linux kernel on the destination must support userfault on hugepages.
236  b) The huge-page configuration on the source and destination VMs must be
237     identical; i.e. RAMBlocks on both sides must use the same page size.
238  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
239     RAM if it doesn't have enough hugepages, triggering (b) to fail.
240     Using ``-mem-prealloc`` enforces the allocation using hugepages.
241  d) Care should be taken with the size of hugepage used; postcopy with 2MB
242     hugepages works well, however 1GB hugepages are likely to be problematic
243     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
244     and until the full page is transferred the destination thread is blocked.
245
246Postcopy with shared memory
247---------------------------
248
249Postcopy migration with shared memory needs explicit support from the other
250processes that share memory and from QEMU. There are restrictions on the type of
251memory that userfault can support shared.
252
253The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
254(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
255for hugetlbfs which may be a problem in some configurations).
256
257The vhost-user code in QEMU supports clients that have Postcopy support,
258and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
259to support postcopy.
260
261The client needs to open a userfaultfd and register the areas
262of memory that it maps with userfault.  The client must then pass the
263userfaultfd back to QEMU together with a mapping table that allows
264fault addresses in the clients address space to be converted back to
265RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
266fault-thread and page requests are made on behalf of the client by QEMU.
267QEMU performs 'wake' operations on the client's userfaultfd to allow it
268to continue after a page has arrived.
269
270.. note::
271  There are two future improvements that would be nice:
272    a) Some way to make QEMU ignorant of the addresses in the clients
273       address space
274    b) Avoiding the need for QEMU to perform ufd-wake calls after the
275       pages have arrived
276
277Retro-fitting postcopy to existing clients is possible:
278  a) A mechanism is needed for the registration with userfault as above,
279     and the registration needs to be coordinated with the phases of
280     postcopy.  In vhost-user extra messages are added to the existing
281     control channel.
282  b) Any thread that can block due to guest memory accesses must be
283     identified and the implication understood; for example if the
284     guest memory access is made while holding a lock then all other
285     threads waiting for that lock will also be blocked.
286
287Postcopy preemption mode
288------------------------
289
290Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
291allows urgent pages (those got page fault requested from destination QEMU
292explicitly) to be sent in a separate preempt channel, rather than queued in
293the background migration channel.  Anyone who cares about latencies of page
294faults during a postcopy migration should enable this feature.  By default,
295it's not enabled.
296
297Postcopy blocktime statistics
298-----------------------------
299
300Blocktime is a postcopy live migration metric, intended to show how
301long the vCPU was in state of interruptible sleep due to pagefault.
302That metric is calculated both for all vCPUs as overlapped value, and
303separately for each vCPU. These values are calculated on destination
304side.  To enable postcopy blocktime calculation, enter following
305command on destination monitor:
306
307``migrate_set_capability postcopy-blocktime on``
308
309Postcopy blocktime can be retrieved by query-migrate qmp command.
310postcopy-blocktime value of qmp command will show overlapped blocking
311time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
312time per vCPU.
313