xref: /openbmc/qemu/docs/devel/migration/postcopy.rst (revision bfb4c7cd)
1Postcopy
2========
3
4'Postcopy' migration is a way to deal with migrations that refuse to converge
5(or take too long to converge) its plus side is that there is an upper bound on
6the amount of migration traffic and time it takes, the down side is that during
7the postcopy phase, a failure of *either* side causes the guest to be lost.
8
9In postcopy the destination CPUs are started before all the memory has been
10transferred, and accesses to pages that are yet to be transferred cause
11a fault that's translated by QEMU into a request to the source QEMU.
12
13Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
14doesn't finish in a given time the switch is made to postcopy.
15
16Enabling postcopy
17-----------------
18
19To enable postcopy, issue this command on the monitor (both source and
20destination) prior to the start of migration:
21
22``migrate_set_capability postcopy-ram on``
23
24The normal commands are then used to start a migration, which is still
25started in precopy mode.  Issuing:
26
27``migrate_start_postcopy``
28
29will now cause the transition from precopy to postcopy.
30It can be issued immediately after migration is started or any
31time later on.  Issuing it after the end of a migration is harmless.
32
33Blocktime is a postcopy live migration metric, intended to show how
34long the vCPU was in state of interruptible sleep due to pagefault.
35That metric is calculated both for all vCPUs as overlapped value, and
36separately for each vCPU. These values are calculated on destination
37side.  To enable postcopy blocktime calculation, enter following
38command on destination monitor:
39
40``migrate_set_capability postcopy-blocktime on``
41
42Postcopy blocktime can be retrieved by query-migrate qmp command.
43postcopy-blocktime value of qmp command will show overlapped blocking
44time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
45time per vCPU.
46
47.. note::
48  During the postcopy phase, the bandwidth limits set using
49  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
50  the destination is waiting for).
51
52Postcopy device transfer
53------------------------
54
55Loading of device data may cause the device emulation to access guest RAM
56that may trigger faults that have to be resolved by the source, as such
57the migration stream has to be able to respond with page data *during* the
58device load, and hence the device data has to be read from the stream completely
59before the device load begins to free the stream up.  This is achieved by
60'packaging' the device data into a blob that's read in one go.
61
62Source behaviour
63----------------
64
65Until postcopy is entered the migration stream is identical to normal
66precopy, except for the addition of a 'postcopy advise' command at
67the beginning, to tell the destination that postcopy might happen.
68When postcopy starts the source sends the page discard data and then
69forms the 'package' containing:
70
71   - Command: 'postcopy listen'
72   - The device state
73
74     A series of sections, identical to the precopy streams device state stream
75     containing everything except postcopiable devices (i.e. RAM)
76   - Command: 'postcopy run'
77
78The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
79contents are formatted in the same way as the main migration stream.
80
81During postcopy the source scans the list of dirty pages and sends them
82to the destination without being requested (in much the same way as precopy),
83however when a page request is received from the destination, the dirty page
84scanning restarts from the requested location.  This causes requested pages
85to be sent quickly, and also causes pages directly after the requested page
86to be sent quickly in the hope that those pages are likely to be used
87by the destination soon.
88
89Destination behaviour
90---------------------
91
92Initially the destination looks the same as precopy, with a single thread
93reading the migration stream; the 'postcopy advise' and 'discard' commands
94are processed to change the way RAM is managed, but don't affect the stream
95processing.
96
97::
98
99  ------------------------------------------------------------------------------
100                          1      2   3     4 5                      6   7
101  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
102  thread                             |       |
103                                     |     (page request)
104                                     |        \___
105                                     v            \
106  listen thread:                     --- page -- page -- page -- page -- page --
107
108                                     a   b        c
109  ------------------------------------------------------------------------------
110
111- On receipt of ``CMD_PACKAGED`` (1)
112
113   All the data associated with the package - the ( ... ) section in the diagram -
114   is read into memory, and the main thread recurses into qemu_loadvm_state_main
115   to process the contents of the package (2) which contains commands (3,6) and
116   devices (4...)
117
118- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
119
120   a new thread (a) is started that takes over servicing the migration stream,
121   while the main thread carries on loading the package.   It loads normal
122   background page data (b) but if during a device load a fault happens (5)
123   the returned page (c) is loaded by the listen thread allowing the main
124   threads device load to carry on.
125
126- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
127
128   letting the destination CPUs start running.  At the end of the
129   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
130   is no longer used by migration, while the listen thread carries on servicing
131   page data until the end of migration.
132
133Postcopy Recovery
134-----------------
135
136Comparing to precopy, postcopy is special on error handlings.  When any
137error happens (in this case, mostly network errors), QEMU cannot easily
138fail a migration because VM data resides in both source and destination
139QEMU instances.  On the other hand, when issue happens QEMU on both sides
140will go into a paused state.  It'll need a recovery phase to continue a
141paused postcopy migration.
142
143The recovery phase normally contains a few steps:
144
145  - When network issue occurs, both QEMU will go into PAUSED state
146
147  - When the network is recovered (or a new network is provided), the admin
148    can setup the new channel for migration using QMP command
149    'migrate-recover' on destination node, preparing for a resume.
150
151  - On source host, the admin can continue the interrupted postcopy
152    migration using QMP command 'migrate' with resume=true flag set.
153
154  - After the connection is re-established, QEMU will continue the postcopy
155    migration on both sides.
156
157During a paused postcopy migration, the VM can logically still continue
158running, and it will not be impacted from any page access to pages that
159were already migrated to destination VM before the interruption happens.
160However, if any of the missing pages got accessed on destination VM, the VM
161thread will be halted waiting for the page to be migrated, it means it can
162be halted until the recovery is complete.
163
164The impact of accessing missing pages can be relevant to different
165configurations of the guest.  For example, when with async page fault
166enabled, logically the guest can proactively schedule out the threads
167accessing missing pages.
168
169Postcopy states
170---------------
171
172Postcopy moves through a series of states (see postcopy_state) from
173ADVISE->DISCARD->LISTEN->RUNNING->END
174
175 - Advise
176
177    Set at the start of migration if postcopy is enabled, even
178    if it hasn't had the start command; here the destination
179    checks that its OS has the support needed for postcopy, and performs
180    setup to ensure the RAM mappings are suitable for later postcopy.
181    The destination will fail early in migration at this point if the
182    required OS support is not present.
183    (Triggered by reception of POSTCOPY_ADVISE command)
184
185 - Discard
186
187    Entered on receipt of the first 'discard' command; prior to
188    the first Discard being performed, hugepages are switched off
189    (using madvise) to ensure that no new huge pages are created
190    during the postcopy phase, and to cause any huge pages that
191    have discards on them to be broken.
192
193 - Listen
194
195    The first command in the package, POSTCOPY_LISTEN, switches
196    the destination state to Listen, and starts a new thread
197    (the 'listen thread') which takes over the job of receiving
198    pages off the migration stream, while the main thread carries
199    on processing the blob.  With this thread able to process page
200    reception, the destination now 'sensitises' the RAM to detect
201    any access to missing pages (on Linux using the 'userfault'
202    system).
203
204 - Running
205
206    POSTCOPY_RUN causes the destination to synchronise all
207    state and start the CPUs and IO devices running.  The main
208    thread now finishes processing the migration package and
209    now carries on as it would for normal precopy migration
210    (although it can't do the cleanup it would do as it
211    finishes a normal migration).
212
213 - Paused
214
215    Postcopy can run into a paused state (normally on both sides when
216    happens), where all threads will be temporarily halted mostly due to
217    network errors.  When reaching paused state, migration will make sure
218    the qemu binary on both sides maintain the data without corrupting
219    the VM.  To continue the migration, the admin needs to fix the
220    migration channel using the QMP command 'migrate-recover' on the
221    destination node, then resume the migration using QMP command 'migrate'
222    again on source node, with resume=true flag set.
223
224 - End
225
226    The listen thread can now quit, and perform the cleanup of migration
227    state, the migration is now complete.
228
229Source side page map
230--------------------
231
232The 'migration bitmap' in postcopy is basically the same as in the precopy,
233where each of the bit to indicate that page is 'dirty' - i.e. needs
234sending.  During the precopy phase this is updated as the CPU dirties
235pages, however during postcopy the CPUs are stopped and nothing should
236dirty anything any more. Instead, dirty bits are cleared when the relevant
237pages are sent during postcopy.
238
239Postcopy with hugepages
240-----------------------
241
242Postcopy now works with hugetlbfs backed memory:
243
244  a) The linux kernel on the destination must support userfault on hugepages.
245  b) The huge-page configuration on the source and destination VMs must be
246     identical; i.e. RAMBlocks on both sides must use the same page size.
247  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
248     RAM if it doesn't have enough hugepages, triggering (b) to fail.
249     Using ``-mem-prealloc`` enforces the allocation using hugepages.
250  d) Care should be taken with the size of hugepage used; postcopy with 2MB
251     hugepages works well, however 1GB hugepages are likely to be problematic
252     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
253     and until the full page is transferred the destination thread is blocked.
254
255Postcopy with shared memory
256---------------------------
257
258Postcopy migration with shared memory needs explicit support from the other
259processes that share memory and from QEMU. There are restrictions on the type of
260memory that userfault can support shared.
261
262The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
263(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
264for hugetlbfs which may be a problem in some configurations).
265
266The vhost-user code in QEMU supports clients that have Postcopy support,
267and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
268to support postcopy.
269
270The client needs to open a userfaultfd and register the areas
271of memory that it maps with userfault.  The client must then pass the
272userfaultfd back to QEMU together with a mapping table that allows
273fault addresses in the clients address space to be converted back to
274RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
275fault-thread and page requests are made on behalf of the client by QEMU.
276QEMU performs 'wake' operations on the client's userfaultfd to allow it
277to continue after a page has arrived.
278
279.. note::
280  There are two future improvements that would be nice:
281    a) Some way to make QEMU ignorant of the addresses in the clients
282       address space
283    b) Avoiding the need for QEMU to perform ufd-wake calls after the
284       pages have arrived
285
286Retro-fitting postcopy to existing clients is possible:
287  a) A mechanism is needed for the registration with userfault as above,
288     and the registration needs to be coordinated with the phases of
289     postcopy.  In vhost-user extra messages are added to the existing
290     control channel.
291  b) Any thread that can block due to guest memory accesses must be
292     identified and the implication understood; for example if the
293     guest memory access is made while holding a lock then all other
294     threads waiting for that lock will also be blocked.
295
296Postcopy Preemption Mode
297------------------------
298
299Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
300allows urgent pages (those got page fault requested from destination QEMU
301explicitly) to be sent in a separate preempt channel, rather than queued in
302the background migration channel.  Anyone who cares about latencies of page
303faults during a postcopy migration should enable this feature.  By default,
304it's not enabled.
305