xref: /openbmc/qemu/docs/devel/migration/main.rst (revision 8cb2f8b1)
1*8cb2f8b1SPeter Xu=========
2*8cb2f8b1SPeter XuMigration
3*8cb2f8b1SPeter Xu=========
4*8cb2f8b1SPeter Xu
5*8cb2f8b1SPeter XuQEMU has code to load/save the state of the guest that it is running.
6*8cb2f8b1SPeter XuThese are two complementary operations.  Saving the state just does
7*8cb2f8b1SPeter Xuthat, saves the state for each device that the guest is running.
8*8cb2f8b1SPeter XuRestoring a guest is just the opposite operation: we need to load the
9*8cb2f8b1SPeter Xustate of each device.
10*8cb2f8b1SPeter Xu
11*8cb2f8b1SPeter XuFor this to work, QEMU has to be launched with the same arguments the
12*8cb2f8b1SPeter Xutwo times.  I.e. it can only restore the state in one guest that has
13*8cb2f8b1SPeter Xuthe same devices that the one it was saved (this last requirement can
14*8cb2f8b1SPeter Xube relaxed a bit, but for now we can consider that configuration has
15*8cb2f8b1SPeter Xuto be exactly the same).
16*8cb2f8b1SPeter Xu
17*8cb2f8b1SPeter XuOnce that we are able to save/restore a guest, a new functionality is
18*8cb2f8b1SPeter Xurequested: migration.  This means that QEMU is able to start in one
19*8cb2f8b1SPeter Xumachine and being "migrated" to another machine.  I.e. being moved to
20*8cb2f8b1SPeter Xuanother machine.
21*8cb2f8b1SPeter Xu
22*8cb2f8b1SPeter XuNext was the "live migration" functionality.  This is important
23*8cb2f8b1SPeter Xubecause some guests run with a lot of state (specially RAM), and it
24*8cb2f8b1SPeter Xucan take a while to move all state from one machine to another.  Live
25*8cb2f8b1SPeter Xumigration allows the guest to continue running while the state is
26*8cb2f8b1SPeter Xutransferred.  Only while the last part of the state is transferred has
27*8cb2f8b1SPeter Xuthe guest to be stopped.  Typically the time that the guest is
28*8cb2f8b1SPeter Xuunresponsive during live migration is the low hundred of milliseconds
29*8cb2f8b1SPeter Xu(notice that this depends on a lot of things).
30*8cb2f8b1SPeter Xu
31*8cb2f8b1SPeter Xu.. contents::
32*8cb2f8b1SPeter Xu
33*8cb2f8b1SPeter XuTransports
34*8cb2f8b1SPeter Xu==========
35*8cb2f8b1SPeter Xu
36*8cb2f8b1SPeter XuThe migration stream is normally just a byte stream that can be passed
37*8cb2f8b1SPeter Xuover any transport.
38*8cb2f8b1SPeter Xu
39*8cb2f8b1SPeter Xu- tcp migration: do the migration using tcp sockets
40*8cb2f8b1SPeter Xu- unix migration: do the migration using unix sockets
41*8cb2f8b1SPeter Xu- exec migration: do the migration using the stdin/stdout through a process.
42*8cb2f8b1SPeter Xu- fd migration: do the migration using a file descriptor that is
43*8cb2f8b1SPeter Xu  passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
44*8cb2f8b1SPeter Xu
45*8cb2f8b1SPeter XuIn addition, support is included for migration using RDMA, which
46*8cb2f8b1SPeter Xutransports the page data using ``RDMA``, where the hardware takes care of
47*8cb2f8b1SPeter Xutransporting the pages, and the load on the CPU is much lower.  While the
48*8cb2f8b1SPeter Xuinternals of RDMA migration are a bit different, this isn't really visible
49*8cb2f8b1SPeter Xuoutside the RAM migration code.
50*8cb2f8b1SPeter Xu
51*8cb2f8b1SPeter XuAll these migration protocols use the same infrastructure to
52*8cb2f8b1SPeter Xusave/restore state devices.  This infrastructure is shared with the
53*8cb2f8b1SPeter Xusavevm/loadvm functionality.
54*8cb2f8b1SPeter Xu
55*8cb2f8b1SPeter XuDebugging
56*8cb2f8b1SPeter Xu=========
57*8cb2f8b1SPeter Xu
58*8cb2f8b1SPeter XuThe migration stream can be analyzed thanks to ``scripts/analyze-migration.py``.
59*8cb2f8b1SPeter Xu
60*8cb2f8b1SPeter XuExample usage:
61*8cb2f8b1SPeter Xu
62*8cb2f8b1SPeter Xu.. code-block:: shell
63*8cb2f8b1SPeter Xu
64*8cb2f8b1SPeter Xu  $ qemu-system-x86_64 -display none -monitor stdio
65*8cb2f8b1SPeter Xu  (qemu) migrate "exec:cat > mig"
66*8cb2f8b1SPeter Xu  (qemu) q
67*8cb2f8b1SPeter Xu  $ ./scripts/analyze-migration.py -f mig
68*8cb2f8b1SPeter Xu  {
69*8cb2f8b1SPeter Xu    "ram (3)": {
70*8cb2f8b1SPeter Xu        "section sizes": {
71*8cb2f8b1SPeter Xu            "pc.ram": "0x0000000008000000",
72*8cb2f8b1SPeter Xu  ...
73*8cb2f8b1SPeter Xu
74*8cb2f8b1SPeter XuSee also ``analyze-migration.py -h`` help for more options.
75*8cb2f8b1SPeter Xu
76*8cb2f8b1SPeter XuCommon infrastructure
77*8cb2f8b1SPeter Xu=====================
78*8cb2f8b1SPeter Xu
79*8cb2f8b1SPeter XuThe files, sockets or fd's that carry the migration stream are abstracted by
80*8cb2f8b1SPeter Xuthe  ``QEMUFile`` type (see ``migration/qemu-file.h``).  In most cases this
81*8cb2f8b1SPeter Xuis connected to a subtype of ``QIOChannel`` (see ``io/``).
82*8cb2f8b1SPeter Xu
83*8cb2f8b1SPeter Xu
84*8cb2f8b1SPeter XuSaving the state of one device
85*8cb2f8b1SPeter Xu==============================
86*8cb2f8b1SPeter Xu
87*8cb2f8b1SPeter XuFor most devices, the state is saved in a single call to the migration
88*8cb2f8b1SPeter Xuinfrastructure; these are *non-iterative* devices.  The data for these
89*8cb2f8b1SPeter Xudevices is sent at the end of precopy migration, when the CPUs are paused.
90*8cb2f8b1SPeter XuThere are also *iterative* devices, which contain a very large amount of
91*8cb2f8b1SPeter Xudata (e.g. RAM or large tables).  See the iterative device section below.
92*8cb2f8b1SPeter Xu
93*8cb2f8b1SPeter XuGeneral advice for device developers
94*8cb2f8b1SPeter Xu------------------------------------
95*8cb2f8b1SPeter Xu
96*8cb2f8b1SPeter Xu- The migration state saved should reflect the device being modelled rather
97*8cb2f8b1SPeter Xu  than the way your implementation works.  That way if you change the implementation
98*8cb2f8b1SPeter Xu  later the migration stream will stay compatible.  That model may include
99*8cb2f8b1SPeter Xu  internal state that's not directly visible in a register.
100*8cb2f8b1SPeter Xu
101*8cb2f8b1SPeter Xu- When saving a migration stream the device code may walk and check
102*8cb2f8b1SPeter Xu  the state of the device.  These checks might fail in various ways (e.g.
103*8cb2f8b1SPeter Xu  discovering internal state is corrupt or that the guest has done something bad).
104*8cb2f8b1SPeter Xu  Consider carefully before asserting/aborting at this point, since the
105*8cb2f8b1SPeter Xu  normal response from users is that *migration broke their VM* since it had
106*8cb2f8b1SPeter Xu  apparently been running fine until then.  In these error cases, the device
107*8cb2f8b1SPeter Xu  should log a message indicating the cause of error, and should consider
108*8cb2f8b1SPeter Xu  putting the device into an error state, allowing the rest of the VM to
109*8cb2f8b1SPeter Xu  continue execution.
110*8cb2f8b1SPeter Xu
111*8cb2f8b1SPeter Xu- The migration might happen at an inconvenient point,
112*8cb2f8b1SPeter Xu  e.g. right in the middle of the guest reprogramming the device, during
113*8cb2f8b1SPeter Xu  guest reboot or shutdown or while the device is waiting for external IO.
114*8cb2f8b1SPeter Xu  It's strongly preferred that migrations do not fail in this situation,
115*8cb2f8b1SPeter Xu  since in the cloud environment migrations might happen automatically to
116*8cb2f8b1SPeter Xu  VMs that the administrator doesn't directly control.
117*8cb2f8b1SPeter Xu
118*8cb2f8b1SPeter Xu- If you do need to fail a migration, ensure that sufficient information
119*8cb2f8b1SPeter Xu  is logged to identify what went wrong.
120*8cb2f8b1SPeter Xu
121*8cb2f8b1SPeter Xu- The destination should treat an incoming migration stream as hostile
122*8cb2f8b1SPeter Xu  (which we do to varying degrees in the existing code).  Check that offsets
123*8cb2f8b1SPeter Xu  into buffers and the like can't cause overruns.  Fail the incoming migration
124*8cb2f8b1SPeter Xu  in the case of a corrupted stream like this.
125*8cb2f8b1SPeter Xu
126*8cb2f8b1SPeter Xu- Take care with internal device state or behaviour that might become
127*8cb2f8b1SPeter Xu  migration version dependent.  For example, the order of PCI capabilities
128*8cb2f8b1SPeter Xu  is required to stay constant across migration.  Another example would
129*8cb2f8b1SPeter Xu  be that a special case handled by subsections (see below) might become
130*8cb2f8b1SPeter Xu  much more common if a default behaviour is changed.
131*8cb2f8b1SPeter Xu
132*8cb2f8b1SPeter Xu- The state of the source should not be changed or destroyed by the
133*8cb2f8b1SPeter Xu  outgoing migration.  Migrations timing out or being failed by
134*8cb2f8b1SPeter Xu  higher levels of management, or failures of the destination host are
135*8cb2f8b1SPeter Xu  not unusual, and in that case the VM is restarted on the source.
136*8cb2f8b1SPeter Xu  Note that the management layer can validly revert the migration
137*8cb2f8b1SPeter Xu  even though the QEMU level of migration has succeeded as long as it
138*8cb2f8b1SPeter Xu  does it before starting execution on the destination.
139*8cb2f8b1SPeter Xu
140*8cb2f8b1SPeter Xu- Buses and devices should be able to explicitly specify addresses when
141*8cb2f8b1SPeter Xu  instantiated, and management tools should use those.  For example,
142*8cb2f8b1SPeter Xu  when hot adding USB devices it's important to specify the ports
143*8cb2f8b1SPeter Xu  and addresses, since implicit ordering based on the command line order
144*8cb2f8b1SPeter Xu  may be different on the destination.  This can result in the
145*8cb2f8b1SPeter Xu  device state being loaded into the wrong device.
146*8cb2f8b1SPeter Xu
147*8cb2f8b1SPeter XuVMState
148*8cb2f8b1SPeter Xu-------
149*8cb2f8b1SPeter Xu
150*8cb2f8b1SPeter XuMost device data can be described using the ``VMSTATE`` macros (mostly defined
151*8cb2f8b1SPeter Xuin ``include/migration/vmstate.h``).
152*8cb2f8b1SPeter Xu
153*8cb2f8b1SPeter XuAn example (from hw/input/pckbd.c)
154*8cb2f8b1SPeter Xu
155*8cb2f8b1SPeter Xu.. code:: c
156*8cb2f8b1SPeter Xu
157*8cb2f8b1SPeter Xu  static const VMStateDescription vmstate_kbd = {
158*8cb2f8b1SPeter Xu      .name = "pckbd",
159*8cb2f8b1SPeter Xu      .version_id = 3,
160*8cb2f8b1SPeter Xu      .minimum_version_id = 3,
161*8cb2f8b1SPeter Xu      .fields = (const VMStateField[]) {
162*8cb2f8b1SPeter Xu          VMSTATE_UINT8(write_cmd, KBDState),
163*8cb2f8b1SPeter Xu          VMSTATE_UINT8(status, KBDState),
164*8cb2f8b1SPeter Xu          VMSTATE_UINT8(mode, KBDState),
165*8cb2f8b1SPeter Xu          VMSTATE_UINT8(pending, KBDState),
166*8cb2f8b1SPeter Xu          VMSTATE_END_OF_LIST()
167*8cb2f8b1SPeter Xu      }
168*8cb2f8b1SPeter Xu  };
169*8cb2f8b1SPeter Xu
170*8cb2f8b1SPeter XuWe are declaring the state with name "pckbd".  The ``version_id`` is
171*8cb2f8b1SPeter Xu3, and there are 4 uint8_t fields in the KBDState structure.  We
172*8cb2f8b1SPeter Xuregistered this ``VMSTATEDescription`` with one of the following
173*8cb2f8b1SPeter Xufunctions.  The first one will generate a device ``instance_id``
174*8cb2f8b1SPeter Xudifferent for each registration.  Use the second one if you already
175*8cb2f8b1SPeter Xuhave an id that is different for each instance of the device:
176*8cb2f8b1SPeter Xu
177*8cb2f8b1SPeter Xu.. code:: c
178*8cb2f8b1SPeter Xu
179*8cb2f8b1SPeter Xu    vmstate_register_any(NULL, &vmstate_kbd, s);
180*8cb2f8b1SPeter Xu    vmstate_register(NULL, instance_id, &vmstate_kbd, s);
181*8cb2f8b1SPeter Xu
182*8cb2f8b1SPeter XuFor devices that are ``qdev`` based, we can register the device in the class
183*8cb2f8b1SPeter Xuinit function:
184*8cb2f8b1SPeter Xu
185*8cb2f8b1SPeter Xu.. code:: c
186*8cb2f8b1SPeter Xu
187*8cb2f8b1SPeter Xu    dc->vmsd = &vmstate_kbd_isa;
188*8cb2f8b1SPeter Xu
189*8cb2f8b1SPeter XuThe VMState macros take care of ensuring that the device data section
190*8cb2f8b1SPeter Xuis formatted portably (normally big endian) and make some compile time checks
191*8cb2f8b1SPeter Xuagainst the types of the fields in the structures.
192*8cb2f8b1SPeter Xu
193*8cb2f8b1SPeter XuVMState macros can include other VMStateDescriptions to store substructures
194*8cb2f8b1SPeter Xu(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length
195*8cb2f8b1SPeter Xuarrays (``VMSTATE_VARRAY_``).  Various other macros exist for special
196*8cb2f8b1SPeter Xucases.
197*8cb2f8b1SPeter Xu
198*8cb2f8b1SPeter XuNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32
199*8cb2f8b1SPeter Xuends up with a 4 byte bigendian representation on the wire; in the future
200*8cb2f8b1SPeter Xuit might be possible to use a more structured format.
201*8cb2f8b1SPeter Xu
202*8cb2f8b1SPeter XuLegacy way
203*8cb2f8b1SPeter Xu----------
204*8cb2f8b1SPeter Xu
205*8cb2f8b1SPeter XuThis way is going to disappear as soon as all current users are ported to VMSTATE;
206*8cb2f8b1SPeter Xualthough converting existing code can be tricky, and thus 'soon' is relative.
207*8cb2f8b1SPeter Xu
208*8cb2f8b1SPeter XuEach device has to register two functions, one to save the state and
209*8cb2f8b1SPeter Xuanother to load the state back.
210*8cb2f8b1SPeter Xu
211*8cb2f8b1SPeter Xu.. code:: c
212*8cb2f8b1SPeter Xu
213*8cb2f8b1SPeter Xu  int register_savevm_live(const char *idstr,
214*8cb2f8b1SPeter Xu                           int instance_id,
215*8cb2f8b1SPeter Xu                           int version_id,
216*8cb2f8b1SPeter Xu                           SaveVMHandlers *ops,
217*8cb2f8b1SPeter Xu                           void *opaque);
218*8cb2f8b1SPeter Xu
219*8cb2f8b1SPeter XuTwo functions in the ``ops`` structure are the ``save_state``
220*8cb2f8b1SPeter Xuand ``load_state`` functions.  Notice that ``load_state`` receives a version_id
221*8cb2f8b1SPeter Xuparameter to know what state format is receiving.  ``save_state`` doesn't
222*8cb2f8b1SPeter Xuhave a version_id parameter because it always uses the latest version.
223*8cb2f8b1SPeter Xu
224*8cb2f8b1SPeter XuNote that because the VMState macros still save the data in a raw
225*8cb2f8b1SPeter Xuformat, in many cases it's possible to replace legacy code
226*8cb2f8b1SPeter Xuwith a carefully constructed VMState description that matches the
227*8cb2f8b1SPeter Xubyte layout of the existing code.
228*8cb2f8b1SPeter Xu
229*8cb2f8b1SPeter XuChanging migration data structures
230*8cb2f8b1SPeter Xu----------------------------------
231*8cb2f8b1SPeter Xu
232*8cb2f8b1SPeter XuWhen we migrate a device, we save/load the state as a series
233*8cb2f8b1SPeter Xuof fields.  Sometimes, due to bugs or new functionality, we need to
234*8cb2f8b1SPeter Xuchange the state to store more/different information.  Changing the migration
235*8cb2f8b1SPeter Xustate saved for a device can break migration compatibility unless
236*8cb2f8b1SPeter Xucare is taken to use the appropriate techniques.  In general QEMU tries
237*8cb2f8b1SPeter Xuto maintain forward migration compatibility (i.e. migrating from
238*8cb2f8b1SPeter XuQEMU n->n+1) and there are users who benefit from backward compatibility
239*8cb2f8b1SPeter Xuas well.
240*8cb2f8b1SPeter Xu
241*8cb2f8b1SPeter XuSubsections
242*8cb2f8b1SPeter Xu-----------
243*8cb2f8b1SPeter Xu
244*8cb2f8b1SPeter XuThe most common structure change is adding new data, e.g. when adding
245*8cb2f8b1SPeter Xua newer form of device, or adding that state that you previously
246*8cb2f8b1SPeter Xuforgot to migrate.  This is best solved using a subsection.
247*8cb2f8b1SPeter Xu
248*8cb2f8b1SPeter XuA subsection is "like" a device vmstate, but with a particularity, it
249*8cb2f8b1SPeter Xuhas a Boolean function that tells if that values are needed to be sent
250*8cb2f8b1SPeter Xuor not.  If this functions returns false, the subsection is not sent.
251*8cb2f8b1SPeter XuSubsections have a unique name, that is looked for on the receiving
252*8cb2f8b1SPeter Xuside.
253*8cb2f8b1SPeter Xu
254*8cb2f8b1SPeter XuOn the receiving side, if we found a subsection for a device that we
255*8cb2f8b1SPeter Xudon't understand, we just fail the migration.  If we understand all
256*8cb2f8b1SPeter Xuthe subsections, then we load the state with success.  There's no check
257*8cb2f8b1SPeter Xuthat a subsection is loaded, so a newer QEMU that knows about a subsection
258*8cb2f8b1SPeter Xucan (with care) load a stream from an older QEMU that didn't send
259*8cb2f8b1SPeter Xuthe subsection.
260*8cb2f8b1SPeter Xu
261*8cb2f8b1SPeter XuIf the new data is only needed in a rare case, then the subsection
262*8cb2f8b1SPeter Xucan be made conditional on that case and the migration will still
263*8cb2f8b1SPeter Xusucceed to older QEMUs in most cases.  This is OK for data that's
264*8cb2f8b1SPeter Xucritical, but in some use cases it's preferred that the migration
265*8cb2f8b1SPeter Xushould succeed even with the data missing.  To support this the
266*8cb2f8b1SPeter Xusubsection can be connected to a device property and from there
267*8cb2f8b1SPeter Xuto a versioned machine type.
268*8cb2f8b1SPeter Xu
269*8cb2f8b1SPeter XuThe 'pre_load' and 'post_load' functions on subsections are only
270*8cb2f8b1SPeter Xucalled if the subsection is loaded.
271*8cb2f8b1SPeter Xu
272*8cb2f8b1SPeter XuOne important note is that the outer post_load() function is called "after"
273*8cb2f8b1SPeter Xuloading all subsections, because a newer subsection could change the same
274*8cb2f8b1SPeter Xuvalue that it uses.  A flag, and the combination of outer pre_load and
275*8cb2f8b1SPeter Xupost_load can be used to detect whether a subsection was loaded, and to
276*8cb2f8b1SPeter Xufall back on default behaviour when the subsection isn't present.
277*8cb2f8b1SPeter Xu
278*8cb2f8b1SPeter XuExample:
279*8cb2f8b1SPeter Xu
280*8cb2f8b1SPeter Xu.. code:: c
281*8cb2f8b1SPeter Xu
282*8cb2f8b1SPeter Xu  static bool ide_drive_pio_state_needed(void *opaque)
283*8cb2f8b1SPeter Xu  {
284*8cb2f8b1SPeter Xu      IDEState *s = opaque;
285*8cb2f8b1SPeter Xu
286*8cb2f8b1SPeter Xu      return ((s->status & DRQ_STAT) != 0)
287*8cb2f8b1SPeter Xu          || (s->bus->error_status & BM_STATUS_PIO_RETRY);
288*8cb2f8b1SPeter Xu  }
289*8cb2f8b1SPeter Xu
290*8cb2f8b1SPeter Xu  const VMStateDescription vmstate_ide_drive_pio_state = {
291*8cb2f8b1SPeter Xu      .name = "ide_drive/pio_state",
292*8cb2f8b1SPeter Xu      .version_id = 1,
293*8cb2f8b1SPeter Xu      .minimum_version_id = 1,
294*8cb2f8b1SPeter Xu      .pre_save = ide_drive_pio_pre_save,
295*8cb2f8b1SPeter Xu      .post_load = ide_drive_pio_post_load,
296*8cb2f8b1SPeter Xu      .needed = ide_drive_pio_state_needed,
297*8cb2f8b1SPeter Xu      .fields = (const VMStateField[]) {
298*8cb2f8b1SPeter Xu          VMSTATE_INT32(req_nb_sectors, IDEState),
299*8cb2f8b1SPeter Xu          VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
300*8cb2f8b1SPeter Xu                               vmstate_info_uint8, uint8_t),
301*8cb2f8b1SPeter Xu          VMSTATE_INT32(cur_io_buffer_offset, IDEState),
302*8cb2f8b1SPeter Xu          VMSTATE_INT32(cur_io_buffer_len, IDEState),
303*8cb2f8b1SPeter Xu          VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
304*8cb2f8b1SPeter Xu          VMSTATE_INT32(elementary_transfer_size, IDEState),
305*8cb2f8b1SPeter Xu          VMSTATE_INT32(packet_transfer_size, IDEState),
306*8cb2f8b1SPeter Xu          VMSTATE_END_OF_LIST()
307*8cb2f8b1SPeter Xu      }
308*8cb2f8b1SPeter Xu  };
309*8cb2f8b1SPeter Xu
310*8cb2f8b1SPeter Xu  const VMStateDescription vmstate_ide_drive = {
311*8cb2f8b1SPeter Xu      .name = "ide_drive",
312*8cb2f8b1SPeter Xu      .version_id = 3,
313*8cb2f8b1SPeter Xu      .minimum_version_id = 0,
314*8cb2f8b1SPeter Xu      .post_load = ide_drive_post_load,
315*8cb2f8b1SPeter Xu      .fields = (const VMStateField[]) {
316*8cb2f8b1SPeter Xu          .... several fields ....
317*8cb2f8b1SPeter Xu          VMSTATE_END_OF_LIST()
318*8cb2f8b1SPeter Xu      },
319*8cb2f8b1SPeter Xu      .subsections = (const VMStateDescription * const []) {
320*8cb2f8b1SPeter Xu          &vmstate_ide_drive_pio_state,
321*8cb2f8b1SPeter Xu          NULL
322*8cb2f8b1SPeter Xu      }
323*8cb2f8b1SPeter Xu  };
324*8cb2f8b1SPeter Xu
325*8cb2f8b1SPeter XuHere we have a subsection for the pio state.  We only need to
326*8cb2f8b1SPeter Xusave/send this state when we are in the middle of a pio operation
327*8cb2f8b1SPeter Xu(that is what ``ide_drive_pio_state_needed()`` checks).  If DRQ_STAT is
328*8cb2f8b1SPeter Xunot enabled, the values on that fields are garbage and don't need to
329*8cb2f8b1SPeter Xube sent.
330*8cb2f8b1SPeter Xu
331*8cb2f8b1SPeter XuConnecting subsections to properties
332*8cb2f8b1SPeter Xu------------------------------------
333*8cb2f8b1SPeter Xu
334*8cb2f8b1SPeter XuUsing a condition function that checks a 'property' to determine whether
335*8cb2f8b1SPeter Xuto send a subsection allows backward migration compatibility when
336*8cb2f8b1SPeter Xunew subsections are added, especially when combined with versioned
337*8cb2f8b1SPeter Xumachine types.
338*8cb2f8b1SPeter Xu
339*8cb2f8b1SPeter XuFor example:
340*8cb2f8b1SPeter Xu
341*8cb2f8b1SPeter Xu   a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
342*8cb2f8b1SPeter Xu      default it to true.
343*8cb2f8b1SPeter Xu   b) Add an entry to the ``hw_compat_`` for the previous version that sets
344*8cb2f8b1SPeter Xu      the property to false.
345*8cb2f8b1SPeter Xu   c) Add a static bool  support_foo function that tests the property.
346*8cb2f8b1SPeter Xu   d) Add a subsection with a .needed set to the support_foo function
347*8cb2f8b1SPeter Xu   e) (potentially) Add an outer pre_load that sets up a default value
348*8cb2f8b1SPeter Xu      for 'foo' to be used if the subsection isn't loaded.
349*8cb2f8b1SPeter Xu
350*8cb2f8b1SPeter XuNow that subsection will not be generated when using an older
351*8cb2f8b1SPeter Xumachine type and the migration stream will be accepted by older
352*8cb2f8b1SPeter XuQEMU versions.
353*8cb2f8b1SPeter Xu
354*8cb2f8b1SPeter XuNot sending existing elements
355*8cb2f8b1SPeter Xu-----------------------------
356*8cb2f8b1SPeter Xu
357*8cb2f8b1SPeter XuSometimes members of the VMState are no longer needed:
358*8cb2f8b1SPeter Xu
359*8cb2f8b1SPeter Xu  - removing them will break migration compatibility
360*8cb2f8b1SPeter Xu
361*8cb2f8b1SPeter Xu  - making them version dependent and bumping the version will break backward migration
362*8cb2f8b1SPeter Xu    compatibility.
363*8cb2f8b1SPeter Xu
364*8cb2f8b1SPeter XuAdding a dummy field into the migration stream is normally the best way to preserve
365*8cb2f8b1SPeter Xucompatibility.
366*8cb2f8b1SPeter Xu
367*8cb2f8b1SPeter XuIf the field really does need to be removed then:
368*8cb2f8b1SPeter Xu
369*8cb2f8b1SPeter Xu  a) Add a new property/compatibility/function in the same way for subsections above.
370*8cb2f8b1SPeter Xu  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
371*8cb2f8b1SPeter Xu
372*8cb2f8b1SPeter Xu   ``VMSTATE_UINT32(foo, barstruct)``
373*8cb2f8b1SPeter Xu
374*8cb2f8b1SPeter Xu   becomes
375*8cb2f8b1SPeter Xu
376*8cb2f8b1SPeter Xu   ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
377*8cb2f8b1SPeter Xu
378*8cb2f8b1SPeter Xu   Sometime in the future when we no longer care about the ancient versions these can be killed off.
379*8cb2f8b1SPeter Xu   Note that for backward compatibility it's important to fill in the structure with
380*8cb2f8b1SPeter Xu   data that the destination will understand.
381*8cb2f8b1SPeter Xu
382*8cb2f8b1SPeter XuAny difference in the predicates on the source and destination will end up
383*8cb2f8b1SPeter Xuwith different fields being enabled and data being loaded into the wrong
384*8cb2f8b1SPeter Xufields; for this reason conditional fields like this are very fragile.
385*8cb2f8b1SPeter Xu
386*8cb2f8b1SPeter XuVersions
387*8cb2f8b1SPeter Xu--------
388*8cb2f8b1SPeter Xu
389*8cb2f8b1SPeter XuVersion numbers are intended for major incompatible changes to the
390*8cb2f8b1SPeter Xumigration of a device, and using them breaks backward-migration
391*8cb2f8b1SPeter Xucompatibility; in general most changes can be made by adding Subsections
392*8cb2f8b1SPeter Xu(see above) or _TEST macros (see above) which won't break compatibility.
393*8cb2f8b1SPeter Xu
394*8cb2f8b1SPeter XuEach version is associated with a series of fields saved.  The ``save_state`` always saves
395*8cb2f8b1SPeter Xuthe state as the newer version.  But ``load_state`` sometimes is able to
396*8cb2f8b1SPeter Xuload state from an older version.
397*8cb2f8b1SPeter Xu
398*8cb2f8b1SPeter XuYou can see that there are two version fields:
399*8cb2f8b1SPeter Xu
400*8cb2f8b1SPeter Xu- ``version_id``: the maximum version_id supported by VMState for that device.
401*8cb2f8b1SPeter Xu- ``minimum_version_id``: the minimum version_id that VMState is able to understand
402*8cb2f8b1SPeter Xu  for that device.
403*8cb2f8b1SPeter Xu
404*8cb2f8b1SPeter XuVMState is able to read versions from minimum_version_id to version_id.
405*8cb2f8b1SPeter Xu
406*8cb2f8b1SPeter XuThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
407*8cb2f8b1SPeter Xue.g.
408*8cb2f8b1SPeter Xu
409*8cb2f8b1SPeter Xu.. code:: c
410*8cb2f8b1SPeter Xu
411*8cb2f8b1SPeter Xu   VMSTATE_UINT16_V(ip_id, Slirp, 2),
412*8cb2f8b1SPeter Xu
413*8cb2f8b1SPeter Xuonly loads that field for versions 2 and newer.
414*8cb2f8b1SPeter Xu
415*8cb2f8b1SPeter XuSaving state will always create a section with the 'version_id' value
416*8cb2f8b1SPeter Xuand thus can't be loaded by any older QEMU.
417*8cb2f8b1SPeter Xu
418*8cb2f8b1SPeter XuMassaging functions
419*8cb2f8b1SPeter Xu-------------------
420*8cb2f8b1SPeter Xu
421*8cb2f8b1SPeter XuSometimes, it is not enough to be able to save the state directly
422*8cb2f8b1SPeter Xufrom one structure, we need to fill the correct values there.  One
423*8cb2f8b1SPeter Xuexample is when we are using kvm.  Before saving the cpu state, we
424*8cb2f8b1SPeter Xuneed to ask kvm to copy to QEMU the state that it is using.  And the
425*8cb2f8b1SPeter Xuopposite when we are loading the state, we need a way to tell kvm to
426*8cb2f8b1SPeter Xuload the state for the cpu that we have just loaded from the QEMUFile.
427*8cb2f8b1SPeter Xu
428*8cb2f8b1SPeter XuThe functions to do that are inside a vmstate definition, and are called:
429*8cb2f8b1SPeter Xu
430*8cb2f8b1SPeter Xu- ``int (*pre_load)(void *opaque);``
431*8cb2f8b1SPeter Xu
432*8cb2f8b1SPeter Xu  This function is called before we load the state of one device.
433*8cb2f8b1SPeter Xu
434*8cb2f8b1SPeter Xu- ``int (*post_load)(void *opaque, int version_id);``
435*8cb2f8b1SPeter Xu
436*8cb2f8b1SPeter Xu  This function is called after we load the state of one device.
437*8cb2f8b1SPeter Xu
438*8cb2f8b1SPeter Xu- ``int (*pre_save)(void *opaque);``
439*8cb2f8b1SPeter Xu
440*8cb2f8b1SPeter Xu  This function is called before we save the state of one device.
441*8cb2f8b1SPeter Xu
442*8cb2f8b1SPeter Xu- ``int (*post_save)(void *opaque);``
443*8cb2f8b1SPeter Xu
444*8cb2f8b1SPeter Xu  This function is called after we save the state of one device
445*8cb2f8b1SPeter Xu  (even upon failure, unless the call to pre_save returned an error).
446*8cb2f8b1SPeter Xu
447*8cb2f8b1SPeter XuExample: You can look at hpet.c, that uses the first three functions
448*8cb2f8b1SPeter Xuto massage the state that is transferred.
449*8cb2f8b1SPeter Xu
450*8cb2f8b1SPeter XuThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration
451*8cb2f8b1SPeter Xudata doesn't match the stored device data well; it allows an
452*8cb2f8b1SPeter Xuintermediate temporary structure to be populated with migration
453*8cb2f8b1SPeter Xudata and then transferred to the main structure.
454*8cb2f8b1SPeter Xu
455*8cb2f8b1SPeter XuIf you use memory API functions that update memory layout outside
456*8cb2f8b1SPeter Xuinitialization (i.e., in response to a guest action), this is a strong
457*8cb2f8b1SPeter Xuindication that you need to call these functions in a ``post_load`` callback.
458*8cb2f8b1SPeter XuExamples of such memory API functions are:
459*8cb2f8b1SPeter Xu
460*8cb2f8b1SPeter Xu  - memory_region_add_subregion()
461*8cb2f8b1SPeter Xu  - memory_region_del_subregion()
462*8cb2f8b1SPeter Xu  - memory_region_set_readonly()
463*8cb2f8b1SPeter Xu  - memory_region_set_nonvolatile()
464*8cb2f8b1SPeter Xu  - memory_region_set_enabled()
465*8cb2f8b1SPeter Xu  - memory_region_set_address()
466*8cb2f8b1SPeter Xu  - memory_region_set_alias_offset()
467*8cb2f8b1SPeter Xu
468*8cb2f8b1SPeter XuIterative device migration
469*8cb2f8b1SPeter Xu--------------------------
470*8cb2f8b1SPeter Xu
471*8cb2f8b1SPeter XuSome devices, such as RAM, Block storage or certain platform devices,
472*8cb2f8b1SPeter Xuhave large amounts of data that would mean that the CPUs would be
473*8cb2f8b1SPeter Xupaused for too long if they were sent in one section.  For these
474*8cb2f8b1SPeter Xudevices an *iterative* approach is taken.
475*8cb2f8b1SPeter Xu
476*8cb2f8b1SPeter XuThe iterative devices generally don't use VMState macros
477*8cb2f8b1SPeter Xu(although it may be possible in some cases) and instead use
478*8cb2f8b1SPeter Xuqemu_put_*/qemu_get_* macros to read/write data to the stream.  Specialist
479*8cb2f8b1SPeter Xuversions exist for high bandwidth IO.
480*8cb2f8b1SPeter Xu
481*8cb2f8b1SPeter Xu
482*8cb2f8b1SPeter XuAn iterative device must provide:
483*8cb2f8b1SPeter Xu
484*8cb2f8b1SPeter Xu  - A ``save_setup`` function that initialises the data structures and
485*8cb2f8b1SPeter Xu    transmits a first section containing information on the device.  In the
486*8cb2f8b1SPeter Xu    case of RAM this transmits a list of RAMBlocks and sizes.
487*8cb2f8b1SPeter Xu
488*8cb2f8b1SPeter Xu  - A ``load_setup`` function that initialises the data structures on the
489*8cb2f8b1SPeter Xu    destination.
490*8cb2f8b1SPeter Xu
491*8cb2f8b1SPeter Xu  - A ``state_pending_exact`` function that indicates how much more
492*8cb2f8b1SPeter Xu    data we must save.  The core migration code will use this to
493*8cb2f8b1SPeter Xu    determine when to pause the CPUs and complete the migration.
494*8cb2f8b1SPeter Xu
495*8cb2f8b1SPeter Xu  - A ``state_pending_estimate`` function that indicates how much more
496*8cb2f8b1SPeter Xu    data we must save.  When the estimated amount is smaller than the
497*8cb2f8b1SPeter Xu    threshold, we call ``state_pending_exact``.
498*8cb2f8b1SPeter Xu
499*8cb2f8b1SPeter Xu  - A ``save_live_iterate`` function should send a chunk of data until
500*8cb2f8b1SPeter Xu    the point that stream bandwidth limits tell it to stop.  Each call
501*8cb2f8b1SPeter Xu    generates one section.
502*8cb2f8b1SPeter Xu
503*8cb2f8b1SPeter Xu  - A ``save_live_complete_precopy`` function that must transmit the
504*8cb2f8b1SPeter Xu    last section for the device containing any remaining data.
505*8cb2f8b1SPeter Xu
506*8cb2f8b1SPeter Xu  - A ``load_state`` function used to load sections generated by
507*8cb2f8b1SPeter Xu    any of the save functions that generate sections.
508*8cb2f8b1SPeter Xu
509*8cb2f8b1SPeter Xu  - ``cleanup`` functions for both save and load that are called
510*8cb2f8b1SPeter Xu    at the end of migration.
511*8cb2f8b1SPeter Xu
512*8cb2f8b1SPeter XuNote that the contents of the sections for iterative migration tend
513*8cb2f8b1SPeter Xuto be open-coded by the devices; care should be taken in parsing
514*8cb2f8b1SPeter Xuthe results and structuring the stream to make them easy to validate.
515*8cb2f8b1SPeter Xu
516*8cb2f8b1SPeter XuDevice ordering
517*8cb2f8b1SPeter Xu---------------
518*8cb2f8b1SPeter Xu
519*8cb2f8b1SPeter XuThere are cases in which the ordering of device loading matters; for
520*8cb2f8b1SPeter Xuexample in some systems where a device may assert an interrupt during loading,
521*8cb2f8b1SPeter Xuif the interrupt controller is loaded later then it might lose the state.
522*8cb2f8b1SPeter Xu
523*8cb2f8b1SPeter XuSome ordering is implicitly provided by the order in which the machine
524*8cb2f8b1SPeter Xudefinition creates devices, however this is somewhat fragile.
525*8cb2f8b1SPeter Xu
526*8cb2f8b1SPeter XuThe ``MigrationPriority`` enum provides a means of explicitly enforcing
527*8cb2f8b1SPeter Xuordering.  Numerically higher priorities are loaded earlier.
528*8cb2f8b1SPeter XuThe priority is set by setting the ``priority`` field of the top level
529*8cb2f8b1SPeter Xu``VMStateDescription`` for the device.
530*8cb2f8b1SPeter Xu
531*8cb2f8b1SPeter XuStream structure
532*8cb2f8b1SPeter Xu================
533*8cb2f8b1SPeter Xu
534*8cb2f8b1SPeter XuThe stream tries to be word and endian agnostic, allowing migration between hosts
535*8cb2f8b1SPeter Xuof different characteristics running the same VM.
536*8cb2f8b1SPeter Xu
537*8cb2f8b1SPeter Xu  - Header
538*8cb2f8b1SPeter Xu
539*8cb2f8b1SPeter Xu    - Magic
540*8cb2f8b1SPeter Xu    - Version
541*8cb2f8b1SPeter Xu    - VM configuration section
542*8cb2f8b1SPeter Xu
543*8cb2f8b1SPeter Xu       - Machine type
544*8cb2f8b1SPeter Xu       - Target page bits
545*8cb2f8b1SPeter Xu  - List of sections
546*8cb2f8b1SPeter Xu    Each section contains a device, or one iteration of a device save.
547*8cb2f8b1SPeter Xu
548*8cb2f8b1SPeter Xu    - section type
549*8cb2f8b1SPeter Xu    - section id
550*8cb2f8b1SPeter Xu    - ID string (First section of each device)
551*8cb2f8b1SPeter Xu    - instance id (First section of each device)
552*8cb2f8b1SPeter Xu    - version id (First section of each device)
553*8cb2f8b1SPeter Xu    - <device data>
554*8cb2f8b1SPeter Xu    - Footer mark
555*8cb2f8b1SPeter Xu  - EOF mark
556*8cb2f8b1SPeter Xu  - VM Description structure
557*8cb2f8b1SPeter Xu    Consisting of a JSON description of the contents for analysis only
558*8cb2f8b1SPeter Xu
559*8cb2f8b1SPeter XuThe ``device data`` in each section consists of the data produced
560*8cb2f8b1SPeter Xuby the code described above.  For non-iterative devices they have a single
561*8cb2f8b1SPeter Xusection; iterative devices have an initial and last section and a set
562*8cb2f8b1SPeter Xuof parts in between.
563*8cb2f8b1SPeter XuNote that there is very little checking by the common code of the integrity
564*8cb2f8b1SPeter Xuof the ``device data`` contents, that's up to the devices themselves.
565*8cb2f8b1SPeter XuThe ``footer mark`` provides a little bit of protection for the case where
566*8cb2f8b1SPeter Xuthe receiving side reads more or less data than expected.
567*8cb2f8b1SPeter Xu
568*8cb2f8b1SPeter XuThe ``ID string`` is normally unique, having been formed from a bus name
569*8cb2f8b1SPeter Xuand device address, PCI devices and storage devices hung off PCI controllers
570*8cb2f8b1SPeter Xufit this pattern well.  Some devices are fixed single instances (e.g. "pc-ram").
571*8cb2f8b1SPeter XuOthers (especially either older devices or system devices which for
572*8cb2f8b1SPeter Xusome reason don't have a bus concept) make use of the ``instance id``
573*8cb2f8b1SPeter Xufor otherwise identically named devices.
574*8cb2f8b1SPeter Xu
575*8cb2f8b1SPeter XuReturn path
576*8cb2f8b1SPeter Xu-----------
577*8cb2f8b1SPeter Xu
578*8cb2f8b1SPeter XuOnly a unidirectional stream is required for normal migration, however a
579*8cb2f8b1SPeter Xu``return path`` can be created when bidirectional communication is desired.
580*8cb2f8b1SPeter XuThis is primarily used by postcopy, but is also used to return a success
581*8cb2f8b1SPeter Xuflag to the source at the end of migration.
582*8cb2f8b1SPeter Xu
583*8cb2f8b1SPeter Xu``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
584*8cb2f8b1SPeter Xupath.
585*8cb2f8b1SPeter Xu
586*8cb2f8b1SPeter Xu  Source side
587*8cb2f8b1SPeter Xu
588*8cb2f8b1SPeter Xu     Forward path - written by migration thread
589*8cb2f8b1SPeter Xu     Return path  - opened by main thread, read by return-path thread
590*8cb2f8b1SPeter Xu
591*8cb2f8b1SPeter Xu  Destination side
592*8cb2f8b1SPeter Xu
593*8cb2f8b1SPeter Xu     Forward path - read by main thread
594*8cb2f8b1SPeter Xu     Return path  - opened by main thread, written by main thread AND postcopy
595*8cb2f8b1SPeter Xu     thread (protected by rp_mutex)
596*8cb2f8b1SPeter Xu
597*8cb2f8b1SPeter XuDirty limit
598*8cb2f8b1SPeter Xu=====================
599*8cb2f8b1SPeter XuThe dirty limit, short for dirty page rate upper limit, is a new capability
600*8cb2f8b1SPeter Xuintroduced in the 8.1 QEMU release that uses a new algorithm based on the KVM
601*8cb2f8b1SPeter Xudirty ring to throttle down the guest during live migration.
602*8cb2f8b1SPeter Xu
603*8cb2f8b1SPeter XuThe algorithm framework is as follows:
604*8cb2f8b1SPeter Xu
605*8cb2f8b1SPeter Xu::
606*8cb2f8b1SPeter Xu
607*8cb2f8b1SPeter Xu  ------------------------------------------------------------------------------
608*8cb2f8b1SPeter Xu  main   --------------> throttle thread ------------> PREPARE(1) <--------
609*8cb2f8b1SPeter Xu  thread  \                                                |              |
610*8cb2f8b1SPeter Xu           \                                               |              |
611*8cb2f8b1SPeter Xu            \                                              V              |
612*8cb2f8b1SPeter Xu             -\                                        CALCULATE(2)       |
613*8cb2f8b1SPeter Xu               \                                           |              |
614*8cb2f8b1SPeter Xu                \                                          |              |
615*8cb2f8b1SPeter Xu                 \                                         V              |
616*8cb2f8b1SPeter Xu                  \                                    SET PENALTY(3) -----
617*8cb2f8b1SPeter Xu                   -\                                      |
618*8cb2f8b1SPeter Xu                     \                                     |
619*8cb2f8b1SPeter Xu                      \                                    V
620*8cb2f8b1SPeter Xu                       -> virtual CPU thread -------> ACCEPT PENALTY(4)
621*8cb2f8b1SPeter Xu  ------------------------------------------------------------------------------
622*8cb2f8b1SPeter Xu
623*8cb2f8b1SPeter XuWhen the qmp command qmp_set_vcpu_dirty_limit is called for the first time,
624*8cb2f8b1SPeter Xuthe QEMU main thread starts the throttle thread. The throttle thread, once
625*8cb2f8b1SPeter Xulaunched, executes the loop, which consists of three steps:
626*8cb2f8b1SPeter Xu
627*8cb2f8b1SPeter Xu  - PREPARE (1)
628*8cb2f8b1SPeter Xu
629*8cb2f8b1SPeter Xu     The entire work of PREPARE (1) is preparation for the second stage,
630*8cb2f8b1SPeter Xu     CALCULATE(2), as the name implies. It involves preparing the dirty
631*8cb2f8b1SPeter Xu     page rate value and the corresponding upper limit of the VM:
632*8cb2f8b1SPeter Xu     The dirty page rate is calculated via the KVM dirty ring mechanism,
633*8cb2f8b1SPeter Xu     which tells QEMU how many dirty pages a virtual CPU has had since the
634*8cb2f8b1SPeter Xu     last KVM_EXIT_DIRTY_RING_FULL exception; The dirty page rate upper
635*8cb2f8b1SPeter Xu     limit is specified by caller, therefore fetch it directly.
636*8cb2f8b1SPeter Xu
637*8cb2f8b1SPeter Xu  - CALCULATE (2)
638*8cb2f8b1SPeter Xu
639*8cb2f8b1SPeter Xu     Calculate a suitable sleep period for each virtual CPU, which will be
640*8cb2f8b1SPeter Xu     used to determine the penalty for the target virtual CPU. The
641*8cb2f8b1SPeter Xu     computation must be done carefully in order to reduce the dirty page
642*8cb2f8b1SPeter Xu     rate progressively down to the upper limit without oscillation. To
643*8cb2f8b1SPeter Xu     achieve this, two strategies are provided: the first is to add or
644*8cb2f8b1SPeter Xu     subtract sleep time based on the ratio of the current dirty page rate
645*8cb2f8b1SPeter Xu     to the limit, which is used when the current dirty page rate is far
646*8cb2f8b1SPeter Xu     from the limit; the second is to add or subtract a fixed time when
647*8cb2f8b1SPeter Xu     the current dirty page rate is close to the limit.
648*8cb2f8b1SPeter Xu
649*8cb2f8b1SPeter Xu  - SET PENALTY (3)
650*8cb2f8b1SPeter Xu
651*8cb2f8b1SPeter Xu     Set the sleep time for each virtual CPU that should be penalized based
652*8cb2f8b1SPeter Xu     on the results of the calculation supplied by step CALCULATE (2).
653*8cb2f8b1SPeter Xu
654*8cb2f8b1SPeter XuAfter completing the three above stages, the throttle thread loops back
655*8cb2f8b1SPeter Xuto step PREPARE (1) until the dirty limit is reached.
656*8cb2f8b1SPeter Xu
657*8cb2f8b1SPeter XuOn the other hand, each virtual CPU thread reads the sleep duration and
658*8cb2f8b1SPeter Xusleeps in the path of the KVM_EXIT_DIRTY_RING_FULL exception handler, that
659*8cb2f8b1SPeter Xuis ACCEPT PENALTY (4). Virtual CPUs tied with writing processes will
660*8cb2f8b1SPeter Xuobviously exit to the path and get penalized, whereas virtual CPUs involved
661*8cb2f8b1SPeter Xuwith read processes will not.
662*8cb2f8b1SPeter Xu
663*8cb2f8b1SPeter XuIn summary, thanks to the KVM dirty ring technology, the dirty limit
664*8cb2f8b1SPeter Xualgorithm will restrict virtual CPUs as needed to keep their dirty page
665*8cb2f8b1SPeter Xurate inside the limit. This leads to more steady reading performance during
666*8cb2f8b1SPeter Xulive migration and can aid in improving large guest responsiveness.
667*8cb2f8b1SPeter Xu
668*8cb2f8b1SPeter XuPostcopy
669*8cb2f8b1SPeter Xu========
670*8cb2f8b1SPeter Xu
671*8cb2f8b1SPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge
672*8cb2f8b1SPeter Xu(or take too long to converge) its plus side is that there is an upper bound on
673*8cb2f8b1SPeter Xuthe amount of migration traffic and time it takes, the down side is that during
674*8cb2f8b1SPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost.
675*8cb2f8b1SPeter Xu
676*8cb2f8b1SPeter XuIn postcopy the destination CPUs are started before all the memory has been
677*8cb2f8b1SPeter Xutransferred, and accesses to pages that are yet to be transferred cause
678*8cb2f8b1SPeter Xua fault that's translated by QEMU into a request to the source QEMU.
679*8cb2f8b1SPeter Xu
680*8cb2f8b1SPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
681*8cb2f8b1SPeter Xudoesn't finish in a given time the switch is made to postcopy.
682*8cb2f8b1SPeter Xu
683*8cb2f8b1SPeter XuEnabling postcopy
684*8cb2f8b1SPeter Xu-----------------
685*8cb2f8b1SPeter Xu
686*8cb2f8b1SPeter XuTo enable postcopy, issue this command on the monitor (both source and
687*8cb2f8b1SPeter Xudestination) prior to the start of migration:
688*8cb2f8b1SPeter Xu
689*8cb2f8b1SPeter Xu``migrate_set_capability postcopy-ram on``
690*8cb2f8b1SPeter Xu
691*8cb2f8b1SPeter XuThe normal commands are then used to start a migration, which is still
692*8cb2f8b1SPeter Xustarted in precopy mode.  Issuing:
693*8cb2f8b1SPeter Xu
694*8cb2f8b1SPeter Xu``migrate_start_postcopy``
695*8cb2f8b1SPeter Xu
696*8cb2f8b1SPeter Xuwill now cause the transition from precopy to postcopy.
697*8cb2f8b1SPeter XuIt can be issued immediately after migration is started or any
698*8cb2f8b1SPeter Xutime later on.  Issuing it after the end of a migration is harmless.
699*8cb2f8b1SPeter Xu
700*8cb2f8b1SPeter XuBlocktime is a postcopy live migration metric, intended to show how
701*8cb2f8b1SPeter Xulong the vCPU was in state of interruptible sleep due to pagefault.
702*8cb2f8b1SPeter XuThat metric is calculated both for all vCPUs as overlapped value, and
703*8cb2f8b1SPeter Xuseparately for each vCPU. These values are calculated on destination
704*8cb2f8b1SPeter Xuside.  To enable postcopy blocktime calculation, enter following
705*8cb2f8b1SPeter Xucommand on destination monitor:
706*8cb2f8b1SPeter Xu
707*8cb2f8b1SPeter Xu``migrate_set_capability postcopy-blocktime on``
708*8cb2f8b1SPeter Xu
709*8cb2f8b1SPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command.
710*8cb2f8b1SPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking
711*8cb2f8b1SPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
712*8cb2f8b1SPeter Xutime per vCPU.
713*8cb2f8b1SPeter Xu
714*8cb2f8b1SPeter Xu.. note::
715*8cb2f8b1SPeter Xu  During the postcopy phase, the bandwidth limits set using
716*8cb2f8b1SPeter Xu  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
717*8cb2f8b1SPeter Xu  the destination is waiting for).
718*8cb2f8b1SPeter Xu
719*8cb2f8b1SPeter XuPostcopy device transfer
720*8cb2f8b1SPeter Xu------------------------
721*8cb2f8b1SPeter Xu
722*8cb2f8b1SPeter XuLoading of device data may cause the device emulation to access guest RAM
723*8cb2f8b1SPeter Xuthat may trigger faults that have to be resolved by the source, as such
724*8cb2f8b1SPeter Xuthe migration stream has to be able to respond with page data *during* the
725*8cb2f8b1SPeter Xudevice load, and hence the device data has to be read from the stream completely
726*8cb2f8b1SPeter Xubefore the device load begins to free the stream up.  This is achieved by
727*8cb2f8b1SPeter Xu'packaging' the device data into a blob that's read in one go.
728*8cb2f8b1SPeter Xu
729*8cb2f8b1SPeter XuSource behaviour
730*8cb2f8b1SPeter Xu----------------
731*8cb2f8b1SPeter Xu
732*8cb2f8b1SPeter XuUntil postcopy is entered the migration stream is identical to normal
733*8cb2f8b1SPeter Xuprecopy, except for the addition of a 'postcopy advise' command at
734*8cb2f8b1SPeter Xuthe beginning, to tell the destination that postcopy might happen.
735*8cb2f8b1SPeter XuWhen postcopy starts the source sends the page discard data and then
736*8cb2f8b1SPeter Xuforms the 'package' containing:
737*8cb2f8b1SPeter Xu
738*8cb2f8b1SPeter Xu   - Command: 'postcopy listen'
739*8cb2f8b1SPeter Xu   - The device state
740*8cb2f8b1SPeter Xu
741*8cb2f8b1SPeter Xu     A series of sections, identical to the precopy streams device state stream
742*8cb2f8b1SPeter Xu     containing everything except postcopiable devices (i.e. RAM)
743*8cb2f8b1SPeter Xu   - Command: 'postcopy run'
744*8cb2f8b1SPeter Xu
745*8cb2f8b1SPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
746*8cb2f8b1SPeter Xucontents are formatted in the same way as the main migration stream.
747*8cb2f8b1SPeter Xu
748*8cb2f8b1SPeter XuDuring postcopy the source scans the list of dirty pages and sends them
749*8cb2f8b1SPeter Xuto the destination without being requested (in much the same way as precopy),
750*8cb2f8b1SPeter Xuhowever when a page request is received from the destination, the dirty page
751*8cb2f8b1SPeter Xuscanning restarts from the requested location.  This causes requested pages
752*8cb2f8b1SPeter Xuto be sent quickly, and also causes pages directly after the requested page
753*8cb2f8b1SPeter Xuto be sent quickly in the hope that those pages are likely to be used
754*8cb2f8b1SPeter Xuby the destination soon.
755*8cb2f8b1SPeter Xu
756*8cb2f8b1SPeter XuDestination behaviour
757*8cb2f8b1SPeter Xu---------------------
758*8cb2f8b1SPeter Xu
759*8cb2f8b1SPeter XuInitially the destination looks the same as precopy, with a single thread
760*8cb2f8b1SPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands
761*8cb2f8b1SPeter Xuare processed to change the way RAM is managed, but don't affect the stream
762*8cb2f8b1SPeter Xuprocessing.
763*8cb2f8b1SPeter Xu
764*8cb2f8b1SPeter Xu::
765*8cb2f8b1SPeter Xu
766*8cb2f8b1SPeter Xu  ------------------------------------------------------------------------------
767*8cb2f8b1SPeter Xu                          1      2   3     4 5                      6   7
768*8cb2f8b1SPeter Xu  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
769*8cb2f8b1SPeter Xu  thread                             |       |
770*8cb2f8b1SPeter Xu                                     |     (page request)
771*8cb2f8b1SPeter Xu                                     |        \___
772*8cb2f8b1SPeter Xu                                     v            \
773*8cb2f8b1SPeter Xu  listen thread:                     --- page -- page -- page -- page -- page --
774*8cb2f8b1SPeter Xu
775*8cb2f8b1SPeter Xu                                     a   b        c
776*8cb2f8b1SPeter Xu  ------------------------------------------------------------------------------
777*8cb2f8b1SPeter Xu
778*8cb2f8b1SPeter Xu- On receipt of ``CMD_PACKAGED`` (1)
779*8cb2f8b1SPeter Xu
780*8cb2f8b1SPeter Xu   All the data associated with the package - the ( ... ) section in the diagram -
781*8cb2f8b1SPeter Xu   is read into memory, and the main thread recurses into qemu_loadvm_state_main
782*8cb2f8b1SPeter Xu   to process the contents of the package (2) which contains commands (3,6) and
783*8cb2f8b1SPeter Xu   devices (4...)
784*8cb2f8b1SPeter Xu
785*8cb2f8b1SPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
786*8cb2f8b1SPeter Xu
787*8cb2f8b1SPeter Xu   a new thread (a) is started that takes over servicing the migration stream,
788*8cb2f8b1SPeter Xu   while the main thread carries on loading the package.   It loads normal
789*8cb2f8b1SPeter Xu   background page data (b) but if during a device load a fault happens (5)
790*8cb2f8b1SPeter Xu   the returned page (c) is loaded by the listen thread allowing the main
791*8cb2f8b1SPeter Xu   threads device load to carry on.
792*8cb2f8b1SPeter Xu
793*8cb2f8b1SPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
794*8cb2f8b1SPeter Xu
795*8cb2f8b1SPeter Xu   letting the destination CPUs start running.  At the end of the
796*8cb2f8b1SPeter Xu   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
797*8cb2f8b1SPeter Xu   is no longer used by migration, while the listen thread carries on servicing
798*8cb2f8b1SPeter Xu   page data until the end of migration.
799*8cb2f8b1SPeter Xu
800*8cb2f8b1SPeter XuPostcopy Recovery
801*8cb2f8b1SPeter Xu-----------------
802*8cb2f8b1SPeter Xu
803*8cb2f8b1SPeter XuComparing to precopy, postcopy is special on error handlings.  When any
804*8cb2f8b1SPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily
805*8cb2f8b1SPeter Xufail a migration because VM data resides in both source and destination
806*8cb2f8b1SPeter XuQEMU instances.  On the other hand, when issue happens QEMU on both sides
807*8cb2f8b1SPeter Xuwill go into a paused state.  It'll need a recovery phase to continue a
808*8cb2f8b1SPeter Xupaused postcopy migration.
809*8cb2f8b1SPeter Xu
810*8cb2f8b1SPeter XuThe recovery phase normally contains a few steps:
811*8cb2f8b1SPeter Xu
812*8cb2f8b1SPeter Xu  - When network issue occurs, both QEMU will go into PAUSED state
813*8cb2f8b1SPeter Xu
814*8cb2f8b1SPeter Xu  - When the network is recovered (or a new network is provided), the admin
815*8cb2f8b1SPeter Xu    can setup the new channel for migration using QMP command
816*8cb2f8b1SPeter Xu    'migrate-recover' on destination node, preparing for a resume.
817*8cb2f8b1SPeter Xu
818*8cb2f8b1SPeter Xu  - On source host, the admin can continue the interrupted postcopy
819*8cb2f8b1SPeter Xu    migration using QMP command 'migrate' with resume=true flag set.
820*8cb2f8b1SPeter Xu
821*8cb2f8b1SPeter Xu  - After the connection is re-established, QEMU will continue the postcopy
822*8cb2f8b1SPeter Xu    migration on both sides.
823*8cb2f8b1SPeter Xu
824*8cb2f8b1SPeter XuDuring a paused postcopy migration, the VM can logically still continue
825*8cb2f8b1SPeter Xurunning, and it will not be impacted from any page access to pages that
826*8cb2f8b1SPeter Xuwere already migrated to destination VM before the interruption happens.
827*8cb2f8b1SPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM
828*8cb2f8b1SPeter Xuthread will be halted waiting for the page to be migrated, it means it can
829*8cb2f8b1SPeter Xube halted until the recovery is complete.
830*8cb2f8b1SPeter Xu
831*8cb2f8b1SPeter XuThe impact of accessing missing pages can be relevant to different
832*8cb2f8b1SPeter Xuconfigurations of the guest.  For example, when with async page fault
833*8cb2f8b1SPeter Xuenabled, logically the guest can proactively schedule out the threads
834*8cb2f8b1SPeter Xuaccessing missing pages.
835*8cb2f8b1SPeter Xu
836*8cb2f8b1SPeter XuPostcopy states
837*8cb2f8b1SPeter Xu---------------
838*8cb2f8b1SPeter Xu
839*8cb2f8b1SPeter XuPostcopy moves through a series of states (see postcopy_state) from
840*8cb2f8b1SPeter XuADVISE->DISCARD->LISTEN->RUNNING->END
841*8cb2f8b1SPeter Xu
842*8cb2f8b1SPeter Xu - Advise
843*8cb2f8b1SPeter Xu
844*8cb2f8b1SPeter Xu    Set at the start of migration if postcopy is enabled, even
845*8cb2f8b1SPeter Xu    if it hasn't had the start command; here the destination
846*8cb2f8b1SPeter Xu    checks that its OS has the support needed for postcopy, and performs
847*8cb2f8b1SPeter Xu    setup to ensure the RAM mappings are suitable for later postcopy.
848*8cb2f8b1SPeter Xu    The destination will fail early in migration at this point if the
849*8cb2f8b1SPeter Xu    required OS support is not present.
850*8cb2f8b1SPeter Xu    (Triggered by reception of POSTCOPY_ADVISE command)
851*8cb2f8b1SPeter Xu
852*8cb2f8b1SPeter Xu - Discard
853*8cb2f8b1SPeter Xu
854*8cb2f8b1SPeter Xu    Entered on receipt of the first 'discard' command; prior to
855*8cb2f8b1SPeter Xu    the first Discard being performed, hugepages are switched off
856*8cb2f8b1SPeter Xu    (using madvise) to ensure that no new huge pages are created
857*8cb2f8b1SPeter Xu    during the postcopy phase, and to cause any huge pages that
858*8cb2f8b1SPeter Xu    have discards on them to be broken.
859*8cb2f8b1SPeter Xu
860*8cb2f8b1SPeter Xu - Listen
861*8cb2f8b1SPeter Xu
862*8cb2f8b1SPeter Xu    The first command in the package, POSTCOPY_LISTEN, switches
863*8cb2f8b1SPeter Xu    the destination state to Listen, and starts a new thread
864*8cb2f8b1SPeter Xu    (the 'listen thread') which takes over the job of receiving
865*8cb2f8b1SPeter Xu    pages off the migration stream, while the main thread carries
866*8cb2f8b1SPeter Xu    on processing the blob.  With this thread able to process page
867*8cb2f8b1SPeter Xu    reception, the destination now 'sensitises' the RAM to detect
868*8cb2f8b1SPeter Xu    any access to missing pages (on Linux using the 'userfault'
869*8cb2f8b1SPeter Xu    system).
870*8cb2f8b1SPeter Xu
871*8cb2f8b1SPeter Xu - Running
872*8cb2f8b1SPeter Xu
873*8cb2f8b1SPeter Xu    POSTCOPY_RUN causes the destination to synchronise all
874*8cb2f8b1SPeter Xu    state and start the CPUs and IO devices running.  The main
875*8cb2f8b1SPeter Xu    thread now finishes processing the migration package and
876*8cb2f8b1SPeter Xu    now carries on as it would for normal precopy migration
877*8cb2f8b1SPeter Xu    (although it can't do the cleanup it would do as it
878*8cb2f8b1SPeter Xu    finishes a normal migration).
879*8cb2f8b1SPeter Xu
880*8cb2f8b1SPeter Xu - Paused
881*8cb2f8b1SPeter Xu
882*8cb2f8b1SPeter Xu    Postcopy can run into a paused state (normally on both sides when
883*8cb2f8b1SPeter Xu    happens), where all threads will be temporarily halted mostly due to
884*8cb2f8b1SPeter Xu    network errors.  When reaching paused state, migration will make sure
885*8cb2f8b1SPeter Xu    the qemu binary on both sides maintain the data without corrupting
886*8cb2f8b1SPeter Xu    the VM.  To continue the migration, the admin needs to fix the
887*8cb2f8b1SPeter Xu    migration channel using the QMP command 'migrate-recover' on the
888*8cb2f8b1SPeter Xu    destination node, then resume the migration using QMP command 'migrate'
889*8cb2f8b1SPeter Xu    again on source node, with resume=true flag set.
890*8cb2f8b1SPeter Xu
891*8cb2f8b1SPeter Xu - End
892*8cb2f8b1SPeter Xu
893*8cb2f8b1SPeter Xu    The listen thread can now quit, and perform the cleanup of migration
894*8cb2f8b1SPeter Xu    state, the migration is now complete.
895*8cb2f8b1SPeter Xu
896*8cb2f8b1SPeter XuSource side page map
897*8cb2f8b1SPeter Xu--------------------
898*8cb2f8b1SPeter Xu
899*8cb2f8b1SPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy,
900*8cb2f8b1SPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs
901*8cb2f8b1SPeter Xusending.  During the precopy phase this is updated as the CPU dirties
902*8cb2f8b1SPeter Xupages, however during postcopy the CPUs are stopped and nothing should
903*8cb2f8b1SPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant
904*8cb2f8b1SPeter Xupages are sent during postcopy.
905*8cb2f8b1SPeter Xu
906*8cb2f8b1SPeter XuPostcopy with hugepages
907*8cb2f8b1SPeter Xu-----------------------
908*8cb2f8b1SPeter Xu
909*8cb2f8b1SPeter XuPostcopy now works with hugetlbfs backed memory:
910*8cb2f8b1SPeter Xu
911*8cb2f8b1SPeter Xu  a) The linux kernel on the destination must support userfault on hugepages.
912*8cb2f8b1SPeter Xu  b) The huge-page configuration on the source and destination VMs must be
913*8cb2f8b1SPeter Xu     identical; i.e. RAMBlocks on both sides must use the same page size.
914*8cb2f8b1SPeter Xu  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
915*8cb2f8b1SPeter Xu     RAM if it doesn't have enough hugepages, triggering (b) to fail.
916*8cb2f8b1SPeter Xu     Using ``-mem-prealloc`` enforces the allocation using hugepages.
917*8cb2f8b1SPeter Xu  d) Care should be taken with the size of hugepage used; postcopy with 2MB
918*8cb2f8b1SPeter Xu     hugepages works well, however 1GB hugepages are likely to be problematic
919*8cb2f8b1SPeter Xu     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
920*8cb2f8b1SPeter Xu     and until the full page is transferred the destination thread is blocked.
921*8cb2f8b1SPeter Xu
922*8cb2f8b1SPeter XuPostcopy with shared memory
923*8cb2f8b1SPeter Xu---------------------------
924*8cb2f8b1SPeter Xu
925*8cb2f8b1SPeter XuPostcopy migration with shared memory needs explicit support from the other
926*8cb2f8b1SPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of
927*8cb2f8b1SPeter Xumemory that userfault can support shared.
928*8cb2f8b1SPeter Xu
929*8cb2f8b1SPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
930*8cb2f8b1SPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
931*8cb2f8b1SPeter Xufor hugetlbfs which may be a problem in some configurations).
932*8cb2f8b1SPeter Xu
933*8cb2f8b1SPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support,
934*8cb2f8b1SPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
935*8cb2f8b1SPeter Xuto support postcopy.
936*8cb2f8b1SPeter Xu
937*8cb2f8b1SPeter XuThe client needs to open a userfaultfd and register the areas
938*8cb2f8b1SPeter Xuof memory that it maps with userfault.  The client must then pass the
939*8cb2f8b1SPeter Xuuserfaultfd back to QEMU together with a mapping table that allows
940*8cb2f8b1SPeter Xufault addresses in the clients address space to be converted back to
941*8cb2f8b1SPeter XuRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
942*8cb2f8b1SPeter Xufault-thread and page requests are made on behalf of the client by QEMU.
943*8cb2f8b1SPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it
944*8cb2f8b1SPeter Xuto continue after a page has arrived.
945*8cb2f8b1SPeter Xu
946*8cb2f8b1SPeter Xu.. note::
947*8cb2f8b1SPeter Xu  There are two future improvements that would be nice:
948*8cb2f8b1SPeter Xu    a) Some way to make QEMU ignorant of the addresses in the clients
949*8cb2f8b1SPeter Xu       address space
950*8cb2f8b1SPeter Xu    b) Avoiding the need for QEMU to perform ufd-wake calls after the
951*8cb2f8b1SPeter Xu       pages have arrived
952*8cb2f8b1SPeter Xu
953*8cb2f8b1SPeter XuRetro-fitting postcopy to existing clients is possible:
954*8cb2f8b1SPeter Xu  a) A mechanism is needed for the registration with userfault as above,
955*8cb2f8b1SPeter Xu     and the registration needs to be coordinated with the phases of
956*8cb2f8b1SPeter Xu     postcopy.  In vhost-user extra messages are added to the existing
957*8cb2f8b1SPeter Xu     control channel.
958*8cb2f8b1SPeter Xu  b) Any thread that can block due to guest memory accesses must be
959*8cb2f8b1SPeter Xu     identified and the implication understood; for example if the
960*8cb2f8b1SPeter Xu     guest memory access is made while holding a lock then all other
961*8cb2f8b1SPeter Xu     threads waiting for that lock will also be blocked.
962*8cb2f8b1SPeter Xu
963*8cb2f8b1SPeter XuPostcopy Preemption Mode
964*8cb2f8b1SPeter Xu------------------------
965*8cb2f8b1SPeter Xu
966*8cb2f8b1SPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it
967*8cb2f8b1SPeter Xuallows urgent pages (those got page fault requested from destination QEMU
968*8cb2f8b1SPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in
969*8cb2f8b1SPeter Xuthe background migration channel.  Anyone who cares about latencies of page
970*8cb2f8b1SPeter Xufaults during a postcopy migration should enable this feature.  By default,
971*8cb2f8b1SPeter Xuit's not enabled.
972*8cb2f8b1SPeter Xu
973*8cb2f8b1SPeter XuFirmware
974*8cb2f8b1SPeter Xu========
975*8cb2f8b1SPeter Xu
976*8cb2f8b1SPeter XuMigration migrates the copies of RAM and ROM, and thus when running
977*8cb2f8b1SPeter Xuon the destination it includes the firmware from the source. Even after
978*8cb2f8b1SPeter Xuresetting a VM, the old firmware is used.  Only once QEMU has been restarted
979*8cb2f8b1SPeter Xuis the new firmware in use.
980*8cb2f8b1SPeter Xu
981*8cb2f8b1SPeter Xu- Changes in firmware size can cause changes in the required RAMBlock size
982*8cb2f8b1SPeter Xu  to hold the firmware and thus migration can fail.  In practice it's best
983*8cb2f8b1SPeter Xu  to pad firmware images to convenient powers of 2 with plenty of space
984*8cb2f8b1SPeter Xu  for growth.
985*8cb2f8b1SPeter Xu
986*8cb2f8b1SPeter Xu- Care should be taken with device emulation code so that newer
987*8cb2f8b1SPeter Xu  emulation code can work with older firmware to allow forward migration.
988*8cb2f8b1SPeter Xu
989*8cb2f8b1SPeter Xu- Care should be taken with newer firmware so that backward migration
990*8cb2f8b1SPeter Xu  to older systems with older device emulation code will work.
991*8cb2f8b1SPeter Xu
992*8cb2f8b1SPeter XuIn some cases it may be best to tie specific firmware versions to specific
993*8cb2f8b1SPeter Xuversioned machine types to cut down on the combinations that will need
994*8cb2f8b1SPeter Xusupport.  This is also useful when newer versions of firmware outgrow
995*8cb2f8b1SPeter Xuthe padding.
996*8cb2f8b1SPeter Xu
997*8cb2f8b1SPeter Xu
998*8cb2f8b1SPeter XuBackwards compatibility
999*8cb2f8b1SPeter Xu=======================
1000*8cb2f8b1SPeter Xu
1001*8cb2f8b1SPeter XuHow backwards compatibility works
1002*8cb2f8b1SPeter Xu---------------------------------
1003*8cb2f8b1SPeter Xu
1004*8cb2f8b1SPeter XuWhen we do migration, we have two QEMU processes: the source and the
1005*8cb2f8b1SPeter Xutarget.  There are two cases, they are the same version or they are
1006*8cb2f8b1SPeter Xudifferent versions.  The easy case is when they are the same version.
1007*8cb2f8b1SPeter XuThe difficult one is when they are different versions.
1008*8cb2f8b1SPeter Xu
1009*8cb2f8b1SPeter XuThere are two things that are different, but they have very similar
1010*8cb2f8b1SPeter Xunames and sometimes get confused:
1011*8cb2f8b1SPeter Xu
1012*8cb2f8b1SPeter Xu- QEMU version
1013*8cb2f8b1SPeter Xu- machine type version
1014*8cb2f8b1SPeter Xu
1015*8cb2f8b1SPeter XuLet's start with a practical example, we start with:
1016*8cb2f8b1SPeter Xu
1017*8cb2f8b1SPeter Xu- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
1018*8cb2f8b1SPeter Xu- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
1019*8cb2f8b1SPeter Xu
1020*8cb2f8b1SPeter XuRelated to this are the "latest" machine types defined on each of
1021*8cb2f8b1SPeter Xuthem:
1022*8cb2f8b1SPeter Xu
1023*8cb2f8b1SPeter Xu- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
1024*8cb2f8b1SPeter Xu- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
1025*8cb2f8b1SPeter Xu
1026*8cb2f8b1SPeter XuFirst of all, migration is only supposed to work if you use the same
1027*8cb2f8b1SPeter Xumachine type in both source and destination. The QEMU hardware
1028*8cb2f8b1SPeter Xuconfiguration needs to be the same also on source and destination.
1029*8cb2f8b1SPeter XuMost aspects of the backend configuration can be changed at will,
1030*8cb2f8b1SPeter Xuexcept for a few cases where the backend features influence frontend
1031*8cb2f8b1SPeter Xudevice feature exposure.  But that is not relevant for this section.
1032*8cb2f8b1SPeter Xu
1033*8cb2f8b1SPeter XuI am going to list the number of combinations that we can have.  Let's
1034*8cb2f8b1SPeter Xustart with the trivial ones, QEMU is the same on source and
1035*8cb2f8b1SPeter Xudestination:
1036*8cb2f8b1SPeter Xu
1037*8cb2f8b1SPeter Xu1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2
1038*8cb2f8b1SPeter Xu
1039*8cb2f8b1SPeter Xu  This is the latest QEMU with the latest machine type.
1040*8cb2f8b1SPeter Xu  This have to work, and if it doesn't work it is a bug.
1041*8cb2f8b1SPeter Xu
1042*8cb2f8b1SPeter Xu2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
1043*8cb2f8b1SPeter Xu
1044*8cb2f8b1SPeter Xu  Exactly the same case than the previous one, but for 5.1.
1045*8cb2f8b1SPeter Xu  Nothing to see here either.
1046*8cb2f8b1SPeter Xu
1047*8cb2f8b1SPeter XuThis are the easiest ones, we will not talk more about them in this
1048*8cb2f8b1SPeter Xusection.
1049*8cb2f8b1SPeter Xu
1050*8cb2f8b1SPeter XuNow we start with the more interesting cases.  Consider the case where
1051*8cb2f8b1SPeter Xuwe have the same QEMU version in both sides (qemu-5.2) but we are using
1052*8cb2f8b1SPeter Xuthe latest machine type for that version (pc-5.2) but one of an older
1053*8cb2f8b1SPeter XuQEMU version, in this case pc-5.1.
1054*8cb2f8b1SPeter Xu
1055*8cb2f8b1SPeter Xu3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
1056*8cb2f8b1SPeter Xu
1057*8cb2f8b1SPeter Xu  It needs to use the definition of pc-5.1 and the devices as they
1058*8cb2f8b1SPeter Xu  were configured on 5.1, but this should be easy in the sense that
1059*8cb2f8b1SPeter Xu  both sides are the same QEMU and both sides have exactly the same
1060*8cb2f8b1SPeter Xu  idea of what the pc-5.1 machine is.
1061*8cb2f8b1SPeter Xu
1062*8cb2f8b1SPeter Xu4 - qemu-5.1 -M pc-5.2  -> migrates to -> qemu-5.1 -M pc-5.2
1063*8cb2f8b1SPeter Xu
1064*8cb2f8b1SPeter Xu  This combination is not possible as the qemu-5.1 doesn't understand
1065*8cb2f8b1SPeter Xu  pc-5.2 machine type.  So nothing to worry here.
1066*8cb2f8b1SPeter Xu
1067*8cb2f8b1SPeter XuNow it comes the interesting ones, when both QEMU processes are
1068*8cb2f8b1SPeter Xudifferent.  Notice also that the machine type needs to be pc-5.1,
1069*8cb2f8b1SPeter Xubecause we have the limitation than qemu-5.1 doesn't know pc-5.2.  So
1070*8cb2f8b1SPeter Xuthe possible cases are:
1071*8cb2f8b1SPeter Xu
1072*8cb2f8b1SPeter Xu5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
1073*8cb2f8b1SPeter Xu
1074*8cb2f8b1SPeter Xu  This migration is known as newer to older.  We need to make sure
1075*8cb2f8b1SPeter Xu  when we are developing 5.2 we need to take care about not to break
1076*8cb2f8b1SPeter Xu  migration to qemu-5.1.  Notice that we can't make updates to
1077*8cb2f8b1SPeter Xu  qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
1078*8cb2f8b1SPeter Xu  in qemu-5.2 side to make the relevant changes.
1079*8cb2f8b1SPeter Xu
1080*8cb2f8b1SPeter Xu6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
1081*8cb2f8b1SPeter Xu
1082*8cb2f8b1SPeter Xu  This migration is known as older to newer.  We need to make sure
1083*8cb2f8b1SPeter Xu  than we are able to receive migrations from qemu-5.1. The problem is
1084*8cb2f8b1SPeter Xu  similar to the previous one.
1085*8cb2f8b1SPeter Xu
1086*8cb2f8b1SPeter XuIf qemu-5.1 and qemu-5.2 were the same, there will not be any
1087*8cb2f8b1SPeter Xucompatibility problems.  But the reason that we create qemu-5.2 is to
1088*8cb2f8b1SPeter Xuget new features, devices, defaults, etc.
1089*8cb2f8b1SPeter Xu
1090*8cb2f8b1SPeter XuIf we get a device that has a new feature, or change a default value,
1091*8cb2f8b1SPeter Xuwe have a problem when we try to migrate between different QEMU
1092*8cb2f8b1SPeter Xuversions.
1093*8cb2f8b1SPeter Xu
1094*8cb2f8b1SPeter XuSo we need a way to tell qemu-5.2 that when we are using machine type
1095*8cb2f8b1SPeter Xupc-5.1, it needs to **not** use the feature, to be able to migrate to
1096*8cb2f8b1SPeter Xureal qemu-5.1.
1097*8cb2f8b1SPeter Xu
1098*8cb2f8b1SPeter XuAnd the equivalent part when migrating from qemu-5.1 to qemu-5.2.
1099*8cb2f8b1SPeter Xuqemu-5.2 has to expect that it is not going to get data for the new
1100*8cb2f8b1SPeter Xufeature, because qemu-5.1 doesn't know about it.
1101*8cb2f8b1SPeter Xu
1102*8cb2f8b1SPeter XuHow do we tell QEMU about these device feature changes?  In
1103*8cb2f8b1SPeter Xuhw/core/machine.c:hw_compat_X_Y arrays.
1104*8cb2f8b1SPeter Xu
1105*8cb2f8b1SPeter XuIf we change a default value, we need to put back the old value on
1106*8cb2f8b1SPeter Xuthat array.  And the device, during initialization needs to look at
1107*8cb2f8b1SPeter Xuthat array to see what value it needs to get for that feature.  And
1108*8cb2f8b1SPeter Xuwhat are we going to put in that array, the value of a property.
1109*8cb2f8b1SPeter Xu
1110*8cb2f8b1SPeter XuTo create a property for a device, we need to use one of the
1111*8cb2f8b1SPeter XuDEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
1112*8cb2f8b1SPeter Xumacros that exist.  With it, we set the default value for that
1113*8cb2f8b1SPeter Xuproperty, and that is what it is going to get in the latest released
1114*8cb2f8b1SPeter Xuversion.  But if we want a different value for a previous version, we
1115*8cb2f8b1SPeter Xucan change that in the hw_compat_X_Y arrays.
1116*8cb2f8b1SPeter Xu
1117*8cb2f8b1SPeter Xuhw_compat_X_Y is an array of registers that have the format:
1118*8cb2f8b1SPeter Xu
1119*8cb2f8b1SPeter Xu- name_device
1120*8cb2f8b1SPeter Xu- name_property
1121*8cb2f8b1SPeter Xu- value
1122*8cb2f8b1SPeter Xu
1123*8cb2f8b1SPeter XuLet's see a practical example.
1124*8cb2f8b1SPeter Xu
1125*8cb2f8b1SPeter XuIn qemu-5.2 virtio-blk-device got multi queue support.  This is a
1126*8cb2f8b1SPeter Xuchange that is not backward compatible.  In qemu-5.1 it has one
1127*8cb2f8b1SPeter Xuqueue. In qemu-5.2 it has the same number of queues as the number of
1128*8cb2f8b1SPeter Xucpus in the system.
1129*8cb2f8b1SPeter Xu
1130*8cb2f8b1SPeter XuWhen we are doing migration, if we migrate from a device that has 4
1131*8cb2f8b1SPeter Xuqueues to a device that have only one queue, we don't know where to
1132*8cb2f8b1SPeter Xuput the extra information for the other 3 queues, and we fail
1133*8cb2f8b1SPeter Xumigration.
1134*8cb2f8b1SPeter Xu
1135*8cb2f8b1SPeter XuSimilar problem when we migrate from qemu-5.1 that has only one queue
1136*8cb2f8b1SPeter Xuto qemu-5.2, we only sent information for one queue, but destination
1137*8cb2f8b1SPeter Xuhas 4, and we have 3 queues that are not properly initialized and
1138*8cb2f8b1SPeter Xuanything can happen.
1139*8cb2f8b1SPeter Xu
1140*8cb2f8b1SPeter XuSo, how can we address this problem.  Easy, just convince qemu-5.2
1141*8cb2f8b1SPeter Xuthat when it is running pc-5.1, it needs to set the number of queues
1142*8cb2f8b1SPeter Xufor virtio-blk-devices to 1.
1143*8cb2f8b1SPeter Xu
1144*8cb2f8b1SPeter XuThat way we fix the cases 5 and 6.
1145*8cb2f8b1SPeter Xu
1146*8cb2f8b1SPeter Xu5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
1147*8cb2f8b1SPeter Xu
1148*8cb2f8b1SPeter Xu    qemu-5.2 -M pc-5.1 sets number of queues to be 1.
1149*8cb2f8b1SPeter Xu    qemu-5.1 -M pc-5.1 expects number of queues to be 1.
1150*8cb2f8b1SPeter Xu
1151*8cb2f8b1SPeter Xu    correct.  migration works.
1152*8cb2f8b1SPeter Xu
1153*8cb2f8b1SPeter Xu6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
1154*8cb2f8b1SPeter Xu
1155*8cb2f8b1SPeter Xu    qemu-5.1 -M pc-5.1 sets number of queues to be 1.
1156*8cb2f8b1SPeter Xu    qemu-5.2 -M pc-5.1 expects number of queues to be 1.
1157*8cb2f8b1SPeter Xu
1158*8cb2f8b1SPeter Xu    correct.  migration works.
1159*8cb2f8b1SPeter Xu
1160*8cb2f8b1SPeter XuAnd now the other interesting case, case 3.  In this case we have:
1161*8cb2f8b1SPeter Xu
1162*8cb2f8b1SPeter Xu3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
1163*8cb2f8b1SPeter Xu
1164*8cb2f8b1SPeter Xu    Here we have the same QEMU in both sides.  So it doesn't matter a
1165*8cb2f8b1SPeter Xu    lot if we have set the number of queues to 1 or not, because
1166*8cb2f8b1SPeter Xu    they are the same.
1167*8cb2f8b1SPeter Xu
1168*8cb2f8b1SPeter Xu    WRONG!
1169*8cb2f8b1SPeter Xu
1170*8cb2f8b1SPeter Xu    Think what happens if we do one of this double migrations:
1171*8cb2f8b1SPeter Xu
1172*8cb2f8b1SPeter Xu    A -> migrates -> B -> migrates -> C
1173*8cb2f8b1SPeter Xu
1174*8cb2f8b1SPeter Xu    where:
1175*8cb2f8b1SPeter Xu
1176*8cb2f8b1SPeter Xu    A: qemu-5.1 -M pc-5.1
1177*8cb2f8b1SPeter Xu    B: qemu-5.2 -M pc-5.1
1178*8cb2f8b1SPeter Xu    C: qemu-5.2 -M pc-5.1
1179*8cb2f8b1SPeter Xu
1180*8cb2f8b1SPeter Xu    migration A -> B is case 6, so number of queues needs to be 1.
1181*8cb2f8b1SPeter Xu
1182*8cb2f8b1SPeter Xu    migration B -> C is case 3, so we don't care.  But actually we
1183*8cb2f8b1SPeter Xu    care because we haven't started the guest in qemu-5.2, it came
1184*8cb2f8b1SPeter Xu    migrated from qemu-5.1.  So to be in the safe place, we need to
1185*8cb2f8b1SPeter Xu    always use number of queues 1 when we are using pc-5.1.
1186*8cb2f8b1SPeter Xu
1187*8cb2f8b1SPeter XuNow, how was this done in reality?  The following commit shows how it
1188*8cb2f8b1SPeter Xuwas done::
1189*8cb2f8b1SPeter Xu
1190*8cb2f8b1SPeter Xu  commit 9445e1e15e66c19e42bea942ba810db28052cd05
1191*8cb2f8b1SPeter Xu  Author: Stefan Hajnoczi <stefanha@redhat.com>
1192*8cb2f8b1SPeter Xu  Date:   Tue Aug 18 15:33:47 2020 +0100
1193*8cb2f8b1SPeter Xu
1194*8cb2f8b1SPeter Xu  virtio-blk-pci: default num_queues to -smp N
1195*8cb2f8b1SPeter Xu
1196*8cb2f8b1SPeter XuThe relevant parts for migration are::
1197*8cb2f8b1SPeter Xu
1198*8cb2f8b1SPeter Xu    @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
1199*8cb2f8b1SPeter Xu     #endif
1200*8cb2f8b1SPeter Xu         DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
1201*8cb2f8b1SPeter Xu                         true),
1202*8cb2f8b1SPeter Xu    -    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
1203*8cb2f8b1SPeter Xu    +    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
1204*8cb2f8b1SPeter Xu    +                       VIRTIO_BLK_AUTO_NUM_QUEUES),
1205*8cb2f8b1SPeter Xu         DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
1206*8cb2f8b1SPeter Xu
1207*8cb2f8b1SPeter XuIt changes the default value of num_queues.  But it fishes it for old
1208*8cb2f8b1SPeter Xumachine types to have the right value::
1209*8cb2f8b1SPeter Xu
1210*8cb2f8b1SPeter Xu    @@ -31,6 +31,7 @@
1211*8cb2f8b1SPeter Xu     GlobalProperty hw_compat_5_1[] = {
1212*8cb2f8b1SPeter Xu         ...
1213*8cb2f8b1SPeter Xu    +    { "virtio-blk-device", "num-queues", "1"},
1214*8cb2f8b1SPeter Xu         ...
1215*8cb2f8b1SPeter Xu     };
1216*8cb2f8b1SPeter Xu
1217*8cb2f8b1SPeter XuA device with different features on both sides
1218*8cb2f8b1SPeter Xu----------------------------------------------
1219*8cb2f8b1SPeter Xu
1220*8cb2f8b1SPeter XuLet's assume that we are using the same QEMU binary on both sides,
1221*8cb2f8b1SPeter Xujust to make the things easier.  But we have a device that has
1222*8cb2f8b1SPeter Xudifferent features on both sides of the migration.  That can be
1223*8cb2f8b1SPeter Xubecause the devices are different, because the kernel driver of both
1224*8cb2f8b1SPeter Xudevices have different features, whatever.
1225*8cb2f8b1SPeter Xu
1226*8cb2f8b1SPeter XuHow can we get this to work with migration.  The way to do that is
1227*8cb2f8b1SPeter Xu"theoretically" easy.  You have to get the features that the device
1228*8cb2f8b1SPeter Xuhas in the source of the migration.  The features that the device has
1229*8cb2f8b1SPeter Xuon the target of the migration, you get the intersection of the
1230*8cb2f8b1SPeter Xufeatures of both sides, and that is the way that you should launch
1231*8cb2f8b1SPeter XuQEMU.
1232*8cb2f8b1SPeter Xu
1233*8cb2f8b1SPeter XuNotice that this is not completely related to QEMU.  The most
1234*8cb2f8b1SPeter Xuimportant thing here is that this should be handled by the managing
1235*8cb2f8b1SPeter Xuapplication that launches QEMU.  If QEMU is configured correctly, the
1236*8cb2f8b1SPeter Xumigration will succeed.
1237*8cb2f8b1SPeter Xu
1238*8cb2f8b1SPeter XuThat said, actually doing it is complicated.  Almost all devices are
1239*8cb2f8b1SPeter Xubad at being able to be launched with only some features enabled.
1240*8cb2f8b1SPeter XuWith one big exception: cpus.
1241*8cb2f8b1SPeter Xu
1242*8cb2f8b1SPeter XuYou can read the documentation for QEMU x86 cpu models here:
1243*8cb2f8b1SPeter Xu
1244*8cb2f8b1SPeter Xuhttps://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
1245*8cb2f8b1SPeter Xu
1246*8cb2f8b1SPeter XuSee when they talk about migration they recommend that one chooses the
1247*8cb2f8b1SPeter Xunewest cpu model that is supported for all cpus.
1248*8cb2f8b1SPeter Xu
1249*8cb2f8b1SPeter XuLet's say that we have:
1250*8cb2f8b1SPeter Xu
1251*8cb2f8b1SPeter XuHost A:
1252*8cb2f8b1SPeter Xu
1253*8cb2f8b1SPeter XuDevice X has the feature Y
1254*8cb2f8b1SPeter Xu
1255*8cb2f8b1SPeter XuHost B:
1256*8cb2f8b1SPeter Xu
1257*8cb2f8b1SPeter XuDevice X has not the feature Y
1258*8cb2f8b1SPeter Xu
1259*8cb2f8b1SPeter XuIf we try to migrate without any care from host A to host B, it will
1260*8cb2f8b1SPeter Xufail because when migration tries to load the feature Y on
1261*8cb2f8b1SPeter Xudestination, it will find that the hardware is not there.
1262*8cb2f8b1SPeter Xu
1263*8cb2f8b1SPeter XuDoing this would be the equivalent of doing with cpus:
1264*8cb2f8b1SPeter Xu
1265*8cb2f8b1SPeter XuHost A:
1266*8cb2f8b1SPeter Xu
1267*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host
1268*8cb2f8b1SPeter Xu
1269*8cb2f8b1SPeter XuHost B:
1270*8cb2f8b1SPeter Xu
1271*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host
1272*8cb2f8b1SPeter Xu
1273*8cb2f8b1SPeter XuWhen both hosts have different cpu features this is guaranteed to
1274*8cb2f8b1SPeter Xufail.  Especially if Host B has less features than host A.  If host A
1275*8cb2f8b1SPeter Xuhas less features than host B, sometimes it works.  Important word of
1276*8cb2f8b1SPeter Xulast sentence is "sometimes".
1277*8cb2f8b1SPeter Xu
1278*8cb2f8b1SPeter XuSo, forgetting about cpu models and continuing with the -cpu host
1279*8cb2f8b1SPeter Xuexample, let's see that the differences of the cpus is that Host A and
1280*8cb2f8b1SPeter XuB have the following features:
1281*8cb2f8b1SPeter Xu
1282*8cb2f8b1SPeter XuFeatures:   'pcid'  'stibp' 'taa-no'
1283*8cb2f8b1SPeter XuHost A:        X       X
1284*8cb2f8b1SPeter XuHost B:                        X
1285*8cb2f8b1SPeter Xu
1286*8cb2f8b1SPeter XuAnd we want to migrate between them, the way configure both QEMU cpu
1287*8cb2f8b1SPeter Xuwill be:
1288*8cb2f8b1SPeter Xu
1289*8cb2f8b1SPeter XuHost A:
1290*8cb2f8b1SPeter Xu
1291*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
1292*8cb2f8b1SPeter Xu
1293*8cb2f8b1SPeter XuHost B:
1294*8cb2f8b1SPeter Xu
1295*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host,taa-no=off
1296*8cb2f8b1SPeter Xu
1297*8cb2f8b1SPeter XuAnd you would be able to migrate between them.  It is responsibility
1298*8cb2f8b1SPeter Xuof the management application or of the user to make sure that the
1299*8cb2f8b1SPeter Xuconfiguration is correct.  QEMU doesn't know how to look at this kind
1300*8cb2f8b1SPeter Xuof features in general.
1301*8cb2f8b1SPeter Xu
1302*8cb2f8b1SPeter XuNotice that we don't recommend to use -cpu host for migration.  It is
1303*8cb2f8b1SPeter Xuused in this example because it makes the example simpler.
1304*8cb2f8b1SPeter Xu
1305*8cb2f8b1SPeter XuOther devices have worse control about individual features.  If they
1306*8cb2f8b1SPeter Xuwant to be able to migrate between hosts that show different features,
1307*8cb2f8b1SPeter Xuthe device needs a way to configure which ones it is going to use.
1308*8cb2f8b1SPeter Xu
1309*8cb2f8b1SPeter XuIn this section we have considered that we are using the same QEMU
1310*8cb2f8b1SPeter Xubinary in both sides of the migration.  If we use different QEMU
1311*8cb2f8b1SPeter Xuversions process, then we need to have into account all other
1312*8cb2f8b1SPeter Xudifferences and the examples become even more complicated.
1313*8cb2f8b1SPeter Xu
1314*8cb2f8b1SPeter XuHow to mitigate when we have a backward compatibility error
1315*8cb2f8b1SPeter Xu-----------------------------------------------------------
1316*8cb2f8b1SPeter Xu
1317*8cb2f8b1SPeter XuWe broke migration for old machine types continuously during
1318*8cb2f8b1SPeter Xudevelopment.  But as soon as we find that there is a problem, we fix
1319*8cb2f8b1SPeter Xuit.  The problem is what happens when we detect after we have done a
1320*8cb2f8b1SPeter Xurelease that something has gone wrong.
1321*8cb2f8b1SPeter Xu
1322*8cb2f8b1SPeter XuLet see how it worked with one example.
1323*8cb2f8b1SPeter Xu
1324*8cb2f8b1SPeter XuAfter the release of qemu-8.0 we found a problem when doing migration
1325*8cb2f8b1SPeter Xuof the machine type pc-7.2.
1326*8cb2f8b1SPeter Xu
1327*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1328*8cb2f8b1SPeter Xu
1329*8cb2f8b1SPeter Xu  This migration works
1330*8cb2f8b1SPeter Xu
1331*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1332*8cb2f8b1SPeter Xu
1333*8cb2f8b1SPeter Xu  This migration works
1334*8cb2f8b1SPeter Xu
1335*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1336*8cb2f8b1SPeter Xu
1337*8cb2f8b1SPeter Xu  This migration fails
1338*8cb2f8b1SPeter Xu
1339*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1340*8cb2f8b1SPeter Xu
1341*8cb2f8b1SPeter Xu  This migration fails
1342*8cb2f8b1SPeter Xu
1343*8cb2f8b1SPeter XuSo clearly something fails when migration between qemu-7.2 and
1344*8cb2f8b1SPeter Xuqemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
1345*8cb2f8b1SPeter Xupointed to this commit.
1346*8cb2f8b1SPeter Xu
1347*8cb2f8b1SPeter XuIn qemu-8.0 we got this commit::
1348*8cb2f8b1SPeter Xu
1349*8cb2f8b1SPeter Xu    commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
1350*8cb2f8b1SPeter Xu    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
1351*8cb2f8b1SPeter Xu    Date:   Thu Mar 2 13:37:02 2023 +0000
1352*8cb2f8b1SPeter Xu
1353*8cb2f8b1SPeter Xu    hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register
1354*8cb2f8b1SPeter Xu
1355*8cb2f8b1SPeter Xu
1356*8cb2f8b1SPeter XuThe relevant bits of the commit for our example are this ones::
1357*8cb2f8b1SPeter Xu
1358*8cb2f8b1SPeter Xu    --- a/hw/pci/pcie_aer.c
1359*8cb2f8b1SPeter Xu    +++ b/hw/pci/pcie_aer.c
1360*8cb2f8b1SPeter Xu    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
1361*8cb2f8b1SPeter Xu
1362*8cb2f8b1SPeter Xu         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
1363*8cb2f8b1SPeter Xu                      PCI_ERR_UNC_SUPPORTED);
1364*8cb2f8b1SPeter Xu    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1365*8cb2f8b1SPeter Xu    +                 PCI_ERR_UNC_MASK_DEFAULT);
1366*8cb2f8b1SPeter Xu    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1367*8cb2f8b1SPeter Xu    +                 PCI_ERR_UNC_SUPPORTED);
1368*8cb2f8b1SPeter Xu
1369*8cb2f8b1SPeter Xu         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
1370*8cb2f8b1SPeter Xu                     PCI_ERR_UNC_SEVERITY_DEFAULT);
1371*8cb2f8b1SPeter Xu
1372*8cb2f8b1SPeter XuThe patch changes how we configure PCI space for AER.  But QEMU fails
1373*8cb2f8b1SPeter Xuwhen the PCI space configuration is different between source and
1374*8cb2f8b1SPeter Xudestination.
1375*8cb2f8b1SPeter Xu
1376*8cb2f8b1SPeter XuThe following commit shows how this got fixed::
1377*8cb2f8b1SPeter Xu
1378*8cb2f8b1SPeter Xu    commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
1379*8cb2f8b1SPeter Xu    Author: Leonardo Bras <leobras@redhat.com>
1380*8cb2f8b1SPeter Xu    Date:   Tue May 2 21:27:02 2023 -0300
1381*8cb2f8b1SPeter Xu
1382*8cb2f8b1SPeter Xu    hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
1383*8cb2f8b1SPeter Xu
1384*8cb2f8b1SPeter Xu    [...]
1385*8cb2f8b1SPeter Xu
1386*8cb2f8b1SPeter XuThe relevant parts of the fix in QEMU are as follow:
1387*8cb2f8b1SPeter Xu
1388*8cb2f8b1SPeter XuFirst, we create a new property for the device to be able to configure
1389*8cb2f8b1SPeter Xuthe old behaviour or the new behaviour::
1390*8cb2f8b1SPeter Xu
1391*8cb2f8b1SPeter Xu    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
1392*8cb2f8b1SPeter Xu    index 8a87ccc8b0..5153ad63d6 100644
1393*8cb2f8b1SPeter Xu    --- a/hw/pci/pci.c
1394*8cb2f8b1SPeter Xu    +++ b/hw/pci/pci.c
1395*8cb2f8b1SPeter Xu    @@ -79,6 +79,8 @@ static Property pci_props[] = {
1396*8cb2f8b1SPeter Xu         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
1397*8cb2f8b1SPeter Xu                            failover_pair_id),
1398*8cb2f8b1SPeter Xu         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
1399*8cb2f8b1SPeter Xu    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
1400*8cb2f8b1SPeter Xu    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
1401*8cb2f8b1SPeter Xu         DEFINE_PROP_END_OF_LIST()
1402*8cb2f8b1SPeter Xu     };
1403*8cb2f8b1SPeter Xu
1404*8cb2f8b1SPeter XuNotice that we enable the feature for new machine types.
1405*8cb2f8b1SPeter Xu
1406*8cb2f8b1SPeter XuNow we see how the fix is done.  This is going to depend on what kind
1407*8cb2f8b1SPeter Xuof breakage happens, but in this case it is quite simple::
1408*8cb2f8b1SPeter Xu
1409*8cb2f8b1SPeter Xu    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
1410*8cb2f8b1SPeter Xu    index 103667c368..374d593ead 100644
1411*8cb2f8b1SPeter Xu    --- a/hw/pci/pcie_aer.c
1412*8cb2f8b1SPeter Xu    +++ b/hw/pci/pcie_aer.c
1413*8cb2f8b1SPeter Xu    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
1414*8cb2f8b1SPeter Xu    uint16_t offset,
1415*8cb2f8b1SPeter Xu
1416*8cb2f8b1SPeter Xu         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
1417*8cb2f8b1SPeter Xu                      PCI_ERR_UNC_SUPPORTED);
1418*8cb2f8b1SPeter Xu    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1419*8cb2f8b1SPeter Xu    -                 PCI_ERR_UNC_MASK_DEFAULT);
1420*8cb2f8b1SPeter Xu    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1421*8cb2f8b1SPeter Xu    -                 PCI_ERR_UNC_SUPPORTED);
1422*8cb2f8b1SPeter Xu    +
1423*8cb2f8b1SPeter Xu    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
1424*8cb2f8b1SPeter Xu    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1425*8cb2f8b1SPeter Xu    +                     PCI_ERR_UNC_MASK_DEFAULT);
1426*8cb2f8b1SPeter Xu    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1427*8cb2f8b1SPeter Xu    +                     PCI_ERR_UNC_SUPPORTED);
1428*8cb2f8b1SPeter Xu    +    }
1429*8cb2f8b1SPeter Xu
1430*8cb2f8b1SPeter Xu         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
1431*8cb2f8b1SPeter Xu                      PCI_ERR_UNC_SEVERITY_DEFAULT);
1432*8cb2f8b1SPeter Xu
1433*8cb2f8b1SPeter XuI.e. If the property bit is enabled, we configure it as we did for
1434*8cb2f8b1SPeter Xuqemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
1435*8cb2f8b1SPeter Xu
1436*8cb2f8b1SPeter XuAnd now, everything that is missing is disabling the feature for old
1437*8cb2f8b1SPeter Xumachine types::
1438*8cb2f8b1SPeter Xu
1439*8cb2f8b1SPeter Xu    diff --git a/hw/core/machine.c b/hw/core/machine.c
1440*8cb2f8b1SPeter Xu    index 47a34841a5..07f763eb2e 100644
1441*8cb2f8b1SPeter Xu    --- a/hw/core/machine.c
1442*8cb2f8b1SPeter Xu    +++ b/hw/core/machine.c
1443*8cb2f8b1SPeter Xu    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
1444*8cb2f8b1SPeter Xu         { "e1000e", "migrate-timadj", "off" },
1445*8cb2f8b1SPeter Xu         { "virtio-mem", "x-early-migration", "false" },
1446*8cb2f8b1SPeter Xu         { "migration", "x-preempt-pre-7-2", "true" },
1447*8cb2f8b1SPeter Xu    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
1448*8cb2f8b1SPeter Xu     };
1449*8cb2f8b1SPeter Xu     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
1450*8cb2f8b1SPeter Xu
1451*8cb2f8b1SPeter XuAnd now, when qemu-8.0.1 is released with this fix, all combinations
1452*8cb2f8b1SPeter Xuare going to work as supposed.
1453*8cb2f8b1SPeter Xu
1454*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
1455*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
1456*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
1457*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
1458*8cb2f8b1SPeter Xu
1459*8cb2f8b1SPeter XuSo the normality has been restored and everything is ok, no?
1460*8cb2f8b1SPeter Xu
1461*8cb2f8b1SPeter XuNot really, now our matrix is much bigger.  We started with the easy
1462*8cb2f8b1SPeter Xucases, migration from the same version to the same version always
1463*8cb2f8b1SPeter Xuworks:
1464*8cb2f8b1SPeter Xu
1465*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1466*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1467*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1468*8cb2f8b1SPeter Xu
1469*8cb2f8b1SPeter XuNow the interesting ones.  When the QEMU processes versions are
1470*8cb2f8b1SPeter Xudifferent.  For the 1st set, their fail and we can do nothing, both
1471*8cb2f8b1SPeter Xuversions are released and we can't change anything.
1472*8cb2f8b1SPeter Xu
1473*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1474*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1475*8cb2f8b1SPeter Xu
1476*8cb2f8b1SPeter XuThis two are the ones that work. The whole point of making the
1477*8cb2f8b1SPeter Xuchange in qemu-8.0.1 release was to fix this issue:
1478*8cb2f8b1SPeter Xu
1479*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1480*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1481*8cb2f8b1SPeter Xu
1482*8cb2f8b1SPeter XuBut now we found that qemu-8.0 neither can migrate to qemu-7.2 not
1483*8cb2f8b1SPeter Xuqemu-8.0.1.
1484*8cb2f8b1SPeter Xu
1485*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1486*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1487*8cb2f8b1SPeter Xu
1488*8cb2f8b1SPeter XuSo, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
1489*8cb2f8b1SPeter Xuanything except to qemu-8.0.
1490*8cb2f8b1SPeter Xu
1491*8cb2f8b1SPeter XuCan we do better?
1492*8cb2f8b1SPeter Xu
1493*8cb2f8b1SPeter XuYeap.  If we know that we are going to do this migration:
1494*8cb2f8b1SPeter Xu
1495*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1496*8cb2f8b1SPeter Xu
1497*8cb2f8b1SPeter XuWe can launch the appropriate devices with::
1498*8cb2f8b1SPeter Xu
1499*8cb2f8b1SPeter Xu  --device...,x-pci-e-err-unc-mask=on
1500*8cb2f8b1SPeter Xu
1501*8cb2f8b1SPeter XuAnd now we can receive a migration from 8.0.  And from now on, we can
1502*8cb2f8b1SPeter Xudo that migration to new machine types if we remember to enable that
1503*8cb2f8b1SPeter Xuproperty for pc-7.2.  Notice that we need to remember, it is not
1504*8cb2f8b1SPeter Xuenough to know that the source of the migration is qemu-8.0.  Think of
1505*8cb2f8b1SPeter Xuthis example:
1506*8cb2f8b1SPeter Xu
1507*8cb2f8b1SPeter Xu$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
1508*8cb2f8b1SPeter Xu
1509*8cb2f8b1SPeter XuIn the second migration, the source is not qemu-8.0, but we still have
1510*8cb2f8b1SPeter Xuthat "problem" and have that property enabled.  Notice that we need to
1511*8cb2f8b1SPeter Xucontinue having this mark/property until we have this machine
1512*8cb2f8b1SPeter Xurebooted.  But it is not a normal reboot (that don't reload QEMU) we
1513*8cb2f8b1SPeter Xuneed the machine to poweroff/poweron on a fixed QEMU.  And from now
1514*8cb2f8b1SPeter Xuon we can use the proper real machine.
1515