1*8cb2f8b1SPeter Xu========= 2*8cb2f8b1SPeter XuMigration 3*8cb2f8b1SPeter Xu========= 4*8cb2f8b1SPeter Xu 5*8cb2f8b1SPeter XuQEMU has code to load/save the state of the guest that it is running. 6*8cb2f8b1SPeter XuThese are two complementary operations. Saving the state just does 7*8cb2f8b1SPeter Xuthat, saves the state for each device that the guest is running. 8*8cb2f8b1SPeter XuRestoring a guest is just the opposite operation: we need to load the 9*8cb2f8b1SPeter Xustate of each device. 10*8cb2f8b1SPeter Xu 11*8cb2f8b1SPeter XuFor this to work, QEMU has to be launched with the same arguments the 12*8cb2f8b1SPeter Xutwo times. I.e. it can only restore the state in one guest that has 13*8cb2f8b1SPeter Xuthe same devices that the one it was saved (this last requirement can 14*8cb2f8b1SPeter Xube relaxed a bit, but for now we can consider that configuration has 15*8cb2f8b1SPeter Xuto be exactly the same). 16*8cb2f8b1SPeter Xu 17*8cb2f8b1SPeter XuOnce that we are able to save/restore a guest, a new functionality is 18*8cb2f8b1SPeter Xurequested: migration. This means that QEMU is able to start in one 19*8cb2f8b1SPeter Xumachine and being "migrated" to another machine. I.e. being moved to 20*8cb2f8b1SPeter Xuanother machine. 21*8cb2f8b1SPeter Xu 22*8cb2f8b1SPeter XuNext was the "live migration" functionality. This is important 23*8cb2f8b1SPeter Xubecause some guests run with a lot of state (specially RAM), and it 24*8cb2f8b1SPeter Xucan take a while to move all state from one machine to another. Live 25*8cb2f8b1SPeter Xumigration allows the guest to continue running while the state is 26*8cb2f8b1SPeter Xutransferred. Only while the last part of the state is transferred has 27*8cb2f8b1SPeter Xuthe guest to be stopped. Typically the time that the guest is 28*8cb2f8b1SPeter Xuunresponsive during live migration is the low hundred of milliseconds 29*8cb2f8b1SPeter Xu(notice that this depends on a lot of things). 30*8cb2f8b1SPeter Xu 31*8cb2f8b1SPeter Xu.. contents:: 32*8cb2f8b1SPeter Xu 33*8cb2f8b1SPeter XuTransports 34*8cb2f8b1SPeter Xu========== 35*8cb2f8b1SPeter Xu 36*8cb2f8b1SPeter XuThe migration stream is normally just a byte stream that can be passed 37*8cb2f8b1SPeter Xuover any transport. 38*8cb2f8b1SPeter Xu 39*8cb2f8b1SPeter Xu- tcp migration: do the migration using tcp sockets 40*8cb2f8b1SPeter Xu- unix migration: do the migration using unix sockets 41*8cb2f8b1SPeter Xu- exec migration: do the migration using the stdin/stdout through a process. 42*8cb2f8b1SPeter Xu- fd migration: do the migration using a file descriptor that is 43*8cb2f8b1SPeter Xu passed to QEMU. QEMU doesn't care how this file descriptor is opened. 44*8cb2f8b1SPeter Xu 45*8cb2f8b1SPeter XuIn addition, support is included for migration using RDMA, which 46*8cb2f8b1SPeter Xutransports the page data using ``RDMA``, where the hardware takes care of 47*8cb2f8b1SPeter Xutransporting the pages, and the load on the CPU is much lower. While the 48*8cb2f8b1SPeter Xuinternals of RDMA migration are a bit different, this isn't really visible 49*8cb2f8b1SPeter Xuoutside the RAM migration code. 50*8cb2f8b1SPeter Xu 51*8cb2f8b1SPeter XuAll these migration protocols use the same infrastructure to 52*8cb2f8b1SPeter Xusave/restore state devices. This infrastructure is shared with the 53*8cb2f8b1SPeter Xusavevm/loadvm functionality. 54*8cb2f8b1SPeter Xu 55*8cb2f8b1SPeter XuDebugging 56*8cb2f8b1SPeter Xu========= 57*8cb2f8b1SPeter Xu 58*8cb2f8b1SPeter XuThe migration stream can be analyzed thanks to ``scripts/analyze-migration.py``. 59*8cb2f8b1SPeter Xu 60*8cb2f8b1SPeter XuExample usage: 61*8cb2f8b1SPeter Xu 62*8cb2f8b1SPeter Xu.. code-block:: shell 63*8cb2f8b1SPeter Xu 64*8cb2f8b1SPeter Xu $ qemu-system-x86_64 -display none -monitor stdio 65*8cb2f8b1SPeter Xu (qemu) migrate "exec:cat > mig" 66*8cb2f8b1SPeter Xu (qemu) q 67*8cb2f8b1SPeter Xu $ ./scripts/analyze-migration.py -f mig 68*8cb2f8b1SPeter Xu { 69*8cb2f8b1SPeter Xu "ram (3)": { 70*8cb2f8b1SPeter Xu "section sizes": { 71*8cb2f8b1SPeter Xu "pc.ram": "0x0000000008000000", 72*8cb2f8b1SPeter Xu ... 73*8cb2f8b1SPeter Xu 74*8cb2f8b1SPeter XuSee also ``analyze-migration.py -h`` help for more options. 75*8cb2f8b1SPeter Xu 76*8cb2f8b1SPeter XuCommon infrastructure 77*8cb2f8b1SPeter Xu===================== 78*8cb2f8b1SPeter Xu 79*8cb2f8b1SPeter XuThe files, sockets or fd's that carry the migration stream are abstracted by 80*8cb2f8b1SPeter Xuthe ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this 81*8cb2f8b1SPeter Xuis connected to a subtype of ``QIOChannel`` (see ``io/``). 82*8cb2f8b1SPeter Xu 83*8cb2f8b1SPeter Xu 84*8cb2f8b1SPeter XuSaving the state of one device 85*8cb2f8b1SPeter Xu============================== 86*8cb2f8b1SPeter Xu 87*8cb2f8b1SPeter XuFor most devices, the state is saved in a single call to the migration 88*8cb2f8b1SPeter Xuinfrastructure; these are *non-iterative* devices. The data for these 89*8cb2f8b1SPeter Xudevices is sent at the end of precopy migration, when the CPUs are paused. 90*8cb2f8b1SPeter XuThere are also *iterative* devices, which contain a very large amount of 91*8cb2f8b1SPeter Xudata (e.g. RAM or large tables). See the iterative device section below. 92*8cb2f8b1SPeter Xu 93*8cb2f8b1SPeter XuGeneral advice for device developers 94*8cb2f8b1SPeter Xu------------------------------------ 95*8cb2f8b1SPeter Xu 96*8cb2f8b1SPeter Xu- The migration state saved should reflect the device being modelled rather 97*8cb2f8b1SPeter Xu than the way your implementation works. That way if you change the implementation 98*8cb2f8b1SPeter Xu later the migration stream will stay compatible. That model may include 99*8cb2f8b1SPeter Xu internal state that's not directly visible in a register. 100*8cb2f8b1SPeter Xu 101*8cb2f8b1SPeter Xu- When saving a migration stream the device code may walk and check 102*8cb2f8b1SPeter Xu the state of the device. These checks might fail in various ways (e.g. 103*8cb2f8b1SPeter Xu discovering internal state is corrupt or that the guest has done something bad). 104*8cb2f8b1SPeter Xu Consider carefully before asserting/aborting at this point, since the 105*8cb2f8b1SPeter Xu normal response from users is that *migration broke their VM* since it had 106*8cb2f8b1SPeter Xu apparently been running fine until then. In these error cases, the device 107*8cb2f8b1SPeter Xu should log a message indicating the cause of error, and should consider 108*8cb2f8b1SPeter Xu putting the device into an error state, allowing the rest of the VM to 109*8cb2f8b1SPeter Xu continue execution. 110*8cb2f8b1SPeter Xu 111*8cb2f8b1SPeter Xu- The migration might happen at an inconvenient point, 112*8cb2f8b1SPeter Xu e.g. right in the middle of the guest reprogramming the device, during 113*8cb2f8b1SPeter Xu guest reboot or shutdown or while the device is waiting for external IO. 114*8cb2f8b1SPeter Xu It's strongly preferred that migrations do not fail in this situation, 115*8cb2f8b1SPeter Xu since in the cloud environment migrations might happen automatically to 116*8cb2f8b1SPeter Xu VMs that the administrator doesn't directly control. 117*8cb2f8b1SPeter Xu 118*8cb2f8b1SPeter Xu- If you do need to fail a migration, ensure that sufficient information 119*8cb2f8b1SPeter Xu is logged to identify what went wrong. 120*8cb2f8b1SPeter Xu 121*8cb2f8b1SPeter Xu- The destination should treat an incoming migration stream as hostile 122*8cb2f8b1SPeter Xu (which we do to varying degrees in the existing code). Check that offsets 123*8cb2f8b1SPeter Xu into buffers and the like can't cause overruns. Fail the incoming migration 124*8cb2f8b1SPeter Xu in the case of a corrupted stream like this. 125*8cb2f8b1SPeter Xu 126*8cb2f8b1SPeter Xu- Take care with internal device state or behaviour that might become 127*8cb2f8b1SPeter Xu migration version dependent. For example, the order of PCI capabilities 128*8cb2f8b1SPeter Xu is required to stay constant across migration. Another example would 129*8cb2f8b1SPeter Xu be that a special case handled by subsections (see below) might become 130*8cb2f8b1SPeter Xu much more common if a default behaviour is changed. 131*8cb2f8b1SPeter Xu 132*8cb2f8b1SPeter Xu- The state of the source should not be changed or destroyed by the 133*8cb2f8b1SPeter Xu outgoing migration. Migrations timing out or being failed by 134*8cb2f8b1SPeter Xu higher levels of management, or failures of the destination host are 135*8cb2f8b1SPeter Xu not unusual, and in that case the VM is restarted on the source. 136*8cb2f8b1SPeter Xu Note that the management layer can validly revert the migration 137*8cb2f8b1SPeter Xu even though the QEMU level of migration has succeeded as long as it 138*8cb2f8b1SPeter Xu does it before starting execution on the destination. 139*8cb2f8b1SPeter Xu 140*8cb2f8b1SPeter Xu- Buses and devices should be able to explicitly specify addresses when 141*8cb2f8b1SPeter Xu instantiated, and management tools should use those. For example, 142*8cb2f8b1SPeter Xu when hot adding USB devices it's important to specify the ports 143*8cb2f8b1SPeter Xu and addresses, since implicit ordering based on the command line order 144*8cb2f8b1SPeter Xu may be different on the destination. This can result in the 145*8cb2f8b1SPeter Xu device state being loaded into the wrong device. 146*8cb2f8b1SPeter Xu 147*8cb2f8b1SPeter XuVMState 148*8cb2f8b1SPeter Xu------- 149*8cb2f8b1SPeter Xu 150*8cb2f8b1SPeter XuMost device data can be described using the ``VMSTATE`` macros (mostly defined 151*8cb2f8b1SPeter Xuin ``include/migration/vmstate.h``). 152*8cb2f8b1SPeter Xu 153*8cb2f8b1SPeter XuAn example (from hw/input/pckbd.c) 154*8cb2f8b1SPeter Xu 155*8cb2f8b1SPeter Xu.. code:: c 156*8cb2f8b1SPeter Xu 157*8cb2f8b1SPeter Xu static const VMStateDescription vmstate_kbd = { 158*8cb2f8b1SPeter Xu .name = "pckbd", 159*8cb2f8b1SPeter Xu .version_id = 3, 160*8cb2f8b1SPeter Xu .minimum_version_id = 3, 161*8cb2f8b1SPeter Xu .fields = (const VMStateField[]) { 162*8cb2f8b1SPeter Xu VMSTATE_UINT8(write_cmd, KBDState), 163*8cb2f8b1SPeter Xu VMSTATE_UINT8(status, KBDState), 164*8cb2f8b1SPeter Xu VMSTATE_UINT8(mode, KBDState), 165*8cb2f8b1SPeter Xu VMSTATE_UINT8(pending, KBDState), 166*8cb2f8b1SPeter Xu VMSTATE_END_OF_LIST() 167*8cb2f8b1SPeter Xu } 168*8cb2f8b1SPeter Xu }; 169*8cb2f8b1SPeter Xu 170*8cb2f8b1SPeter XuWe are declaring the state with name "pckbd". The ``version_id`` is 171*8cb2f8b1SPeter Xu3, and there are 4 uint8_t fields in the KBDState structure. We 172*8cb2f8b1SPeter Xuregistered this ``VMSTATEDescription`` with one of the following 173*8cb2f8b1SPeter Xufunctions. The first one will generate a device ``instance_id`` 174*8cb2f8b1SPeter Xudifferent for each registration. Use the second one if you already 175*8cb2f8b1SPeter Xuhave an id that is different for each instance of the device: 176*8cb2f8b1SPeter Xu 177*8cb2f8b1SPeter Xu.. code:: c 178*8cb2f8b1SPeter Xu 179*8cb2f8b1SPeter Xu vmstate_register_any(NULL, &vmstate_kbd, s); 180*8cb2f8b1SPeter Xu vmstate_register(NULL, instance_id, &vmstate_kbd, s); 181*8cb2f8b1SPeter Xu 182*8cb2f8b1SPeter XuFor devices that are ``qdev`` based, we can register the device in the class 183*8cb2f8b1SPeter Xuinit function: 184*8cb2f8b1SPeter Xu 185*8cb2f8b1SPeter Xu.. code:: c 186*8cb2f8b1SPeter Xu 187*8cb2f8b1SPeter Xu dc->vmsd = &vmstate_kbd_isa; 188*8cb2f8b1SPeter Xu 189*8cb2f8b1SPeter XuThe VMState macros take care of ensuring that the device data section 190*8cb2f8b1SPeter Xuis formatted portably (normally big endian) and make some compile time checks 191*8cb2f8b1SPeter Xuagainst the types of the fields in the structures. 192*8cb2f8b1SPeter Xu 193*8cb2f8b1SPeter XuVMState macros can include other VMStateDescriptions to store substructures 194*8cb2f8b1SPeter Xu(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length 195*8cb2f8b1SPeter Xuarrays (``VMSTATE_VARRAY_``). Various other macros exist for special 196*8cb2f8b1SPeter Xucases. 197*8cb2f8b1SPeter Xu 198*8cb2f8b1SPeter XuNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 199*8cb2f8b1SPeter Xuends up with a 4 byte bigendian representation on the wire; in the future 200*8cb2f8b1SPeter Xuit might be possible to use a more structured format. 201*8cb2f8b1SPeter Xu 202*8cb2f8b1SPeter XuLegacy way 203*8cb2f8b1SPeter Xu---------- 204*8cb2f8b1SPeter Xu 205*8cb2f8b1SPeter XuThis way is going to disappear as soon as all current users are ported to VMSTATE; 206*8cb2f8b1SPeter Xualthough converting existing code can be tricky, and thus 'soon' is relative. 207*8cb2f8b1SPeter Xu 208*8cb2f8b1SPeter XuEach device has to register two functions, one to save the state and 209*8cb2f8b1SPeter Xuanother to load the state back. 210*8cb2f8b1SPeter Xu 211*8cb2f8b1SPeter Xu.. code:: c 212*8cb2f8b1SPeter Xu 213*8cb2f8b1SPeter Xu int register_savevm_live(const char *idstr, 214*8cb2f8b1SPeter Xu int instance_id, 215*8cb2f8b1SPeter Xu int version_id, 216*8cb2f8b1SPeter Xu SaveVMHandlers *ops, 217*8cb2f8b1SPeter Xu void *opaque); 218*8cb2f8b1SPeter Xu 219*8cb2f8b1SPeter XuTwo functions in the ``ops`` structure are the ``save_state`` 220*8cb2f8b1SPeter Xuand ``load_state`` functions. Notice that ``load_state`` receives a version_id 221*8cb2f8b1SPeter Xuparameter to know what state format is receiving. ``save_state`` doesn't 222*8cb2f8b1SPeter Xuhave a version_id parameter because it always uses the latest version. 223*8cb2f8b1SPeter Xu 224*8cb2f8b1SPeter XuNote that because the VMState macros still save the data in a raw 225*8cb2f8b1SPeter Xuformat, in many cases it's possible to replace legacy code 226*8cb2f8b1SPeter Xuwith a carefully constructed VMState description that matches the 227*8cb2f8b1SPeter Xubyte layout of the existing code. 228*8cb2f8b1SPeter Xu 229*8cb2f8b1SPeter XuChanging migration data structures 230*8cb2f8b1SPeter Xu---------------------------------- 231*8cb2f8b1SPeter Xu 232*8cb2f8b1SPeter XuWhen we migrate a device, we save/load the state as a series 233*8cb2f8b1SPeter Xuof fields. Sometimes, due to bugs or new functionality, we need to 234*8cb2f8b1SPeter Xuchange the state to store more/different information. Changing the migration 235*8cb2f8b1SPeter Xustate saved for a device can break migration compatibility unless 236*8cb2f8b1SPeter Xucare is taken to use the appropriate techniques. In general QEMU tries 237*8cb2f8b1SPeter Xuto maintain forward migration compatibility (i.e. migrating from 238*8cb2f8b1SPeter XuQEMU n->n+1) and there are users who benefit from backward compatibility 239*8cb2f8b1SPeter Xuas well. 240*8cb2f8b1SPeter Xu 241*8cb2f8b1SPeter XuSubsections 242*8cb2f8b1SPeter Xu----------- 243*8cb2f8b1SPeter Xu 244*8cb2f8b1SPeter XuThe most common structure change is adding new data, e.g. when adding 245*8cb2f8b1SPeter Xua newer form of device, or adding that state that you previously 246*8cb2f8b1SPeter Xuforgot to migrate. This is best solved using a subsection. 247*8cb2f8b1SPeter Xu 248*8cb2f8b1SPeter XuA subsection is "like" a device vmstate, but with a particularity, it 249*8cb2f8b1SPeter Xuhas a Boolean function that tells if that values are needed to be sent 250*8cb2f8b1SPeter Xuor not. If this functions returns false, the subsection is not sent. 251*8cb2f8b1SPeter XuSubsections have a unique name, that is looked for on the receiving 252*8cb2f8b1SPeter Xuside. 253*8cb2f8b1SPeter Xu 254*8cb2f8b1SPeter XuOn the receiving side, if we found a subsection for a device that we 255*8cb2f8b1SPeter Xudon't understand, we just fail the migration. If we understand all 256*8cb2f8b1SPeter Xuthe subsections, then we load the state with success. There's no check 257*8cb2f8b1SPeter Xuthat a subsection is loaded, so a newer QEMU that knows about a subsection 258*8cb2f8b1SPeter Xucan (with care) load a stream from an older QEMU that didn't send 259*8cb2f8b1SPeter Xuthe subsection. 260*8cb2f8b1SPeter Xu 261*8cb2f8b1SPeter XuIf the new data is only needed in a rare case, then the subsection 262*8cb2f8b1SPeter Xucan be made conditional on that case and the migration will still 263*8cb2f8b1SPeter Xusucceed to older QEMUs in most cases. This is OK for data that's 264*8cb2f8b1SPeter Xucritical, but in some use cases it's preferred that the migration 265*8cb2f8b1SPeter Xushould succeed even with the data missing. To support this the 266*8cb2f8b1SPeter Xusubsection can be connected to a device property and from there 267*8cb2f8b1SPeter Xuto a versioned machine type. 268*8cb2f8b1SPeter Xu 269*8cb2f8b1SPeter XuThe 'pre_load' and 'post_load' functions on subsections are only 270*8cb2f8b1SPeter Xucalled if the subsection is loaded. 271*8cb2f8b1SPeter Xu 272*8cb2f8b1SPeter XuOne important note is that the outer post_load() function is called "after" 273*8cb2f8b1SPeter Xuloading all subsections, because a newer subsection could change the same 274*8cb2f8b1SPeter Xuvalue that it uses. A flag, and the combination of outer pre_load and 275*8cb2f8b1SPeter Xupost_load can be used to detect whether a subsection was loaded, and to 276*8cb2f8b1SPeter Xufall back on default behaviour when the subsection isn't present. 277*8cb2f8b1SPeter Xu 278*8cb2f8b1SPeter XuExample: 279*8cb2f8b1SPeter Xu 280*8cb2f8b1SPeter Xu.. code:: c 281*8cb2f8b1SPeter Xu 282*8cb2f8b1SPeter Xu static bool ide_drive_pio_state_needed(void *opaque) 283*8cb2f8b1SPeter Xu { 284*8cb2f8b1SPeter Xu IDEState *s = opaque; 285*8cb2f8b1SPeter Xu 286*8cb2f8b1SPeter Xu return ((s->status & DRQ_STAT) != 0) 287*8cb2f8b1SPeter Xu || (s->bus->error_status & BM_STATUS_PIO_RETRY); 288*8cb2f8b1SPeter Xu } 289*8cb2f8b1SPeter Xu 290*8cb2f8b1SPeter Xu const VMStateDescription vmstate_ide_drive_pio_state = { 291*8cb2f8b1SPeter Xu .name = "ide_drive/pio_state", 292*8cb2f8b1SPeter Xu .version_id = 1, 293*8cb2f8b1SPeter Xu .minimum_version_id = 1, 294*8cb2f8b1SPeter Xu .pre_save = ide_drive_pio_pre_save, 295*8cb2f8b1SPeter Xu .post_load = ide_drive_pio_post_load, 296*8cb2f8b1SPeter Xu .needed = ide_drive_pio_state_needed, 297*8cb2f8b1SPeter Xu .fields = (const VMStateField[]) { 298*8cb2f8b1SPeter Xu VMSTATE_INT32(req_nb_sectors, IDEState), 299*8cb2f8b1SPeter Xu VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, 300*8cb2f8b1SPeter Xu vmstate_info_uint8, uint8_t), 301*8cb2f8b1SPeter Xu VMSTATE_INT32(cur_io_buffer_offset, IDEState), 302*8cb2f8b1SPeter Xu VMSTATE_INT32(cur_io_buffer_len, IDEState), 303*8cb2f8b1SPeter Xu VMSTATE_UINT8(end_transfer_fn_idx, IDEState), 304*8cb2f8b1SPeter Xu VMSTATE_INT32(elementary_transfer_size, IDEState), 305*8cb2f8b1SPeter Xu VMSTATE_INT32(packet_transfer_size, IDEState), 306*8cb2f8b1SPeter Xu VMSTATE_END_OF_LIST() 307*8cb2f8b1SPeter Xu } 308*8cb2f8b1SPeter Xu }; 309*8cb2f8b1SPeter Xu 310*8cb2f8b1SPeter Xu const VMStateDescription vmstate_ide_drive = { 311*8cb2f8b1SPeter Xu .name = "ide_drive", 312*8cb2f8b1SPeter Xu .version_id = 3, 313*8cb2f8b1SPeter Xu .minimum_version_id = 0, 314*8cb2f8b1SPeter Xu .post_load = ide_drive_post_load, 315*8cb2f8b1SPeter Xu .fields = (const VMStateField[]) { 316*8cb2f8b1SPeter Xu .... several fields .... 317*8cb2f8b1SPeter Xu VMSTATE_END_OF_LIST() 318*8cb2f8b1SPeter Xu }, 319*8cb2f8b1SPeter Xu .subsections = (const VMStateDescription * const []) { 320*8cb2f8b1SPeter Xu &vmstate_ide_drive_pio_state, 321*8cb2f8b1SPeter Xu NULL 322*8cb2f8b1SPeter Xu } 323*8cb2f8b1SPeter Xu }; 324*8cb2f8b1SPeter Xu 325*8cb2f8b1SPeter XuHere we have a subsection for the pio state. We only need to 326*8cb2f8b1SPeter Xusave/send this state when we are in the middle of a pio operation 327*8cb2f8b1SPeter Xu(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is 328*8cb2f8b1SPeter Xunot enabled, the values on that fields are garbage and don't need to 329*8cb2f8b1SPeter Xube sent. 330*8cb2f8b1SPeter Xu 331*8cb2f8b1SPeter XuConnecting subsections to properties 332*8cb2f8b1SPeter Xu------------------------------------ 333*8cb2f8b1SPeter Xu 334*8cb2f8b1SPeter XuUsing a condition function that checks a 'property' to determine whether 335*8cb2f8b1SPeter Xuto send a subsection allows backward migration compatibility when 336*8cb2f8b1SPeter Xunew subsections are added, especially when combined with versioned 337*8cb2f8b1SPeter Xumachine types. 338*8cb2f8b1SPeter Xu 339*8cb2f8b1SPeter XuFor example: 340*8cb2f8b1SPeter Xu 341*8cb2f8b1SPeter Xu a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and 342*8cb2f8b1SPeter Xu default it to true. 343*8cb2f8b1SPeter Xu b) Add an entry to the ``hw_compat_`` for the previous version that sets 344*8cb2f8b1SPeter Xu the property to false. 345*8cb2f8b1SPeter Xu c) Add a static bool support_foo function that tests the property. 346*8cb2f8b1SPeter Xu d) Add a subsection with a .needed set to the support_foo function 347*8cb2f8b1SPeter Xu e) (potentially) Add an outer pre_load that sets up a default value 348*8cb2f8b1SPeter Xu for 'foo' to be used if the subsection isn't loaded. 349*8cb2f8b1SPeter Xu 350*8cb2f8b1SPeter XuNow that subsection will not be generated when using an older 351*8cb2f8b1SPeter Xumachine type and the migration stream will be accepted by older 352*8cb2f8b1SPeter XuQEMU versions. 353*8cb2f8b1SPeter Xu 354*8cb2f8b1SPeter XuNot sending existing elements 355*8cb2f8b1SPeter Xu----------------------------- 356*8cb2f8b1SPeter Xu 357*8cb2f8b1SPeter XuSometimes members of the VMState are no longer needed: 358*8cb2f8b1SPeter Xu 359*8cb2f8b1SPeter Xu - removing them will break migration compatibility 360*8cb2f8b1SPeter Xu 361*8cb2f8b1SPeter Xu - making them version dependent and bumping the version will break backward migration 362*8cb2f8b1SPeter Xu compatibility. 363*8cb2f8b1SPeter Xu 364*8cb2f8b1SPeter XuAdding a dummy field into the migration stream is normally the best way to preserve 365*8cb2f8b1SPeter Xucompatibility. 366*8cb2f8b1SPeter Xu 367*8cb2f8b1SPeter XuIf the field really does need to be removed then: 368*8cb2f8b1SPeter Xu 369*8cb2f8b1SPeter Xu a) Add a new property/compatibility/function in the same way for subsections above. 370*8cb2f8b1SPeter Xu b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: 371*8cb2f8b1SPeter Xu 372*8cb2f8b1SPeter Xu ``VMSTATE_UINT32(foo, barstruct)`` 373*8cb2f8b1SPeter Xu 374*8cb2f8b1SPeter Xu becomes 375*8cb2f8b1SPeter Xu 376*8cb2f8b1SPeter Xu ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` 377*8cb2f8b1SPeter Xu 378*8cb2f8b1SPeter Xu Sometime in the future when we no longer care about the ancient versions these can be killed off. 379*8cb2f8b1SPeter Xu Note that for backward compatibility it's important to fill in the structure with 380*8cb2f8b1SPeter Xu data that the destination will understand. 381*8cb2f8b1SPeter Xu 382*8cb2f8b1SPeter XuAny difference in the predicates on the source and destination will end up 383*8cb2f8b1SPeter Xuwith different fields being enabled and data being loaded into the wrong 384*8cb2f8b1SPeter Xufields; for this reason conditional fields like this are very fragile. 385*8cb2f8b1SPeter Xu 386*8cb2f8b1SPeter XuVersions 387*8cb2f8b1SPeter Xu-------- 388*8cb2f8b1SPeter Xu 389*8cb2f8b1SPeter XuVersion numbers are intended for major incompatible changes to the 390*8cb2f8b1SPeter Xumigration of a device, and using them breaks backward-migration 391*8cb2f8b1SPeter Xucompatibility; in general most changes can be made by adding Subsections 392*8cb2f8b1SPeter Xu(see above) or _TEST macros (see above) which won't break compatibility. 393*8cb2f8b1SPeter Xu 394*8cb2f8b1SPeter XuEach version is associated with a series of fields saved. The ``save_state`` always saves 395*8cb2f8b1SPeter Xuthe state as the newer version. But ``load_state`` sometimes is able to 396*8cb2f8b1SPeter Xuload state from an older version. 397*8cb2f8b1SPeter Xu 398*8cb2f8b1SPeter XuYou can see that there are two version fields: 399*8cb2f8b1SPeter Xu 400*8cb2f8b1SPeter Xu- ``version_id``: the maximum version_id supported by VMState for that device. 401*8cb2f8b1SPeter Xu- ``minimum_version_id``: the minimum version_id that VMState is able to understand 402*8cb2f8b1SPeter Xu for that device. 403*8cb2f8b1SPeter Xu 404*8cb2f8b1SPeter XuVMState is able to read versions from minimum_version_id to version_id. 405*8cb2f8b1SPeter Xu 406*8cb2f8b1SPeter XuThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, 407*8cb2f8b1SPeter Xue.g. 408*8cb2f8b1SPeter Xu 409*8cb2f8b1SPeter Xu.. code:: c 410*8cb2f8b1SPeter Xu 411*8cb2f8b1SPeter Xu VMSTATE_UINT16_V(ip_id, Slirp, 2), 412*8cb2f8b1SPeter Xu 413*8cb2f8b1SPeter Xuonly loads that field for versions 2 and newer. 414*8cb2f8b1SPeter Xu 415*8cb2f8b1SPeter XuSaving state will always create a section with the 'version_id' value 416*8cb2f8b1SPeter Xuand thus can't be loaded by any older QEMU. 417*8cb2f8b1SPeter Xu 418*8cb2f8b1SPeter XuMassaging functions 419*8cb2f8b1SPeter Xu------------------- 420*8cb2f8b1SPeter Xu 421*8cb2f8b1SPeter XuSometimes, it is not enough to be able to save the state directly 422*8cb2f8b1SPeter Xufrom one structure, we need to fill the correct values there. One 423*8cb2f8b1SPeter Xuexample is when we are using kvm. Before saving the cpu state, we 424*8cb2f8b1SPeter Xuneed to ask kvm to copy to QEMU the state that it is using. And the 425*8cb2f8b1SPeter Xuopposite when we are loading the state, we need a way to tell kvm to 426*8cb2f8b1SPeter Xuload the state for the cpu that we have just loaded from the QEMUFile. 427*8cb2f8b1SPeter Xu 428*8cb2f8b1SPeter XuThe functions to do that are inside a vmstate definition, and are called: 429*8cb2f8b1SPeter Xu 430*8cb2f8b1SPeter Xu- ``int (*pre_load)(void *opaque);`` 431*8cb2f8b1SPeter Xu 432*8cb2f8b1SPeter Xu This function is called before we load the state of one device. 433*8cb2f8b1SPeter Xu 434*8cb2f8b1SPeter Xu- ``int (*post_load)(void *opaque, int version_id);`` 435*8cb2f8b1SPeter Xu 436*8cb2f8b1SPeter Xu This function is called after we load the state of one device. 437*8cb2f8b1SPeter Xu 438*8cb2f8b1SPeter Xu- ``int (*pre_save)(void *opaque);`` 439*8cb2f8b1SPeter Xu 440*8cb2f8b1SPeter Xu This function is called before we save the state of one device. 441*8cb2f8b1SPeter Xu 442*8cb2f8b1SPeter Xu- ``int (*post_save)(void *opaque);`` 443*8cb2f8b1SPeter Xu 444*8cb2f8b1SPeter Xu This function is called after we save the state of one device 445*8cb2f8b1SPeter Xu (even upon failure, unless the call to pre_save returned an error). 446*8cb2f8b1SPeter Xu 447*8cb2f8b1SPeter XuExample: You can look at hpet.c, that uses the first three functions 448*8cb2f8b1SPeter Xuto massage the state that is transferred. 449*8cb2f8b1SPeter Xu 450*8cb2f8b1SPeter XuThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration 451*8cb2f8b1SPeter Xudata doesn't match the stored device data well; it allows an 452*8cb2f8b1SPeter Xuintermediate temporary structure to be populated with migration 453*8cb2f8b1SPeter Xudata and then transferred to the main structure. 454*8cb2f8b1SPeter Xu 455*8cb2f8b1SPeter XuIf you use memory API functions that update memory layout outside 456*8cb2f8b1SPeter Xuinitialization (i.e., in response to a guest action), this is a strong 457*8cb2f8b1SPeter Xuindication that you need to call these functions in a ``post_load`` callback. 458*8cb2f8b1SPeter XuExamples of such memory API functions are: 459*8cb2f8b1SPeter Xu 460*8cb2f8b1SPeter Xu - memory_region_add_subregion() 461*8cb2f8b1SPeter Xu - memory_region_del_subregion() 462*8cb2f8b1SPeter Xu - memory_region_set_readonly() 463*8cb2f8b1SPeter Xu - memory_region_set_nonvolatile() 464*8cb2f8b1SPeter Xu - memory_region_set_enabled() 465*8cb2f8b1SPeter Xu - memory_region_set_address() 466*8cb2f8b1SPeter Xu - memory_region_set_alias_offset() 467*8cb2f8b1SPeter Xu 468*8cb2f8b1SPeter XuIterative device migration 469*8cb2f8b1SPeter Xu-------------------------- 470*8cb2f8b1SPeter Xu 471*8cb2f8b1SPeter XuSome devices, such as RAM, Block storage or certain platform devices, 472*8cb2f8b1SPeter Xuhave large amounts of data that would mean that the CPUs would be 473*8cb2f8b1SPeter Xupaused for too long if they were sent in one section. For these 474*8cb2f8b1SPeter Xudevices an *iterative* approach is taken. 475*8cb2f8b1SPeter Xu 476*8cb2f8b1SPeter XuThe iterative devices generally don't use VMState macros 477*8cb2f8b1SPeter Xu(although it may be possible in some cases) and instead use 478*8cb2f8b1SPeter Xuqemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist 479*8cb2f8b1SPeter Xuversions exist for high bandwidth IO. 480*8cb2f8b1SPeter Xu 481*8cb2f8b1SPeter Xu 482*8cb2f8b1SPeter XuAn iterative device must provide: 483*8cb2f8b1SPeter Xu 484*8cb2f8b1SPeter Xu - A ``save_setup`` function that initialises the data structures and 485*8cb2f8b1SPeter Xu transmits a first section containing information on the device. In the 486*8cb2f8b1SPeter Xu case of RAM this transmits a list of RAMBlocks and sizes. 487*8cb2f8b1SPeter Xu 488*8cb2f8b1SPeter Xu - A ``load_setup`` function that initialises the data structures on the 489*8cb2f8b1SPeter Xu destination. 490*8cb2f8b1SPeter Xu 491*8cb2f8b1SPeter Xu - A ``state_pending_exact`` function that indicates how much more 492*8cb2f8b1SPeter Xu data we must save. The core migration code will use this to 493*8cb2f8b1SPeter Xu determine when to pause the CPUs and complete the migration. 494*8cb2f8b1SPeter Xu 495*8cb2f8b1SPeter Xu - A ``state_pending_estimate`` function that indicates how much more 496*8cb2f8b1SPeter Xu data we must save. When the estimated amount is smaller than the 497*8cb2f8b1SPeter Xu threshold, we call ``state_pending_exact``. 498*8cb2f8b1SPeter Xu 499*8cb2f8b1SPeter Xu - A ``save_live_iterate`` function should send a chunk of data until 500*8cb2f8b1SPeter Xu the point that stream bandwidth limits tell it to stop. Each call 501*8cb2f8b1SPeter Xu generates one section. 502*8cb2f8b1SPeter Xu 503*8cb2f8b1SPeter Xu - A ``save_live_complete_precopy`` function that must transmit the 504*8cb2f8b1SPeter Xu last section for the device containing any remaining data. 505*8cb2f8b1SPeter Xu 506*8cb2f8b1SPeter Xu - A ``load_state`` function used to load sections generated by 507*8cb2f8b1SPeter Xu any of the save functions that generate sections. 508*8cb2f8b1SPeter Xu 509*8cb2f8b1SPeter Xu - ``cleanup`` functions for both save and load that are called 510*8cb2f8b1SPeter Xu at the end of migration. 511*8cb2f8b1SPeter Xu 512*8cb2f8b1SPeter XuNote that the contents of the sections for iterative migration tend 513*8cb2f8b1SPeter Xuto be open-coded by the devices; care should be taken in parsing 514*8cb2f8b1SPeter Xuthe results and structuring the stream to make them easy to validate. 515*8cb2f8b1SPeter Xu 516*8cb2f8b1SPeter XuDevice ordering 517*8cb2f8b1SPeter Xu--------------- 518*8cb2f8b1SPeter Xu 519*8cb2f8b1SPeter XuThere are cases in which the ordering of device loading matters; for 520*8cb2f8b1SPeter Xuexample in some systems where a device may assert an interrupt during loading, 521*8cb2f8b1SPeter Xuif the interrupt controller is loaded later then it might lose the state. 522*8cb2f8b1SPeter Xu 523*8cb2f8b1SPeter XuSome ordering is implicitly provided by the order in which the machine 524*8cb2f8b1SPeter Xudefinition creates devices, however this is somewhat fragile. 525*8cb2f8b1SPeter Xu 526*8cb2f8b1SPeter XuThe ``MigrationPriority`` enum provides a means of explicitly enforcing 527*8cb2f8b1SPeter Xuordering. Numerically higher priorities are loaded earlier. 528*8cb2f8b1SPeter XuThe priority is set by setting the ``priority`` field of the top level 529*8cb2f8b1SPeter Xu``VMStateDescription`` for the device. 530*8cb2f8b1SPeter Xu 531*8cb2f8b1SPeter XuStream structure 532*8cb2f8b1SPeter Xu================ 533*8cb2f8b1SPeter Xu 534*8cb2f8b1SPeter XuThe stream tries to be word and endian agnostic, allowing migration between hosts 535*8cb2f8b1SPeter Xuof different characteristics running the same VM. 536*8cb2f8b1SPeter Xu 537*8cb2f8b1SPeter Xu - Header 538*8cb2f8b1SPeter Xu 539*8cb2f8b1SPeter Xu - Magic 540*8cb2f8b1SPeter Xu - Version 541*8cb2f8b1SPeter Xu - VM configuration section 542*8cb2f8b1SPeter Xu 543*8cb2f8b1SPeter Xu - Machine type 544*8cb2f8b1SPeter Xu - Target page bits 545*8cb2f8b1SPeter Xu - List of sections 546*8cb2f8b1SPeter Xu Each section contains a device, or one iteration of a device save. 547*8cb2f8b1SPeter Xu 548*8cb2f8b1SPeter Xu - section type 549*8cb2f8b1SPeter Xu - section id 550*8cb2f8b1SPeter Xu - ID string (First section of each device) 551*8cb2f8b1SPeter Xu - instance id (First section of each device) 552*8cb2f8b1SPeter Xu - version id (First section of each device) 553*8cb2f8b1SPeter Xu - <device data> 554*8cb2f8b1SPeter Xu - Footer mark 555*8cb2f8b1SPeter Xu - EOF mark 556*8cb2f8b1SPeter Xu - VM Description structure 557*8cb2f8b1SPeter Xu Consisting of a JSON description of the contents for analysis only 558*8cb2f8b1SPeter Xu 559*8cb2f8b1SPeter XuThe ``device data`` in each section consists of the data produced 560*8cb2f8b1SPeter Xuby the code described above. For non-iterative devices they have a single 561*8cb2f8b1SPeter Xusection; iterative devices have an initial and last section and a set 562*8cb2f8b1SPeter Xuof parts in between. 563*8cb2f8b1SPeter XuNote that there is very little checking by the common code of the integrity 564*8cb2f8b1SPeter Xuof the ``device data`` contents, that's up to the devices themselves. 565*8cb2f8b1SPeter XuThe ``footer mark`` provides a little bit of protection for the case where 566*8cb2f8b1SPeter Xuthe receiving side reads more or less data than expected. 567*8cb2f8b1SPeter Xu 568*8cb2f8b1SPeter XuThe ``ID string`` is normally unique, having been formed from a bus name 569*8cb2f8b1SPeter Xuand device address, PCI devices and storage devices hung off PCI controllers 570*8cb2f8b1SPeter Xufit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). 571*8cb2f8b1SPeter XuOthers (especially either older devices or system devices which for 572*8cb2f8b1SPeter Xusome reason don't have a bus concept) make use of the ``instance id`` 573*8cb2f8b1SPeter Xufor otherwise identically named devices. 574*8cb2f8b1SPeter Xu 575*8cb2f8b1SPeter XuReturn path 576*8cb2f8b1SPeter Xu----------- 577*8cb2f8b1SPeter Xu 578*8cb2f8b1SPeter XuOnly a unidirectional stream is required for normal migration, however a 579*8cb2f8b1SPeter Xu``return path`` can be created when bidirectional communication is desired. 580*8cb2f8b1SPeter XuThis is primarily used by postcopy, but is also used to return a success 581*8cb2f8b1SPeter Xuflag to the source at the end of migration. 582*8cb2f8b1SPeter Xu 583*8cb2f8b1SPeter Xu``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return 584*8cb2f8b1SPeter Xupath. 585*8cb2f8b1SPeter Xu 586*8cb2f8b1SPeter Xu Source side 587*8cb2f8b1SPeter Xu 588*8cb2f8b1SPeter Xu Forward path - written by migration thread 589*8cb2f8b1SPeter Xu Return path - opened by main thread, read by return-path thread 590*8cb2f8b1SPeter Xu 591*8cb2f8b1SPeter Xu Destination side 592*8cb2f8b1SPeter Xu 593*8cb2f8b1SPeter Xu Forward path - read by main thread 594*8cb2f8b1SPeter Xu Return path - opened by main thread, written by main thread AND postcopy 595*8cb2f8b1SPeter Xu thread (protected by rp_mutex) 596*8cb2f8b1SPeter Xu 597*8cb2f8b1SPeter XuDirty limit 598*8cb2f8b1SPeter Xu===================== 599*8cb2f8b1SPeter XuThe dirty limit, short for dirty page rate upper limit, is a new capability 600*8cb2f8b1SPeter Xuintroduced in the 8.1 QEMU release that uses a new algorithm based on the KVM 601*8cb2f8b1SPeter Xudirty ring to throttle down the guest during live migration. 602*8cb2f8b1SPeter Xu 603*8cb2f8b1SPeter XuThe algorithm framework is as follows: 604*8cb2f8b1SPeter Xu 605*8cb2f8b1SPeter Xu:: 606*8cb2f8b1SPeter Xu 607*8cb2f8b1SPeter Xu ------------------------------------------------------------------------------ 608*8cb2f8b1SPeter Xu main --------------> throttle thread ------------> PREPARE(1) <-------- 609*8cb2f8b1SPeter Xu thread \ | | 610*8cb2f8b1SPeter Xu \ | | 611*8cb2f8b1SPeter Xu \ V | 612*8cb2f8b1SPeter Xu -\ CALCULATE(2) | 613*8cb2f8b1SPeter Xu \ | | 614*8cb2f8b1SPeter Xu \ | | 615*8cb2f8b1SPeter Xu \ V | 616*8cb2f8b1SPeter Xu \ SET PENALTY(3) ----- 617*8cb2f8b1SPeter Xu -\ | 618*8cb2f8b1SPeter Xu \ | 619*8cb2f8b1SPeter Xu \ V 620*8cb2f8b1SPeter Xu -> virtual CPU thread -------> ACCEPT PENALTY(4) 621*8cb2f8b1SPeter Xu ------------------------------------------------------------------------------ 622*8cb2f8b1SPeter Xu 623*8cb2f8b1SPeter XuWhen the qmp command qmp_set_vcpu_dirty_limit is called for the first time, 624*8cb2f8b1SPeter Xuthe QEMU main thread starts the throttle thread. The throttle thread, once 625*8cb2f8b1SPeter Xulaunched, executes the loop, which consists of three steps: 626*8cb2f8b1SPeter Xu 627*8cb2f8b1SPeter Xu - PREPARE (1) 628*8cb2f8b1SPeter Xu 629*8cb2f8b1SPeter Xu The entire work of PREPARE (1) is preparation for the second stage, 630*8cb2f8b1SPeter Xu CALCULATE(2), as the name implies. It involves preparing the dirty 631*8cb2f8b1SPeter Xu page rate value and the corresponding upper limit of the VM: 632*8cb2f8b1SPeter Xu The dirty page rate is calculated via the KVM dirty ring mechanism, 633*8cb2f8b1SPeter Xu which tells QEMU how many dirty pages a virtual CPU has had since the 634*8cb2f8b1SPeter Xu last KVM_EXIT_DIRTY_RING_FULL exception; The dirty page rate upper 635*8cb2f8b1SPeter Xu limit is specified by caller, therefore fetch it directly. 636*8cb2f8b1SPeter Xu 637*8cb2f8b1SPeter Xu - CALCULATE (2) 638*8cb2f8b1SPeter Xu 639*8cb2f8b1SPeter Xu Calculate a suitable sleep period for each virtual CPU, which will be 640*8cb2f8b1SPeter Xu used to determine the penalty for the target virtual CPU. The 641*8cb2f8b1SPeter Xu computation must be done carefully in order to reduce the dirty page 642*8cb2f8b1SPeter Xu rate progressively down to the upper limit without oscillation. To 643*8cb2f8b1SPeter Xu achieve this, two strategies are provided: the first is to add or 644*8cb2f8b1SPeter Xu subtract sleep time based on the ratio of the current dirty page rate 645*8cb2f8b1SPeter Xu to the limit, which is used when the current dirty page rate is far 646*8cb2f8b1SPeter Xu from the limit; the second is to add or subtract a fixed time when 647*8cb2f8b1SPeter Xu the current dirty page rate is close to the limit. 648*8cb2f8b1SPeter Xu 649*8cb2f8b1SPeter Xu - SET PENALTY (3) 650*8cb2f8b1SPeter Xu 651*8cb2f8b1SPeter Xu Set the sleep time for each virtual CPU that should be penalized based 652*8cb2f8b1SPeter Xu on the results of the calculation supplied by step CALCULATE (2). 653*8cb2f8b1SPeter Xu 654*8cb2f8b1SPeter XuAfter completing the three above stages, the throttle thread loops back 655*8cb2f8b1SPeter Xuto step PREPARE (1) until the dirty limit is reached. 656*8cb2f8b1SPeter Xu 657*8cb2f8b1SPeter XuOn the other hand, each virtual CPU thread reads the sleep duration and 658*8cb2f8b1SPeter Xusleeps in the path of the KVM_EXIT_DIRTY_RING_FULL exception handler, that 659*8cb2f8b1SPeter Xuis ACCEPT PENALTY (4). Virtual CPUs tied with writing processes will 660*8cb2f8b1SPeter Xuobviously exit to the path and get penalized, whereas virtual CPUs involved 661*8cb2f8b1SPeter Xuwith read processes will not. 662*8cb2f8b1SPeter Xu 663*8cb2f8b1SPeter XuIn summary, thanks to the KVM dirty ring technology, the dirty limit 664*8cb2f8b1SPeter Xualgorithm will restrict virtual CPUs as needed to keep their dirty page 665*8cb2f8b1SPeter Xurate inside the limit. This leads to more steady reading performance during 666*8cb2f8b1SPeter Xulive migration and can aid in improving large guest responsiveness. 667*8cb2f8b1SPeter Xu 668*8cb2f8b1SPeter XuPostcopy 669*8cb2f8b1SPeter Xu======== 670*8cb2f8b1SPeter Xu 671*8cb2f8b1SPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge 672*8cb2f8b1SPeter Xu(or take too long to converge) its plus side is that there is an upper bound on 673*8cb2f8b1SPeter Xuthe amount of migration traffic and time it takes, the down side is that during 674*8cb2f8b1SPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost. 675*8cb2f8b1SPeter Xu 676*8cb2f8b1SPeter XuIn postcopy the destination CPUs are started before all the memory has been 677*8cb2f8b1SPeter Xutransferred, and accesses to pages that are yet to be transferred cause 678*8cb2f8b1SPeter Xua fault that's translated by QEMU into a request to the source QEMU. 679*8cb2f8b1SPeter Xu 680*8cb2f8b1SPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 681*8cb2f8b1SPeter Xudoesn't finish in a given time the switch is made to postcopy. 682*8cb2f8b1SPeter Xu 683*8cb2f8b1SPeter XuEnabling postcopy 684*8cb2f8b1SPeter Xu----------------- 685*8cb2f8b1SPeter Xu 686*8cb2f8b1SPeter XuTo enable postcopy, issue this command on the monitor (both source and 687*8cb2f8b1SPeter Xudestination) prior to the start of migration: 688*8cb2f8b1SPeter Xu 689*8cb2f8b1SPeter Xu``migrate_set_capability postcopy-ram on`` 690*8cb2f8b1SPeter Xu 691*8cb2f8b1SPeter XuThe normal commands are then used to start a migration, which is still 692*8cb2f8b1SPeter Xustarted in precopy mode. Issuing: 693*8cb2f8b1SPeter Xu 694*8cb2f8b1SPeter Xu``migrate_start_postcopy`` 695*8cb2f8b1SPeter Xu 696*8cb2f8b1SPeter Xuwill now cause the transition from precopy to postcopy. 697*8cb2f8b1SPeter XuIt can be issued immediately after migration is started or any 698*8cb2f8b1SPeter Xutime later on. Issuing it after the end of a migration is harmless. 699*8cb2f8b1SPeter Xu 700*8cb2f8b1SPeter XuBlocktime is a postcopy live migration metric, intended to show how 701*8cb2f8b1SPeter Xulong the vCPU was in state of interruptible sleep due to pagefault. 702*8cb2f8b1SPeter XuThat metric is calculated both for all vCPUs as overlapped value, and 703*8cb2f8b1SPeter Xuseparately for each vCPU. These values are calculated on destination 704*8cb2f8b1SPeter Xuside. To enable postcopy blocktime calculation, enter following 705*8cb2f8b1SPeter Xucommand on destination monitor: 706*8cb2f8b1SPeter Xu 707*8cb2f8b1SPeter Xu``migrate_set_capability postcopy-blocktime on`` 708*8cb2f8b1SPeter Xu 709*8cb2f8b1SPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command. 710*8cb2f8b1SPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking 711*8cb2f8b1SPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 712*8cb2f8b1SPeter Xutime per vCPU. 713*8cb2f8b1SPeter Xu 714*8cb2f8b1SPeter Xu.. note:: 715*8cb2f8b1SPeter Xu During the postcopy phase, the bandwidth limits set using 716*8cb2f8b1SPeter Xu ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 717*8cb2f8b1SPeter Xu the destination is waiting for). 718*8cb2f8b1SPeter Xu 719*8cb2f8b1SPeter XuPostcopy device transfer 720*8cb2f8b1SPeter Xu------------------------ 721*8cb2f8b1SPeter Xu 722*8cb2f8b1SPeter XuLoading of device data may cause the device emulation to access guest RAM 723*8cb2f8b1SPeter Xuthat may trigger faults that have to be resolved by the source, as such 724*8cb2f8b1SPeter Xuthe migration stream has to be able to respond with page data *during* the 725*8cb2f8b1SPeter Xudevice load, and hence the device data has to be read from the stream completely 726*8cb2f8b1SPeter Xubefore the device load begins to free the stream up. This is achieved by 727*8cb2f8b1SPeter Xu'packaging' the device data into a blob that's read in one go. 728*8cb2f8b1SPeter Xu 729*8cb2f8b1SPeter XuSource behaviour 730*8cb2f8b1SPeter Xu---------------- 731*8cb2f8b1SPeter Xu 732*8cb2f8b1SPeter XuUntil postcopy is entered the migration stream is identical to normal 733*8cb2f8b1SPeter Xuprecopy, except for the addition of a 'postcopy advise' command at 734*8cb2f8b1SPeter Xuthe beginning, to tell the destination that postcopy might happen. 735*8cb2f8b1SPeter XuWhen postcopy starts the source sends the page discard data and then 736*8cb2f8b1SPeter Xuforms the 'package' containing: 737*8cb2f8b1SPeter Xu 738*8cb2f8b1SPeter Xu - Command: 'postcopy listen' 739*8cb2f8b1SPeter Xu - The device state 740*8cb2f8b1SPeter Xu 741*8cb2f8b1SPeter Xu A series of sections, identical to the precopy streams device state stream 742*8cb2f8b1SPeter Xu containing everything except postcopiable devices (i.e. RAM) 743*8cb2f8b1SPeter Xu - Command: 'postcopy run' 744*8cb2f8b1SPeter Xu 745*8cb2f8b1SPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 746*8cb2f8b1SPeter Xucontents are formatted in the same way as the main migration stream. 747*8cb2f8b1SPeter Xu 748*8cb2f8b1SPeter XuDuring postcopy the source scans the list of dirty pages and sends them 749*8cb2f8b1SPeter Xuto the destination without being requested (in much the same way as precopy), 750*8cb2f8b1SPeter Xuhowever when a page request is received from the destination, the dirty page 751*8cb2f8b1SPeter Xuscanning restarts from the requested location. This causes requested pages 752*8cb2f8b1SPeter Xuto be sent quickly, and also causes pages directly after the requested page 753*8cb2f8b1SPeter Xuto be sent quickly in the hope that those pages are likely to be used 754*8cb2f8b1SPeter Xuby the destination soon. 755*8cb2f8b1SPeter Xu 756*8cb2f8b1SPeter XuDestination behaviour 757*8cb2f8b1SPeter Xu--------------------- 758*8cb2f8b1SPeter Xu 759*8cb2f8b1SPeter XuInitially the destination looks the same as precopy, with a single thread 760*8cb2f8b1SPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands 761*8cb2f8b1SPeter Xuare processed to change the way RAM is managed, but don't affect the stream 762*8cb2f8b1SPeter Xuprocessing. 763*8cb2f8b1SPeter Xu 764*8cb2f8b1SPeter Xu:: 765*8cb2f8b1SPeter Xu 766*8cb2f8b1SPeter Xu ------------------------------------------------------------------------------ 767*8cb2f8b1SPeter Xu 1 2 3 4 5 6 7 768*8cb2f8b1SPeter Xu main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 769*8cb2f8b1SPeter Xu thread | | 770*8cb2f8b1SPeter Xu | (page request) 771*8cb2f8b1SPeter Xu | \___ 772*8cb2f8b1SPeter Xu v \ 773*8cb2f8b1SPeter Xu listen thread: --- page -- page -- page -- page -- page -- 774*8cb2f8b1SPeter Xu 775*8cb2f8b1SPeter Xu a b c 776*8cb2f8b1SPeter Xu ------------------------------------------------------------------------------ 777*8cb2f8b1SPeter Xu 778*8cb2f8b1SPeter Xu- On receipt of ``CMD_PACKAGED`` (1) 779*8cb2f8b1SPeter Xu 780*8cb2f8b1SPeter Xu All the data associated with the package - the ( ... ) section in the diagram - 781*8cb2f8b1SPeter Xu is read into memory, and the main thread recurses into qemu_loadvm_state_main 782*8cb2f8b1SPeter Xu to process the contents of the package (2) which contains commands (3,6) and 783*8cb2f8b1SPeter Xu devices (4...) 784*8cb2f8b1SPeter Xu 785*8cb2f8b1SPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 786*8cb2f8b1SPeter Xu 787*8cb2f8b1SPeter Xu a new thread (a) is started that takes over servicing the migration stream, 788*8cb2f8b1SPeter Xu while the main thread carries on loading the package. It loads normal 789*8cb2f8b1SPeter Xu background page data (b) but if during a device load a fault happens (5) 790*8cb2f8b1SPeter Xu the returned page (c) is loaded by the listen thread allowing the main 791*8cb2f8b1SPeter Xu threads device load to carry on. 792*8cb2f8b1SPeter Xu 793*8cb2f8b1SPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 794*8cb2f8b1SPeter Xu 795*8cb2f8b1SPeter Xu letting the destination CPUs start running. At the end of the 796*8cb2f8b1SPeter Xu ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 797*8cb2f8b1SPeter Xu is no longer used by migration, while the listen thread carries on servicing 798*8cb2f8b1SPeter Xu page data until the end of migration. 799*8cb2f8b1SPeter Xu 800*8cb2f8b1SPeter XuPostcopy Recovery 801*8cb2f8b1SPeter Xu----------------- 802*8cb2f8b1SPeter Xu 803*8cb2f8b1SPeter XuComparing to precopy, postcopy is special on error handlings. When any 804*8cb2f8b1SPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily 805*8cb2f8b1SPeter Xufail a migration because VM data resides in both source and destination 806*8cb2f8b1SPeter XuQEMU instances. On the other hand, when issue happens QEMU on both sides 807*8cb2f8b1SPeter Xuwill go into a paused state. It'll need a recovery phase to continue a 808*8cb2f8b1SPeter Xupaused postcopy migration. 809*8cb2f8b1SPeter Xu 810*8cb2f8b1SPeter XuThe recovery phase normally contains a few steps: 811*8cb2f8b1SPeter Xu 812*8cb2f8b1SPeter Xu - When network issue occurs, both QEMU will go into PAUSED state 813*8cb2f8b1SPeter Xu 814*8cb2f8b1SPeter Xu - When the network is recovered (or a new network is provided), the admin 815*8cb2f8b1SPeter Xu can setup the new channel for migration using QMP command 816*8cb2f8b1SPeter Xu 'migrate-recover' on destination node, preparing for a resume. 817*8cb2f8b1SPeter Xu 818*8cb2f8b1SPeter Xu - On source host, the admin can continue the interrupted postcopy 819*8cb2f8b1SPeter Xu migration using QMP command 'migrate' with resume=true flag set. 820*8cb2f8b1SPeter Xu 821*8cb2f8b1SPeter Xu - After the connection is re-established, QEMU will continue the postcopy 822*8cb2f8b1SPeter Xu migration on both sides. 823*8cb2f8b1SPeter Xu 824*8cb2f8b1SPeter XuDuring a paused postcopy migration, the VM can logically still continue 825*8cb2f8b1SPeter Xurunning, and it will not be impacted from any page access to pages that 826*8cb2f8b1SPeter Xuwere already migrated to destination VM before the interruption happens. 827*8cb2f8b1SPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM 828*8cb2f8b1SPeter Xuthread will be halted waiting for the page to be migrated, it means it can 829*8cb2f8b1SPeter Xube halted until the recovery is complete. 830*8cb2f8b1SPeter Xu 831*8cb2f8b1SPeter XuThe impact of accessing missing pages can be relevant to different 832*8cb2f8b1SPeter Xuconfigurations of the guest. For example, when with async page fault 833*8cb2f8b1SPeter Xuenabled, logically the guest can proactively schedule out the threads 834*8cb2f8b1SPeter Xuaccessing missing pages. 835*8cb2f8b1SPeter Xu 836*8cb2f8b1SPeter XuPostcopy states 837*8cb2f8b1SPeter Xu--------------- 838*8cb2f8b1SPeter Xu 839*8cb2f8b1SPeter XuPostcopy moves through a series of states (see postcopy_state) from 840*8cb2f8b1SPeter XuADVISE->DISCARD->LISTEN->RUNNING->END 841*8cb2f8b1SPeter Xu 842*8cb2f8b1SPeter Xu - Advise 843*8cb2f8b1SPeter Xu 844*8cb2f8b1SPeter Xu Set at the start of migration if postcopy is enabled, even 845*8cb2f8b1SPeter Xu if it hasn't had the start command; here the destination 846*8cb2f8b1SPeter Xu checks that its OS has the support needed for postcopy, and performs 847*8cb2f8b1SPeter Xu setup to ensure the RAM mappings are suitable for later postcopy. 848*8cb2f8b1SPeter Xu The destination will fail early in migration at this point if the 849*8cb2f8b1SPeter Xu required OS support is not present. 850*8cb2f8b1SPeter Xu (Triggered by reception of POSTCOPY_ADVISE command) 851*8cb2f8b1SPeter Xu 852*8cb2f8b1SPeter Xu - Discard 853*8cb2f8b1SPeter Xu 854*8cb2f8b1SPeter Xu Entered on receipt of the first 'discard' command; prior to 855*8cb2f8b1SPeter Xu the first Discard being performed, hugepages are switched off 856*8cb2f8b1SPeter Xu (using madvise) to ensure that no new huge pages are created 857*8cb2f8b1SPeter Xu during the postcopy phase, and to cause any huge pages that 858*8cb2f8b1SPeter Xu have discards on them to be broken. 859*8cb2f8b1SPeter Xu 860*8cb2f8b1SPeter Xu - Listen 861*8cb2f8b1SPeter Xu 862*8cb2f8b1SPeter Xu The first command in the package, POSTCOPY_LISTEN, switches 863*8cb2f8b1SPeter Xu the destination state to Listen, and starts a new thread 864*8cb2f8b1SPeter Xu (the 'listen thread') which takes over the job of receiving 865*8cb2f8b1SPeter Xu pages off the migration stream, while the main thread carries 866*8cb2f8b1SPeter Xu on processing the blob. With this thread able to process page 867*8cb2f8b1SPeter Xu reception, the destination now 'sensitises' the RAM to detect 868*8cb2f8b1SPeter Xu any access to missing pages (on Linux using the 'userfault' 869*8cb2f8b1SPeter Xu system). 870*8cb2f8b1SPeter Xu 871*8cb2f8b1SPeter Xu - Running 872*8cb2f8b1SPeter Xu 873*8cb2f8b1SPeter Xu POSTCOPY_RUN causes the destination to synchronise all 874*8cb2f8b1SPeter Xu state and start the CPUs and IO devices running. The main 875*8cb2f8b1SPeter Xu thread now finishes processing the migration package and 876*8cb2f8b1SPeter Xu now carries on as it would for normal precopy migration 877*8cb2f8b1SPeter Xu (although it can't do the cleanup it would do as it 878*8cb2f8b1SPeter Xu finishes a normal migration). 879*8cb2f8b1SPeter Xu 880*8cb2f8b1SPeter Xu - Paused 881*8cb2f8b1SPeter Xu 882*8cb2f8b1SPeter Xu Postcopy can run into a paused state (normally on both sides when 883*8cb2f8b1SPeter Xu happens), where all threads will be temporarily halted mostly due to 884*8cb2f8b1SPeter Xu network errors. When reaching paused state, migration will make sure 885*8cb2f8b1SPeter Xu the qemu binary on both sides maintain the data without corrupting 886*8cb2f8b1SPeter Xu the VM. To continue the migration, the admin needs to fix the 887*8cb2f8b1SPeter Xu migration channel using the QMP command 'migrate-recover' on the 888*8cb2f8b1SPeter Xu destination node, then resume the migration using QMP command 'migrate' 889*8cb2f8b1SPeter Xu again on source node, with resume=true flag set. 890*8cb2f8b1SPeter Xu 891*8cb2f8b1SPeter Xu - End 892*8cb2f8b1SPeter Xu 893*8cb2f8b1SPeter Xu The listen thread can now quit, and perform the cleanup of migration 894*8cb2f8b1SPeter Xu state, the migration is now complete. 895*8cb2f8b1SPeter Xu 896*8cb2f8b1SPeter XuSource side page map 897*8cb2f8b1SPeter Xu-------------------- 898*8cb2f8b1SPeter Xu 899*8cb2f8b1SPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy, 900*8cb2f8b1SPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs 901*8cb2f8b1SPeter Xusending. During the precopy phase this is updated as the CPU dirties 902*8cb2f8b1SPeter Xupages, however during postcopy the CPUs are stopped and nothing should 903*8cb2f8b1SPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant 904*8cb2f8b1SPeter Xupages are sent during postcopy. 905*8cb2f8b1SPeter Xu 906*8cb2f8b1SPeter XuPostcopy with hugepages 907*8cb2f8b1SPeter Xu----------------------- 908*8cb2f8b1SPeter Xu 909*8cb2f8b1SPeter XuPostcopy now works with hugetlbfs backed memory: 910*8cb2f8b1SPeter Xu 911*8cb2f8b1SPeter Xu a) The linux kernel on the destination must support userfault on hugepages. 912*8cb2f8b1SPeter Xu b) The huge-page configuration on the source and destination VMs must be 913*8cb2f8b1SPeter Xu identical; i.e. RAMBlocks on both sides must use the same page size. 914*8cb2f8b1SPeter Xu c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 915*8cb2f8b1SPeter Xu RAM if it doesn't have enough hugepages, triggering (b) to fail. 916*8cb2f8b1SPeter Xu Using ``-mem-prealloc`` enforces the allocation using hugepages. 917*8cb2f8b1SPeter Xu d) Care should be taken with the size of hugepage used; postcopy with 2MB 918*8cb2f8b1SPeter Xu hugepages works well, however 1GB hugepages are likely to be problematic 919*8cb2f8b1SPeter Xu since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 920*8cb2f8b1SPeter Xu and until the full page is transferred the destination thread is blocked. 921*8cb2f8b1SPeter Xu 922*8cb2f8b1SPeter XuPostcopy with shared memory 923*8cb2f8b1SPeter Xu--------------------------- 924*8cb2f8b1SPeter Xu 925*8cb2f8b1SPeter XuPostcopy migration with shared memory needs explicit support from the other 926*8cb2f8b1SPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of 927*8cb2f8b1SPeter Xumemory that userfault can support shared. 928*8cb2f8b1SPeter Xu 929*8cb2f8b1SPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 930*8cb2f8b1SPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 931*8cb2f8b1SPeter Xufor hugetlbfs which may be a problem in some configurations). 932*8cb2f8b1SPeter Xu 933*8cb2f8b1SPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support, 934*8cb2f8b1SPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 935*8cb2f8b1SPeter Xuto support postcopy. 936*8cb2f8b1SPeter Xu 937*8cb2f8b1SPeter XuThe client needs to open a userfaultfd and register the areas 938*8cb2f8b1SPeter Xuof memory that it maps with userfault. The client must then pass the 939*8cb2f8b1SPeter Xuuserfaultfd back to QEMU together with a mapping table that allows 940*8cb2f8b1SPeter Xufault addresses in the clients address space to be converted back to 941*8cb2f8b1SPeter XuRAMBlock/offsets. The client's userfaultfd is added to the postcopy 942*8cb2f8b1SPeter Xufault-thread and page requests are made on behalf of the client by QEMU. 943*8cb2f8b1SPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it 944*8cb2f8b1SPeter Xuto continue after a page has arrived. 945*8cb2f8b1SPeter Xu 946*8cb2f8b1SPeter Xu.. note:: 947*8cb2f8b1SPeter Xu There are two future improvements that would be nice: 948*8cb2f8b1SPeter Xu a) Some way to make QEMU ignorant of the addresses in the clients 949*8cb2f8b1SPeter Xu address space 950*8cb2f8b1SPeter Xu b) Avoiding the need for QEMU to perform ufd-wake calls after the 951*8cb2f8b1SPeter Xu pages have arrived 952*8cb2f8b1SPeter Xu 953*8cb2f8b1SPeter XuRetro-fitting postcopy to existing clients is possible: 954*8cb2f8b1SPeter Xu a) A mechanism is needed for the registration with userfault as above, 955*8cb2f8b1SPeter Xu and the registration needs to be coordinated with the phases of 956*8cb2f8b1SPeter Xu postcopy. In vhost-user extra messages are added to the existing 957*8cb2f8b1SPeter Xu control channel. 958*8cb2f8b1SPeter Xu b) Any thread that can block due to guest memory accesses must be 959*8cb2f8b1SPeter Xu identified and the implication understood; for example if the 960*8cb2f8b1SPeter Xu guest memory access is made while holding a lock then all other 961*8cb2f8b1SPeter Xu threads waiting for that lock will also be blocked. 962*8cb2f8b1SPeter Xu 963*8cb2f8b1SPeter XuPostcopy Preemption Mode 964*8cb2f8b1SPeter Xu------------------------ 965*8cb2f8b1SPeter Xu 966*8cb2f8b1SPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it 967*8cb2f8b1SPeter Xuallows urgent pages (those got page fault requested from destination QEMU 968*8cb2f8b1SPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in 969*8cb2f8b1SPeter Xuthe background migration channel. Anyone who cares about latencies of page 970*8cb2f8b1SPeter Xufaults during a postcopy migration should enable this feature. By default, 971*8cb2f8b1SPeter Xuit's not enabled. 972*8cb2f8b1SPeter Xu 973*8cb2f8b1SPeter XuFirmware 974*8cb2f8b1SPeter Xu======== 975*8cb2f8b1SPeter Xu 976*8cb2f8b1SPeter XuMigration migrates the copies of RAM and ROM, and thus when running 977*8cb2f8b1SPeter Xuon the destination it includes the firmware from the source. Even after 978*8cb2f8b1SPeter Xuresetting a VM, the old firmware is used. Only once QEMU has been restarted 979*8cb2f8b1SPeter Xuis the new firmware in use. 980*8cb2f8b1SPeter Xu 981*8cb2f8b1SPeter Xu- Changes in firmware size can cause changes in the required RAMBlock size 982*8cb2f8b1SPeter Xu to hold the firmware and thus migration can fail. In practice it's best 983*8cb2f8b1SPeter Xu to pad firmware images to convenient powers of 2 with plenty of space 984*8cb2f8b1SPeter Xu for growth. 985*8cb2f8b1SPeter Xu 986*8cb2f8b1SPeter Xu- Care should be taken with device emulation code so that newer 987*8cb2f8b1SPeter Xu emulation code can work with older firmware to allow forward migration. 988*8cb2f8b1SPeter Xu 989*8cb2f8b1SPeter Xu- Care should be taken with newer firmware so that backward migration 990*8cb2f8b1SPeter Xu to older systems with older device emulation code will work. 991*8cb2f8b1SPeter Xu 992*8cb2f8b1SPeter XuIn some cases it may be best to tie specific firmware versions to specific 993*8cb2f8b1SPeter Xuversioned machine types to cut down on the combinations that will need 994*8cb2f8b1SPeter Xusupport. This is also useful when newer versions of firmware outgrow 995*8cb2f8b1SPeter Xuthe padding. 996*8cb2f8b1SPeter Xu 997*8cb2f8b1SPeter Xu 998*8cb2f8b1SPeter XuBackwards compatibility 999*8cb2f8b1SPeter Xu======================= 1000*8cb2f8b1SPeter Xu 1001*8cb2f8b1SPeter XuHow backwards compatibility works 1002*8cb2f8b1SPeter Xu--------------------------------- 1003*8cb2f8b1SPeter Xu 1004*8cb2f8b1SPeter XuWhen we do migration, we have two QEMU processes: the source and the 1005*8cb2f8b1SPeter Xutarget. There are two cases, they are the same version or they are 1006*8cb2f8b1SPeter Xudifferent versions. The easy case is when they are the same version. 1007*8cb2f8b1SPeter XuThe difficult one is when they are different versions. 1008*8cb2f8b1SPeter Xu 1009*8cb2f8b1SPeter XuThere are two things that are different, but they have very similar 1010*8cb2f8b1SPeter Xunames and sometimes get confused: 1011*8cb2f8b1SPeter Xu 1012*8cb2f8b1SPeter Xu- QEMU version 1013*8cb2f8b1SPeter Xu- machine type version 1014*8cb2f8b1SPeter Xu 1015*8cb2f8b1SPeter XuLet's start with a practical example, we start with: 1016*8cb2f8b1SPeter Xu 1017*8cb2f8b1SPeter Xu- qemu-system-x86_64 (v5.2), from now on qemu-5.2. 1018*8cb2f8b1SPeter Xu- qemu-system-x86_64 (v5.1), from now on qemu-5.1. 1019*8cb2f8b1SPeter Xu 1020*8cb2f8b1SPeter XuRelated to this are the "latest" machine types defined on each of 1021*8cb2f8b1SPeter Xuthem: 1022*8cb2f8b1SPeter Xu 1023*8cb2f8b1SPeter Xu- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2 1024*8cb2f8b1SPeter Xu- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1 1025*8cb2f8b1SPeter Xu 1026*8cb2f8b1SPeter XuFirst of all, migration is only supposed to work if you use the same 1027*8cb2f8b1SPeter Xumachine type in both source and destination. The QEMU hardware 1028*8cb2f8b1SPeter Xuconfiguration needs to be the same also on source and destination. 1029*8cb2f8b1SPeter XuMost aspects of the backend configuration can be changed at will, 1030*8cb2f8b1SPeter Xuexcept for a few cases where the backend features influence frontend 1031*8cb2f8b1SPeter Xudevice feature exposure. But that is not relevant for this section. 1032*8cb2f8b1SPeter Xu 1033*8cb2f8b1SPeter XuI am going to list the number of combinations that we can have. Let's 1034*8cb2f8b1SPeter Xustart with the trivial ones, QEMU is the same on source and 1035*8cb2f8b1SPeter Xudestination: 1036*8cb2f8b1SPeter Xu 1037*8cb2f8b1SPeter Xu1 - qemu-5.2 -M pc-5.2 -> migrates to -> qemu-5.2 -M pc-5.2 1038*8cb2f8b1SPeter Xu 1039*8cb2f8b1SPeter Xu This is the latest QEMU with the latest machine type. 1040*8cb2f8b1SPeter Xu This have to work, and if it doesn't work it is a bug. 1041*8cb2f8b1SPeter Xu 1042*8cb2f8b1SPeter Xu2 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 1043*8cb2f8b1SPeter Xu 1044*8cb2f8b1SPeter Xu Exactly the same case than the previous one, but for 5.1. 1045*8cb2f8b1SPeter Xu Nothing to see here either. 1046*8cb2f8b1SPeter Xu 1047*8cb2f8b1SPeter XuThis are the easiest ones, we will not talk more about them in this 1048*8cb2f8b1SPeter Xusection. 1049*8cb2f8b1SPeter Xu 1050*8cb2f8b1SPeter XuNow we start with the more interesting cases. Consider the case where 1051*8cb2f8b1SPeter Xuwe have the same QEMU version in both sides (qemu-5.2) but we are using 1052*8cb2f8b1SPeter Xuthe latest machine type for that version (pc-5.2) but one of an older 1053*8cb2f8b1SPeter XuQEMU version, in this case pc-5.1. 1054*8cb2f8b1SPeter Xu 1055*8cb2f8b1SPeter Xu3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 1056*8cb2f8b1SPeter Xu 1057*8cb2f8b1SPeter Xu It needs to use the definition of pc-5.1 and the devices as they 1058*8cb2f8b1SPeter Xu were configured on 5.1, but this should be easy in the sense that 1059*8cb2f8b1SPeter Xu both sides are the same QEMU and both sides have exactly the same 1060*8cb2f8b1SPeter Xu idea of what the pc-5.1 machine is. 1061*8cb2f8b1SPeter Xu 1062*8cb2f8b1SPeter Xu4 - qemu-5.1 -M pc-5.2 -> migrates to -> qemu-5.1 -M pc-5.2 1063*8cb2f8b1SPeter Xu 1064*8cb2f8b1SPeter Xu This combination is not possible as the qemu-5.1 doesn't understand 1065*8cb2f8b1SPeter Xu pc-5.2 machine type. So nothing to worry here. 1066*8cb2f8b1SPeter Xu 1067*8cb2f8b1SPeter XuNow it comes the interesting ones, when both QEMU processes are 1068*8cb2f8b1SPeter Xudifferent. Notice also that the machine type needs to be pc-5.1, 1069*8cb2f8b1SPeter Xubecause we have the limitation than qemu-5.1 doesn't know pc-5.2. So 1070*8cb2f8b1SPeter Xuthe possible cases are: 1071*8cb2f8b1SPeter Xu 1072*8cb2f8b1SPeter Xu5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 1073*8cb2f8b1SPeter Xu 1074*8cb2f8b1SPeter Xu This migration is known as newer to older. We need to make sure 1075*8cb2f8b1SPeter Xu when we are developing 5.2 we need to take care about not to break 1076*8cb2f8b1SPeter Xu migration to qemu-5.1. Notice that we can't make updates to 1077*8cb2f8b1SPeter Xu qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is 1078*8cb2f8b1SPeter Xu in qemu-5.2 side to make the relevant changes. 1079*8cb2f8b1SPeter Xu 1080*8cb2f8b1SPeter Xu6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 1081*8cb2f8b1SPeter Xu 1082*8cb2f8b1SPeter Xu This migration is known as older to newer. We need to make sure 1083*8cb2f8b1SPeter Xu than we are able to receive migrations from qemu-5.1. The problem is 1084*8cb2f8b1SPeter Xu similar to the previous one. 1085*8cb2f8b1SPeter Xu 1086*8cb2f8b1SPeter XuIf qemu-5.1 and qemu-5.2 were the same, there will not be any 1087*8cb2f8b1SPeter Xucompatibility problems. But the reason that we create qemu-5.2 is to 1088*8cb2f8b1SPeter Xuget new features, devices, defaults, etc. 1089*8cb2f8b1SPeter Xu 1090*8cb2f8b1SPeter XuIf we get a device that has a new feature, or change a default value, 1091*8cb2f8b1SPeter Xuwe have a problem when we try to migrate between different QEMU 1092*8cb2f8b1SPeter Xuversions. 1093*8cb2f8b1SPeter Xu 1094*8cb2f8b1SPeter XuSo we need a way to tell qemu-5.2 that when we are using machine type 1095*8cb2f8b1SPeter Xupc-5.1, it needs to **not** use the feature, to be able to migrate to 1096*8cb2f8b1SPeter Xureal qemu-5.1. 1097*8cb2f8b1SPeter Xu 1098*8cb2f8b1SPeter XuAnd the equivalent part when migrating from qemu-5.1 to qemu-5.2. 1099*8cb2f8b1SPeter Xuqemu-5.2 has to expect that it is not going to get data for the new 1100*8cb2f8b1SPeter Xufeature, because qemu-5.1 doesn't know about it. 1101*8cb2f8b1SPeter Xu 1102*8cb2f8b1SPeter XuHow do we tell QEMU about these device feature changes? In 1103*8cb2f8b1SPeter Xuhw/core/machine.c:hw_compat_X_Y arrays. 1104*8cb2f8b1SPeter Xu 1105*8cb2f8b1SPeter XuIf we change a default value, we need to put back the old value on 1106*8cb2f8b1SPeter Xuthat array. And the device, during initialization needs to look at 1107*8cb2f8b1SPeter Xuthat array to see what value it needs to get for that feature. And 1108*8cb2f8b1SPeter Xuwhat are we going to put in that array, the value of a property. 1109*8cb2f8b1SPeter Xu 1110*8cb2f8b1SPeter XuTo create a property for a device, we need to use one of the 1111*8cb2f8b1SPeter XuDEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the 1112*8cb2f8b1SPeter Xumacros that exist. With it, we set the default value for that 1113*8cb2f8b1SPeter Xuproperty, and that is what it is going to get in the latest released 1114*8cb2f8b1SPeter Xuversion. But if we want a different value for a previous version, we 1115*8cb2f8b1SPeter Xucan change that in the hw_compat_X_Y arrays. 1116*8cb2f8b1SPeter Xu 1117*8cb2f8b1SPeter Xuhw_compat_X_Y is an array of registers that have the format: 1118*8cb2f8b1SPeter Xu 1119*8cb2f8b1SPeter Xu- name_device 1120*8cb2f8b1SPeter Xu- name_property 1121*8cb2f8b1SPeter Xu- value 1122*8cb2f8b1SPeter Xu 1123*8cb2f8b1SPeter XuLet's see a practical example. 1124*8cb2f8b1SPeter Xu 1125*8cb2f8b1SPeter XuIn qemu-5.2 virtio-blk-device got multi queue support. This is a 1126*8cb2f8b1SPeter Xuchange that is not backward compatible. In qemu-5.1 it has one 1127*8cb2f8b1SPeter Xuqueue. In qemu-5.2 it has the same number of queues as the number of 1128*8cb2f8b1SPeter Xucpus in the system. 1129*8cb2f8b1SPeter Xu 1130*8cb2f8b1SPeter XuWhen we are doing migration, if we migrate from a device that has 4 1131*8cb2f8b1SPeter Xuqueues to a device that have only one queue, we don't know where to 1132*8cb2f8b1SPeter Xuput the extra information for the other 3 queues, and we fail 1133*8cb2f8b1SPeter Xumigration. 1134*8cb2f8b1SPeter Xu 1135*8cb2f8b1SPeter XuSimilar problem when we migrate from qemu-5.1 that has only one queue 1136*8cb2f8b1SPeter Xuto qemu-5.2, we only sent information for one queue, but destination 1137*8cb2f8b1SPeter Xuhas 4, and we have 3 queues that are not properly initialized and 1138*8cb2f8b1SPeter Xuanything can happen. 1139*8cb2f8b1SPeter Xu 1140*8cb2f8b1SPeter XuSo, how can we address this problem. Easy, just convince qemu-5.2 1141*8cb2f8b1SPeter Xuthat when it is running pc-5.1, it needs to set the number of queues 1142*8cb2f8b1SPeter Xufor virtio-blk-devices to 1. 1143*8cb2f8b1SPeter Xu 1144*8cb2f8b1SPeter XuThat way we fix the cases 5 and 6. 1145*8cb2f8b1SPeter Xu 1146*8cb2f8b1SPeter Xu5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 1147*8cb2f8b1SPeter Xu 1148*8cb2f8b1SPeter Xu qemu-5.2 -M pc-5.1 sets number of queues to be 1. 1149*8cb2f8b1SPeter Xu qemu-5.1 -M pc-5.1 expects number of queues to be 1. 1150*8cb2f8b1SPeter Xu 1151*8cb2f8b1SPeter Xu correct. migration works. 1152*8cb2f8b1SPeter Xu 1153*8cb2f8b1SPeter Xu6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 1154*8cb2f8b1SPeter Xu 1155*8cb2f8b1SPeter Xu qemu-5.1 -M pc-5.1 sets number of queues to be 1. 1156*8cb2f8b1SPeter Xu qemu-5.2 -M pc-5.1 expects number of queues to be 1. 1157*8cb2f8b1SPeter Xu 1158*8cb2f8b1SPeter Xu correct. migration works. 1159*8cb2f8b1SPeter Xu 1160*8cb2f8b1SPeter XuAnd now the other interesting case, case 3. In this case we have: 1161*8cb2f8b1SPeter Xu 1162*8cb2f8b1SPeter Xu3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 1163*8cb2f8b1SPeter Xu 1164*8cb2f8b1SPeter Xu Here we have the same QEMU in both sides. So it doesn't matter a 1165*8cb2f8b1SPeter Xu lot if we have set the number of queues to 1 or not, because 1166*8cb2f8b1SPeter Xu they are the same. 1167*8cb2f8b1SPeter Xu 1168*8cb2f8b1SPeter Xu WRONG! 1169*8cb2f8b1SPeter Xu 1170*8cb2f8b1SPeter Xu Think what happens if we do one of this double migrations: 1171*8cb2f8b1SPeter Xu 1172*8cb2f8b1SPeter Xu A -> migrates -> B -> migrates -> C 1173*8cb2f8b1SPeter Xu 1174*8cb2f8b1SPeter Xu where: 1175*8cb2f8b1SPeter Xu 1176*8cb2f8b1SPeter Xu A: qemu-5.1 -M pc-5.1 1177*8cb2f8b1SPeter Xu B: qemu-5.2 -M pc-5.1 1178*8cb2f8b1SPeter Xu C: qemu-5.2 -M pc-5.1 1179*8cb2f8b1SPeter Xu 1180*8cb2f8b1SPeter Xu migration A -> B is case 6, so number of queues needs to be 1. 1181*8cb2f8b1SPeter Xu 1182*8cb2f8b1SPeter Xu migration B -> C is case 3, so we don't care. But actually we 1183*8cb2f8b1SPeter Xu care because we haven't started the guest in qemu-5.2, it came 1184*8cb2f8b1SPeter Xu migrated from qemu-5.1. So to be in the safe place, we need to 1185*8cb2f8b1SPeter Xu always use number of queues 1 when we are using pc-5.1. 1186*8cb2f8b1SPeter Xu 1187*8cb2f8b1SPeter XuNow, how was this done in reality? The following commit shows how it 1188*8cb2f8b1SPeter Xuwas done:: 1189*8cb2f8b1SPeter Xu 1190*8cb2f8b1SPeter Xu commit 9445e1e15e66c19e42bea942ba810db28052cd05 1191*8cb2f8b1SPeter Xu Author: Stefan Hajnoczi <stefanha@redhat.com> 1192*8cb2f8b1SPeter Xu Date: Tue Aug 18 15:33:47 2020 +0100 1193*8cb2f8b1SPeter Xu 1194*8cb2f8b1SPeter Xu virtio-blk-pci: default num_queues to -smp N 1195*8cb2f8b1SPeter Xu 1196*8cb2f8b1SPeter XuThe relevant parts for migration are:: 1197*8cb2f8b1SPeter Xu 1198*8cb2f8b1SPeter Xu @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = { 1199*8cb2f8b1SPeter Xu #endif 1200*8cb2f8b1SPeter Xu DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0, 1201*8cb2f8b1SPeter Xu true), 1202*8cb2f8b1SPeter Xu - DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1), 1203*8cb2f8b1SPeter Xu + DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1204*8cb2f8b1SPeter Xu + VIRTIO_BLK_AUTO_NUM_QUEUES), 1205*8cb2f8b1SPeter Xu DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256), 1206*8cb2f8b1SPeter Xu 1207*8cb2f8b1SPeter XuIt changes the default value of num_queues. But it fishes it for old 1208*8cb2f8b1SPeter Xumachine types to have the right value:: 1209*8cb2f8b1SPeter Xu 1210*8cb2f8b1SPeter Xu @@ -31,6 +31,7 @@ 1211*8cb2f8b1SPeter Xu GlobalProperty hw_compat_5_1[] = { 1212*8cb2f8b1SPeter Xu ... 1213*8cb2f8b1SPeter Xu + { "virtio-blk-device", "num-queues", "1"}, 1214*8cb2f8b1SPeter Xu ... 1215*8cb2f8b1SPeter Xu }; 1216*8cb2f8b1SPeter Xu 1217*8cb2f8b1SPeter XuA device with different features on both sides 1218*8cb2f8b1SPeter Xu---------------------------------------------- 1219*8cb2f8b1SPeter Xu 1220*8cb2f8b1SPeter XuLet's assume that we are using the same QEMU binary on both sides, 1221*8cb2f8b1SPeter Xujust to make the things easier. But we have a device that has 1222*8cb2f8b1SPeter Xudifferent features on both sides of the migration. That can be 1223*8cb2f8b1SPeter Xubecause the devices are different, because the kernel driver of both 1224*8cb2f8b1SPeter Xudevices have different features, whatever. 1225*8cb2f8b1SPeter Xu 1226*8cb2f8b1SPeter XuHow can we get this to work with migration. The way to do that is 1227*8cb2f8b1SPeter Xu"theoretically" easy. You have to get the features that the device 1228*8cb2f8b1SPeter Xuhas in the source of the migration. The features that the device has 1229*8cb2f8b1SPeter Xuon the target of the migration, you get the intersection of the 1230*8cb2f8b1SPeter Xufeatures of both sides, and that is the way that you should launch 1231*8cb2f8b1SPeter XuQEMU. 1232*8cb2f8b1SPeter Xu 1233*8cb2f8b1SPeter XuNotice that this is not completely related to QEMU. The most 1234*8cb2f8b1SPeter Xuimportant thing here is that this should be handled by the managing 1235*8cb2f8b1SPeter Xuapplication that launches QEMU. If QEMU is configured correctly, the 1236*8cb2f8b1SPeter Xumigration will succeed. 1237*8cb2f8b1SPeter Xu 1238*8cb2f8b1SPeter XuThat said, actually doing it is complicated. Almost all devices are 1239*8cb2f8b1SPeter Xubad at being able to be launched with only some features enabled. 1240*8cb2f8b1SPeter XuWith one big exception: cpus. 1241*8cb2f8b1SPeter Xu 1242*8cb2f8b1SPeter XuYou can read the documentation for QEMU x86 cpu models here: 1243*8cb2f8b1SPeter Xu 1244*8cb2f8b1SPeter Xuhttps://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html 1245*8cb2f8b1SPeter Xu 1246*8cb2f8b1SPeter XuSee when they talk about migration they recommend that one chooses the 1247*8cb2f8b1SPeter Xunewest cpu model that is supported for all cpus. 1248*8cb2f8b1SPeter Xu 1249*8cb2f8b1SPeter XuLet's say that we have: 1250*8cb2f8b1SPeter Xu 1251*8cb2f8b1SPeter XuHost A: 1252*8cb2f8b1SPeter Xu 1253*8cb2f8b1SPeter XuDevice X has the feature Y 1254*8cb2f8b1SPeter Xu 1255*8cb2f8b1SPeter XuHost B: 1256*8cb2f8b1SPeter Xu 1257*8cb2f8b1SPeter XuDevice X has not the feature Y 1258*8cb2f8b1SPeter Xu 1259*8cb2f8b1SPeter XuIf we try to migrate without any care from host A to host B, it will 1260*8cb2f8b1SPeter Xufail because when migration tries to load the feature Y on 1261*8cb2f8b1SPeter Xudestination, it will find that the hardware is not there. 1262*8cb2f8b1SPeter Xu 1263*8cb2f8b1SPeter XuDoing this would be the equivalent of doing with cpus: 1264*8cb2f8b1SPeter Xu 1265*8cb2f8b1SPeter XuHost A: 1266*8cb2f8b1SPeter Xu 1267*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host 1268*8cb2f8b1SPeter Xu 1269*8cb2f8b1SPeter XuHost B: 1270*8cb2f8b1SPeter Xu 1271*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host 1272*8cb2f8b1SPeter Xu 1273*8cb2f8b1SPeter XuWhen both hosts have different cpu features this is guaranteed to 1274*8cb2f8b1SPeter Xufail. Especially if Host B has less features than host A. If host A 1275*8cb2f8b1SPeter Xuhas less features than host B, sometimes it works. Important word of 1276*8cb2f8b1SPeter Xulast sentence is "sometimes". 1277*8cb2f8b1SPeter Xu 1278*8cb2f8b1SPeter XuSo, forgetting about cpu models and continuing with the -cpu host 1279*8cb2f8b1SPeter Xuexample, let's see that the differences of the cpus is that Host A and 1280*8cb2f8b1SPeter XuB have the following features: 1281*8cb2f8b1SPeter Xu 1282*8cb2f8b1SPeter XuFeatures: 'pcid' 'stibp' 'taa-no' 1283*8cb2f8b1SPeter XuHost A: X X 1284*8cb2f8b1SPeter XuHost B: X 1285*8cb2f8b1SPeter Xu 1286*8cb2f8b1SPeter XuAnd we want to migrate between them, the way configure both QEMU cpu 1287*8cb2f8b1SPeter Xuwill be: 1288*8cb2f8b1SPeter Xu 1289*8cb2f8b1SPeter XuHost A: 1290*8cb2f8b1SPeter Xu 1291*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off 1292*8cb2f8b1SPeter Xu 1293*8cb2f8b1SPeter XuHost B: 1294*8cb2f8b1SPeter Xu 1295*8cb2f8b1SPeter Xu$ qemu-system-x86_64 -cpu host,taa-no=off 1296*8cb2f8b1SPeter Xu 1297*8cb2f8b1SPeter XuAnd you would be able to migrate between them. It is responsibility 1298*8cb2f8b1SPeter Xuof the management application or of the user to make sure that the 1299*8cb2f8b1SPeter Xuconfiguration is correct. QEMU doesn't know how to look at this kind 1300*8cb2f8b1SPeter Xuof features in general. 1301*8cb2f8b1SPeter Xu 1302*8cb2f8b1SPeter XuNotice that we don't recommend to use -cpu host for migration. It is 1303*8cb2f8b1SPeter Xuused in this example because it makes the example simpler. 1304*8cb2f8b1SPeter Xu 1305*8cb2f8b1SPeter XuOther devices have worse control about individual features. If they 1306*8cb2f8b1SPeter Xuwant to be able to migrate between hosts that show different features, 1307*8cb2f8b1SPeter Xuthe device needs a way to configure which ones it is going to use. 1308*8cb2f8b1SPeter Xu 1309*8cb2f8b1SPeter XuIn this section we have considered that we are using the same QEMU 1310*8cb2f8b1SPeter Xubinary in both sides of the migration. If we use different QEMU 1311*8cb2f8b1SPeter Xuversions process, then we need to have into account all other 1312*8cb2f8b1SPeter Xudifferences and the examples become even more complicated. 1313*8cb2f8b1SPeter Xu 1314*8cb2f8b1SPeter XuHow to mitigate when we have a backward compatibility error 1315*8cb2f8b1SPeter Xu----------------------------------------------------------- 1316*8cb2f8b1SPeter Xu 1317*8cb2f8b1SPeter XuWe broke migration for old machine types continuously during 1318*8cb2f8b1SPeter Xudevelopment. But as soon as we find that there is a problem, we fix 1319*8cb2f8b1SPeter Xuit. The problem is what happens when we detect after we have done a 1320*8cb2f8b1SPeter Xurelease that something has gone wrong. 1321*8cb2f8b1SPeter Xu 1322*8cb2f8b1SPeter XuLet see how it worked with one example. 1323*8cb2f8b1SPeter Xu 1324*8cb2f8b1SPeter XuAfter the release of qemu-8.0 we found a problem when doing migration 1325*8cb2f8b1SPeter Xuof the machine type pc-7.2. 1326*8cb2f8b1SPeter Xu 1327*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1328*8cb2f8b1SPeter Xu 1329*8cb2f8b1SPeter Xu This migration works 1330*8cb2f8b1SPeter Xu 1331*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1332*8cb2f8b1SPeter Xu 1333*8cb2f8b1SPeter Xu This migration works 1334*8cb2f8b1SPeter Xu 1335*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1336*8cb2f8b1SPeter Xu 1337*8cb2f8b1SPeter Xu This migration fails 1338*8cb2f8b1SPeter Xu 1339*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1340*8cb2f8b1SPeter Xu 1341*8cb2f8b1SPeter Xu This migration fails 1342*8cb2f8b1SPeter Xu 1343*8cb2f8b1SPeter XuSo clearly something fails when migration between qemu-7.2 and 1344*8cb2f8b1SPeter Xuqemu-8.0 with machine type pc-7.2. The error messages, and git bisect 1345*8cb2f8b1SPeter Xupointed to this commit. 1346*8cb2f8b1SPeter Xu 1347*8cb2f8b1SPeter XuIn qemu-8.0 we got this commit:: 1348*8cb2f8b1SPeter Xu 1349*8cb2f8b1SPeter Xu commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2 1350*8cb2f8b1SPeter Xu Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> 1351*8cb2f8b1SPeter Xu Date: Thu Mar 2 13:37:02 2023 +0000 1352*8cb2f8b1SPeter Xu 1353*8cb2f8b1SPeter Xu hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register 1354*8cb2f8b1SPeter Xu 1355*8cb2f8b1SPeter Xu 1356*8cb2f8b1SPeter XuThe relevant bits of the commit for our example are this ones:: 1357*8cb2f8b1SPeter Xu 1358*8cb2f8b1SPeter Xu --- a/hw/pci/pcie_aer.c 1359*8cb2f8b1SPeter Xu +++ b/hw/pci/pcie_aer.c 1360*8cb2f8b1SPeter Xu @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev, 1361*8cb2f8b1SPeter Xu 1362*8cb2f8b1SPeter Xu pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, 1363*8cb2f8b1SPeter Xu PCI_ERR_UNC_SUPPORTED); 1364*8cb2f8b1SPeter Xu + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1365*8cb2f8b1SPeter Xu + PCI_ERR_UNC_MASK_DEFAULT); 1366*8cb2f8b1SPeter Xu + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1367*8cb2f8b1SPeter Xu + PCI_ERR_UNC_SUPPORTED); 1368*8cb2f8b1SPeter Xu 1369*8cb2f8b1SPeter Xu pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, 1370*8cb2f8b1SPeter Xu PCI_ERR_UNC_SEVERITY_DEFAULT); 1371*8cb2f8b1SPeter Xu 1372*8cb2f8b1SPeter XuThe patch changes how we configure PCI space for AER. But QEMU fails 1373*8cb2f8b1SPeter Xuwhen the PCI space configuration is different between source and 1374*8cb2f8b1SPeter Xudestination. 1375*8cb2f8b1SPeter Xu 1376*8cb2f8b1SPeter XuThe following commit shows how this got fixed:: 1377*8cb2f8b1SPeter Xu 1378*8cb2f8b1SPeter Xu commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f 1379*8cb2f8b1SPeter Xu Author: Leonardo Bras <leobras@redhat.com> 1380*8cb2f8b1SPeter Xu Date: Tue May 2 21:27:02 2023 -0300 1381*8cb2f8b1SPeter Xu 1382*8cb2f8b1SPeter Xu hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0 1383*8cb2f8b1SPeter Xu 1384*8cb2f8b1SPeter Xu [...] 1385*8cb2f8b1SPeter Xu 1386*8cb2f8b1SPeter XuThe relevant parts of the fix in QEMU are as follow: 1387*8cb2f8b1SPeter Xu 1388*8cb2f8b1SPeter XuFirst, we create a new property for the device to be able to configure 1389*8cb2f8b1SPeter Xuthe old behaviour or the new behaviour:: 1390*8cb2f8b1SPeter Xu 1391*8cb2f8b1SPeter Xu diff --git a/hw/pci/pci.c b/hw/pci/pci.c 1392*8cb2f8b1SPeter Xu index 8a87ccc8b0..5153ad63d6 100644 1393*8cb2f8b1SPeter Xu --- a/hw/pci/pci.c 1394*8cb2f8b1SPeter Xu +++ b/hw/pci/pci.c 1395*8cb2f8b1SPeter Xu @@ -79,6 +79,8 @@ static Property pci_props[] = { 1396*8cb2f8b1SPeter Xu DEFINE_PROP_STRING("failover_pair_id", PCIDevice, 1397*8cb2f8b1SPeter Xu failover_pair_id), 1398*8cb2f8b1SPeter Xu DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), 1399*8cb2f8b1SPeter Xu + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, 1400*8cb2f8b1SPeter Xu + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), 1401*8cb2f8b1SPeter Xu DEFINE_PROP_END_OF_LIST() 1402*8cb2f8b1SPeter Xu }; 1403*8cb2f8b1SPeter Xu 1404*8cb2f8b1SPeter XuNotice that we enable the feature for new machine types. 1405*8cb2f8b1SPeter Xu 1406*8cb2f8b1SPeter XuNow we see how the fix is done. This is going to depend on what kind 1407*8cb2f8b1SPeter Xuof breakage happens, but in this case it is quite simple:: 1408*8cb2f8b1SPeter Xu 1409*8cb2f8b1SPeter Xu diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c 1410*8cb2f8b1SPeter Xu index 103667c368..374d593ead 100644 1411*8cb2f8b1SPeter Xu --- a/hw/pci/pcie_aer.c 1412*8cb2f8b1SPeter Xu +++ b/hw/pci/pcie_aer.c 1413*8cb2f8b1SPeter Xu @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, 1414*8cb2f8b1SPeter Xu uint16_t offset, 1415*8cb2f8b1SPeter Xu 1416*8cb2f8b1SPeter Xu pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, 1417*8cb2f8b1SPeter Xu PCI_ERR_UNC_SUPPORTED); 1418*8cb2f8b1SPeter Xu - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1419*8cb2f8b1SPeter Xu - PCI_ERR_UNC_MASK_DEFAULT); 1420*8cb2f8b1SPeter Xu - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1421*8cb2f8b1SPeter Xu - PCI_ERR_UNC_SUPPORTED); 1422*8cb2f8b1SPeter Xu + 1423*8cb2f8b1SPeter Xu + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { 1424*8cb2f8b1SPeter Xu + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1425*8cb2f8b1SPeter Xu + PCI_ERR_UNC_MASK_DEFAULT); 1426*8cb2f8b1SPeter Xu + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1427*8cb2f8b1SPeter Xu + PCI_ERR_UNC_SUPPORTED); 1428*8cb2f8b1SPeter Xu + } 1429*8cb2f8b1SPeter Xu 1430*8cb2f8b1SPeter Xu pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, 1431*8cb2f8b1SPeter Xu PCI_ERR_UNC_SEVERITY_DEFAULT); 1432*8cb2f8b1SPeter Xu 1433*8cb2f8b1SPeter XuI.e. If the property bit is enabled, we configure it as we did for 1434*8cb2f8b1SPeter Xuqemu-8.0. If the property bit is not set, we configure it as it was in 7.2. 1435*8cb2f8b1SPeter Xu 1436*8cb2f8b1SPeter XuAnd now, everything that is missing is disabling the feature for old 1437*8cb2f8b1SPeter Xumachine types:: 1438*8cb2f8b1SPeter Xu 1439*8cb2f8b1SPeter Xu diff --git a/hw/core/machine.c b/hw/core/machine.c 1440*8cb2f8b1SPeter Xu index 47a34841a5..07f763eb2e 100644 1441*8cb2f8b1SPeter Xu --- a/hw/core/machine.c 1442*8cb2f8b1SPeter Xu +++ b/hw/core/machine.c 1443*8cb2f8b1SPeter Xu @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { 1444*8cb2f8b1SPeter Xu { "e1000e", "migrate-timadj", "off" }, 1445*8cb2f8b1SPeter Xu { "virtio-mem", "x-early-migration", "false" }, 1446*8cb2f8b1SPeter Xu { "migration", "x-preempt-pre-7-2", "true" }, 1447*8cb2f8b1SPeter Xu + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, 1448*8cb2f8b1SPeter Xu }; 1449*8cb2f8b1SPeter Xu const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); 1450*8cb2f8b1SPeter Xu 1451*8cb2f8b1SPeter XuAnd now, when qemu-8.0.1 is released with this fix, all combinations 1452*8cb2f8b1SPeter Xuare going to work as supposed. 1453*8cb2f8b1SPeter Xu 1454*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) 1455*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) 1456*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) 1457*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) 1458*8cb2f8b1SPeter Xu 1459*8cb2f8b1SPeter XuSo the normality has been restored and everything is ok, no? 1460*8cb2f8b1SPeter Xu 1461*8cb2f8b1SPeter XuNot really, now our matrix is much bigger. We started with the easy 1462*8cb2f8b1SPeter Xucases, migration from the same version to the same version always 1463*8cb2f8b1SPeter Xuworks: 1464*8cb2f8b1SPeter Xu 1465*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1466*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1467*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1468*8cb2f8b1SPeter Xu 1469*8cb2f8b1SPeter XuNow the interesting ones. When the QEMU processes versions are 1470*8cb2f8b1SPeter Xudifferent. For the 1st set, their fail and we can do nothing, both 1471*8cb2f8b1SPeter Xuversions are released and we can't change anything. 1472*8cb2f8b1SPeter Xu 1473*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1474*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1475*8cb2f8b1SPeter Xu 1476*8cb2f8b1SPeter XuThis two are the ones that work. The whole point of making the 1477*8cb2f8b1SPeter Xuchange in qemu-8.0.1 release was to fix this issue: 1478*8cb2f8b1SPeter Xu 1479*8cb2f8b1SPeter Xu- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1480*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1481*8cb2f8b1SPeter Xu 1482*8cb2f8b1SPeter XuBut now we found that qemu-8.0 neither can migrate to qemu-7.2 not 1483*8cb2f8b1SPeter Xuqemu-8.0.1. 1484*8cb2f8b1SPeter Xu 1485*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1486*8cb2f8b1SPeter Xu- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1487*8cb2f8b1SPeter Xu 1488*8cb2f8b1SPeter XuSo, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to 1489*8cb2f8b1SPeter Xuanything except to qemu-8.0. 1490*8cb2f8b1SPeter Xu 1491*8cb2f8b1SPeter XuCan we do better? 1492*8cb2f8b1SPeter Xu 1493*8cb2f8b1SPeter XuYeap. If we know that we are going to do this migration: 1494*8cb2f8b1SPeter Xu 1495*8cb2f8b1SPeter Xu- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1496*8cb2f8b1SPeter Xu 1497*8cb2f8b1SPeter XuWe can launch the appropriate devices with:: 1498*8cb2f8b1SPeter Xu 1499*8cb2f8b1SPeter Xu --device...,x-pci-e-err-unc-mask=on 1500*8cb2f8b1SPeter Xu 1501*8cb2f8b1SPeter XuAnd now we can receive a migration from 8.0. And from now on, we can 1502*8cb2f8b1SPeter Xudo that migration to new machine types if we remember to enable that 1503*8cb2f8b1SPeter Xuproperty for pc-7.2. Notice that we need to remember, it is not 1504*8cb2f8b1SPeter Xuenough to know that the source of the migration is qemu-8.0. Think of 1505*8cb2f8b1SPeter Xuthis example: 1506*8cb2f8b1SPeter Xu 1507*8cb2f8b1SPeter Xu$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2 1508*8cb2f8b1SPeter Xu 1509*8cb2f8b1SPeter XuIn the second migration, the source is not qemu-8.0, but we still have 1510*8cb2f8b1SPeter Xuthat "problem" and have that property enabled. Notice that we need to 1511*8cb2f8b1SPeter Xucontinue having this mark/property until we have this machine 1512*8cb2f8b1SPeter Xurebooted. But it is not a normal reboot (that don't reload QEMU) we 1513*8cb2f8b1SPeter Xuneed the machine to poweroff/poweron on a fixed QEMU. And from now 1514*8cb2f8b1SPeter Xuon we can use the proper real machine. 1515