18cb2f8b1SPeter Xu===================== 2*f6bbac98SPeter XuVFIO device migration 38cb2f8b1SPeter Xu===================== 48cb2f8b1SPeter Xu 58cb2f8b1SPeter XuMigration of virtual machine involves saving the state for each device that 68cb2f8b1SPeter Xuthe guest is running on source host and restoring this saved state on the 78cb2f8b1SPeter Xudestination host. This document details how saving and restoring of VFIO 88cb2f8b1SPeter Xudevices is done in QEMU. 98cb2f8b1SPeter Xu 108cb2f8b1SPeter XuMigration of VFIO devices consists of two phases: the optional pre-copy phase, 118cb2f8b1SPeter Xuand the stop-and-copy phase. The pre-copy phase is iterative and allows to 128cb2f8b1SPeter Xuaccommodate VFIO devices that have a large amount of data that needs to be 138cb2f8b1SPeter Xutransferred. The iterative pre-copy phase of migration allows for the guest to 148cb2f8b1SPeter Xucontinue whilst the VFIO device state is transferred to the destination, this 158cb2f8b1SPeter Xuhelps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy 168cb2f8b1SPeter Xusupport by reporting the VFIO_MIGRATION_PRE_COPY flag in the 178cb2f8b1SPeter XuVFIO_DEVICE_FEATURE_MIGRATION ioctl. 188cb2f8b1SPeter Xu 198cb2f8b1SPeter XuWhen pre-copy is supported, it's possible to further reduce downtime by 208cb2f8b1SPeter Xuenabling "switchover-ack" migration capability. 218cb2f8b1SPeter XuVFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream 228cb2f8b1SPeter Xuand recommends that the initial bytes are sent and loaded in the destination 238cb2f8b1SPeter Xubefore stopping the source VM. Enabling this migration capability will 248cb2f8b1SPeter Xuguarantee that and thus, can potentially reduce downtime even further. 258cb2f8b1SPeter Xu 268cb2f8b1SPeter XuTo support migration of multiple devices that might do P2P transactions between 278cb2f8b1SPeter Xuthemselves, VFIO migration uAPI defines an intermediate P2P quiescent state. 288cb2f8b1SPeter XuWhile in the P2P quiescent state, P2P DMA transactions cannot be initiated by 298cb2f8b1SPeter Xuthe device, but the device can respond to incoming ones. Additionally, all 308cb2f8b1SPeter Xuoutstanding P2P transactions are guaranteed to have been completed by the time 318cb2f8b1SPeter Xuthe device enters this state. 328cb2f8b1SPeter Xu 338cb2f8b1SPeter XuAll the devices that support P2P migration are first transitioned to the P2P 348cb2f8b1SPeter Xuquiescent state and only then are they stopped or started. This makes migration 358cb2f8b1SPeter Xusafe P2P-wise, since starting and stopping the devices is not done atomically 368cb2f8b1SPeter Xufor all the devices together. 378cb2f8b1SPeter Xu 388cb2f8b1SPeter XuThus, multiple VFIO devices migration is allowed only if all the devices 398cb2f8b1SPeter Xusupport P2P migration. Single VFIO device migration is allowed regardless of 408cb2f8b1SPeter XuP2P migration support. 418cb2f8b1SPeter Xu 428cb2f8b1SPeter XuA detailed description of the UAPI for VFIO device migration can be found in 438cb2f8b1SPeter Xuthe comment for the ``vfio_device_mig_state`` structure in the header file 448cb2f8b1SPeter Xulinux-headers/linux/vfio.h. 458cb2f8b1SPeter Xu 468cb2f8b1SPeter XuVFIO implements the device hooks for the iterative approach as follows: 478cb2f8b1SPeter Xu 488cb2f8b1SPeter Xu* A ``save_setup`` function that sets up migration on the source. 498cb2f8b1SPeter Xu 508cb2f8b1SPeter Xu* A ``load_setup`` function that sets the VFIO device on the destination in 518cb2f8b1SPeter Xu _RESUMING state. 528cb2f8b1SPeter Xu 538cb2f8b1SPeter Xu* A ``state_pending_estimate`` function that reports an estimate of the 548cb2f8b1SPeter Xu remaining pre-copy data that the vendor driver has yet to save for the VFIO 558cb2f8b1SPeter Xu device. 568cb2f8b1SPeter Xu 578cb2f8b1SPeter Xu* A ``state_pending_exact`` function that reads pending_bytes from the vendor 588cb2f8b1SPeter Xu driver, which indicates the amount of data that the vendor driver has yet to 598cb2f8b1SPeter Xu save for the VFIO device. 608cb2f8b1SPeter Xu 618cb2f8b1SPeter Xu* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is 628cb2f8b1SPeter Xu active only when the VFIO device is in pre-copy states. 638cb2f8b1SPeter Xu 648cb2f8b1SPeter Xu* A ``save_live_iterate`` function that reads the VFIO device's data from the 658cb2f8b1SPeter Xu vendor driver during iterative pre-copy phase. 668cb2f8b1SPeter Xu 678cb2f8b1SPeter Xu* A ``switchover_ack_needed`` function that checks if the VFIO device uses 688cb2f8b1SPeter Xu "switchover-ack" migration capability when this capability is enabled. 698cb2f8b1SPeter Xu 708cb2f8b1SPeter Xu* A ``save_state`` function to save the device config space if it is present. 718cb2f8b1SPeter Xu 728cb2f8b1SPeter Xu* A ``save_live_complete_precopy`` function that sets the VFIO device in 738cb2f8b1SPeter Xu _STOP_COPY state and iteratively copies the data for the VFIO device until 748cb2f8b1SPeter Xu the vendor driver indicates that no data remains. 758cb2f8b1SPeter Xu 768cb2f8b1SPeter Xu* A ``load_state`` function that loads the config section and the data 778cb2f8b1SPeter Xu sections that are generated by the save functions above. 788cb2f8b1SPeter Xu 798cb2f8b1SPeter Xu* ``cleanup`` functions for both save and load that perform any migration 808cb2f8b1SPeter Xu related cleanup. 818cb2f8b1SPeter Xu 828cb2f8b1SPeter Xu 838cb2f8b1SPeter XuThe VFIO migration code uses a VM state change handler to change the VFIO 848cb2f8b1SPeter Xudevice state when the VM state changes from running to not-running, and 858cb2f8b1SPeter Xuvice versa. 868cb2f8b1SPeter Xu 878cb2f8b1SPeter XuSimilarly, a migration state change handler is used to trigger a transition of 888cb2f8b1SPeter Xuthe VFIO device state when certain changes of the migration state occur. For 898cb2f8b1SPeter Xuexample, the VFIO device state is transitioned back to _RUNNING in case a 908cb2f8b1SPeter Xumigration failed or was canceled. 918cb2f8b1SPeter Xu 928cb2f8b1SPeter XuSystem memory dirty pages tracking 938cb2f8b1SPeter Xu---------------------------------- 948cb2f8b1SPeter Xu 958cb2f8b1SPeter XuA ``log_global_start`` and ``log_global_stop`` memory listener callback informs 968cb2f8b1SPeter Xuthe VFIO dirty tracking module to start and stop dirty page tracking. A 978cb2f8b1SPeter Xu``log_sync`` memory listener callback queries the dirty page bitmap from the 988cb2f8b1SPeter Xudirty tracking module and marks system memory pages which were DMA-ed by the 998cb2f8b1SPeter XuVFIO device as dirty. The dirty page bitmap is queried per container. 1008cb2f8b1SPeter Xu 1018cb2f8b1SPeter XuCurrently there are two ways dirty page tracking can be done: 1028cb2f8b1SPeter Xu(1) Device dirty tracking: 1038cb2f8b1SPeter XuIn this method the device is responsible to log and report its DMAs. This 1048cb2f8b1SPeter Xumethod can be used only if the device is capable of tracking its DMAs. 1058cb2f8b1SPeter XuDiscovering device capability, starting and stopping dirty tracking, and 1068cb2f8b1SPeter Xusyncing the dirty bitmaps from the device are done using the DMA logging uAPI. 1078cb2f8b1SPeter XuMore info about the uAPI can be found in the comments of the 1088cb2f8b1SPeter Xu``vfio_device_feature_dma_logging_control`` and 1098cb2f8b1SPeter Xu``vfio_device_feature_dma_logging_report`` structures in the header file 1108cb2f8b1SPeter Xulinux-headers/linux/vfio.h. 1118cb2f8b1SPeter Xu 1128cb2f8b1SPeter Xu(2) VFIO IOMMU module: 1138cb2f8b1SPeter XuIn this method dirty tracking is done by IOMMU. However, there is currently no 1148cb2f8b1SPeter XuIOMMU support for dirty page tracking. For this reason, all pages are 1158cb2f8b1SPeter Xuperpetually marked dirty, unless the device driver pins pages through external 1168cb2f8b1SPeter XuAPIs in which case only those pinned pages are perpetually marked dirty. 1178cb2f8b1SPeter Xu 1188cb2f8b1SPeter XuIf the above two methods are not supported, all pages are perpetually marked 1198cb2f8b1SPeter Xudirty by QEMU. 1208cb2f8b1SPeter Xu 1218cb2f8b1SPeter XuBy default, dirty pages are tracked during pre-copy as well as stop-and-copy 1228cb2f8b1SPeter Xuphase. So, a page marked as dirty will be copied to the destination in both 1238cb2f8b1SPeter Xuphases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can 1248cb2f8b1SPeter Xuachieve its downtime tolerances. If QEMU during pre-copy phase keeps finding 1258cb2f8b1SPeter Xudirty pages continuously, then it understands that even in stop-and-copy phase, 1268cb2f8b1SPeter Xuit is likely to find dirty pages and can predict the downtime accordingly. 1278cb2f8b1SPeter Xu 1288cb2f8b1SPeter XuQEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` 1298cb2f8b1SPeter Xuwhich disables querying the dirty bitmap during pre-copy phase. If it is set to 1308cb2f8b1SPeter Xuoff, all dirty pages will be copied to the destination in stop-and-copy phase 1318cb2f8b1SPeter Xuonly. 1328cb2f8b1SPeter Xu 1338cb2f8b1SPeter XuSystem memory dirty pages tracking when vIOMMU is enabled 1348cb2f8b1SPeter Xu--------------------------------------------------------- 1358cb2f8b1SPeter Xu 1368cb2f8b1SPeter XuWith vIOMMU, an IO virtual address range can get unmapped while in pre-copy 1378cb2f8b1SPeter Xuphase of migration. In that case, the unmap ioctl returns any dirty pages in 1388cb2f8b1SPeter Xuthat range and QEMU reports corresponding guest physical pages dirty. During 1398cb2f8b1SPeter Xustop-and-copy phase, an IOMMU notifier is used to get a callback for mapped 1408cb2f8b1SPeter Xupages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those 1418cb2f8b1SPeter Xumapped ranges. If device dirty tracking is enabled with vIOMMU, live migration 1428cb2f8b1SPeter Xuwill be blocked. 1438cb2f8b1SPeter Xu 1448cb2f8b1SPeter XuFlow of state changes during Live migration 1458cb2f8b1SPeter Xu=========================================== 1468cb2f8b1SPeter Xu 1478cb2f8b1SPeter XuBelow is the state change flow during live migration for a VFIO device that 1488cb2f8b1SPeter Xusupports both precopy and P2P migration. The flow for devices that don't 1498cb2f8b1SPeter Xusupport it is similar, except that the relevant states for precopy and P2P are 1508cb2f8b1SPeter Xuskipped. 1518cb2f8b1SPeter XuThe values in the parentheses represent the VM state, the migration state, and 1528cb2f8b1SPeter Xuthe VFIO device state, respectively. 1538cb2f8b1SPeter Xu 1548cb2f8b1SPeter XuLive migration save path 1558cb2f8b1SPeter Xu------------------------ 1568cb2f8b1SPeter Xu 1578cb2f8b1SPeter Xu:: 1588cb2f8b1SPeter Xu 1598cb2f8b1SPeter Xu QEMU normal running state 1608cb2f8b1SPeter Xu (RUNNING, _NONE, _RUNNING) 1618cb2f8b1SPeter Xu | 1628cb2f8b1SPeter Xu migrate_init spawns migration_thread 1638cb2f8b1SPeter Xu Migration thread then calls each device's .save_setup() 1648cb2f8b1SPeter Xu (RUNNING, _SETUP, _PRE_COPY) 1658cb2f8b1SPeter Xu | 1668cb2f8b1SPeter Xu (RUNNING, _ACTIVE, _PRE_COPY) 1678cb2f8b1SPeter Xu If device is active, get pending_bytes by .state_pending_{estimate,exact}() 1688cb2f8b1SPeter Xu If total pending_bytes >= threshold_size, call .save_live_iterate() 1698cb2f8b1SPeter Xu Data of VFIO device for pre-copy phase is copied 1708cb2f8b1SPeter Xu Iterate till total pending bytes converge and are less than threshold 1718cb2f8b1SPeter Xu | 1728cb2f8b1SPeter Xu On migration completion, the vCPUs and the VFIO device are stopped 1738cb2f8b1SPeter Xu The VFIO device is first put in P2P quiescent state 1748cb2f8b1SPeter Xu (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P) 1758cb2f8b1SPeter Xu | 1768cb2f8b1SPeter Xu Then the VFIO device is put in _STOP_COPY state 1778cb2f8b1SPeter Xu (FINISH_MIGRATE, _ACTIVE, _STOP_COPY) 1788cb2f8b1SPeter Xu .save_live_complete_precopy() is called for each active device 1798cb2f8b1SPeter Xu For the VFIO device, iterate in .save_live_complete_precopy() until 1808cb2f8b1SPeter Xu pending data is 0 1818cb2f8b1SPeter Xu | 1828cb2f8b1SPeter Xu (POSTMIGRATE, _COMPLETED, _STOP_COPY) 1838cb2f8b1SPeter Xu Migraton thread schedules cleanup bottom half and exits 1848cb2f8b1SPeter Xu | 1858cb2f8b1SPeter Xu .save_cleanup() is called 1868cb2f8b1SPeter Xu (POSTMIGRATE, _COMPLETED, _STOP) 1878cb2f8b1SPeter Xu 1888cb2f8b1SPeter XuLive migration resume path 1898cb2f8b1SPeter Xu-------------------------- 1908cb2f8b1SPeter Xu 1918cb2f8b1SPeter Xu:: 1928cb2f8b1SPeter Xu 1938cb2f8b1SPeter Xu Incoming migration calls .load_setup() for each device 1948cb2f8b1SPeter Xu (RESTORE_VM, _ACTIVE, _STOP) 1958cb2f8b1SPeter Xu | 1968cb2f8b1SPeter Xu For each device, .load_state() is called for that device section data 1978cb2f8b1SPeter Xu (RESTORE_VM, _ACTIVE, _RESUMING) 1988cb2f8b1SPeter Xu | 1998cb2f8b1SPeter Xu At the end, .load_cleanup() is called for each device and vCPUs are started 2008cb2f8b1SPeter Xu The VFIO device is first put in P2P quiescent state 2018cb2f8b1SPeter Xu (RUNNING, _ACTIVE, _RUNNING_P2P) 2028cb2f8b1SPeter Xu | 2038cb2f8b1SPeter Xu (RUNNING, _NONE, _RUNNING) 2048cb2f8b1SPeter Xu 2058cb2f8b1SPeter XuPostcopy 2068cb2f8b1SPeter Xu======== 2078cb2f8b1SPeter Xu 2088cb2f8b1SPeter XuPostcopy migration is currently not supported for VFIO devices. 209