xref: /openbmc/qemu/docs/devel/migration/vfio.rst (revision 9da8dfe4f5389b4b0c713bca9564b0fec5ddbe7f)
18cb2f8b1SPeter Xu=====================
2*f6bbac98SPeter XuVFIO device migration
38cb2f8b1SPeter Xu=====================
48cb2f8b1SPeter Xu
58cb2f8b1SPeter XuMigration of virtual machine involves saving the state for each device that
68cb2f8b1SPeter Xuthe guest is running on source host and restoring this saved state on the
78cb2f8b1SPeter Xudestination host. This document details how saving and restoring of VFIO
88cb2f8b1SPeter Xudevices is done in QEMU.
98cb2f8b1SPeter Xu
108cb2f8b1SPeter XuMigration of VFIO devices consists of two phases: the optional pre-copy phase,
118cb2f8b1SPeter Xuand the stop-and-copy phase. The pre-copy phase is iterative and allows to
128cb2f8b1SPeter Xuaccommodate VFIO devices that have a large amount of data that needs to be
138cb2f8b1SPeter Xutransferred. The iterative pre-copy phase of migration allows for the guest to
148cb2f8b1SPeter Xucontinue whilst the VFIO device state is transferred to the destination, this
158cb2f8b1SPeter Xuhelps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
168cb2f8b1SPeter Xusupport by reporting the VFIO_MIGRATION_PRE_COPY flag in the
178cb2f8b1SPeter XuVFIO_DEVICE_FEATURE_MIGRATION ioctl.
188cb2f8b1SPeter Xu
198cb2f8b1SPeter XuWhen pre-copy is supported, it's possible to further reduce downtime by
208cb2f8b1SPeter Xuenabling "switchover-ack" migration capability.
218cb2f8b1SPeter XuVFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
228cb2f8b1SPeter Xuand recommends that the initial bytes are sent and loaded in the destination
238cb2f8b1SPeter Xubefore stopping the source VM. Enabling this migration capability will
248cb2f8b1SPeter Xuguarantee that and thus, can potentially reduce downtime even further.
258cb2f8b1SPeter Xu
268cb2f8b1SPeter XuTo support migration of multiple devices that might do P2P transactions between
278cb2f8b1SPeter Xuthemselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
288cb2f8b1SPeter XuWhile in the P2P quiescent state, P2P DMA transactions cannot be initiated by
298cb2f8b1SPeter Xuthe device, but the device can respond to incoming ones. Additionally, all
308cb2f8b1SPeter Xuoutstanding P2P transactions are guaranteed to have been completed by the time
318cb2f8b1SPeter Xuthe device enters this state.
328cb2f8b1SPeter Xu
338cb2f8b1SPeter XuAll the devices that support P2P migration are first transitioned to the P2P
348cb2f8b1SPeter Xuquiescent state and only then are they stopped or started. This makes migration
358cb2f8b1SPeter Xusafe P2P-wise, since starting and stopping the devices is not done atomically
368cb2f8b1SPeter Xufor all the devices together.
378cb2f8b1SPeter Xu
388cb2f8b1SPeter XuThus, multiple VFIO devices migration is allowed only if all the devices
398cb2f8b1SPeter Xusupport P2P migration. Single VFIO device migration is allowed regardless of
408cb2f8b1SPeter XuP2P migration support.
418cb2f8b1SPeter Xu
428cb2f8b1SPeter XuA detailed description of the UAPI for VFIO device migration can be found in
438cb2f8b1SPeter Xuthe comment for the ``vfio_device_mig_state`` structure in the header file
448cb2f8b1SPeter Xulinux-headers/linux/vfio.h.
458cb2f8b1SPeter Xu
468cb2f8b1SPeter XuVFIO implements the device hooks for the iterative approach as follows:
478cb2f8b1SPeter Xu
488cb2f8b1SPeter Xu* A ``save_setup`` function that sets up migration on the source.
498cb2f8b1SPeter Xu
508cb2f8b1SPeter Xu* A ``load_setup`` function that sets the VFIO device on the destination in
518cb2f8b1SPeter Xu  _RESUMING state.
528cb2f8b1SPeter Xu
538cb2f8b1SPeter Xu* A ``state_pending_estimate`` function that reports an estimate of the
548cb2f8b1SPeter Xu  remaining pre-copy data that the vendor driver has yet to save for the VFIO
558cb2f8b1SPeter Xu  device.
568cb2f8b1SPeter Xu
578cb2f8b1SPeter Xu* A ``state_pending_exact`` function that reads pending_bytes from the vendor
588cb2f8b1SPeter Xu  driver, which indicates the amount of data that the vendor driver has yet to
598cb2f8b1SPeter Xu  save for the VFIO device.
608cb2f8b1SPeter Xu
618cb2f8b1SPeter Xu* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
628cb2f8b1SPeter Xu  active only when the VFIO device is in pre-copy states.
638cb2f8b1SPeter Xu
648cb2f8b1SPeter Xu* A ``save_live_iterate`` function that reads the VFIO device's data from the
658cb2f8b1SPeter Xu  vendor driver during iterative pre-copy phase.
668cb2f8b1SPeter Xu
678cb2f8b1SPeter Xu* A ``switchover_ack_needed`` function that checks if the VFIO device uses
688cb2f8b1SPeter Xu  "switchover-ack" migration capability when this capability is enabled.
698cb2f8b1SPeter Xu
708cb2f8b1SPeter Xu* A ``save_state`` function to save the device config space if it is present.
718cb2f8b1SPeter Xu
728cb2f8b1SPeter Xu* A ``save_live_complete_precopy`` function that sets the VFIO device in
738cb2f8b1SPeter Xu  _STOP_COPY state and iteratively copies the data for the VFIO device until
748cb2f8b1SPeter Xu  the vendor driver indicates that no data remains.
758cb2f8b1SPeter Xu
768cb2f8b1SPeter Xu* A ``load_state`` function that loads the config section and the data
778cb2f8b1SPeter Xu  sections that are generated by the save functions above.
788cb2f8b1SPeter Xu
798cb2f8b1SPeter Xu* ``cleanup`` functions for both save and load that perform any migration
808cb2f8b1SPeter Xu  related cleanup.
818cb2f8b1SPeter Xu
828cb2f8b1SPeter Xu
838cb2f8b1SPeter XuThe VFIO migration code uses a VM state change handler to change the VFIO
848cb2f8b1SPeter Xudevice state when the VM state changes from running to not-running, and
858cb2f8b1SPeter Xuvice versa.
868cb2f8b1SPeter Xu
878cb2f8b1SPeter XuSimilarly, a migration state change handler is used to trigger a transition of
888cb2f8b1SPeter Xuthe VFIO device state when certain changes of the migration state occur. For
898cb2f8b1SPeter Xuexample, the VFIO device state is transitioned back to _RUNNING in case a
908cb2f8b1SPeter Xumigration failed or was canceled.
918cb2f8b1SPeter Xu
928cb2f8b1SPeter XuSystem memory dirty pages tracking
938cb2f8b1SPeter Xu----------------------------------
948cb2f8b1SPeter Xu
958cb2f8b1SPeter XuA ``log_global_start`` and ``log_global_stop`` memory listener callback informs
968cb2f8b1SPeter Xuthe VFIO dirty tracking module to start and stop dirty page tracking. A
978cb2f8b1SPeter Xu``log_sync`` memory listener callback queries the dirty page bitmap from the
988cb2f8b1SPeter Xudirty tracking module and marks system memory pages which were DMA-ed by the
998cb2f8b1SPeter XuVFIO device as dirty. The dirty page bitmap is queried per container.
1008cb2f8b1SPeter Xu
1018cb2f8b1SPeter XuCurrently there are two ways dirty page tracking can be done:
1028cb2f8b1SPeter Xu(1) Device dirty tracking:
1038cb2f8b1SPeter XuIn this method the device is responsible to log and report its DMAs. This
1048cb2f8b1SPeter Xumethod can be used only if the device is capable of tracking its DMAs.
1058cb2f8b1SPeter XuDiscovering device capability, starting and stopping dirty tracking, and
1068cb2f8b1SPeter Xusyncing the dirty bitmaps from the device are done using the DMA logging uAPI.
1078cb2f8b1SPeter XuMore info about the uAPI can be found in the comments of the
1088cb2f8b1SPeter Xu``vfio_device_feature_dma_logging_control`` and
1098cb2f8b1SPeter Xu``vfio_device_feature_dma_logging_report`` structures in the header file
1108cb2f8b1SPeter Xulinux-headers/linux/vfio.h.
1118cb2f8b1SPeter Xu
1128cb2f8b1SPeter Xu(2) VFIO IOMMU module:
1138cb2f8b1SPeter XuIn this method dirty tracking is done by IOMMU. However, there is currently no
1148cb2f8b1SPeter XuIOMMU support for dirty page tracking. For this reason, all pages are
1158cb2f8b1SPeter Xuperpetually marked dirty, unless the device driver pins pages through external
1168cb2f8b1SPeter XuAPIs in which case only those pinned pages are perpetually marked dirty.
1178cb2f8b1SPeter Xu
1188cb2f8b1SPeter XuIf the above two methods are not supported, all pages are perpetually marked
1198cb2f8b1SPeter Xudirty by QEMU.
1208cb2f8b1SPeter Xu
1218cb2f8b1SPeter XuBy default, dirty pages are tracked during pre-copy as well as stop-and-copy
1228cb2f8b1SPeter Xuphase. So, a page marked as dirty will be copied to the destination in both
1238cb2f8b1SPeter Xuphases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
1248cb2f8b1SPeter Xuachieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
1258cb2f8b1SPeter Xudirty pages continuously, then it understands that even in stop-and-copy phase,
1268cb2f8b1SPeter Xuit is likely to find dirty pages and can predict the downtime accordingly.
1278cb2f8b1SPeter Xu
1288cb2f8b1SPeter XuQEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
1298cb2f8b1SPeter Xuwhich disables querying the dirty bitmap during pre-copy phase. If it is set to
1308cb2f8b1SPeter Xuoff, all dirty pages will be copied to the destination in stop-and-copy phase
1318cb2f8b1SPeter Xuonly.
1328cb2f8b1SPeter Xu
1338cb2f8b1SPeter XuSystem memory dirty pages tracking when vIOMMU is enabled
1348cb2f8b1SPeter Xu---------------------------------------------------------
1358cb2f8b1SPeter Xu
1368cb2f8b1SPeter XuWith vIOMMU, an IO virtual address range can get unmapped while in pre-copy
1378cb2f8b1SPeter Xuphase of migration. In that case, the unmap ioctl returns any dirty pages in
1388cb2f8b1SPeter Xuthat range and QEMU reports corresponding guest physical pages dirty. During
1398cb2f8b1SPeter Xustop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
1408cb2f8b1SPeter Xupages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
1418cb2f8b1SPeter Xumapped ranges. If device dirty tracking is enabled with vIOMMU, live migration
1428cb2f8b1SPeter Xuwill be blocked.
1438cb2f8b1SPeter Xu
1448cb2f8b1SPeter XuFlow of state changes during Live migration
1458cb2f8b1SPeter Xu===========================================
1468cb2f8b1SPeter Xu
1478cb2f8b1SPeter XuBelow is the state change flow during live migration for a VFIO device that
1488cb2f8b1SPeter Xusupports both precopy and P2P migration. The flow for devices that don't
1498cb2f8b1SPeter Xusupport it is similar, except that the relevant states for precopy and P2P are
1508cb2f8b1SPeter Xuskipped.
1518cb2f8b1SPeter XuThe values in the parentheses represent the VM state, the migration state, and
1528cb2f8b1SPeter Xuthe VFIO device state, respectively.
1538cb2f8b1SPeter Xu
1548cb2f8b1SPeter XuLive migration save path
1558cb2f8b1SPeter Xu------------------------
1568cb2f8b1SPeter Xu
1578cb2f8b1SPeter Xu::
1588cb2f8b1SPeter Xu
1598cb2f8b1SPeter Xu                           QEMU normal running state
1608cb2f8b1SPeter Xu                           (RUNNING, _NONE, _RUNNING)
1618cb2f8b1SPeter Xu                                      |
1628cb2f8b1SPeter Xu                     migrate_init spawns migration_thread
1638cb2f8b1SPeter Xu            Migration thread then calls each device's .save_setup()
1648cb2f8b1SPeter Xu                          (RUNNING, _SETUP, _PRE_COPY)
1658cb2f8b1SPeter Xu                                      |
1668cb2f8b1SPeter Xu                         (RUNNING, _ACTIVE, _PRE_COPY)
1678cb2f8b1SPeter Xu  If device is active, get pending_bytes by .state_pending_{estimate,exact}()
1688cb2f8b1SPeter Xu       If total pending_bytes >= threshold_size, call .save_live_iterate()
1698cb2f8b1SPeter Xu                Data of VFIO device for pre-copy phase is copied
1708cb2f8b1SPeter Xu      Iterate till total pending bytes converge and are less than threshold
1718cb2f8b1SPeter Xu                                      |
1728cb2f8b1SPeter Xu       On migration completion, the vCPUs and the VFIO device are stopped
1738cb2f8b1SPeter Xu              The VFIO device is first put in P2P quiescent state
1748cb2f8b1SPeter Xu                    (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
1758cb2f8b1SPeter Xu                                      |
1768cb2f8b1SPeter Xu                Then the VFIO device is put in _STOP_COPY state
1778cb2f8b1SPeter Xu                     (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
1788cb2f8b1SPeter Xu         .save_live_complete_precopy() is called for each active device
1798cb2f8b1SPeter Xu      For the VFIO device, iterate in .save_live_complete_precopy() until
1808cb2f8b1SPeter Xu                               pending data is 0
1818cb2f8b1SPeter Xu                                      |
1828cb2f8b1SPeter Xu                     (POSTMIGRATE, _COMPLETED, _STOP_COPY)
1838cb2f8b1SPeter Xu            Migraton thread schedules cleanup bottom half and exits
1848cb2f8b1SPeter Xu                                      |
1858cb2f8b1SPeter Xu                           .save_cleanup() is called
1868cb2f8b1SPeter Xu                        (POSTMIGRATE, _COMPLETED, _STOP)
1878cb2f8b1SPeter Xu
1888cb2f8b1SPeter XuLive migration resume path
1898cb2f8b1SPeter Xu--------------------------
1908cb2f8b1SPeter Xu
1918cb2f8b1SPeter Xu::
1928cb2f8b1SPeter Xu
1938cb2f8b1SPeter Xu             Incoming migration calls .load_setup() for each device
1948cb2f8b1SPeter Xu                          (RESTORE_VM, _ACTIVE, _STOP)
1958cb2f8b1SPeter Xu                                      |
1968cb2f8b1SPeter Xu     For each device, .load_state() is called for that device section data
1978cb2f8b1SPeter Xu                        (RESTORE_VM, _ACTIVE, _RESUMING)
1988cb2f8b1SPeter Xu                                      |
1998cb2f8b1SPeter Xu  At the end, .load_cleanup() is called for each device and vCPUs are started
2008cb2f8b1SPeter Xu              The VFIO device is first put in P2P quiescent state
2018cb2f8b1SPeter Xu                        (RUNNING, _ACTIVE, _RUNNING_P2P)
2028cb2f8b1SPeter Xu                                      |
2038cb2f8b1SPeter Xu                           (RUNNING, _NONE, _RUNNING)
2048cb2f8b1SPeter Xu
2058cb2f8b1SPeter XuPostcopy
2068cb2f8b1SPeter Xu========
2078cb2f8b1SPeter Xu
2088cb2f8b1SPeter XuPostcopy migration is currently not supported for VFIO devices.
209