104d0583aSPavel Dovgalyuk.. 204d0583aSPavel Dovgalyuk Copyright (c) 2022, ISP RAS 343185f7bSPavel Dovgalyuk Written by Pavel Dovgalyuk and Alex Bennée 404d0583aSPavel Dovgalyuk 504d0583aSPavel Dovgalyuk======================= 604d0583aSPavel DovgalyukExecution Record/Replay 704d0583aSPavel Dovgalyuk======================= 804d0583aSPavel Dovgalyuk 943185f7bSPavel DovgalyukCore concepts 1043185f7bSPavel Dovgalyuk============= 1104d0583aSPavel Dovgalyuk 1243185f7bSPavel DovgalyukRecord/replay functions are used for the deterministic replay of qemu 1343185f7bSPavel Dovgalyukexecution. Execution recording writes a non-deterministic events log, which 1443185f7bSPavel Dovgalyukcan be later used for replaying the execution anywhere and for unlimited 1543185f7bSPavel Dovgalyuknumber of times. Execution replaying reads the log and replays all 1643185f7bSPavel Dovgalyuknon-deterministic events including external input, hardware clocks, 1743185f7bSPavel Dovgalyukand interrupts. 1843185f7bSPavel Dovgalyuk 1943185f7bSPavel DovgalyukSeveral parts of QEMU include function calls to make event log recording 2043185f7bSPavel Dovgalyukand replaying. 2143185f7bSPavel DovgalyukDevices' models that have non-deterministic input from external devices were 2243185f7bSPavel Dovgalyukchanged to write every external event into the execution log immediately. 2343185f7bSPavel DovgalyukE.g. network packets are written into the log when they arrive into the virtual 2443185f7bSPavel Dovgalyuknetwork adapter. 2543185f7bSPavel Dovgalyuk 2643185f7bSPavel DovgalyukAll non-deterministic events are coming from these devices. But to 2743185f7bSPavel Dovgalyukreplay them we need to know at which moments they occur. We specify 2843185f7bSPavel Dovgalyukthese moments by counting the number of instructions executed between 2943185f7bSPavel Dovgalyukevery pair of consecutive events. 3043185f7bSPavel Dovgalyuk 3143185f7bSPavel DovgalyukAcademic papers with description of deterministic replay implementation: 3243185f7bSPavel Dovgalyuk 3343185f7bSPavel Dovgalyuk* `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_ 3443185f7bSPavel Dovgalyuk* `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_ 3543185f7bSPavel Dovgalyuk 3643185f7bSPavel DovgalyukModifications of qemu include: 3743185f7bSPavel Dovgalyuk 3843185f7bSPavel Dovgalyuk * wrappers for clock and time functions to save their return values in the log 3943185f7bSPavel Dovgalyuk * saving different asynchronous events (e.g. system shutdown) into the log 4043185f7bSPavel Dovgalyuk * synchronization of the bottom halves execution 4143185f7bSPavel Dovgalyuk * synchronization of the threads from thread pool 4243185f7bSPavel Dovgalyuk * recording/replaying user input (mouse, keyboard, and microphone) 4343185f7bSPavel Dovgalyuk * adding internal checkpoints for cpu and io synchronization 4443185f7bSPavel Dovgalyuk * network filter for recording and replaying the packets 4543185f7bSPavel Dovgalyuk * block driver for making block layer deterministic 4643185f7bSPavel Dovgalyuk * serial port input record and replay 4743185f7bSPavel Dovgalyuk * recording of random numbers obtained from the external sources 4843185f7bSPavel Dovgalyuk 4943185f7bSPavel DovgalyukInstruction counting 5043185f7bSPavel Dovgalyuk-------------------- 5143185f7bSPavel Dovgalyuk 5243185f7bSPavel DovgalyukQEMU should work in icount mode to use record/replay feature. icount was 5343185f7bSPavel Dovgalyukdesigned to allow deterministic execution in absence of external inputs 5443185f7bSPavel Dovgalyukof the virtual machine. We also use icount to control the occurrence of the 5543185f7bSPavel Dovgalyuknon-deterministic events. The number of instructions elapsed from the last event 5643185f7bSPavel Dovgalyukis written to the log while recording the execution. In replay mode we 5743185f7bSPavel Dovgalyukcan predict when to inject that event using the instruction counter. 5843185f7bSPavel Dovgalyuk 5943185f7bSPavel DovgalyukLocking and thread synchronisation 6043185f7bSPavel Dovgalyuk---------------------------------- 6143185f7bSPavel Dovgalyuk 6243185f7bSPavel DovgalyukPreviously the synchronisation of the main thread and the vCPU thread 6343185f7bSPavel Dovgalyukwas ensured by the holding of the BQL. However the trend has been to 6443185f7bSPavel Dovgalyukreduce the time the BQL was held across the system including under TCG 6543185f7bSPavel Dovgalyuksystem emulation. As it is important that batches of events are kept 6643185f7bSPavel Dovgalyukin sequence (e.g. expiring timers and checkpoints in the main thread 6743185f7bSPavel Dovgalyukwhile instruction checkpoints are written by the vCPU thread) we need 6843185f7bSPavel Dovgalyukanother lock to keep things in lock-step. This role is now handled by 6943185f7bSPavel Dovgalyukthe replay_mutex_lock. It used to be held only for each event being 7043185f7bSPavel Dovgalyukwritten but now it is held for a whole execution period. This results 7143185f7bSPavel Dovgalyukin a deterministic ping-pong between the two main threads. 7243185f7bSPavel Dovgalyuk 7343185f7bSPavel DovgalyukAs the BQL is now a finer grained lock than the replay_lock it is almost 7443185f7bSPavel Dovgalyukcertainly a bug, and a source of deadlocks, to take the 7543185f7bSPavel Dovgalyukreplay_mutex_lock while the BQL is held. This is enforced by an assert. 7643185f7bSPavel DovgalyukWhile the unlocks are usually in the reverse order, this is not 7743185f7bSPavel Dovgalyuknecessary; you can drop the replay_lock while holding the BQL, without 7843185f7bSPavel Dovgalyukdoing a more complicated unlock_iothread/replay_unlock/lock_iothread 7943185f7bSPavel Dovgalyuksequence. 8043185f7bSPavel Dovgalyuk 8143185f7bSPavel DovgalyukCheckpoints 8243185f7bSPavel Dovgalyuk----------- 8343185f7bSPavel Dovgalyuk 8443185f7bSPavel DovgalyukReplaying the execution of virtual machine is bound by sources of 8543185f7bSPavel Dovgalyuknon-determinism. These are inputs from clock and peripheral devices, 8643185f7bSPavel Dovgalyukand QEMU thread scheduling. Thread scheduling affect on processing events 8743185f7bSPavel Dovgalyukfrom timers, asynchronous input-output, and bottom halves. 8843185f7bSPavel Dovgalyuk 8943185f7bSPavel DovgalyukInvocations of timers are coupled with clock reads and changing the state 9043185f7bSPavel Dovgalyukof the virtual machine. Reads produce non-deterministic data taken from 9143185f7bSPavel Dovgalyukhost clock. And VM state changes should preserve their order. Their relative 9243185f7bSPavel Dovgalyukorder in replay mode must replicate the order of callbacks in record mode. 9343185f7bSPavel DovgalyukTo preserve this order we use checkpoints. When a specific clock is processed 9443185f7bSPavel Dovgalyukin record mode we save to the log special "checkpoint" event. 9543185f7bSPavel DovgalyukCheckpoints here do not refer to virtual machine snapshots. They are just 9643185f7bSPavel Dovgalyukrecord/replay events used for synchronization. 9743185f7bSPavel Dovgalyuk 9843185f7bSPavel DovgalyukQEMU in replay mode will try to invoke timers processing in random moment 9943185f7bSPavel Dovgalyukof time. That's why we do not process a group of timers until the checkpoint 10043185f7bSPavel Dovgalyukevent will be read from the log. Such an event allows synchronizing CPU 10143185f7bSPavel Dovgalyukexecution and timer events. 10243185f7bSPavel Dovgalyuk 10343185f7bSPavel DovgalyukTwo other checkpoints govern the "warping" of the virtual clock. 10443185f7bSPavel DovgalyukWhile the virtual machine is idle, the virtual clock increments at 10543185f7bSPavel Dovgalyuk1 ns per *real time* nanosecond. This is done by setting up a timer 10643185f7bSPavel Dovgalyuk(called the warp timer) on the virtual real time clock, so that the 10743185f7bSPavel Dovgalyuktimer fires at the next deadline of the virtual clock; the virtual clock 10843185f7bSPavel Dovgalyukis then incremented (which is called "warping" the virtual clock) as 10943185f7bSPavel Dovgalyuksoon as the timer fires or the CPUs need to go out of the idle state. 11043185f7bSPavel DovgalyukTwo functions are used for this purpose; because these actions change 11143185f7bSPavel Dovgalyukvirtual machine state and must be deterministic, each of them creates a 11243185f7bSPavel Dovgalyukcheckpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so 11343185f7bSPavel Dovgalyukstarts accounting real time to virtual clock. ``icount_account_warp_timer`` 11443185f7bSPavel Dovgalyukis called when the CPUs get an interrupt or when the warp timer fires, 11543185f7bSPavel Dovgalyukand it warps the virtual clock by the amount of real time that has passed 11643185f7bSPavel Dovgalyuksince ``icount_start_warp_timer``. 11743185f7bSPavel Dovgalyuk 11843185f7bSPavel DovgalyukVirtual devices 11943185f7bSPavel Dovgalyuk=============== 12043185f7bSPavel Dovgalyuk 12143185f7bSPavel DovgalyukRecord/replay mechanism, that could be enabled through icount mode, expects 12243185f7bSPavel Dovgalyukthe virtual devices to satisfy the following requirement: 12343185f7bSPavel Dovgalyukeverything that affects 12404d0583aSPavel Dovgalyukthe guest state during execution in icount mode should be deterministic. 12504d0583aSPavel Dovgalyuk 12604d0583aSPavel DovgalyukTimers 12704d0583aSPavel Dovgalyuk------ 12804d0583aSPavel Dovgalyuk 12943185f7bSPavel DovgalyukTimers are used to execute callbacks from different subsystems of QEMU 13043185f7bSPavel Dovgalyukat the specified moments of time. There are several kinds of timers: 13143185f7bSPavel Dovgalyuk 13243185f7bSPavel Dovgalyuk * Real time clock. Based on host time and used only for callbacks that 13343185f7bSPavel Dovgalyuk do not change the virtual machine state. For this reason real time 13443185f7bSPavel Dovgalyuk clock and timers does not affect deterministic replay at all. 13543185f7bSPavel Dovgalyuk * Virtual clock. These timers run only during the emulation. In icount 13643185f7bSPavel Dovgalyuk mode virtual clock value is calculated using executed instructions counter. 13743185f7bSPavel Dovgalyuk That is why it is completely deterministic and does not have to be recorded. 13843185f7bSPavel Dovgalyuk * Host clock. This clock is used by device models that simulate real time 13943185f7bSPavel Dovgalyuk sources (e.g. real time clock chip). Host clock is the one of the sources 14043185f7bSPavel Dovgalyuk of non-determinism. Host clock read operations should be logged to 14143185f7bSPavel Dovgalyuk make the execution deterministic. 14243185f7bSPavel Dovgalyuk * Virtual real time clock. This clock is similar to real time clock but 14343185f7bSPavel Dovgalyuk it is used only for increasing virtual clock while virtual machine is 14443185f7bSPavel Dovgalyuk sleeping. Due to its nature it is also non-deterministic as the host clock 14543185f7bSPavel Dovgalyuk and has to be logged too. 14643185f7bSPavel Dovgalyuk 14704d0583aSPavel DovgalyukAll virtual devices should use virtual clock for timers that change the guest 14804d0583aSPavel Dovgalyukstate. Virtual clock is deterministic, therefore such timers are deterministic 14904d0583aSPavel Dovgalyuktoo. 15004d0583aSPavel Dovgalyuk 15104d0583aSPavel DovgalyukVirtual devices can also use realtime clock for the events that do not change 15204d0583aSPavel Dovgalyukthe guest state directly. When the clock ticking should depend on VM execution 15304d0583aSPavel Dovgalyukspeed, use virtual clock with EXTERNAL attribute. It is not deterministic, 15404d0583aSPavel Dovgalyukbut its speed depends on the guest execution. This clock is used by 15504d0583aSPavel Dovgalyukthe virtual devices (e.g., slirp routing device) that lie outside the 15604d0583aSPavel Dovgalyukreplayed guest. 15704d0583aSPavel Dovgalyuk 15843185f7bSPavel DovgalyukBlock devices 15943185f7bSPavel Dovgalyuk------------- 16043185f7bSPavel Dovgalyuk 16143185f7bSPavel DovgalyukBlock devices record/replay module (``blkreplay``) intercepts calls of 16243185f7bSPavel Dovgalyukbdrv coroutine functions at the top of block drivers stack. 16343185f7bSPavel Dovgalyuk 16443185f7bSPavel DovgalyukAll block completion operations are added to the queue in the coroutines. 16543185f7bSPavel DovgalyukWhen the queue is flushed the information about processed requests 16643185f7bSPavel Dovgalyukis recorded to the log. In replay phase the queue is matched with 16743185f7bSPavel Dovgalyukevents read from the log. Therefore block devices requests are processed 16843185f7bSPavel Dovgalyukdeterministically. 16943185f7bSPavel Dovgalyuk 17004d0583aSPavel DovgalyukBottom halves 17104d0583aSPavel Dovgalyuk------------- 17204d0583aSPavel Dovgalyuk 17304d0583aSPavel DovgalyukBottom half callbacks, that affect the guest state, should be invoked through 17443185f7bSPavel Dovgalyuk``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions. 17504d0583aSPavel DovgalyukTheir invocations are saved in record mode and synchronized with the existing 17604d0583aSPavel Dovgalyuklog in replay mode. 17704d0583aSPavel Dovgalyuk 17843185f7bSPavel DovgalyukDisk I/O events are completely deterministic in our model, because 17943185f7bSPavel Dovgalyukin both record and replay modes we start virtual machine from the same 18043185f7bSPavel Dovgalyukdisk state. But callbacks that virtual disk controller uses for reading and 18143185f7bSPavel Dovgalyukwriting the disk may occur at different moments of time in record and replay 18243185f7bSPavel Dovgalyukmodes. 18343185f7bSPavel Dovgalyuk 18443185f7bSPavel DovgalyukReading and writing requests are created by CPU thread of QEMU. Later these 18543185f7bSPavel Dovgalyukrequests proceed to block layer which creates "bottom halves". Bottom 18643185f7bSPavel Dovgalyukhalves consist of callback and its parameters. They are processed when 1870b2675c4SStefan Hajnoczimain loop locks the BQL. These locks are not synchronized with 18843185f7bSPavel Dovgalyukreplaying process because main loop also processes the events that do not 18943185f7bSPavel Dovgalyukaffect the virtual machine state (like user interaction with monitor). 19043185f7bSPavel Dovgalyuk 19143185f7bSPavel DovgalyukThat is why we had to implement saving and replaying bottom halves callbacks 19243185f7bSPavel Dovgalyuksynchronously to the CPU execution. When the callback is about to execute 19343185f7bSPavel Dovgalyukit is added to the queue in the replay module. This queue is written to the 19443185f7bSPavel Dovgalyuklog when its callbacks are executed. In replay mode callbacks are not processed 19543185f7bSPavel Dovgalyukuntil the corresponding event is read from the events log file. 19643185f7bSPavel Dovgalyuk 19743185f7bSPavel DovgalyukSometimes the block layer uses asynchronous callbacks for its internal purposes 19843185f7bSPavel Dovgalyuk(like reading or writing VM snapshots or disk image cluster tables). In this 19943185f7bSPavel Dovgalyukcase bottom halves are not marked as "replayable" and do not saved 20043185f7bSPavel Dovgalyukinto the log. 20143185f7bSPavel Dovgalyuk 20204d0583aSPavel DovgalyukSaving/restoring the VM state 20304d0583aSPavel Dovgalyuk----------------------------- 20404d0583aSPavel Dovgalyuk 205*e300f4c1SPeter MaydellRecord/replay relies on VM state save and restore being complete and 206*e300f4c1SPeter Maydelldeterministic. 207*e300f4c1SPeter Maydell 20804d0583aSPavel DovgalyukAll fields in the device state structure (including virtual timers) 20904d0583aSPavel Dovgalyukshould be restored by loadvm to the same values they had before savevm. 21004d0583aSPavel Dovgalyuk 21104d0583aSPavel DovgalyukAvoid accessing other devices' state, because the order of saving/restoring 21204d0583aSPavel Dovgalyukis not defined. It means that you should not call functions like 21343185f7bSPavel Dovgalyuk``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid 21404d0583aSPavel Dovgalyukthe dependencies that may make restoring the VM state non-deterministic. 21504d0583aSPavel Dovgalyuk 21604d0583aSPavel DovgalyukStopping the VM 21704d0583aSPavel Dovgalyuk--------------- 21804d0583aSPavel Dovgalyuk 21904d0583aSPavel DovgalyukStopping the guest should not interfere with its state (with the exception 22004d0583aSPavel Dovgalyukof the network connections, that could be broken by the remote timeouts). 22104d0583aSPavel DovgalyukVM can be stopped at any moment of replay by the user. Restarting the VM 22204d0583aSPavel Dovgalyukafter that stop should not break the replay by the unneeded guest state change. 22343185f7bSPavel Dovgalyuk 22443185f7bSPavel DovgalyukReplay log format 22543185f7bSPavel Dovgalyuk================= 22643185f7bSPavel Dovgalyuk 22743185f7bSPavel DovgalyukRecord/replay log consists of the header and the sequence of execution 22843185f7bSPavel Dovgalyukevents. The header includes 4-byte replay version id and 8-byte reserved 22943185f7bSPavel Dovgalyukfield. Version is updated every time replay log format changes to prevent 23043185f7bSPavel Dovgalyukusing replay log created by another build of qemu. 23143185f7bSPavel Dovgalyuk 23243185f7bSPavel DovgalyukThe sequence of the events describes virtual machine state changes. 23343185f7bSPavel DovgalyukIt includes all non-deterministic inputs of VM, synchronization marks and 23443185f7bSPavel Dovgalyukinstruction counts used to correctly inject inputs at replay. 23543185f7bSPavel Dovgalyuk 23643185f7bSPavel DovgalyukSynchronization marks (checkpoints) are used for synchronizing qemu threads 23743185f7bSPavel Dovgalyukthat perform operations with virtual hardware. These operations may change 23843185f7bSPavel Dovgalyuksystem's state (e.g., change some register or generate interrupt) and 23943185f7bSPavel Dovgalyuktherefore should execute synchronously with CPU thread. 24043185f7bSPavel Dovgalyuk 24143185f7bSPavel DovgalyukEvery event in the log includes 1-byte event id and optional arguments. 24243185f7bSPavel DovgalyukWhen argument is an array, it is stored as 4-byte array length 24343185f7bSPavel Dovgalyukand corresponding number of bytes with data. 24443185f7bSPavel DovgalyukHere is the list of events that are written into the log: 24543185f7bSPavel Dovgalyuk 24643185f7bSPavel Dovgalyuk - EVENT_INSTRUCTION. Instructions executed since last event. Followed by: 24743185f7bSPavel Dovgalyuk 24843185f7bSPavel Dovgalyuk - 4-byte number of executed instructions. 24943185f7bSPavel Dovgalyuk 25043185f7bSPavel Dovgalyuk - EVENT_INTERRUPT. Used to synchronize interrupt processing. 25143185f7bSPavel Dovgalyuk - EVENT_EXCEPTION. Used to synchronize exception handling. 25243185f7bSPavel Dovgalyuk - EVENT_ASYNC. This is a group of events. When such an event is generated, 25343185f7bSPavel Dovgalyuk it is stored in the queue and processed in icount_account_warp_timer(). 25443185f7bSPavel Dovgalyuk Every such event has it's own id from the following list: 25543185f7bSPavel Dovgalyuk 25643185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes 25743185f7bSPavel Dovgalyuk callbacks that affect virtual machine state, but normally called 25843185f7bSPavel Dovgalyuk asynchronously. Followed by: 25943185f7bSPavel Dovgalyuk 26043185f7bSPavel Dovgalyuk - 8-byte operation id. 26143185f7bSPavel Dovgalyuk 26243185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains 26343185f7bSPavel Dovgalyuk parameters of keyboard and mouse input operations 26443185f7bSPavel Dovgalyuk (key press/release, mouse pointer movement). Followed by: 26543185f7bSPavel Dovgalyuk 26643185f7bSPavel Dovgalyuk - 9-16 bytes depending of input event. 26743185f7bSPavel Dovgalyuk 26843185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. 26943185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input 27043185f7bSPavel Dovgalyuk initiated by the sender. Followed by: 27143185f7bSPavel Dovgalyuk 27243185f7bSPavel Dovgalyuk - 1-byte character device id. 27343185f7bSPavel Dovgalyuk - Array with bytes were read. 27443185f7bSPavel Dovgalyuk 27543185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize 27643185f7bSPavel Dovgalyuk operations with disk and flash drives with CPU. Followed by: 27743185f7bSPavel Dovgalyuk 27843185f7bSPavel Dovgalyuk - 8-byte operation id. 27943185f7bSPavel Dovgalyuk 28043185f7bSPavel Dovgalyuk - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by: 28143185f7bSPavel Dovgalyuk 28243185f7bSPavel Dovgalyuk - 1-byte network adapter id. 28343185f7bSPavel Dovgalyuk - 4-byte packet flags. 28443185f7bSPavel Dovgalyuk - Array with packet bytes. 28543185f7bSPavel Dovgalyuk 28643185f7bSPavel Dovgalyuk - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, 28743185f7bSPavel Dovgalyuk e.g., by closing the window. 28843185f7bSPavel Dovgalyuk - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by: 28943185f7bSPavel Dovgalyuk 29043185f7bSPavel Dovgalyuk - 4-byte output function return value. 29143185f7bSPavel Dovgalyuk - 4-byte offset in the output array. 29243185f7bSPavel Dovgalyuk 29343185f7bSPavel Dovgalyuk - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, 29443185f7bSPavel Dovgalyuk initiated by qemu. Followed by: 29543185f7bSPavel Dovgalyuk 29643185f7bSPavel Dovgalyuk - Array with bytes that were read. 29743185f7bSPavel Dovgalyuk 29843185f7bSPavel Dovgalyuk - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, 29943185f7bSPavel Dovgalyuk initiated by qemu. Followed by: 30043185f7bSPavel Dovgalyuk 30143185f7bSPavel Dovgalyuk - 4-byte error code. 30243185f7bSPavel Dovgalyuk 30343185f7bSPavel Dovgalyuk - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by: 30443185f7bSPavel Dovgalyuk 30543185f7bSPavel Dovgalyuk - 8-byte clock value. 30643185f7bSPavel Dovgalyuk 30743185f7bSPavel Dovgalyuk - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of 30843185f7bSPavel Dovgalyuk CPU, internal threads, and asynchronous input events. 30943185f7bSPavel Dovgalyuk - EVENT_END. Last event in the log. 310