xref: /openbmc/qemu/docs/devel/replay.rst (revision 8d3031fa)
1..
2   Copyright (c) 2022, ISP RAS
3   Written by Pavel Dovgalyuk and Alex Bennée
4
5=======================
6Execution Record/Replay
7=======================
8
9Core concepts
10=============
11
12Record/replay functions are used for the deterministic replay of qemu
13execution. Execution recording writes a non-deterministic events log, which
14can be later used for replaying the execution anywhere and for unlimited
15number of times. Execution replaying reads the log and replays all
16non-deterministic events including external input, hardware clocks,
17and interrupts.
18
19Several parts of QEMU include function calls to make event log recording
20and replaying.
21Devices' models that have non-deterministic input from external devices were
22changed to write every external event into the execution log immediately.
23E.g. network packets are written into the log when they arrive into the virtual
24network adapter.
25
26All non-deterministic events are coming from these devices. But to
27replay them we need to know at which moments they occur. We specify
28these moments by counting the number of instructions executed between
29every pair of consecutive events.
30
31Academic papers with description of deterministic replay implementation:
32
33* `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_
34* `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_
35
36Modifications of qemu include:
37
38 * wrappers for clock and time functions to save their return values in the log
39 * saving different asynchronous events (e.g. system shutdown) into the log
40 * synchronization of the bottom halves execution
41 * synchronization of the threads from thread pool
42 * recording/replaying user input (mouse, keyboard, and microphone)
43 * adding internal checkpoints for cpu and io synchronization
44 * network filter for recording and replaying the packets
45 * block driver for making block layer deterministic
46 * serial port input record and replay
47 * recording of random numbers obtained from the external sources
48
49Instruction counting
50--------------------
51
52QEMU should work in icount mode to use record/replay feature. icount was
53designed to allow deterministic execution in absence of external inputs
54of the virtual machine. We also use icount to control the occurrence of the
55non-deterministic events. The number of instructions elapsed from the last event
56is written to the log while recording the execution. In replay mode we
57can predict when to inject that event using the instruction counter.
58
59Locking and thread synchronisation
60----------------------------------
61
62Previously the synchronisation of the main thread and the vCPU thread
63was ensured by the holding of the BQL. However the trend has been to
64reduce the time the BQL was held across the system including under TCG
65system emulation. As it is important that batches of events are kept
66in sequence (e.g. expiring timers and checkpoints in the main thread
67while instruction checkpoints are written by the vCPU thread) we need
68another lock to keep things in lock-step. This role is now handled by
69the replay_mutex_lock. It used to be held only for each event being
70written but now it is held for a whole execution period. This results
71in a deterministic ping-pong between the two main threads.
72
73As the BQL is now a finer grained lock than the replay_lock it is almost
74certainly a bug, and a source of deadlocks, to take the
75replay_mutex_lock while the BQL is held. This is enforced by an assert.
76While the unlocks are usually in the reverse order, this is not
77necessary; you can drop the replay_lock while holding the BQL, without
78doing a more complicated unlock_iothread/replay_unlock/lock_iothread
79sequence.
80
81Checkpoints
82-----------
83
84Replaying the execution of virtual machine is bound by sources of
85non-determinism. These are inputs from clock and peripheral devices,
86and QEMU thread scheduling. Thread scheduling affect on processing events
87from timers, asynchronous input-output, and bottom halves.
88
89Invocations of timers are coupled with clock reads and changing the state
90of the virtual machine. Reads produce non-deterministic data taken from
91host clock. And VM state changes should preserve their order. Their relative
92order in replay mode must replicate the order of callbacks in record mode.
93To preserve this order we use checkpoints. When a specific clock is processed
94in record mode we save to the log special "checkpoint" event.
95Checkpoints here do not refer to virtual machine snapshots. They are just
96record/replay events used for synchronization.
97
98QEMU in replay mode will try to invoke timers processing in random moment
99of time. That's why we do not process a group of timers until the checkpoint
100event will be read from the log. Such an event allows synchronizing CPU
101execution and timer events.
102
103Two other checkpoints govern the "warping" of the virtual clock.
104While the virtual machine is idle, the virtual clock increments at
1051 ns per *real time* nanosecond.  This is done by setting up a timer
106(called the warp timer) on the virtual real time clock, so that the
107timer fires at the next deadline of the virtual clock; the virtual clock
108is then incremented (which is called "warping" the virtual clock) as
109soon as the timer fires or the CPUs need to go out of the idle state.
110Two functions are used for this purpose; because these actions change
111virtual machine state and must be deterministic, each of them creates a
112checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so
113starts accounting real time to virtual clock. ``icount_account_warp_timer``
114is called when the CPUs get an interrupt or when the warp timer fires,
115and it warps the virtual clock by the amount of real time that has passed
116since ``icount_start_warp_timer``.
117
118Virtual devices
119===============
120
121Record/replay mechanism, that could be enabled through icount mode, expects
122the virtual devices to satisfy the following requirement:
123everything that affects
124the guest state during execution in icount mode should be deterministic.
125
126Timers
127------
128
129Timers are used to execute callbacks from different subsystems of QEMU
130at the specified moments of time. There are several kinds of timers:
131
132 * Real time clock. Based on host time and used only for callbacks that
133   do not change the virtual machine state. For this reason real time
134   clock and timers does not affect deterministic replay at all.
135 * Virtual clock. These timers run only during the emulation. In icount
136   mode virtual clock value is calculated using executed instructions counter.
137   That is why it is completely deterministic and does not have to be recorded.
138 * Host clock. This clock is used by device models that simulate real time
139   sources (e.g. real time clock chip). Host clock is the one of the sources
140   of non-determinism. Host clock read operations should be logged to
141   make the execution deterministic.
142 * Virtual real time clock. This clock is similar to real time clock but
143   it is used only for increasing virtual clock while virtual machine is
144   sleeping. Due to its nature it is also non-deterministic as the host clock
145   and has to be logged too.
146
147All virtual devices should use virtual clock for timers that change the guest
148state. Virtual clock is deterministic, therefore such timers are deterministic
149too.
150
151Virtual devices can also use realtime clock for the events that do not change
152the guest state directly. When the clock ticking should depend on VM execution
153speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
154but its speed depends on the guest execution. This clock is used by
155the virtual devices (e.g., slirp routing device) that lie outside the
156replayed guest.
157
158Block devices
159-------------
160
161Block devices record/replay module (``blkreplay``) intercepts calls of
162bdrv coroutine functions at the top of block drivers stack.
163
164All block completion operations are added to the queue in the coroutines.
165When the queue is flushed the information about processed requests
166is recorded to the log. In replay phase the queue is matched with
167events read from the log. Therefore block devices requests are processed
168deterministically.
169
170Bottom halves
171-------------
172
173Bottom half callbacks, that affect the guest state, should be invoked through
174``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions.
175Their invocations are saved in record mode and synchronized with the existing
176log in replay mode.
177
178Disk I/O events are completely deterministic in our model, because
179in both record and replay modes we start virtual machine from the same
180disk state. But callbacks that virtual disk controller uses for reading and
181writing the disk may occur at different moments of time in record and replay
182modes.
183
184Reading and writing requests are created by CPU thread of QEMU. Later these
185requests proceed to block layer which creates "bottom halves". Bottom
186halves consist of callback and its parameters. They are processed when
187main loop locks the BQL. These locks are not synchronized with
188replaying process because main loop also processes the events that do not
189affect the virtual machine state (like user interaction with monitor).
190
191That is why we had to implement saving and replaying bottom halves callbacks
192synchronously to the CPU execution. When the callback is about to execute
193it is added to the queue in the replay module. This queue is written to the
194log when its callbacks are executed. In replay mode callbacks are not processed
195until the corresponding event is read from the events log file.
196
197Sometimes the block layer uses asynchronous callbacks for its internal purposes
198(like reading or writing VM snapshots or disk image cluster tables). In this
199case bottom halves are not marked as "replayable" and do not saved
200into the log.
201
202Saving/restoring the VM state
203-----------------------------
204
205Record/replay relies on VM state save and restore being complete and
206deterministic.
207
208All fields in the device state structure (including virtual timers)
209should be restored by loadvm to the same values they had before savevm.
210
211Avoid accessing other devices' state, because the order of saving/restoring
212is not defined. It means that you should not call functions like
213``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid
214the dependencies that may make restoring the VM state non-deterministic.
215
216Stopping the VM
217---------------
218
219Stopping the guest should not interfere with its state (with the exception
220of the network connections, that could be broken by the remote timeouts).
221VM can be stopped at any moment of replay by the user. Restarting the VM
222after that stop should not break the replay by the unneeded guest state change.
223
224Replay log format
225=================
226
227Record/replay log consists of the header and the sequence of execution
228events. The header includes 4-byte replay version id and 8-byte reserved
229field. Version is updated every time replay log format changes to prevent
230using replay log created by another build of qemu.
231
232The sequence of the events describes virtual machine state changes.
233It includes all non-deterministic inputs of VM, synchronization marks and
234instruction counts used to correctly inject inputs at replay.
235
236Synchronization marks (checkpoints) are used for synchronizing qemu threads
237that perform operations with virtual hardware. These operations may change
238system's state (e.g., change some register or generate interrupt) and
239therefore should execute synchronously with CPU thread.
240
241Every event in the log includes 1-byte event id and optional arguments.
242When argument is an array, it is stored as 4-byte array length
243and corresponding number of bytes with data.
244Here is the list of events that are written into the log:
245
246 - EVENT_INSTRUCTION. Instructions executed since last event. Followed by:
247
248   - 4-byte number of executed instructions.
249
250 - EVENT_INTERRUPT. Used to synchronize interrupt processing.
251 - EVENT_EXCEPTION. Used to synchronize exception handling.
252 - EVENT_ASYNC. This is a group of events. When such an event is generated,
253   it is stored in the queue and processed in icount_account_warp_timer().
254   Every such event has it's own id from the following list:
255
256     - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
257       callbacks that affect virtual machine state, but normally called
258       asynchronously. Followed by:
259
260        - 8-byte operation id.
261
262     - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
263       parameters of keyboard and mouse input operations
264       (key press/release, mouse pointer movement). Followed by:
265
266        - 9-16 bytes depending of input event.
267
268     - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
269     - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
270       initiated by the sender. Followed by:
271
272        - 1-byte character device id.
273        - Array with bytes were read.
274
275     - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
276       operations with disk and flash drives with CPU. Followed by:
277
278        - 8-byte operation id.
279
280     - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by:
281
282        - 1-byte network adapter id.
283        - 4-byte packet flags.
284        - Array with packet bytes.
285
286 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
287   e.g., by closing the window.
288 - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by:
289
290    - 4-byte output function return value.
291    - 4-byte offset in the output array.
292
293 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
294   initiated by qemu. Followed by:
295
296    - Array with bytes that were read.
297
298 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
299   initiated by qemu. Followed by:
300
301    - 4-byte error code.
302
303 - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by:
304
305    - 8-byte clock value.
306
307 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
308   CPU, internal threads, and asynchronous input events.
309 - EVENT_END. Last event in the log.
310