replay.rst (14117B)
1 .. 2 Copyright (c) 2022, ISP RAS 3 Written by Pavel Dovgalyuk and Alex Bennée 4 5 ======================= 6 Execution Record/Replay 7 ======================= 8 9 Core concepts 10 ============= 11 12 Record/replay functions are used for the deterministic replay of qemu 13 execution. Execution recording writes a non-deterministic events log, which 14 can be later used for replaying the execution anywhere and for unlimited 15 number of times. Execution replaying reads the log and replays all 16 non-deterministic events including external input, hardware clocks, 17 and interrupts. 18 19 Several parts of QEMU include function calls to make event log recording 20 and replaying. 21 Devices' models that have non-deterministic input from external devices were 22 changed to write every external event into the execution log immediately. 23 E.g. network packets are written into the log when they arrive into the virtual 24 network adapter. 25 26 All non-deterministic events are coming from these devices. But to 27 replay them we need to know at which moments they occur. We specify 28 these moments by counting the number of instructions executed between 29 every pair of consecutive events. 30 31 Academic papers with description of deterministic replay implementation: 32 33 * `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_ 34 * `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_ 35 36 Modifications of qemu include: 37 38 * wrappers for clock and time functions to save their return values in the log 39 * saving different asynchronous events (e.g. system shutdown) into the log 40 * synchronization of the bottom halves execution 41 * synchronization of the threads from thread pool 42 * recording/replaying user input (mouse, keyboard, and microphone) 43 * adding internal checkpoints for cpu and io synchronization 44 * network filter for recording and replaying the packets 45 * block driver for making block layer deterministic 46 * serial port input record and replay 47 * recording of random numbers obtained from the external sources 48 49 Instruction counting 50 -------------------- 51 52 QEMU should work in icount mode to use record/replay feature. icount was 53 designed to allow deterministic execution in absence of external inputs 54 of the virtual machine. We also use icount to control the occurrence of the 55 non-deterministic events. The number of instructions elapsed from the last event 56 is written to the log while recording the execution. In replay mode we 57 can predict when to inject that event using the instruction counter. 58 59 Locking and thread synchronisation 60 ---------------------------------- 61 62 Previously the synchronisation of the main thread and the vCPU thread 63 was ensured by the holding of the BQL. However the trend has been to 64 reduce the time the BQL was held across the system including under TCG 65 system emulation. As it is important that batches of events are kept 66 in sequence (e.g. expiring timers and checkpoints in the main thread 67 while instruction checkpoints are written by the vCPU thread) we need 68 another lock to keep things in lock-step. This role is now handled by 69 the replay_mutex_lock. It used to be held only for each event being 70 written but now it is held for a whole execution period. This results 71 in a deterministic ping-pong between the two main threads. 72 73 As the BQL is now a finer grained lock than the replay_lock it is almost 74 certainly a bug, and a source of deadlocks, to take the 75 replay_mutex_lock while the BQL is held. This is enforced by an assert. 76 While the unlocks are usually in the reverse order, this is not 77 necessary; you can drop the replay_lock while holding the BQL, without 78 doing a more complicated unlock_iothread/replay_unlock/lock_iothread 79 sequence. 80 81 Checkpoints 82 ----------- 83 84 Replaying the execution of virtual machine is bound by sources of 85 non-determinism. These are inputs from clock and peripheral devices, 86 and QEMU thread scheduling. Thread scheduling affect on processing events 87 from timers, asynchronous input-output, and bottom halves. 88 89 Invocations of timers are coupled with clock reads and changing the state 90 of the virtual machine. Reads produce non-deterministic data taken from 91 host clock. And VM state changes should preserve their order. Their relative 92 order in replay mode must replicate the order of callbacks in record mode. 93 To preserve this order we use checkpoints. When a specific clock is processed 94 in record mode we save to the log special "checkpoint" event. 95 Checkpoints here do not refer to virtual machine snapshots. They are just 96 record/replay events used for synchronization. 97 98 QEMU in replay mode will try to invoke timers processing in random moment 99 of time. That's why we do not process a group of timers until the checkpoint 100 event will be read from the log. Such an event allows synchronizing CPU 101 execution and timer events. 102 103 Two other checkpoints govern the "warping" of the virtual clock. 104 While the virtual machine is idle, the virtual clock increments at 105 1 ns per *real time* nanosecond. This is done by setting up a timer 106 (called the warp timer) on the virtual real time clock, so that the 107 timer fires at the next deadline of the virtual clock; the virtual clock 108 is then incremented (which is called "warping" the virtual clock) as 109 soon as the timer fires or the CPUs need to go out of the idle state. 110 Two functions are used for this purpose; because these actions change 111 virtual machine state and must be deterministic, each of them creates a 112 checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so 113 starts accounting real time to virtual clock. ``icount_account_warp_timer`` 114 is called when the CPUs get an interrupt or when the warp timer fires, 115 and it warps the virtual clock by the amount of real time that has passed 116 since ``icount_start_warp_timer``. 117 118 Virtual devices 119 =============== 120 121 Record/replay mechanism, that could be enabled through icount mode, expects 122 the virtual devices to satisfy the following requirement: 123 everything that affects 124 the guest state during execution in icount mode should be deterministic. 125 126 Timers 127 ------ 128 129 Timers are used to execute callbacks from different subsystems of QEMU 130 at the specified moments of time. There are several kinds of timers: 131 132 * Real time clock. Based on host time and used only for callbacks that 133 do not change the virtual machine state. For this reason real time 134 clock and timers does not affect deterministic replay at all. 135 * Virtual clock. These timers run only during the emulation. In icount 136 mode virtual clock value is calculated using executed instructions counter. 137 That is why it is completely deterministic and does not have to be recorded. 138 * Host clock. This clock is used by device models that simulate real time 139 sources (e.g. real time clock chip). Host clock is the one of the sources 140 of non-determinism. Host clock read operations should be logged to 141 make the execution deterministic. 142 * Virtual real time clock. This clock is similar to real time clock but 143 it is used only for increasing virtual clock while virtual machine is 144 sleeping. Due to its nature it is also non-deterministic as the host clock 145 and has to be logged too. 146 147 All virtual devices should use virtual clock for timers that change the guest 148 state. Virtual clock is deterministic, therefore such timers are deterministic 149 too. 150 151 Virtual devices can also use realtime clock for the events that do not change 152 the guest state directly. When the clock ticking should depend on VM execution 153 speed, use virtual clock with EXTERNAL attribute. It is not deterministic, 154 but its speed depends on the guest execution. This clock is used by 155 the virtual devices (e.g., slirp routing device) that lie outside the 156 replayed guest. 157 158 Block devices 159 ------------- 160 161 Block devices record/replay module (``blkreplay``) intercepts calls of 162 bdrv coroutine functions at the top of block drivers stack. 163 164 All block completion operations are added to the queue in the coroutines. 165 When the queue is flushed the information about processed requests 166 is recorded to the log. In replay phase the queue is matched with 167 events read from the log. Therefore block devices requests are processed 168 deterministically. 169 170 Bottom halves 171 ------------- 172 173 Bottom half callbacks, that affect the guest state, should be invoked through 174 ``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions. 175 Their invocations are saved in record mode and synchronized with the existing 176 log in replay mode. 177 178 Disk I/O events are completely deterministic in our model, because 179 in both record and replay modes we start virtual machine from the same 180 disk state. But callbacks that virtual disk controller uses for reading and 181 writing the disk may occur at different moments of time in record and replay 182 modes. 183 184 Reading and writing requests are created by CPU thread of QEMU. Later these 185 requests proceed to block layer which creates "bottom halves". Bottom 186 halves consist of callback and its parameters. They are processed when 187 main loop locks the global mutex. These locks are not synchronized with 188 replaying process because main loop also processes the events that do not 189 affect the virtual machine state (like user interaction with monitor). 190 191 That is why we had to implement saving and replaying bottom halves callbacks 192 synchronously to the CPU execution. When the callback is about to execute 193 it is added to the queue in the replay module. This queue is written to the 194 log when its callbacks are executed. In replay mode callbacks are not processed 195 until the corresponding event is read from the events log file. 196 197 Sometimes the block layer uses asynchronous callbacks for its internal purposes 198 (like reading or writing VM snapshots or disk image cluster tables). In this 199 case bottom halves are not marked as "replayable" and do not saved 200 into the log. 201 202 Saving/restoring the VM state 203 ----------------------------- 204 205 All fields in the device state structure (including virtual timers) 206 should be restored by loadvm to the same values they had before savevm. 207 208 Avoid accessing other devices' state, because the order of saving/restoring 209 is not defined. It means that you should not call functions like 210 ``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid 211 the dependencies that may make restoring the VM state non-deterministic. 212 213 Stopping the VM 214 --------------- 215 216 Stopping the guest should not interfere with its state (with the exception 217 of the network connections, that could be broken by the remote timeouts). 218 VM can be stopped at any moment of replay by the user. Restarting the VM 219 after that stop should not break the replay by the unneeded guest state change. 220 221 Replay log format 222 ================= 223 224 Record/replay log consists of the header and the sequence of execution 225 events. The header includes 4-byte replay version id and 8-byte reserved 226 field. Version is updated every time replay log format changes to prevent 227 using replay log created by another build of qemu. 228 229 The sequence of the events describes virtual machine state changes. 230 It includes all non-deterministic inputs of VM, synchronization marks and 231 instruction counts used to correctly inject inputs at replay. 232 233 Synchronization marks (checkpoints) are used for synchronizing qemu threads 234 that perform operations with virtual hardware. These operations may change 235 system's state (e.g., change some register or generate interrupt) and 236 therefore should execute synchronously with CPU thread. 237 238 Every event in the log includes 1-byte event id and optional arguments. 239 When argument is an array, it is stored as 4-byte array length 240 and corresponding number of bytes with data. 241 Here is the list of events that are written into the log: 242 243 - EVENT_INSTRUCTION. Instructions executed since last event. Followed by: 244 245 - 4-byte number of executed instructions. 246 247 - EVENT_INTERRUPT. Used to synchronize interrupt processing. 248 - EVENT_EXCEPTION. Used to synchronize exception handling. 249 - EVENT_ASYNC. This is a group of events. When such an event is generated, 250 it is stored in the queue and processed in icount_account_warp_timer(). 251 Every such event has it's own id from the following list: 252 253 - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes 254 callbacks that affect virtual machine state, but normally called 255 asynchronously. Followed by: 256 257 - 8-byte operation id. 258 259 - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains 260 parameters of keyboard and mouse input operations 261 (key press/release, mouse pointer movement). Followed by: 262 263 - 9-16 bytes depending of input event. 264 265 - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event. 266 - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input 267 initiated by the sender. Followed by: 268 269 - 1-byte character device id. 270 - Array with bytes were read. 271 272 - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize 273 operations with disk and flash drives with CPU. Followed by: 274 275 - 8-byte operation id. 276 277 - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by: 278 279 - 1-byte network adapter id. 280 - 4-byte packet flags. 281 - Array with packet bytes. 282 283 - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu, 284 e.g., by closing the window. 285 - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by: 286 287 - 4-byte output function return value. 288 - 4-byte offset in the output array. 289 290 - EVENT_CHAR_READ_ALL. Used to synchronize character input operations, 291 initiated by qemu. Followed by: 292 293 - Array with bytes that were read. 294 295 - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation, 296 initiated by qemu. Followed by: 297 298 - 4-byte error code. 299 300 - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by: 301 302 - 8-byte clock value. 303 304 - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of 305 CPU, internal threads, and asynchronous input events. 306 - EVENT_END. Last event in the log.