1.. 2 Copyright (c) 2015-2020 Linaro Ltd. 3 4 This work is licensed under the terms of the GNU GPL, version 2 or 5 later. See the COPYING file in the top-level directory. 6 7Introduction 8============ 9 10This document outlines the design for multi-threaded TCG (a.k.a MTTCG) 11system-mode emulation. user-mode emulation has always mirrored the 12thread structure of the translated executable although some of the 13changes done for MTTCG system emulation have improved the stability of 14linux-user emulation. 15 16The original system-mode TCG implementation was single threaded and 17dealt with multiple CPUs with simple round-robin scheduling. This 18simplified a lot of things but became increasingly limited as systems 19being emulated gained additional cores and per-core performance gains 20for host systems started to level off. 21 22vCPU Scheduling 23=============== 24 25We introduce a new running mode where each vCPU will run on its own 26user-space thread. This is enabled by default for all FE/BE 27combinations where the host memory model is able to accommodate the 28guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the 29guest has had the required work done to support this safely 30(TARGET_SUPPORTS_MTTCG). 31 32System emulation will fall back to the original round robin approach 33if: 34 35* forced by --accel tcg,thread=single 36* enabling --icount mode 37* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) 38 39In the general case of running translated code there should be no 40inter-vCPU dependencies and all vCPUs should be able to run at full 41speed. Synchronisation will only be required while accessing internal 42shared data structures or when the emulated architecture requires a 43coherent representation of the emulated machine state. 44 45Shared Data Structures 46====================== 47 48Main Run Loop 49------------- 50 51Even when there is no code being generated there are a number of 52structures associated with the hot-path through the main run-loop. 53These are associated with looking up the next translation block to 54execute. These include: 55 56 tb_jmp_cache (per-vCPU, cache of recent jumps) 57 tb_ctx.htable (global hash table, phys address->tb lookup) 58 59As TB linking only occurs when blocks are in the same page this code 60is critical to performance as looking up the next TB to execute is the 61most common reason to exit the generated code. 62 63DESIGN REQUIREMENT: Make access to lookup structures safe with 64multiple reader/writer threads. Minimise any lock contention to do it. 65 66The hot-path avoids using locks where possible. The tb_jmp_cache is 67updated with atomic accesses to ensure consistent results. The fall 68back QHT based hash table is also designed for lockless lookups. Locks 69are only taken when code generation is required or TranslationBlocks 70have their block-to-block jumps patched. 71 72Global TCG State 73---------------- 74 75User-mode emulation 76~~~~~~~~~~~~~~~~~~~ 77 78We need to protect the entire code generation cycle including any post 79generation patching of the translated code. This also implies a shared 80translation buffer which contains code running on all cores. Any 81execution path that comes to the main run loop will need to hold a 82mutex for code generation. This also includes times when we need flush 83code or entries from any shared lookups/caches. Structures held on a 84per-vCPU basis won't need locking unless other vCPUs will need to 85modify them. 86 87DESIGN REQUIREMENT: Add locking around all code generation and TB 88patching. 89 90(Current solution) 91 92Code generation is serialised with mmap_lock(). 93 94!User-mode emulation 95~~~~~~~~~~~~~~~~~~~~ 96 97Each vCPU has its own TCG context and associated TCG region, thereby 98requiring no locking during translation. 99 100Translation Blocks 101------------------ 102 103Currently the whole system shares a single code generation buffer 104which when full will force a flush of all translations and start from 105scratch again. Some operations also force a full flush of translations 106including: 107 108 - debugging operations (breakpoint insertion/removal) 109 - some CPU helper functions 110 - linux-user spawning its first thread 111 112This is done with the async_safe_run_on_cpu() mechanism to ensure all 113vCPUs are quiescent when changes are being made to shared global 114structures. 115 116More granular translation invalidation events are typically due 117to a change of the state of a physical page: 118 119 - code modification (self modify code, patching code) 120 - page changes (new page mapping in linux-user mode) 121 122While setting the invalid flag in a TranslationBlock will stop it 123being used when looked up in the hot-path there are a number of other 124book-keeping structures that need to be safely cleared. 125 126Any TranslationBlocks which have been patched to jump directly to the 127now invalid blocks need the jump patches reversing so they will return 128to the C code. 129 130There are a number of look-up caches that need to be properly updated 131including the: 132 133 - jump lookup cache 134 - the physical-to-tb lookup hash table 135 - the global page table 136 137The global page table (l1_map) which provides a multi-level look-up 138for PageDesc structures which contain pointers to the start of a 139linked list of all Translation Blocks in that page (see page_next). 140 141Both the jump patching and the page cache involve linked lists that 142the invalidated TranslationBlock needs to be removed from. 143 144DESIGN REQUIREMENT: Safely handle invalidation of TBs 145 - safely patch/revert direct jumps 146 - remove central PageDesc lookup entries 147 - ensure lookup caches/hashes are safely updated 148 149(Current solution) 150 151The direct jump themselves are updated atomically by the TCG 152tb_set_jmp_target() code. Modification to the linked lists that allow 153searching for linked pages are done under the protection of tb->jmp_lock, 154where tb is the destination block of a jump. Each origin block keeps a 155pointer to its destinations so that the appropriate lock can be acquired before 156iterating over a jump list. 157 158The global page table is a lockless radix tree; cmpxchg is used 159to atomically insert new elements. 160 161The lookup caches are updated atomically and the lookup hash uses QHT 162which is designed for concurrent safe lookup. 163 164Parallel code generation is supported. QHT is used at insertion time 165as the synchronization point across threads, thereby ensuring that we only 166keep track of a single TranslationBlock for each guest code block. 167 168Memory maps and TLBs 169-------------------- 170 171The memory handling code is fairly critical to the speed of memory 172access in the emulated system. The SoftMMU code is designed so the 173hot-path can be handled entirely within translated code. This is 174handled with a per-vCPU TLB structure which once populated will allow 175a series of accesses to the page to occur without exiting the 176translated code. It is possible to set flags in the TLB address which 177will ensure the slow-path is taken for each access. This can be done 178to support: 179 180 - Memory regions (dividing up access to PIO, MMIO and RAM) 181 - Dirty page tracking (for code gen, SMC detection, migration and display) 182 - Virtual TLB (for translating guest address->real address) 183 184When the TLB tables are updated by a vCPU thread other than their own 185we need to ensure it is done in a safe way so no inconsistent state is 186seen by the vCPU thread. 187 188Some operations require updating a number of vCPUs TLBs at the same 189time in a synchronised manner. 190 191DESIGN REQUIREMENTS: 192 193 - TLB Flush All/Page 194 - can be across-vCPUs 195 - cross vCPU TLB flush may need other vCPU brought to halt 196 - change may need to be visible to the calling vCPU immediately 197 - TLB Flag Update 198 - usually cross-vCPU 199 - want change to be visible as soon as possible 200 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) 201 - This is a per-vCPU table - by definition can't race 202 - updated by its own thread when the slow-path is forced 203 204(Current solution) 205 206We have updated cputlb.c to defer operations when a cross-vCPU 207operation with async_run_on_cpu() which ensures each vCPU sees a 208coherent state when it next runs its work (in a few instructions 209time). 210 211A new set up operations (tlb_flush_*_all_cpus) take an additional flag 212which when set will force synchronisation by setting the source vCPUs 213work as "safe work" and exiting the cpu run loop. This ensure by the 214time execution restarts all flush operations have completed. 215 216TLB flag updates are all done atomically and are also protected by the 217corresponding page lock. 218 219(Known limitation) 220 221Not really a limitation but the wait mechanism is overly strict for 222some architectures which only need flushes completed by a barrier 223instruction. This could be a future optimisation. 224 225Emulated hardware state 226----------------------- 227 228Currently thanks to KVM work any access to IO memory is automatically 229protected by the global iothread mutex, also known as the BQL (Big 230Qemu Lock). Any IO region that doesn't use global mutex is expected to 231do its own locking. 232 233However IO memory isn't the only way emulated hardware state can be 234modified. Some architectures have model specific registers that 235trigger hardware emulation features. Generally any translation helper 236that needs to update more than a single vCPUs of state should take the 237BQL. 238 239As the BQL, or global iothread mutex is shared across the system we 240push the use of the lock as far down into the TCG code as possible to 241minimise contention. 242 243(Current solution) 244 245MMIO access automatically serialises hardware emulation by way of the 246BQL. Currently Arm targets serialise all ARM_CP_IO register accesses 247and also defer the reset/startup of vCPUs to the vCPU context by way 248of async_run_on_cpu(). 249 250Updates to interrupt state are also protected by the BQL as they can 251often be cross vCPU. 252 253Memory Consistency 254================== 255 256Between emulated guests and host systems there are a range of memory 257consistency models. Even emulating weakly ordered systems on strongly 258ordered hosts needs to ensure things like store-after-load re-ordering 259can be prevented when the guest wants to. 260 261Memory Barriers 262--------------- 263 264Barriers (sometimes known as fences) provide a mechanism for software 265to enforce a particular ordering of memory operations from the point 266of view of external observers (e.g. another processor core). They can 267apply to any memory operations as well as just loads or stores. 268 269The Linux kernel has an excellent `write-up 270<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>` 271on the various forms of memory barrier and the guarantees they can 272provide. 273 274Barriers are often wrapped around synchronisation primitives to 275provide explicit memory ordering semantics. However they can be used 276by themselves to provide safe lockless access by ensuring for example 277a change to a signal flag will only be visible once the changes to 278payload are. 279 280DESIGN REQUIREMENT: Add a new tcg_memory_barrier op 281 282This would enforce a strong load/store ordering so all loads/stores 283complete at the memory barrier. On single-core non-SMP strongly 284ordered backends this could become a NOP. 285 286Aside from explicit standalone memory barrier instructions there are 287also implicit memory ordering semantics which comes with each guest 288memory access instruction. For example all x86 load/stores come with 289fairly strong guarantees of sequential consistency whereas Arm has 290special variants of load/store instructions that imply acquire/release 291semantics. 292 293In the case of a strongly ordered guest architecture being emulated on 294a weakly ordered host the scope for a heavy performance impact is 295quite high. 296 297DESIGN REQUIREMENTS: Be efficient with use of memory barriers 298 - host systems with stronger implied guarantees can skip some barriers 299 - merge consecutive barriers to the strongest one 300 301(Current solution) 302 303The system currently has a tcg_gen_mb() which will add memory barrier 304operations if code generation is being done in a parallel context. The 305tcg_optimize() function attempts to merge barriers up to their 306strongest form before any load/store operations. The solution was 307originally developed and tested for linux-user based systems. All 308backends have been converted to emit fences when required. So far the 309following front-ends have been updated to emit fences when required: 310 311 - target-i386 312 - target-arm 313 - target-aarch64 314 - target-alpha 315 - target-mips 316 317Memory Control and Maintenance 318------------------------------ 319 320This includes a class of instructions for controlling system cache 321behaviour. While QEMU doesn't model cache behaviour these instructions 322are often seen when code modification has taken place to ensure the 323changes take effect. 324 325Synchronisation Primitives 326-------------------------- 327 328There are two broad types of synchronisation primitives found in 329modern ISAs: atomic instructions and exclusive regions. 330 331The first type offer a simple atomic instruction which will guarantee 332some sort of test and conditional store will be truly atomic w.r.t. 333other cores sharing access to the memory. The classic example is the 334x86 cmpxchg instruction. 335 336The second type offer a pair of load/store instructions which offer a 337guarantee that a region of memory has not been touched between the 338load and store instructions. An example of this is Arm's ldrex/strex 339pair where the strex instruction will return a flag indicating a 340successful store only if no other CPU has accessed the memory region 341since the ldrex. 342 343Traditionally TCG has generated a series of operations that work 344because they are within the context of a single translation block so 345will have completed before another CPU is scheduled. However with 346the ability to have multiple threads running to emulate multiple CPUs 347we will need to explicitly expose these semantics. 348 349DESIGN REQUIREMENTS: 350 - Support classic atomic instructions 351 - Support load/store exclusive (or load link/store conditional) pairs 352 - Generic enough infrastructure to support all guest architectures 353CURRENT OPEN QUESTIONS: 354 - How problematic is the ABA problem in general? 355 356(Current solution) 357 358The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which 359can be used directly or combined to emulate other instructions like 360Arm's ldrex/strex instructions. While they are susceptible to the ABA 361problem so far common guests have not implemented patterns where 362this may be a problem - typically presenting a locking ABI which 363assumes cmpxchg like semantics. 364 365The code also includes a fall-back for cases where multi-threaded TCG 366ops can't work (e.g. guest atomic width > host atomic width). In this 367case an EXCP_ATOMIC exit occurs and the instruction is emulated with 368an exclusive lock which ensures all emulation is serialised. 369 370While the atomic helpers look good enough for now there may be a need 371to look at solutions that can more closely model the guest 372architectures semantics. 373