1.. 2 Copyright (c) 2015-2020 Linaro Ltd. 3 4 This work is licensed under the terms of the GNU GPL, version 2 or 5 later. See the COPYING file in the top-level directory. 6 7================== 8Multi-threaded TCG 9================== 10 11This document outlines the design for multi-threaded TCG (a.k.a MTTCG) 12system-mode emulation. user-mode emulation has always mirrored the 13thread structure of the translated executable although some of the 14changes done for MTTCG system emulation have improved the stability of 15linux-user emulation. 16 17The original system-mode TCG implementation was single threaded and 18dealt with multiple CPUs with simple round-robin scheduling. This 19simplified a lot of things but became increasingly limited as systems 20being emulated gained additional cores and per-core performance gains 21for host systems started to level off. 22 23vCPU Scheduling 24=============== 25 26We introduce a new running mode where each vCPU will run on its own 27user-space thread. This is enabled by default for all FE/BE 28combinations where the host memory model is able to accommodate the 29guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the 30guest has had the required work done to support this safely 31(TARGET_SUPPORTS_MTTCG). 32 33System emulation will fall back to the original round robin approach 34if: 35 36* forced by --accel tcg,thread=single 37* enabling --icount mode 38* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) 39 40In the general case of running translated code there should be no 41inter-vCPU dependencies and all vCPUs should be able to run at full 42speed. Synchronisation will only be required while accessing internal 43shared data structures or when the emulated architecture requires a 44coherent representation of the emulated machine state. 45 46Shared Data Structures 47====================== 48 49Main Run Loop 50------------- 51 52Even when there is no code being generated there are a number of 53structures associated with the hot-path through the main run-loop. 54These are associated with looking up the next translation block to 55execute. These include: 56 57 tb_jmp_cache (per-vCPU, cache of recent jumps) 58 tb_ctx.htable (global hash table, phys address->tb lookup) 59 60As TB linking only occurs when blocks are in the same page this code 61is critical to performance as looking up the next TB to execute is the 62most common reason to exit the generated code. 63 64DESIGN REQUIREMENT: Make access to lookup structures safe with 65multiple reader/writer threads. Minimise any lock contention to do it. 66 67The hot-path avoids using locks where possible. The tb_jmp_cache is 68updated with atomic accesses to ensure consistent results. The fall 69back QHT based hash table is also designed for lockless lookups. Locks 70are only taken when code generation is required or TranslationBlocks 71have their block-to-block jumps patched. 72 73Global TCG State 74---------------- 75 76User-mode emulation 77~~~~~~~~~~~~~~~~~~~ 78 79We need to protect the entire code generation cycle including any post 80generation patching of the translated code. This also implies a shared 81translation buffer which contains code running on all cores. Any 82execution path that comes to the main run loop will need to hold a 83mutex for code generation. This also includes times when we need flush 84code or entries from any shared lookups/caches. Structures held on a 85per-vCPU basis won't need locking unless other vCPUs will need to 86modify them. 87 88DESIGN REQUIREMENT: Add locking around all code generation and TB 89patching. 90 91(Current solution) 92 93Code generation is serialised with mmap_lock(). 94 95!User-mode emulation 96~~~~~~~~~~~~~~~~~~~~ 97 98Each vCPU has its own TCG context and associated TCG region, thereby 99requiring no locking during translation. 100 101Translation Blocks 102------------------ 103 104Currently the whole system shares a single code generation buffer 105which when full will force a flush of all translations and start from 106scratch again. Some operations also force a full flush of translations 107including: 108 109 - debugging operations (breakpoint insertion/removal) 110 - some CPU helper functions 111 - linux-user spawning its first thread 112 - operations related to TCG Plugins 113 114This is done with the async_safe_run_on_cpu() mechanism to ensure all 115vCPUs are quiescent when changes are being made to shared global 116structures. 117 118More granular translation invalidation events are typically due 119to a change of the state of a physical page: 120 121 - code modification (self modify code, patching code) 122 - page changes (new page mapping in linux-user mode) 123 124While setting the invalid flag in a TranslationBlock will stop it 125being used when looked up in the hot-path there are a number of other 126book-keeping structures that need to be safely cleared. 127 128Any TranslationBlocks which have been patched to jump directly to the 129now invalid blocks need the jump patches reversing so they will return 130to the C code. 131 132There are a number of look-up caches that need to be properly updated 133including the: 134 135 - jump lookup cache 136 - the physical-to-tb lookup hash table 137 - the global page table 138 139The global page table (l1_map) which provides a multi-level look-up 140for PageDesc structures which contain pointers to the start of a 141linked list of all Translation Blocks in that page (see page_next). 142 143Both the jump patching and the page cache involve linked lists that 144the invalidated TranslationBlock needs to be removed from. 145 146DESIGN REQUIREMENT: Safely handle invalidation of TBs 147 - safely patch/revert direct jumps 148 - remove central PageDesc lookup entries 149 - ensure lookup caches/hashes are safely updated 150 151(Current solution) 152 153The direct jump themselves are updated atomically by the TCG 154tb_set_jmp_target() code. Modification to the linked lists that allow 155searching for linked pages are done under the protection of tb->jmp_lock, 156where tb is the destination block of a jump. Each origin block keeps a 157pointer to its destinations so that the appropriate lock can be acquired before 158iterating over a jump list. 159 160The global page table is a lockless radix tree; cmpxchg is used 161to atomically insert new elements. 162 163The lookup caches are updated atomically and the lookup hash uses QHT 164which is designed for concurrent safe lookup. 165 166Parallel code generation is supported. QHT is used at insertion time 167as the synchronization point across threads, thereby ensuring that we only 168keep track of a single TranslationBlock for each guest code block. 169 170Memory maps and TLBs 171-------------------- 172 173The memory handling code is fairly critical to the speed of memory 174access in the emulated system. The SoftMMU code is designed so the 175hot-path can be handled entirely within translated code. This is 176handled with a per-vCPU TLB structure which once populated will allow 177a series of accesses to the page to occur without exiting the 178translated code. It is possible to set flags in the TLB address which 179will ensure the slow-path is taken for each access. This can be done 180to support: 181 182 - Memory regions (dividing up access to PIO, MMIO and RAM) 183 - Dirty page tracking (for code gen, SMC detection, migration and display) 184 - Virtual TLB (for translating guest address->real address) 185 186When the TLB tables are updated by a vCPU thread other than their own 187we need to ensure it is done in a safe way so no inconsistent state is 188seen by the vCPU thread. 189 190Some operations require updating a number of vCPUs TLBs at the same 191time in a synchronised manner. 192 193DESIGN REQUIREMENTS: 194 195 - TLB Flush All/Page 196 - can be across-vCPUs 197 - cross vCPU TLB flush may need other vCPU brought to halt 198 - change may need to be visible to the calling vCPU immediately 199 - TLB Flag Update 200 - usually cross-vCPU 201 - want change to be visible as soon as possible 202 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) 203 - This is a per-vCPU table - by definition can't race 204 - updated by its own thread when the slow-path is forced 205 206(Current solution) 207 208A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force 209synchronisation by setting the source vCPUs work as "safe work" and 210exiting the cpu run loop. This ensures that by the time execution 211restarts all flush operations have completed. 212 213TLB flag updates are all done atomically and are also protected by the 214corresponding page lock. 215 216(Known limitation) 217 218Not really a limitation but the wait mechanism is overly strict for 219some architectures which only need flushes completed by a barrier 220instruction. This could be a future optimisation. 221 222Emulated hardware state 223----------------------- 224 225Currently thanks to KVM work any access to IO memory is automatically protected 226by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected 227to do its own locking. 228 229However IO memory isn't the only way emulated hardware state can be 230modified. Some architectures have model specific registers that 231trigger hardware emulation features. Generally any translation helper 232that needs to update more than a single vCPUs of state should take the 233BQL. 234 235As the BQL, or global iothread mutex is shared across the system we 236push the use of the lock as far down into the TCG code as possible to 237minimise contention. 238 239(Current solution) 240 241MMIO access automatically serialises hardware emulation by way of the 242BQL. Currently Arm targets serialise all ARM_CP_IO register accesses 243and also defer the reset/startup of vCPUs to the vCPU context by way 244of async_run_on_cpu(). 245 246Updates to interrupt state are also protected by the BQL as they can 247often be cross vCPU. 248 249Memory Consistency 250================== 251 252Between emulated guests and host systems there are a range of memory 253consistency models. Even emulating weakly ordered systems on strongly 254ordered hosts needs to ensure things like store-after-load re-ordering 255can be prevented when the guest wants to. 256 257Memory Barriers 258--------------- 259 260Barriers (sometimes known as fences) provide a mechanism for software 261to enforce a particular ordering of memory operations from the point 262of view of external observers (e.g. another processor core). They can 263apply to any memory operations as well as just loads or stores. 264 265The Linux kernel has an excellent `write-up 266<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ 267on the various forms of memory barrier and the guarantees they can 268provide. 269 270Barriers are often wrapped around synchronisation primitives to 271provide explicit memory ordering semantics. However they can be used 272by themselves to provide safe lockless access by ensuring for example 273a change to a signal flag will only be visible once the changes to 274payload are. 275 276DESIGN REQUIREMENT: Add a new tcg_memory_barrier op 277 278This would enforce a strong load/store ordering so all loads/stores 279complete at the memory barrier. On single-core non-SMP strongly 280ordered backends this could become a NOP. 281 282Aside from explicit standalone memory barrier instructions there are 283also implicit memory ordering semantics which comes with each guest 284memory access instruction. For example all x86 load/stores come with 285fairly strong guarantees of sequential consistency whereas Arm has 286special variants of load/store instructions that imply acquire/release 287semantics. 288 289In the case of a strongly ordered guest architecture being emulated on 290a weakly ordered host the scope for a heavy performance impact is 291quite high. 292 293DESIGN REQUIREMENTS: Be efficient with use of memory barriers 294 - host systems with stronger implied guarantees can skip some barriers 295 - merge consecutive barriers to the strongest one 296 297(Current solution) 298 299The system currently has a tcg_gen_mb() which will add memory barrier 300operations if code generation is being done in a parallel context. The 301tcg_optimize() function attempts to merge barriers up to their 302strongest form before any load/store operations. The solution was 303originally developed and tested for linux-user based systems. All 304backends have been converted to emit fences when required. So far the 305following front-ends have been updated to emit fences when required: 306 307 - target-i386 308 - target-arm 309 - target-aarch64 310 - target-alpha 311 - target-mips 312 313Memory Control and Maintenance 314------------------------------ 315 316This includes a class of instructions for controlling system cache 317behaviour. While QEMU doesn't model cache behaviour these instructions 318are often seen when code modification has taken place to ensure the 319changes take effect. 320 321Synchronisation Primitives 322-------------------------- 323 324There are two broad types of synchronisation primitives found in 325modern ISAs: atomic instructions and exclusive regions. 326 327The first type offer a simple atomic instruction which will guarantee 328some sort of test and conditional store will be truly atomic w.r.t. 329other cores sharing access to the memory. The classic example is the 330x86 cmpxchg instruction. 331 332The second type offer a pair of load/store instructions which offer a 333guarantee that a region of memory has not been touched between the 334load and store instructions. An example of this is Arm's ldrex/strex 335pair where the strex instruction will return a flag indicating a 336successful store only if no other CPU has accessed the memory region 337since the ldrex. 338 339Traditionally TCG has generated a series of operations that work 340because they are within the context of a single translation block so 341will have completed before another CPU is scheduled. However with 342the ability to have multiple threads running to emulate multiple CPUs 343we will need to explicitly expose these semantics. 344 345DESIGN REQUIREMENTS: 346 - Support classic atomic instructions 347 - Support load/store exclusive (or load link/store conditional) pairs 348 - Generic enough infrastructure to support all guest architectures 349CURRENT OPEN QUESTIONS: 350 - How problematic is the ABA problem in general? 351 352(Current solution) 353 354The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which 355can be used directly or combined to emulate other instructions like 356Arm's ldrex/strex instructions. While they are susceptible to the ABA 357problem so far common guests have not implemented patterns where 358this may be a problem - typically presenting a locking ABI which 359assumes cmpxchg like semantics. 360 361The code also includes a fall-back for cases where multi-threaded TCG 362ops can't work (e.g. guest atomic width > host atomic width). In this 363case an EXCP_ATOMIC exit occurs and the instruction is emulated with 364an exclusive lock which ensures all emulation is serialised. 365 366While the atomic helpers look good enough for now there may be a need 367to look at solutions that can more closely model the guest 368architectures semantics. 369