docs/devel/multi-thread-tcg.rst

2   Copyright (c) 2015-2020 Linaro Ltd.
5   later. See the COPYING file in the top-level directory.
8 Multi-threaded TCG
11 This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
12 system-mode emulation. user-mode emulation has always mirrored the
14 changes done for MTTCG system emulation have improved the stability of
15 linux-user emulation.
17 The original system-mode TCG implementation was single threaded and
18 dealt with multiple CPUs with simple round-robin scheduling. This
20 being emulated gained additional cores and per-core performance gains
27 user-space thread. This is enabled by default for all FE/BE
33 System emulation will fall back to the original round robin approach
36 * forced by --accel tcg,thread=single
37 * enabling --icount mode
41 inter-vCPU dependencies and all vCPUs should be able to run at full
50 -------------
53 structures associated with the hot-path through the main run-loop.
57     tb_jmp_cache (per-vCPU, cache of recent jumps)
58     tb_ctx.htable (global hash table, phys address->tb lookup)
67 The hot-path avoids using locks where possible. The tb_jmp_cache is
71 have their block-to-block jumps patched.
74 ----------------
76 User-mode emulation
85 per-vCPU basis won't need locking unless other vCPUs will need to
95 !User-mode emulation
102 ------------------
104 Currently the whole system shares a single code generation buffer
109   - debugging operations (breakpoint insertion/removal)
110   - some CPU helper functions
111   - linux-user spawning its first thread
112   - operations related to TCG Plugins
121   - code modification (self modify code, patching code)
122   - page changes (new page mapping in linux-user mode)
125 being used when looked up in the hot-path there are a number of other
126 book-keeping structures that need to be safely cleared.
132 There are a number of look-up caches that need to be properly updated
135   - jump lookup cache
136   - the physical-to-tb lookup hash table
137   - the global page table
139 The global page table (l1_map) which provides a multi-level look-up
147                       - safely patch/revert direct jumps
148                       - remove central PageDesc lookup entries
149                       - ensure lookup caches/hashes are safely updated
155 searching for linked pages are done under the protection of tb->jmp_lock,
168 keep track of a single TranslationBlock for each guest code block.
171 --------------------
174 access in the emulated system. The SoftMMU code is designed so the
175 hot-path can be handled entirely within translated code. This is
176 handled with a per-vCPU TLB structure which once populated will allow
179 will ensure the slow-path is taken for each access. This can be done
182   - Memory regions (dividing up access to PIO, MMIO and RAM)
183   - Dirty page tracking (for code gen, SMC detection, migration and display)
184   - Virtual TLB (for translating guest address->real address)
195   - TLB Flush All/Page
196     - can be across-vCPUs
197     - cross vCPU TLB flush may need other vCPU brought to halt
198     - change may need to be visible to the calling vCPU immediately
199   - TLB Flag Update
200     - usually cross-vCPU
201     - want change to be visible as soon as possible
202   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
203     - This is a per-vCPU table - by definition can't race
204     - updated by its own thread when the slow-path is forced
223 -----------------------
232 that needs to update more than a single vCPUs of state should take the
235 As the BQL, or global iothread mutex is shared across the system we
254 ordered hosts needs to ensure things like store-after-load re-ordering
258 ---------------
265 The Linux kernel has an excellent `write-up
266 <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barrier…
279 complete at the memory barrier. On single-core non-SMP strongly
294        - host systems with stronger implied guarantees can skip some barriers
295        - merge consecutive barriers to the strongest one
299 The system currently has a tcg_gen_mb() which will add memory barrier
303 originally developed and tested for linux-user based systems. All
305 following front-ends have been updated to emit fences when required:
307     - target-i386
308     - target-arm
309     - target-aarch64
310     - target-alpha
311     - target-mips
314 ------------------------------
316 This includes a class of instructions for controlling system cache
322 --------------------------
340 because they are within the context of a single translation block so
346   - Support classic atomic instructions
347   - Support load/store exclusive (or load link/store conditional) pairs
348   - Generic enough infrastructure to support all guest architectures
350   - How problematic is the ABA problem in general?
358 this may be a problem - typically presenting a locking ABI which
361 The code also includes a fall-back for cases where multi-threaded TCG