xref: /openbmc/qemu/docs/devel/multi-thread-tcg.rst (revision 07153411cd57e8d6933ab8a43035ed097deee696)
1..
2  Copyright (c) 2015-2020 Linaro Ltd.
3
4  This work is licensed under the terms of the GNU GPL, version 2 or
5  later. See the COPYING file in the top-level directory.
6
7.. _mttcg:
8
9==================
10Multi-threaded TCG
11==================
12
13This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
14system-mode emulation. user-mode emulation has always mirrored the
15thread structure of the translated executable although some of the
16changes done for MTTCG system emulation have improved the stability of
17linux-user emulation.
18
19The original system-mode TCG implementation was single threaded and
20dealt with multiple CPUs with simple round-robin scheduling. This
21simplified a lot of things but became increasingly limited as systems
22being emulated gained additional cores and per-core performance gains
23for host systems started to level off.
24
25vCPU Scheduling
26===============
27
28We introduce a new running mode where each vCPU will run on its own
29user-space thread. This is enabled by default for all FE/BE
30combinations where the host memory model is able to accommodate the
31guest (TCGCPUOps::guest_default_memory_order & ~TCG_TARGET_DEFAULT_MO is zero)
32and the guest has had the required work done to support this safely
33(TCGCPUOps::mttcg_supported).
34
35System emulation will fall back to the original round robin approach
36if:
37
38* forced by --accel tcg,thread=single
39* enabling --icount mode
40
41In the general case of running translated code there should be no
42inter-vCPU dependencies and all vCPUs should be able to run at full
43speed. Synchronisation will only be required while accessing internal
44shared data structures or when the emulated architecture requires a
45coherent representation of the emulated machine state.
46
47Shared Data Structures
48======================
49
50Main Run Loop
51-------------
52
53Even when there is no code being generated there are a number of
54structures associated with the hot-path through the main run-loop.
55These are associated with looking up the next translation block to
56execute. These include:
57
58    tb_jmp_cache (per-vCPU, cache of recent jumps)
59    tb_ctx.htable (global hash table, phys address->tb lookup)
60
61As TB linking only occurs when blocks are in the same page this code
62is critical to performance as looking up the next TB to execute is the
63most common reason to exit the generated code.
64
65DESIGN REQUIREMENT: Make access to lookup structures safe with
66multiple reader/writer threads. Minimise any lock contention to do it.
67
68The hot-path avoids using locks where possible. The tb_jmp_cache is
69updated with atomic accesses to ensure consistent results. The fall
70back QHT based hash table is also designed for lockless lookups. Locks
71are only taken when code generation is required or TranslationBlocks
72have their block-to-block jumps patched.
73
74Global TCG State
75----------------
76
77User-mode emulation
78~~~~~~~~~~~~~~~~~~~
79
80We need to protect the entire code generation cycle including any post
81generation patching of the translated code. This also implies a shared
82translation buffer which contains code running on all cores. Any
83execution path that comes to the main run loop will need to hold a
84mutex for code generation. This also includes times when we need flush
85code or entries from any shared lookups/caches. Structures held on a
86per-vCPU basis won't need locking unless other vCPUs will need to
87modify them.
88
89DESIGN REQUIREMENT: Add locking around all code generation and TB
90patching.
91
92(Current solution)
93
94Code generation is serialised with mmap_lock().
95
96!User-mode emulation
97~~~~~~~~~~~~~~~~~~~~
98
99Each vCPU has its own TCG context and associated TCG region, thereby
100requiring no locking during translation.
101
102Translation Blocks
103------------------
104
105Currently the whole system shares a single code generation buffer
106which when full will force a flush of all translations and start from
107scratch again. Some operations also force a full flush of translations
108including:
109
110  - debugging operations (breakpoint insertion/removal)
111  - some CPU helper functions
112  - linux-user spawning its first thread
113  - operations related to TCG Plugins
114
115This is done with the async_safe_run_on_cpu() mechanism to ensure all
116vCPUs are quiescent when changes are being made to shared global
117structures.
118
119More granular translation invalidation events are typically due
120to a change of the state of a physical page:
121
122  - code modification (self modify code, patching code)
123  - page changes (new page mapping in linux-user mode)
124
125While setting the invalid flag in a TranslationBlock will stop it
126being used when looked up in the hot-path there are a number of other
127book-keeping structures that need to be safely cleared.
128
129Any TranslationBlocks which have been patched to jump directly to the
130now invalid blocks need the jump patches reversing so they will return
131to the C code.
132
133There are a number of look-up caches that need to be properly updated
134including the:
135
136  - jump lookup cache
137  - the physical-to-tb lookup hash table
138  - the global page table
139
140The global page table (l1_map) which provides a multi-level look-up
141for PageDesc structures which contain pointers to the start of a
142linked list of all Translation Blocks in that page (see page_next).
143
144Both the jump patching and the page cache involve linked lists that
145the invalidated TranslationBlock needs to be removed from.
146
147DESIGN REQUIREMENT: Safely handle invalidation of TBs
148                      - safely patch/revert direct jumps
149                      - remove central PageDesc lookup entries
150                      - ensure lookup caches/hashes are safely updated
151
152(Current solution)
153
154The direct jump themselves are updated atomically by the TCG
155tb_set_jmp_target() code. Modification to the linked lists that allow
156searching for linked pages are done under the protection of tb->jmp_lock,
157where tb is the destination block of a jump. Each origin block keeps a
158pointer to its destinations so that the appropriate lock can be acquired before
159iterating over a jump list.
160
161The global page table is a lockless radix tree; cmpxchg is used
162to atomically insert new elements.
163
164The lookup caches are updated atomically and the lookup hash uses QHT
165which is designed for concurrent safe lookup.
166
167Parallel code generation is supported. QHT is used at insertion time
168as the synchronization point across threads, thereby ensuring that we only
169keep track of a single TranslationBlock for each guest code block.
170
171Memory maps and TLBs
172--------------------
173
174The memory handling code is fairly critical to the speed of memory
175access in the emulated system. The SoftMMU code is designed so the
176hot-path can be handled entirely within translated code. This is
177handled with a per-vCPU TLB structure which once populated will allow
178a series of accesses to the page to occur without exiting the
179translated code. It is possible to set flags in the TLB address which
180will ensure the slow-path is taken for each access. This can be done
181to support:
182
183  - Memory regions (dividing up access to PIO, MMIO and RAM)
184  - Dirty page tracking (for code gen, SMC detection, migration and display)
185  - Virtual TLB (for translating guest address->real address)
186
187When the TLB tables are updated by a vCPU thread other than their own
188we need to ensure it is done in a safe way so no inconsistent state is
189seen by the vCPU thread.
190
191Some operations require updating a number of vCPUs TLBs at the same
192time in a synchronised manner.
193
194DESIGN REQUIREMENTS:
195
196  - TLB Flush All/Page
197    - can be across-vCPUs
198    - cross vCPU TLB flush may need other vCPU brought to halt
199    - change may need to be visible to the calling vCPU immediately
200  - TLB Flag Update
201    - usually cross-vCPU
202    - want change to be visible as soon as possible
203  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
204    - This is a per-vCPU table - by definition can't race
205    - updated by its own thread when the slow-path is forced
206
207(Current solution)
208
209A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force
210synchronisation by setting the source vCPUs work as "safe work" and
211exiting the cpu run loop. This ensures that by the time execution
212restarts all flush operations have completed.
213
214TLB flag updates are all done atomically and are also protected by the
215corresponding page lock.
216
217(Known limitation)
218
219Not really a limitation but the wait mechanism is overly strict for
220some architectures which only need flushes completed by a barrier
221instruction. This could be a future optimisation.
222
223Emulated hardware state
224-----------------------
225
226Currently thanks to KVM work any access to IO memory is automatically protected
227by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
228to do its own locking.
229
230However IO memory isn't the only way emulated hardware state can be
231modified. Some architectures have model specific registers that
232trigger hardware emulation features. Generally any translation helper
233that needs to update more than a single vCPUs of state should take the
234BQL.
235
236As the BQL, or global iothread mutex is shared across the system we
237push the use of the lock as far down into the TCG code as possible to
238minimise contention.
239
240(Current solution)
241
242MMIO access automatically serialises hardware emulation by way of the
243BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
244and also defer the reset/startup of vCPUs to the vCPU context by way
245of async_run_on_cpu().
246
247Updates to interrupt state are also protected by the BQL as they can
248often be cross vCPU.
249
250Memory Consistency
251==================
252
253Between emulated guests and host systems there are a range of memory
254consistency models. Even emulating weakly ordered systems on strongly
255ordered hosts needs to ensure things like store-after-load re-ordering
256can be prevented when the guest wants to.
257
258Memory Barriers
259---------------
260
261Barriers (sometimes known as fences) provide a mechanism for software
262to enforce a particular ordering of memory operations from the point
263of view of external observers (e.g. another processor core). They can
264apply to any memory operations as well as just loads or stores.
265
266The Linux kernel has an excellent `write-up
267<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
268on the various forms of memory barrier and the guarantees they can
269provide.
270
271Barriers are often wrapped around synchronisation primitives to
272provide explicit memory ordering semantics. However they can be used
273by themselves to provide safe lockless access by ensuring for example
274a change to a signal flag will only be visible once the changes to
275payload are.
276
277DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
278
279This would enforce a strong load/store ordering so all loads/stores
280complete at the memory barrier. On single-core non-SMP strongly
281ordered backends this could become a NOP.
282
283Aside from explicit standalone memory barrier instructions there are
284also implicit memory ordering semantics which comes with each guest
285memory access instruction. For example all x86 load/stores come with
286fairly strong guarantees of sequential consistency whereas Arm has
287special variants of load/store instructions that imply acquire/release
288semantics.
289
290In the case of a strongly ordered guest architecture being emulated on
291a weakly ordered host the scope for a heavy performance impact is
292quite high.
293
294DESIGN REQUIREMENTS: Be efficient with use of memory barriers
295       - host systems with stronger implied guarantees can skip some barriers
296       - merge consecutive barriers to the strongest one
297
298(Current solution)
299
300The system currently has a tcg_gen_mb() which will add memory barrier
301operations if code generation is being done in a parallel context. The
302tcg_optimize() function attempts to merge barriers up to their
303strongest form before any load/store operations. The solution was
304originally developed and tested for linux-user based systems. All
305backends have been converted to emit fences when required. So far the
306following front-ends have been updated to emit fences when required:
307
308    - target-i386
309    - target-arm
310    - target-aarch64
311    - target-alpha
312    - target-mips
313
314Memory Control and Maintenance
315------------------------------
316
317This includes a class of instructions for controlling system cache
318behaviour. While QEMU doesn't model cache behaviour these instructions
319are often seen when code modification has taken place to ensure the
320changes take effect.
321
322Synchronisation Primitives
323--------------------------
324
325There are two broad types of synchronisation primitives found in
326modern ISAs: atomic instructions and exclusive regions.
327
328The first type offer a simple atomic instruction which will guarantee
329some sort of test and conditional store will be truly atomic w.r.t.
330other cores sharing access to the memory. The classic example is the
331x86 cmpxchg instruction.
332
333The second type offer a pair of load/store instructions which offer a
334guarantee that a region of memory has not been touched between the
335load and store instructions. An example of this is Arm's ldrex/strex
336pair where the strex instruction will return a flag indicating a
337successful store only if no other CPU has accessed the memory region
338since the ldrex.
339
340Traditionally TCG has generated a series of operations that work
341because they are within the context of a single translation block so
342will have completed before another CPU is scheduled. However with
343the ability to have multiple threads running to emulate multiple CPUs
344we will need to explicitly expose these semantics.
345
346DESIGN REQUIREMENTS:
347  - Support classic atomic instructions
348  - Support load/store exclusive (or load link/store conditional) pairs
349  - Generic enough infrastructure to support all guest architectures
350CURRENT OPEN QUESTIONS:
351  - How problematic is the ABA problem in general?
352
353(Current solution)
354
355The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
356can be used directly or combined to emulate other instructions like
357Arm's ldrex/strex instructions. While they are susceptible to the ABA
358problem so far common guests have not implemented patterns where
359this may be a problem - typically presenting a locking ABI which
360assumes cmpxchg like semantics.
361
362The code also includes a fall-back for cases where multi-threaded TCG
363ops can't work (e.g. guest atomic width > host atomic width). In this
364case an EXCP_ATOMIC exit occurs and the instruction is emulated with
365an exclusive lock which ensures all emulation is serialised.
366
367While the atomic helpers look good enough for now there may be a need
368to look at solutions that can more closely model the guest
369architectures semantics.
370