xref: /openbmc/qemu/docs/devel/atomics.rst (revision 3df4c288)
1.. _atomics-ref:
2
3=========================
4Atomic operations in QEMU
5=========================
6
7CPUs perform independent memory operations effectively in random order.
8but this can be a problem for CPU-CPU interaction (including interactions
9between QEMU and the guest).  Multi-threaded programs use various tools
10to instruct the compiler and the CPU to restrict the order to something
11that is consistent with the expectations of the programmer.
12
13The most basic tool is locking.  Mutexes, condition variables and
14semaphores are used in QEMU, and should be the default approach to
15synchronization.  Anything else is considerably harder, but it's
16also justified more often than one would like;
17the most performance-critical parts of QEMU in particular require
18a very low level approach to concurrency, involving memory barriers
19and atomic operations.  The semantics of concurrent memory accesses are governed
20by the C11 memory model.
21
22QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
23provide better portability and a less verbose syntax.  ``qemu/atomic.h``
24provides macros that fall in three camps:
25
26- compiler barriers: ``barrier()``;
27
28- weak atomic access and manual memory barriers: ``qatomic_read()``,
29  ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
30  ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
31  ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
32
33- sequentially consistent atomic access: everything else.
34
35In general, use of ``qemu/atomic.h`` should be wrapped with more easily
36used data structures (e.g. the lock-free singly-linked list operations
37``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
38primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
39atomic operations and memory barriers should be limited to inter-thread
40checking of flags and documented thoroughly.
41
42
43
44Compiler memory barrier
45=======================
46
47``barrier()`` prevents the compiler from moving the memory accesses on
48either side of it to the other side.  The compiler barrier has no direct
49effect on the CPU, which may then reorder things however it wishes.
50
51``barrier()`` is mostly used within ``qemu/atomic.h`` itself.  On some
52architectures, CPU guarantees are strong enough that blocking compiler
53optimizations already ensures the correct order of execution.  In this
54case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple
55compiler barriers.
56
57Still, ``barrier()`` can be useful when writing code that can be interrupted
58by signal handlers.
59
60
61Sequentially consistent atomic access
62=====================================
63
64Most of the operations in the ``qemu/atomic.h`` header ensure *sequential
65consistency*, where "the result of any execution is the same as if the
66operations of all the processors were executed in some sequential order,
67and the operations of each individual processor appear in this sequence
68in the order specified by its program".
69
70``qemu/atomic.h`` provides the following set of atomic read-modify-write
71operations::
72
73    void qatomic_inc(ptr)
74    void qatomic_dec(ptr)
75    void qatomic_add(ptr, val)
76    void qatomic_sub(ptr, val)
77    void qatomic_and(ptr, val)
78    void qatomic_or(ptr, val)
79
80    typeof(*ptr) qatomic_fetch_inc(ptr)
81    typeof(*ptr) qatomic_fetch_dec(ptr)
82    typeof(*ptr) qatomic_fetch_add(ptr, val)
83    typeof(*ptr) qatomic_fetch_sub(ptr, val)
84    typeof(*ptr) qatomic_fetch_and(ptr, val)
85    typeof(*ptr) qatomic_fetch_or(ptr, val)
86    typeof(*ptr) qatomic_fetch_xor(ptr, val)
87    typeof(*ptr) qatomic_fetch_inc_nonzero(ptr)
88    typeof(*ptr) qatomic_xchg(ptr, val)
89    typeof(*ptr) qatomic_cmpxchg(ptr, old, new)
90
91all of which return the old value of ``*ptr``.  These operations are
92polymorphic; they operate on any type that is as wide as a pointer or
93smaller.
94
95Similar operations return the new value of ``*ptr``::
96
97    typeof(*ptr) qatomic_inc_fetch(ptr)
98    typeof(*ptr) qatomic_dec_fetch(ptr)
99    typeof(*ptr) qatomic_add_fetch(ptr, val)
100    typeof(*ptr) qatomic_sub_fetch(ptr, val)
101    typeof(*ptr) qatomic_and_fetch(ptr, val)
102    typeof(*ptr) qatomic_or_fetch(ptr, val)
103    typeof(*ptr) qatomic_xor_fetch(ptr, val)
104
105``qemu/atomic.h`` also provides an optimized shortcut for
106``qatomic_set`` followed by ``smp_mb``::
107
108    void         qatomic_set_mb(ptr, val)
109
110
111Weak atomic access and manual memory barriers
112=============================================
113
114Compared to sequentially consistent atomic access, programming with
115weaker consistency models can be considerably more complicated.
116The only guarantees that you can rely upon in this case are:
117
118- atomic accesses will not cause data races (and hence undefined behavior);
119  ordinary accesses instead cause data races if they are concurrent with
120  other accesses of which at least one is a write.  In order to ensure this,
121  the compiler will not optimize accesses out of existence, create unsolicited
122  accesses, or perform other similar optimizations.
123
124- acquire operations will appear to happen, with respect to the other
125  components of the system, before all the LOAD or STORE operations
126  specified afterwards.
127
128- release operations will appear to happen, with respect to the other
129  components of the system, after all the LOAD or STORE operations
130  specified before.
131
132- release operations will *synchronize with* acquire operations;
133  see :ref:`acqrel` for a detailed explanation.
134
135When using this model, variables are accessed with:
136
137- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from
138  optimizing accesses out of existence and creating unsolicited
139  accesses, but do not otherwise impose any ordering on loads and
140  stores: both the compiler and the processor are free to reorder
141  them.
142
143- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to
144  happen, with respect to the other components of the system,
145  before all the LOAD or STORE operations specified afterwards.
146  Operations coming before ``qatomic_load_acquire()`` can still be
147  reordered after it.
148
149- ``qatomic_store_release()``, which guarantees the STORE to appear to
150  happen, with respect to the other components of the system,
151  after all the LOAD or STORE operations specified before.
152  Operations coming after ``qatomic_store_release()`` can still be
153  reordered before it.
154
155Restrictions to the ordering of accesses can also be specified
156using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
157``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``.
158
159Memory barriers control the order of references to shared memory.
160They come in six kinds:
161
162- ``smp_rmb()`` guarantees that all the LOAD operations specified before
163  the barrier will appear to happen before all the LOAD operations
164  specified after the barrier with respect to the other components of
165  the system.
166
167  In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not
168  required to have any effect on stores.
169
170- ``smp_wmb()`` guarantees that all the STORE operations specified before
171  the barrier will appear to happen before all the STORE operations
172  specified after the barrier with respect to the other components of
173  the system.
174
175  In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not
176  required to have any effect on loads.
177
178- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before
179  the barrier will appear to happen before all the LOAD or STORE operations
180  specified after the barrier with respect to the other components of
181  the system.
182
183- ``smp_mb_release()`` guarantees that all the STORE operations specified *after*
184  the barrier will appear to happen after all the LOAD or STORE operations
185  specified *before* the barrier with respect to the other components of
186  the system.
187
188- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified
189  before the barrier will appear to happen before all the LOAD and
190  STORE operations specified after the barrier with respect to the other
191  components of the system.
192
193  ``smp_mb()`` puts a partial ordering on both loads and stores.  It is
194  stronger than both a read and a write memory barrier; it implies both
195  ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs
196  coming before the barrier from overtaking LOADs coming after the
197  barrier and vice versa.
198
199- ``smp_read_barrier_depends()`` is a weaker kind of read barrier.  On
200  most processors, whenever two loads are performed such that the
201  second depends on the result of the first (e.g., the first load
202  retrieves the address to which the second load will be directed),
203  the processor will guarantee that the first LOAD will appear to happen
204  before the second with respect to the other components of the system.
205  Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``,
206  ``smp_read_barrier_depends()`` can be just a compiler barrier on
207  weakly-ordered architectures such as Arm or PPC[#]_.
208
209  Note that the first load really has to have a _data_ dependency and not
210  a control dependency.  If the address for the second load is dependent
211  on the first load, but the dependency is through a conditional rather
212  than actually loading the address itself, then it's a _control_
213  dependency and a full read barrier or better is required.
214
215.. [#] The DEC Alpha is an exception, because ``smp_read_barrier_depends()``
216   needs a processor barrier.  On strongly-ordered architectures such
217   as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can
218   also be compiler barriers only.
219
220Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are
221mostly used when a data structure has one thread that is always a writer
222and one thread that is always a reader:
223
224    +----------------------------------+----------------------------------+
225    | thread 1                         | thread 2                         |
226    +==================================+==================================+
227    | ::                               | ::                               |
228    |                                  |                                  |
229    |   qatomic_store_release(&a, x);  |   y = qatomic_load_acquire(&b);  |
230    |   qatomic_store_release(&b, y);  |   x = qatomic_load_acquire(&a);  |
231    +----------------------------------+----------------------------------+
232
233In this case, correctness is easy to check for using the "pairing"
234trick that is explained below.
235
236Sometimes, a thread is accessing many variables that are otherwise
237unrelated to each other (for example because, apart from the current
238thread, exactly one other thread will read or write each of these
239variables).  In this case, it is possible to "hoist" the barriers
240outside a loop.  For example:
241
242    +------------------------------------------+----------------------------------+
243    | before                                   | after                            |
244    +==========================================+==================================+
245    | ::                                       | ::                               |
246    |                                          |                                  |
247    |   n = 0;                                 |   n = 0;                         |
248    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
249    |     n += qatomic_load_acquire(&a[i]);    |     n += qatomic_read(&a[i]);    |
250    |                                          |   smp_mb_acquire();              |
251    +------------------------------------------+----------------------------------+
252    | ::                                       | ::                               |
253    |                                          |                                  |
254    |                                          |   smp_mb_release();              |
255    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
256    |     qatomic_store_release(&a[i], false); |     qatomic_set(&a[i], false);   |
257    +------------------------------------------+----------------------------------+
258
259Splitting a loop can also be useful to reduce the number of barriers:
260
261    +------------------------------------------+----------------------------------+
262    | before                                   | after                            |
263    +==========================================+==================================+
264    | ::                                       | ::                               |
265    |                                          |                                  |
266    |   n = 0;                                 |     smp_mb_release();            |
267    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
268    |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); |
269    |     smp_mb();                            |     smb_mb();                    |
270    |     n += qatomic_read(&b[i]);            |     n = 0;                       |
271    |   }                                      |     for (i = 0; i < 10; i++)     |
272    |                                          |       n += qatomic_read(&b[i]);  |
273    +------------------------------------------+----------------------------------+
274
275In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
276as well) ``smp_wmb()``:
277
278    +------------------------------------------+----------------------------------+
279    | before                                   | after                            |
280    +==========================================+==================================+
281    | ::                                       | ::                               |
282    |                                          |                                  |
283    |                                          |     smp_mb_release();            |
284    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
285    |     qatomic_store_release(&a[i], false); |       qatomic_set(&a[i], false); |
286    |     qatomic_store_release(&b[i], false); |     smb_wmb();                   |
287    |   }                                      |     for (i = 0; i < 10; i++)     |
288    |                                          |       qatomic_set(&b[i], false); |
289    +------------------------------------------+----------------------------------+
290
291
292.. _acqrel:
293
294Acquire/release pairing and the *synchronizes-with* relation
295------------------------------------------------------------
296
297Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have
298either *acquire* or *release* semantics [#rmw]_.  This has two effects:
299
300.. [#rmw] Read-modify-write operations can have both---acquire applies to the
301          read part, and release to the write.
302
303- within a thread, they are ordered either before subsequent operations
304  (for acquire) or after previous operations (for release).
305
306- if a release operation in one thread *synchronizes with* an acquire operation
307  in another thread, the ordering constraints propagates from the first to the
308  second thread.  That is, everything before the release operation in the
309  first thread is guaranteed to *happen before* everything after the
310  acquire operation in the second thread.
311
312The concept of acquire and release semantics is not exclusive to atomic
313operations; almost all higher-level synchronization primitives also have
314acquire or release semantics.  For example:
315
316- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
317  release semantics and synchronizes with a ``pthread_mutex_lock`` for the
318  same mutex.
319
320- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
321  ``pthread_cond_wait`` has both release semantics (synchronizing with
322  ``pthread_mutex_lock``) and acquire semantics (synchronizing with
323  ``pthread_mutex_unlock`` and signaling of the condition variable).
324
325- ``pthread_create`` has release semantics and synchronizes with the start
326  of the new thread; ``pthread_join`` has acquire semantics and synchronizes
327  with the exiting of the thread.
328
329- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
330  acquire semantics.
331
332For example, in the following example there are no atomic accesses, but still
333thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
334(release) and ``pthread_join`` (acquire):
335
336      +----------------------+-------------------------------+
337      | thread 1             | thread 2                      |
338      +======================+===============================+
339      | ::                   | ::                            |
340      |                      |                               |
341      |   *a = 1;            |                               |
342      |   pthread_exit(a);   |   pthread_join(thread1, &a);  |
343      |                      |   x = *a;                     |
344      +----------------------+-------------------------------+
345
346Synchronization between threads basically descends from this pairing of
347a release operation and an acquire operation.  Therefore, atomic operations
348other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be
349paired with another operation of the opposite kind: an acquire operation
350will pair with a release operation and vice versa.  This rule of thumb is
351extremely useful; in the case of QEMU, however, note that the other
352operation may actually be in a driver that runs in the guest!
353
354``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
355``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count
356as acquire operations.  ``smp_wmb()``, ``smp_mb_release()``,
357``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release
358operations.  ``smp_mb()`` counts as both acquire and release, therefore
359it can pair with any other atomic operation.  Here is an example:
360
361      +----------------------+------------------------------+
362      | thread 1             | thread 2                     |
363      +======================+==============================+
364      | ::                   | ::                           |
365      |                      |                              |
366      |   qatomic_set(&a, 1);|                              |
367      |   smp_wmb();         |                              |
368      |   qatomic_set(&b, 2);|   x = qatomic_read(&b);      |
369      |                      |   smp_rmb();                 |
370      |                      |   y = qatomic_read(&a);      |
371      +----------------------+------------------------------+
372
373Note that a load-store pair only counts if the two operations access the
374same variable: that is, a store-release on a variable ``x`` *synchronizes
375with* a load-acquire on a variable ``x``, while a release barrier
376synchronizes with any acquire operation.  The following example shows
377correct synchronization:
378
379      +--------------------------------+--------------------------------+
380      | thread 1                       | thread 2                       |
381      +================================+================================+
382      | ::                             | ::                             |
383      |                                |                                |
384      |   qatomic_set(&a, 1);          |                                |
385      |   qatomic_store_release(&b, 2);|   x = qatomic_load_acquire(&b);|
386      |                                |   y = qatomic_read(&a);        |
387      +--------------------------------+--------------------------------+
388
389Acquire and release semantics of higher-level primitives can also be
390relied upon for the purpose of establishing the *synchronizes with*
391relation.
392
393Note that the "writing" thread is accessing the variables in the
394opposite order as the "reading" thread.  This is expected: stores
395before a release operation will normally match the loads after
396the acquire operation, and vice versa.  In fact, this happened already
397in the ``pthread_exit``/``pthread_join`` example above.
398
399Finally, this more complex example has more than two accesses and data
400dependency barriers.  It also does not use atomic accesses whenever there
401cannot be a data race:
402
403      +----------------------+------------------------------+
404      | thread 1             | thread 2                     |
405      +======================+==============================+
406      | ::                   | ::                           |
407      |                      |                              |
408      |   b[2] = 1;          |                              |
409      |   smp_wmb();         |                              |
410      |   x->i = 2;          |                              |
411      |   smp_wmb();         |                              |
412      |   qatomic_set(&a, x);|  x = qatomic_read(&a);       |
413      |                      |  smp_read_barrier_depends(); |
414      |                      |  y = x->i;                   |
415      |                      |  smp_read_barrier_depends(); |
416      |                      |  z = b[y];                   |
417      +----------------------+------------------------------+
418
419Comparison with Linux kernel primitives
420=======================================
421
422Here is a list of differences between Linux kernel atomic operations
423and memory barriers, and the equivalents in QEMU:
424
425- atomic operations in Linux are always on a 32-bit int type and
426  use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic
427  and use normal C types.
428
429- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee
430  at all. Linux 4.1 updated them to implement volatile
431  semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``).
432
433  QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed
434  semantics if the compiler supports it, and volatile semantics otherwise.
435  Both semantics prevent the compiler from doing certain transformations;
436  the difference is that atomic accesses are guaranteed to be atomic,
437  while volatile accesses aren't. Thus, in the volatile case we just cross
438  our fingers hoping that the compiler will generate atomic accesses,
439  since we assume the variables passed are machine-word sized and
440  properly aligned.
441
442  No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either
443  Linux or QEMU.
444
445- atomic read-modify-write operations in Linux are of three kinds:
446
447         ===================== =========================================
448         ``atomic_OP``         returns void
449         ``atomic_OP_return``  returns new value of the variable
450         ``atomic_fetch_OP``   returns the old value of the variable
451         ``atomic_cmpxchg``    returns the old value of the variable
452         ===================== =========================================
453
454  In QEMU, the second kind is named ``atomic_OP_fetch``.
455
456- different atomic read-modify-write operations in Linux imply
457  a different set of memory barriers. In QEMU, all of them enforce
458  sequential consistency: there is a single order in which the
459  program sees them happen.
460
461- however, according to the C11 memory model that QEMU uses, this order
462  does not propagate to other memory accesses on either side of the
463  read-modify-write operation.  As far as those are concerned, the
464  operation consist of just a load-acquire followed by a store-release.
465  Stores that precede the RMW operation, and loads that follow it, can
466  still be reordered and will happen *in the middle* of the read-modify-write
467  operation!
468
469  Therefore, the following example is correct in Linux but not in QEMU:
470
471      +----------------------------------+--------------------------------+
472      | Linux (correct)                  | QEMU (incorrect)               |
473      +==================================+================================+
474      | ::                               | ::                             |
475      |                                  |                                |
476      |   a = atomic_fetch_add(&x, 2);   |   a = qatomic_fetch_add(&x, 2);|
477      |   b = READ_ONCE(&y);             |   b = qatomic_read(&y);        |
478      +----------------------------------+--------------------------------+
479
480  because the read of ``y`` can be moved (by either the processor or the
481  compiler) before the write of ``x``.
482
483  Fixing this requires a full memory barrier between the write of ``x`` and
484  the read of ``y``.  QEMU provides ``smp_mb__before_rmw()`` and
485  ``smp_mb__after_rmw()``; they act both as an optimization,
486  avoiding the memory barrier on processors where it is unnecessary,
487  and as a clarification of this corner case of the C11 memory model:
488
489      +--------------------------------+
490      | QEMU (correct)                 |
491      +================================+
492      | ::                             |
493      |                                |
494      |   a = qatomic_fetch_add(&x, 2);|
495      |   smp_mb__after_rmw();         |
496      |   b = qatomic_read(&y);        |
497      +--------------------------------+
498
499  In the common case where only one thread writes ``x``, it is also possible
500  to write it like this:
501
502      +--------------------------------+
503      | QEMU (correct)                 |
504      +================================+
505      | ::                             |
506      |                                |
507      |   a = qatomic_read(&x);        |
508      |   qatomic_set_mb(&x, a + 2);   |
509      |   b = qatomic_read(&y);        |
510      +--------------------------------+
511
512Sources
513=======
514
515- ``Documentation/memory-barriers.txt`` from the Linux kernel
516