1.. _atomics-ref: 2 3========================= 4Atomic operations in QEMU 5========================= 6 7CPUs perform independent memory operations effectively in random order. 8but this can be a problem for CPU-CPU interaction (including interactions 9between QEMU and the guest). Multi-threaded programs use various tools 10to instruct the compiler and the CPU to restrict the order to something 11that is consistent with the expectations of the programmer. 12 13The most basic tool is locking. Mutexes, condition variables and 14semaphores are used in QEMU, and should be the default approach to 15synchronization. Anything else is considerably harder, but it's 16also justified more often than one would like; 17the most performance-critical parts of QEMU in particular require 18a very low level approach to concurrency, involving memory barriers 19and atomic operations. The semantics of concurrent memory accesses are governed 20by the C11 memory model. 21 22QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to 23provide better portability and a less verbose syntax. ``qemu/atomic.h`` 24provides macros that fall in three camps: 25 26- compiler barriers: ``barrier()``; 27 28- weak atomic access and manual memory barriers: ``qatomic_read()``, 29 ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 30 ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``, 31 ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``; 32 33- sequentially consistent atomic access: everything else. 34 35In general, use of ``qemu/atomic.h`` should be wrapped with more easily 36used data structures (e.g. the lock-free singly-linked list operations 37``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization 38primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of 39atomic operations and memory barriers should be limited to inter-thread 40checking of flags and documented thoroughly. 41 42 43 44Compiler memory barrier 45======================= 46 47``barrier()`` prevents the compiler from moving the memory accesses on 48either side of it to the other side. The compiler barrier has no direct 49effect on the CPU, which may then reorder things however it wishes. 50 51``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some 52architectures, CPU guarantees are strong enough that blocking compiler 53optimizations already ensures the correct order of execution. In this 54case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple 55compiler barriers. 56 57Still, ``barrier()`` can be useful when writing code that can be interrupted 58by signal handlers. 59 60 61Sequentially consistent atomic access 62===================================== 63 64Most of the operations in the ``qemu/atomic.h`` header ensure *sequential 65consistency*, where "the result of any execution is the same as if the 66operations of all the processors were executed in some sequential order, 67and the operations of each individual processor appear in this sequence 68in the order specified by its program". 69 70``qemu/atomic.h`` provides the following set of atomic read-modify-write 71operations:: 72 73 void qatomic_inc(ptr) 74 void qatomic_dec(ptr) 75 void qatomic_add(ptr, val) 76 void qatomic_sub(ptr, val) 77 void qatomic_and(ptr, val) 78 void qatomic_or(ptr, val) 79 80 typeof(*ptr) qatomic_fetch_inc(ptr) 81 typeof(*ptr) qatomic_fetch_dec(ptr) 82 typeof(*ptr) qatomic_fetch_add(ptr, val) 83 typeof(*ptr) qatomic_fetch_sub(ptr, val) 84 typeof(*ptr) qatomic_fetch_and(ptr, val) 85 typeof(*ptr) qatomic_fetch_or(ptr, val) 86 typeof(*ptr) qatomic_fetch_xor(ptr, val) 87 typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) 88 typeof(*ptr) qatomic_xchg(ptr, val) 89 typeof(*ptr) qatomic_cmpxchg(ptr, old, new) 90 91all of which return the old value of ``*ptr``. These operations are 92polymorphic; they operate on any type that is as wide as a pointer or 93smaller. 94 95Similar operations return the new value of ``*ptr``:: 96 97 typeof(*ptr) qatomic_inc_fetch(ptr) 98 typeof(*ptr) qatomic_dec_fetch(ptr) 99 typeof(*ptr) qatomic_add_fetch(ptr, val) 100 typeof(*ptr) qatomic_sub_fetch(ptr, val) 101 typeof(*ptr) qatomic_and_fetch(ptr, val) 102 typeof(*ptr) qatomic_or_fetch(ptr, val) 103 typeof(*ptr) qatomic_xor_fetch(ptr, val) 104 105``qemu/atomic.h`` also provides loads and stores that cannot be reordered 106with each other:: 107 108 typeof(*ptr) qatomic_mb_read(ptr) 109 void qatomic_mb_set(ptr, val) 110 111However these do not provide sequential consistency and, in particular, 112they do not participate in the total ordering enforced by 113sequentially-consistent operations. For this reason they are deprecated. 114They should instead be replaced with any of the following (ordered from 115easiest to hardest): 116 117- accesses inside a mutex or spinlock 118 119- lightweight synchronization primitives such as ``QemuEvent`` 120 121- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing 122 or accessing a new version of a data structure 123 124- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for 125 loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb`` 126 to forbid reordering subsequent loads before a store. 127 128 129Weak atomic access and manual memory barriers 130============================================= 131 132Compared to sequentially consistent atomic access, programming with 133weaker consistency models can be considerably more complicated. 134The only guarantees that you can rely upon in this case are: 135 136- atomic accesses will not cause data races (and hence undefined behavior); 137 ordinary accesses instead cause data races if they are concurrent with 138 other accesses of which at least one is a write. In order to ensure this, 139 the compiler will not optimize accesses out of existence, create unsolicited 140 accesses, or perform other similar optimzations. 141 142- acquire operations will appear to happen, with respect to the other 143 components of the system, before all the LOAD or STORE operations 144 specified afterwards. 145 146- release operations will appear to happen, with respect to the other 147 components of the system, after all the LOAD or STORE operations 148 specified before. 149 150- release operations will *synchronize with* acquire operations; 151 see :ref:`acqrel` for a detailed explanation. 152 153When using this model, variables are accessed with: 154 155- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from 156 optimizing accesses out of existence and creating unsolicited 157 accesses, but do not otherwise impose any ordering on loads and 158 stores: both the compiler and the processor are free to reorder 159 them. 160 161- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to 162 happen, with respect to the other components of the system, 163 before all the LOAD or STORE operations specified afterwards. 164 Operations coming before ``qatomic_load_acquire()`` can still be 165 reordered after it. 166 167- ``qatomic_store_release()``, which guarantees the STORE to appear to 168 happen, with respect to the other components of the system, 169 after all the LOAD or STORE operations specified before. 170 Operations coming after ``qatomic_store_release()`` can still be 171 reordered before it. 172 173Restrictions to the ordering of accesses can also be specified 174using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 175``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. 176 177Memory barriers control the order of references to shared memory. 178They come in six kinds: 179 180- ``smp_rmb()`` guarantees that all the LOAD operations specified before 181 the barrier will appear to happen before all the LOAD operations 182 specified after the barrier with respect to the other components of 183 the system. 184 185 In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not 186 required to have any effect on stores. 187 188- ``smp_wmb()`` guarantees that all the STORE operations specified before 189 the barrier will appear to happen before all the STORE operations 190 specified after the barrier with respect to the other components of 191 the system. 192 193 In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not 194 required to have any effect on loads. 195 196- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before 197 the barrier will appear to happen before all the LOAD or STORE operations 198 specified after the barrier with respect to the other components of 199 the system. 200 201- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* 202 the barrier will appear to happen after all the LOAD or STORE operations 203 specified *before* the barrier with respect to the other components of 204 the system. 205 206- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified 207 before the barrier will appear to happen before all the LOAD and 208 STORE operations specified after the barrier with respect to the other 209 components of the system. 210 211 ``smp_mb()`` puts a partial ordering on both loads and stores. It is 212 stronger than both a read and a write memory barrier; it implies both 213 ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs 214 coming before the barrier from overtaking LOADs coming after the 215 barrier and vice versa. 216 217- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On 218 most processors, whenever two loads are performed such that the 219 second depends on the result of the first (e.g., the first load 220 retrieves the address to which the second load will be directed), 221 the processor will guarantee that the first LOAD will appear to happen 222 before the second with respect to the other components of the system. 223 Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``, 224 ``smp_read_barrier_depends()`` can be just a compiler barrier on 225 weakly-ordered architectures such as Arm or PPC[#]_. 226 227 Note that the first load really has to have a _data_ dependency and not 228 a control dependency. If the address for the second load is dependent 229 on the first load, but the dependency is through a conditional rather 230 than actually loading the address itself, then it's a _control_ 231 dependency and a full read barrier or better is required. 232 233.. [#] The DEC Alpha is an exception, because ``smp_read_barrier_depends()`` 234 needs a processor barrier. On strongly-ordered architectures such 235 as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can 236 also be compiler barriers only. 237 238Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are 239mostly used when a data structure has one thread that is always a writer 240and one thread that is always a reader: 241 242 +----------------------------------+----------------------------------+ 243 | thread 1 | thread 2 | 244 +==================================+==================================+ 245 | :: | :: | 246 | | | 247 | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); | 248 | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); | 249 +----------------------------------+----------------------------------+ 250 251In this case, correctness is easy to check for using the "pairing" 252trick that is explained below. 253 254Sometimes, a thread is accessing many variables that are otherwise 255unrelated to each other (for example because, apart from the current 256thread, exactly one other thread will read or write each of these 257variables). In this case, it is possible to "hoist" the barriers 258outside a loop. For example: 259 260 +------------------------------------------+----------------------------------+ 261 | before | after | 262 +==========================================+==================================+ 263 | :: | :: | 264 | | | 265 | n = 0; | n = 0; | 266 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 267 | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); | 268 | | smp_mb_acquire(); | 269 +------------------------------------------+----------------------------------+ 270 | :: | :: | 271 | | | 272 | | smp_mb_release(); | 273 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 274 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 275 +------------------------------------------+----------------------------------+ 276 277Splitting a loop can also be useful to reduce the number of barriers: 278 279 +------------------------------------------+----------------------------------+ 280 | before | after | 281 +==========================================+==================================+ 282 | :: | :: | 283 | | | 284 | n = 0; | smp_mb_release(); | 285 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 286 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 287 | smp_mb(); | smb_mb(); | 288 | n += qatomic_read(&b[i]); | n = 0; | 289 | } | for (i = 0; i < 10; i++) | 290 | | n += qatomic_read(&b[i]); | 291 +------------------------------------------+----------------------------------+ 292 293In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer 294as well) ``smp_wmb()``: 295 296 +------------------------------------------+----------------------------------+ 297 | before | after | 298 +==========================================+==================================+ 299 | :: | :: | 300 | | | 301 | | smp_mb_release(); | 302 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 303 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 304 | qatomic_store_release(&b[i], false); | smb_wmb(); | 305 | } | for (i = 0; i < 10; i++) | 306 | | qatomic_set(&b[i], false); | 307 +------------------------------------------+----------------------------------+ 308 309 310.. _acqrel: 311 312Acquire/release pairing and the *synchronizes-with* relation 313------------------------------------------------------------ 314 315Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have 316either *acquire* or *release* semantics [#rmw]_. This has two effects: 317 318.. [#rmw] Read-modify-write operations can have both---acquire applies to the 319 read part, and release to the write. 320 321- within a thread, they are ordered either before subsequent operations 322 (for acquire) or after previous operations (for release). 323 324- if a release operation in one thread *synchronizes with* an acquire operation 325 in another thread, the ordering constraints propagates from the first to the 326 second thread. That is, everything before the release operation in the 327 first thread is guaranteed to *happen before* everything after the 328 acquire operation in the second thread. 329 330The concept of acquire and release semantics is not exclusive to atomic 331operations; almost all higher-level synchronization primitives also have 332acquire or release semantics. For example: 333 334- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has 335 release semantics and synchronizes with a ``pthread_mutex_lock`` for the 336 same mutex. 337 338- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; 339 ``pthread_cond_wait`` has both release semantics (synchronizing with 340 ``pthread_mutex_lock``) and acquire semantics (synchronizing with 341 ``pthread_mutex_unlock`` and signaling of the condition variable). 342 343- ``pthread_create`` has release semantics and synchronizes with the start 344 of the new thread; ``pthread_join`` has acquire semantics and synchronizes 345 with the exiting of the thread. 346 347- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has 348 acquire semantics. 349 350For example, in the following example there are no atomic accesses, but still 351thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` 352(release) and ``pthread_join`` (acquire): 353 354 +----------------------+-------------------------------+ 355 | thread 1 | thread 2 | 356 +======================+===============================+ 357 | :: | :: | 358 | | | 359 | *a = 1; | | 360 | pthread_exit(a); | pthread_join(thread1, &a); | 361 | | x = *a; | 362 +----------------------+-------------------------------+ 363 364Synchronization between threads basically descends from this pairing of 365a release operation and an acquire operation. Therefore, atomic operations 366other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be 367paired with another operation of the opposite kind: an acquire operation 368will pair with a release operation and vice versa. This rule of thumb is 369extremely useful; in the case of QEMU, however, note that the other 370operation may actually be in a driver that runs in the guest! 371 372``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, 373``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count 374as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, 375``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release 376operations. ``smp_mb()`` counts as both acquire and release, therefore 377it can pair with any other atomic operation. Here is an example: 378 379 +----------------------+------------------------------+ 380 | thread 1 | thread 2 | 381 +======================+==============================+ 382 | :: | :: | 383 | | | 384 | qatomic_set(&a, 1);| | 385 | smp_wmb(); | | 386 | qatomic_set(&b, 2);| x = qatomic_read(&b); | 387 | | smp_rmb(); | 388 | | y = qatomic_read(&a); | 389 +----------------------+------------------------------+ 390 391Note that a load-store pair only counts if the two operations access the 392same variable: that is, a store-release on a variable ``x`` *synchronizes 393with* a load-acquire on a variable ``x``, while a release barrier 394synchronizes with any acquire operation. The following example shows 395correct synchronization: 396 397 +--------------------------------+--------------------------------+ 398 | thread 1 | thread 2 | 399 +================================+================================+ 400 | :: | :: | 401 | | | 402 | qatomic_set(&a, 1); | | 403 | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);| 404 | | y = qatomic_read(&a); | 405 +--------------------------------+--------------------------------+ 406 407Acquire and release semantics of higher-level primitives can also be 408relied upon for the purpose of establishing the *synchronizes with* 409relation. 410 411Note that the "writing" thread is accessing the variables in the 412opposite order as the "reading" thread. This is expected: stores 413before a release operation will normally match the loads after 414the acquire operation, and vice versa. In fact, this happened already 415in the ``pthread_exit``/``pthread_join`` example above. 416 417Finally, this more complex example has more than two accesses and data 418dependency barriers. It also does not use atomic accesses whenever there 419cannot be a data race: 420 421 +----------------------+------------------------------+ 422 | thread 1 | thread 2 | 423 +======================+==============================+ 424 | :: | :: | 425 | | | 426 | b[2] = 1; | | 427 | smp_wmb(); | | 428 | x->i = 2; | | 429 | smp_wmb(); | | 430 | qatomic_set(&a, x);| x = qatomic_read(&a); | 431 | | smp_read_barrier_depends(); | 432 | | y = x->i; | 433 | | smp_read_barrier_depends(); | 434 | | z = b[y]; | 435 +----------------------+------------------------------+ 436 437Comparison with Linux kernel primitives 438======================================= 439 440Here is a list of differences between Linux kernel atomic operations 441and memory barriers, and the equivalents in QEMU: 442 443- atomic operations in Linux are always on a 32-bit int type and 444 use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic 445 and use normal C types. 446 447- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee 448 at all. Linux 4.1 updated them to implement volatile 449 semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). 450 451 QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed 452 semantics if the compiler supports it, and volatile semantics otherwise. 453 Both semantics prevent the compiler from doing certain transformations; 454 the difference is that atomic accesses are guaranteed to be atomic, 455 while volatile accesses aren't. Thus, in the volatile case we just cross 456 our fingers hoping that the compiler will generate atomic accesses, 457 since we assume the variables passed are machine-word sized and 458 properly aligned. 459 460 No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either 461 Linux or QEMU. 462 463- atomic read-modify-write operations in Linux are of three kinds: 464 465 ===================== ========================================= 466 ``atomic_OP`` returns void 467 ``atomic_OP_return`` returns new value of the variable 468 ``atomic_fetch_OP`` returns the old value of the variable 469 ``atomic_cmpxchg`` returns the old value of the variable 470 ===================== ========================================= 471 472 In QEMU, the second kind is named ``atomic_OP_fetch``. 473 474- different atomic read-modify-write operations in Linux imply 475 a different set of memory barriers. In QEMU, all of them enforce 476 sequential consistency: there is a single order in which the 477 program sees them happen. 478 479- however, according to the C11 memory model that QEMU uses, this order 480 does not propagate to other memory accesses on either side of the 481 read-modify-write operation. As far as those are concerned, the 482 operation consist of just a load-acquire followed by a store-release. 483 Stores that precede the RMW operation, and loads that follow it, can 484 still be reordered and will happen *in the middle* of the read-modify-write 485 operation! 486 487 Therefore, the following example is correct in Linux but not in QEMU: 488 489 +----------------------------------+--------------------------------+ 490 | Linux (correct) | QEMU (incorrect) | 491 +==================================+================================+ 492 | :: | :: | 493 | | | 494 | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);| 495 | b = READ_ONCE(&y); | b = qatomic_read(&y); | 496 +----------------------------------+--------------------------------+ 497 498 because the read of ``y`` can be moved (by either the processor or the 499 compiler) before the write of ``x``. 500 501 Fixing this requires a full memory barrier between the write of ``x`` and 502 the read of ``y``. QEMU provides ``smp_mb__before_rmw()`` and 503 ``smp_mb__after_rmw()``; they act both as an optimization, 504 avoiding the memory barrier on processors where it is unnecessary, 505 and as a clarification of this corner case of the C11 memory model: 506 507 +--------------------------------+ 508 | QEMU (correct) | 509 +================================+ 510 | :: | 511 | | 512 | a = qatomic_fetch_add(&x, 2);| 513 | smp_mb__after_rmw(); | 514 | b = qatomic_read(&y); | 515 +--------------------------------+ 516 517 In the common case where only one thread writes ``x``, it is also possible 518 to write it like this: 519 520 +--------------------------------+ 521 | QEMU (correct) | 522 +================================+ 523 | :: | 524 | | 525 | a = qatomic_read(&x); | 526 | qatomic_set(&x, a + 2); | 527 | smp_mb(); | 528 | b = qatomic_read(&y); | 529 +--------------------------------+ 530 531Sources 532======= 533 534- ``Documentation/memory-barriers.txt`` from the Linux kernel 535