1.. _atomics-ref: 2 3========================= 4Atomic operations in QEMU 5========================= 6 7CPUs perform independent memory operations effectively in random order. 8but this can be a problem for CPU-CPU interaction (including interactions 9between QEMU and the guest). Multi-threaded programs use various tools 10to instruct the compiler and the CPU to restrict the order to something 11that is consistent with the expectations of the programmer. 12 13The most basic tool is locking. Mutexes, condition variables and 14semaphores are used in QEMU, and should be the default approach to 15synchronization. Anything else is considerably harder, but it's 16also justified more often than one would like; 17the most performance-critical parts of QEMU in particular require 18a very low level approach to concurrency, involving memory barriers 19and atomic operations. The semantics of concurrent memory accesses are governed 20by the C11 memory model. 21 22QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to 23provide better portability and a less verbose syntax. ``qemu/atomic.h`` 24provides macros that fall in three camps: 25 26- compiler barriers: ``barrier()``; 27 28- weak atomic access and manual memory barriers: ``qatomic_read()``, 29 ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 30 ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``, 31 ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``; 32 33- sequentially consistent atomic access: everything else. 34 35In general, use of ``qemu/atomic.h`` should be wrapped with more easily 36used data structures (e.g. the lock-free singly-linked list operations 37``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization 38primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of 39atomic operations and memory barriers should be limited to inter-thread 40checking of flags and documented thoroughly. 41 42 43 44Compiler memory barrier 45======================= 46 47``barrier()`` prevents the compiler from moving the memory accesses on 48either side of it to the other side. The compiler barrier has no direct 49effect on the CPU, which may then reorder things however it wishes. 50 51``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some 52architectures, CPU guarantees are strong enough that blocking compiler 53optimizations already ensures the correct order of execution. In this 54case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple 55compiler barriers. 56 57Still, ``barrier()`` can be useful when writing code that can be interrupted 58by signal handlers. 59 60 61Sequentially consistent atomic access 62===================================== 63 64Most of the operations in the ``qemu/atomic.h`` header ensure *sequential 65consistency*, where "the result of any execution is the same as if the 66operations of all the processors were executed in some sequential order, 67and the operations of each individual processor appear in this sequence 68in the order specified by its program". 69 70``qemu/atomic.h`` provides the following set of atomic read-modify-write 71operations:: 72 73 void qatomic_inc(ptr) 74 void qatomic_dec(ptr) 75 void qatomic_add(ptr, val) 76 void qatomic_sub(ptr, val) 77 void qatomic_and(ptr, val) 78 void qatomic_or(ptr, val) 79 80 typeof(*ptr) qatomic_fetch_inc(ptr) 81 typeof(*ptr) qatomic_fetch_dec(ptr) 82 typeof(*ptr) qatomic_fetch_add(ptr, val) 83 typeof(*ptr) qatomic_fetch_sub(ptr, val) 84 typeof(*ptr) qatomic_fetch_and(ptr, val) 85 typeof(*ptr) qatomic_fetch_or(ptr, val) 86 typeof(*ptr) qatomic_fetch_xor(ptr, val) 87 typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) 88 typeof(*ptr) qatomic_xchg(ptr, val) 89 typeof(*ptr) qatomic_cmpxchg(ptr, old, new) 90 91all of which return the old value of ``*ptr``. These operations are 92polymorphic; they operate on any type that is as wide as a pointer or 93smaller. 94 95Similar operations return the new value of ``*ptr``:: 96 97 typeof(*ptr) qatomic_inc_fetch(ptr) 98 typeof(*ptr) qatomic_dec_fetch(ptr) 99 typeof(*ptr) qatomic_add_fetch(ptr, val) 100 typeof(*ptr) qatomic_sub_fetch(ptr, val) 101 typeof(*ptr) qatomic_and_fetch(ptr, val) 102 typeof(*ptr) qatomic_or_fetch(ptr, val) 103 typeof(*ptr) qatomic_xor_fetch(ptr, val) 104 105``qemu/atomic.h`` also provides loads and stores that cannot be reordered 106with each other:: 107 108 typeof(*ptr) qatomic_mb_read(ptr) 109 void qatomic_mb_set(ptr, val) 110 111However these do not provide sequential consistency and, in particular, 112they do not participate in the total ordering enforced by 113sequentially-consistent operations. For this reason they are deprecated. 114They should instead be replaced with any of the following (ordered from 115easiest to hardest): 116 117- accesses inside a mutex or spinlock 118 119- lightweight synchronization primitives such as ``QemuEvent`` 120 121- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing 122 or accessing a new version of a data structure 123 124- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for 125 loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb`` 126 to forbid reordering subsequent loads before a store. 127 128 129Weak atomic access and manual memory barriers 130============================================= 131 132Compared to sequentially consistent atomic access, programming with 133weaker consistency models can be considerably more complicated. 134The only guarantees that you can rely upon in this case are: 135 136- atomic accesses will not cause data races (and hence undefined behavior); 137 ordinary accesses instead cause data races if they are concurrent with 138 other accesses of which at least one is a write. In order to ensure this, 139 the compiler will not optimize accesses out of existence, create unsolicited 140 accesses, or perform other similar optimzations. 141 142- acquire operations will appear to happen, with respect to the other 143 components of the system, before all the LOAD or STORE operations 144 specified afterwards. 145 146- release operations will appear to happen, with respect to the other 147 components of the system, after all the LOAD or STORE operations 148 specified before. 149 150- release operations will *synchronize with* acquire operations; 151 see :ref:`acqrel` for a detailed explanation. 152 153When using this model, variables are accessed with: 154 155- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from 156 optimizing accesses out of existence and creating unsolicited 157 accesses, but do not otherwise impose any ordering on loads and 158 stores: both the compiler and the processor are free to reorder 159 them. 160 161- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to 162 happen, with respect to the other components of the system, 163 before all the LOAD or STORE operations specified afterwards. 164 Operations coming before ``qatomic_load_acquire()`` can still be 165 reordered after it. 166 167- ``qatomic_store_release()``, which guarantees the STORE to appear to 168 happen, with respect to the other components of the system, 169 after all the LOAD or STORE operations specified before. 170 Operations coming after ``qatomic_store_release()`` can still be 171 reordered before it. 172 173Restrictions to the ordering of accesses can also be specified 174using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 175``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. 176 177Memory barriers control the order of references to shared memory. 178They come in six kinds: 179 180- ``smp_rmb()`` guarantees that all the LOAD operations specified before 181 the barrier will appear to happen before all the LOAD operations 182 specified after the barrier with respect to the other components of 183 the system. 184 185 In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not 186 required to have any effect on stores. 187 188- ``smp_wmb()`` guarantees that all the STORE operations specified before 189 the barrier will appear to happen before all the STORE operations 190 specified after the barrier with respect to the other components of 191 the system. 192 193 In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not 194 required to have any effect on loads. 195 196- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before 197 the barrier will appear to happen before all the LOAD or STORE operations 198 specified after the barrier with respect to the other components of 199 the system. 200 201- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* 202 the barrier will appear to happen after all the LOAD or STORE operations 203 specified *before* the barrier with respect to the other components of 204 the system. 205 206- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified 207 before the barrier will appear to happen before all the LOAD and 208 STORE operations specified after the barrier with respect to the other 209 components of the system. 210 211 ``smp_mb()`` puts a partial ordering on both loads and stores. It is 212 stronger than both a read and a write memory barrier; it implies both 213 ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs 214 coming before the barrier from overtaking LOADs coming after the 215 barrier and vice versa. 216 217- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On 218 most processors, whenever two loads are performed such that the 219 second depends on the result of the first (e.g., the first load 220 retrieves the address to which the second load will be directed), 221 the processor will guarantee that the first LOAD will appear to happen 222 before the second with respect to the other components of the system. 223 However, this is not always true---for example, it was not true on 224 Alpha processors. Whenever this kind of access happens to shared 225 memory (that is not protected by a lock), a read barrier is needed, 226 and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. 227 228 Note that the first load really has to have a _data_ dependency and not 229 a control dependency. If the address for the second load is dependent 230 on the first load, but the dependency is through a conditional rather 231 than actually loading the address itself, then it's a _control_ 232 dependency and a full read barrier or better is required. 233 234 235Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are 236mostly used when a data structure has one thread that is always a writer 237and one thread that is always a reader: 238 239 +----------------------------------+----------------------------------+ 240 | thread 1 | thread 2 | 241 +==================================+==================================+ 242 | :: | :: | 243 | | | 244 | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); | 245 | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); | 246 +----------------------------------+----------------------------------+ 247 248In this case, correctness is easy to check for using the "pairing" 249trick that is explained below. 250 251Sometimes, a thread is accessing many variables that are otherwise 252unrelated to each other (for example because, apart from the current 253thread, exactly one other thread will read or write each of these 254variables). In this case, it is possible to "hoist" the barriers 255outside a loop. For example: 256 257 +------------------------------------------+----------------------------------+ 258 | before | after | 259 +==========================================+==================================+ 260 | :: | :: | 261 | | | 262 | n = 0; | n = 0; | 263 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 264 | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); | 265 | | smp_mb_acquire(); | 266 +------------------------------------------+----------------------------------+ 267 | :: | :: | 268 | | | 269 | | smp_mb_release(); | 270 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 271 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 272 +------------------------------------------+----------------------------------+ 273 274Splitting a loop can also be useful to reduce the number of barriers: 275 276 +------------------------------------------+----------------------------------+ 277 | before | after | 278 +==========================================+==================================+ 279 | :: | :: | 280 | | | 281 | n = 0; | smp_mb_release(); | 282 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 283 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 284 | smp_mb(); | smb_mb(); | 285 | n += qatomic_read(&b[i]); | n = 0; | 286 | } | for (i = 0; i < 10; i++) | 287 | | n += qatomic_read(&b[i]); | 288 +------------------------------------------+----------------------------------+ 289 290In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer 291as well) ``smp_wmb()``: 292 293 +------------------------------------------+----------------------------------+ 294 | before | after | 295 +==========================================+==================================+ 296 | :: | :: | 297 | | | 298 | | smp_mb_release(); | 299 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 300 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 301 | qatomic_store_release(&b[i], false); | smb_wmb(); | 302 | } | for (i = 0; i < 10; i++) | 303 | | qatomic_set(&b[i], false); | 304 +------------------------------------------+----------------------------------+ 305 306 307.. _acqrel: 308 309Acquire/release pairing and the *synchronizes-with* relation 310------------------------------------------------------------ 311 312Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have 313either *acquire* or *release* semantics [#rmw]_. This has two effects: 314 315.. [#rmw] Read-modify-write operations can have both---acquire applies to the 316 read part, and release to the write. 317 318- within a thread, they are ordered either before subsequent operations 319 (for acquire) or after previous operations (for release). 320 321- if a release operation in one thread *synchronizes with* an acquire operation 322 in another thread, the ordering constraints propagates from the first to the 323 second thread. That is, everything before the release operation in the 324 first thread is guaranteed to *happen before* everything after the 325 acquire operation in the second thread. 326 327The concept of acquire and release semantics is not exclusive to atomic 328operations; almost all higher-level synchronization primitives also have 329acquire or release semantics. For example: 330 331- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has 332 release semantics and synchronizes with a ``pthread_mutex_lock`` for the 333 same mutex. 334 335- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; 336 ``pthread_cond_wait`` has both release semantics (synchronizing with 337 ``pthread_mutex_lock``) and acquire semantics (synchronizing with 338 ``pthread_mutex_unlock`` and signaling of the condition variable). 339 340- ``pthread_create`` has release semantics and synchronizes with the start 341 of the new thread; ``pthread_join`` has acquire semantics and synchronizes 342 with the exiting of the thread. 343 344- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has 345 acquire semantics. 346 347For example, in the following example there are no atomic accesses, but still 348thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` 349(release) and ``pthread_join`` (acquire): 350 351 +----------------------+-------------------------------+ 352 | thread 1 | thread 2 | 353 +======================+===============================+ 354 | :: | :: | 355 | | | 356 | *a = 1; | | 357 | pthread_exit(a); | pthread_join(thread1, &a); | 358 | | x = *a; | 359 +----------------------+-------------------------------+ 360 361Synchronization between threads basically descends from this pairing of 362a release operation and an acquire operation. Therefore, atomic operations 363other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be 364paired with another operation of the opposite kind: an acquire operation 365will pair with a release operation and vice versa. This rule of thumb is 366extremely useful; in the case of QEMU, however, note that the other 367operation may actually be in a driver that runs in the guest! 368 369``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, 370``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count 371as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, 372``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release 373operations. ``smp_mb()`` counts as both acquire and release, therefore 374it can pair with any other atomic operation. Here is an example: 375 376 +----------------------+------------------------------+ 377 | thread 1 | thread 2 | 378 +======================+==============================+ 379 | :: | :: | 380 | | | 381 | qatomic_set(&a, 1);| | 382 | smp_wmb(); | | 383 | qatomic_set(&b, 2);| x = qatomic_read(&b); | 384 | | smp_rmb(); | 385 | | y = qatomic_read(&a); | 386 +----------------------+------------------------------+ 387 388Note that a load-store pair only counts if the two operations access the 389same variable: that is, a store-release on a variable ``x`` *synchronizes 390with* a load-acquire on a variable ``x``, while a release barrier 391synchronizes with any acquire operation. The following example shows 392correct synchronization: 393 394 +--------------------------------+--------------------------------+ 395 | thread 1 | thread 2 | 396 +================================+================================+ 397 | :: | :: | 398 | | | 399 | qatomic_set(&a, 1); | | 400 | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);| 401 | | y = qatomic_read(&a); | 402 +--------------------------------+--------------------------------+ 403 404Acquire and release semantics of higher-level primitives can also be 405relied upon for the purpose of establishing the *synchronizes with* 406relation. 407 408Note that the "writing" thread is accessing the variables in the 409opposite order as the "reading" thread. This is expected: stores 410before a release operation will normally match the loads after 411the acquire operation, and vice versa. In fact, this happened already 412in the ``pthread_exit``/``pthread_join`` example above. 413 414Finally, this more complex example has more than two accesses and data 415dependency barriers. It also does not use atomic accesses whenever there 416cannot be a data race: 417 418 +----------------------+------------------------------+ 419 | thread 1 | thread 2 | 420 +======================+==============================+ 421 | :: | :: | 422 | | | 423 | b[2] = 1; | | 424 | smp_wmb(); | | 425 | x->i = 2; | | 426 | smp_wmb(); | | 427 | qatomic_set(&a, x);| x = qatomic_read(&a); | 428 | | smp_read_barrier_depends(); | 429 | | y = x->i; | 430 | | smp_read_barrier_depends(); | 431 | | z = b[y]; | 432 +----------------------+------------------------------+ 433 434Comparison with Linux kernel primitives 435======================================= 436 437Here is a list of differences between Linux kernel atomic operations 438and memory barriers, and the equivalents in QEMU: 439 440- atomic operations in Linux are always on a 32-bit int type and 441 use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic 442 and use normal C types. 443 444- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee 445 at all. Linux 4.1 updated them to implement volatile 446 semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). 447 448 QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed 449 semantics if the compiler supports it, and volatile semantics otherwise. 450 Both semantics prevent the compiler from doing certain transformations; 451 the difference is that atomic accesses are guaranteed to be atomic, 452 while volatile accesses aren't. Thus, in the volatile case we just cross 453 our fingers hoping that the compiler will generate atomic accesses, 454 since we assume the variables passed are machine-word sized and 455 properly aligned. 456 457 No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either 458 Linux or QEMU. 459 460- atomic read-modify-write operations in Linux are of three kinds: 461 462 ===================== ========================================= 463 ``atomic_OP`` returns void 464 ``atomic_OP_return`` returns new value of the variable 465 ``atomic_fetch_OP`` returns the old value of the variable 466 ``atomic_cmpxchg`` returns the old value of the variable 467 ===================== ========================================= 468 469 In QEMU, the second kind is named ``atomic_OP_fetch``. 470 471- different atomic read-modify-write operations in Linux imply 472 a different set of memory barriers. In QEMU, all of them enforce 473 sequential consistency: there is a single order in which the 474 program sees them happen. 475 476- however, according to the C11 memory model that QEMU uses, this order 477 does not propagate to other memory accesses on either side of the 478 read-modify-write operation. As far as those are concerned, the 479 operation consist of just a load-acquire followed by a store-release. 480 Stores that precede the RMW operation, and loads that follow it, can 481 still be reordered and will happen *in the middle* of the read-modify-write 482 operation! 483 484 Therefore, the following example is correct in Linux but not in QEMU: 485 486 +----------------------------------+--------------------------------+ 487 | Linux (correct) | QEMU (incorrect) | 488 +==================================+================================+ 489 | :: | :: | 490 | | | 491 | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);| 492 | b = READ_ONCE(&y); | b = qatomic_read(&y); | 493 +----------------------------------+--------------------------------+ 494 495 because the read of ``y`` can be moved (by either the processor or the 496 compiler) before the write of ``x``. 497 498 Fixing this requires a full memory barrier between the write of ``x`` and 499 the read of ``y``. QEMU provides ``smp_mb__before_rmw()`` and 500 ``smp_mb__after_rmw()``; they act both as an optimization, 501 avoiding the memory barrier on processors where it is unnecessary, 502 and as a clarification of this corner case of the C11 memory model: 503 504 +--------------------------------+ 505 | QEMU (correct) | 506 +================================+ 507 | :: | 508 | | 509 | a = qatomic_fetch_add(&x, 2);| 510 | smp_mb__after_rmw(); | 511 | b = qatomic_read(&y); | 512 +--------------------------------+ 513 514 In the common case where only one thread writes ``x``, it is also possible 515 to write it like this: 516 517 +--------------------------------+ 518 | QEMU (correct) | 519 +================================+ 520 | :: | 521 | | 522 | a = qatomic_read(&x); | 523 | qatomic_set(&x, a + 2); | 524 | smp_mb(); | 525 | b = qatomic_read(&y); | 526 +--------------------------------+ 527 528Sources 529======= 530 531- ``Documentation/memory-barriers.txt`` from the Linux kernel 532