1.. _atomics-ref: 2 3========================= 4Atomic operations in QEMU 5========================= 6 7CPUs perform independent memory operations effectively in random order. 8but this can be a problem for CPU-CPU interaction (including interactions 9between QEMU and the guest). Multi-threaded programs use various tools 10to instruct the compiler and the CPU to restrict the order to something 11that is consistent with the expectations of the programmer. 12 13The most basic tool is locking. Mutexes, condition variables and 14semaphores are used in QEMU, and should be the default approach to 15synchronization. Anything else is considerably harder, but it's 16also justified more often than one would like; 17the most performance-critical parts of QEMU in particular require 18a very low level approach to concurrency, involving memory barriers 19and atomic operations. The semantics of concurrent memory accesses are governed 20by the C11 memory model. 21 22QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to 23provide better portability and a less verbose syntax. ``qemu/atomic.h`` 24provides macros that fall in three camps: 25 26- compiler barriers: ``barrier()``; 27 28- weak atomic access and manual memory barriers: ``qatomic_read()``, 29 ``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 30 ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``, 31 ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``; 32 33- sequentially consistent atomic access: everything else. 34 35In general, use of ``qemu/atomic.h`` should be wrapped with more easily 36used data structures (e.g. the lock-free singly-linked list operations 37``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization 38primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``). Bare use of 39atomic operations and memory barriers should be limited to inter-thread 40checking of flags and documented thoroughly. 41 42 43 44Compiler memory barrier 45======================= 46 47``barrier()`` prevents the compiler from moving the memory accesses on 48either side of it to the other side. The compiler barrier has no direct 49effect on the CPU, which may then reorder things however it wishes. 50 51``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some 52architectures, CPU guarantees are strong enough that blocking compiler 53optimizations already ensures the correct order of execution. In this 54case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple 55compiler barriers. 56 57Still, ``barrier()`` can be useful when writing code that can be interrupted 58by signal handlers. 59 60 61Sequentially consistent atomic access 62===================================== 63 64Most of the operations in the ``qemu/atomic.h`` header ensure *sequential 65consistency*, where "the result of any execution is the same as if the 66operations of all the processors were executed in some sequential order, 67and the operations of each individual processor appear in this sequence 68in the order specified by its program". 69 70``qemu/atomic.h`` provides the following set of atomic read-modify-write 71operations:: 72 73 void qatomic_inc(ptr) 74 void qatomic_dec(ptr) 75 void qatomic_add(ptr, val) 76 void qatomic_sub(ptr, val) 77 void qatomic_and(ptr, val) 78 void qatomic_or(ptr, val) 79 80 typeof(*ptr) qatomic_fetch_inc(ptr) 81 typeof(*ptr) qatomic_fetch_dec(ptr) 82 typeof(*ptr) qatomic_fetch_add(ptr, val) 83 typeof(*ptr) qatomic_fetch_sub(ptr, val) 84 typeof(*ptr) qatomic_fetch_and(ptr, val) 85 typeof(*ptr) qatomic_fetch_or(ptr, val) 86 typeof(*ptr) qatomic_fetch_xor(ptr, val) 87 typeof(*ptr) qatomic_fetch_inc_nonzero(ptr) 88 typeof(*ptr) qatomic_xchg(ptr, val) 89 typeof(*ptr) qatomic_cmpxchg(ptr, old, new) 90 91all of which return the old value of ``*ptr``. These operations are 92polymorphic; they operate on any type that is as wide as a pointer or 93smaller. 94 95Similar operations return the new value of ``*ptr``:: 96 97 typeof(*ptr) qatomic_inc_fetch(ptr) 98 typeof(*ptr) qatomic_dec_fetch(ptr) 99 typeof(*ptr) qatomic_add_fetch(ptr, val) 100 typeof(*ptr) qatomic_sub_fetch(ptr, val) 101 typeof(*ptr) qatomic_and_fetch(ptr, val) 102 typeof(*ptr) qatomic_or_fetch(ptr, val) 103 typeof(*ptr) qatomic_xor_fetch(ptr, val) 104 105``qemu/atomic.h`` also provides an optimized shortcut for 106``qatomic_set`` followed by ``smp_mb``:: 107 108 void qatomic_set_mb(ptr, val) 109 110 111Weak atomic access and manual memory barriers 112============================================= 113 114Compared to sequentially consistent atomic access, programming with 115weaker consistency models can be considerably more complicated. 116The only guarantees that you can rely upon in this case are: 117 118- atomic accesses will not cause data races (and hence undefined behavior); 119 ordinary accesses instead cause data races if they are concurrent with 120 other accesses of which at least one is a write. In order to ensure this, 121 the compiler will not optimize accesses out of existence, create unsolicited 122 accesses, or perform other similar optimizations. 123 124- acquire operations will appear to happen, with respect to the other 125 components of the system, before all the LOAD or STORE operations 126 specified afterwards. 127 128- release operations will appear to happen, with respect to the other 129 components of the system, after all the LOAD or STORE operations 130 specified before. 131 132- release operations will *synchronize with* acquire operations; 133 see :ref:`acqrel` for a detailed explanation. 134 135When using this model, variables are accessed with: 136 137- ``qatomic_read()`` and ``qatomic_set()``; these prevent the compiler from 138 optimizing accesses out of existence and creating unsolicited 139 accesses, but do not otherwise impose any ordering on loads and 140 stores: both the compiler and the processor are free to reorder 141 them. 142 143- ``qatomic_load_acquire()``, which guarantees the LOAD to appear to 144 happen, with respect to the other components of the system, 145 before all the LOAD or STORE operations specified afterwards. 146 Operations coming before ``qatomic_load_acquire()`` can still be 147 reordered after it. 148 149- ``qatomic_store_release()``, which guarantees the STORE to appear to 150 happen, with respect to the other components of the system, 151 after all the LOAD or STORE operations specified before. 152 Operations coming after ``qatomic_store_release()`` can still be 153 reordered before it. 154 155Restrictions to the ordering of accesses can also be specified 156using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 157``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. 158 159Memory barriers control the order of references to shared memory. 160They come in six kinds: 161 162- ``smp_rmb()`` guarantees that all the LOAD operations specified before 163 the barrier will appear to happen before all the LOAD operations 164 specified after the barrier with respect to the other components of 165 the system. 166 167 In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not 168 required to have any effect on stores. 169 170- ``smp_wmb()`` guarantees that all the STORE operations specified before 171 the barrier will appear to happen before all the STORE operations 172 specified after the barrier with respect to the other components of 173 the system. 174 175 In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not 176 required to have any effect on loads. 177 178- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before 179 the barrier will appear to happen before all the LOAD or STORE operations 180 specified after the barrier with respect to the other components of 181 the system. 182 183- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* 184 the barrier will appear to happen after all the LOAD or STORE operations 185 specified *before* the barrier with respect to the other components of 186 the system. 187 188- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified 189 before the barrier will appear to happen before all the LOAD and 190 STORE operations specified after the barrier with respect to the other 191 components of the system. 192 193 ``smp_mb()`` puts a partial ordering on both loads and stores. It is 194 stronger than both a read and a write memory barrier; it implies both 195 ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs 196 coming before the barrier from overtaking LOADs coming after the 197 barrier and vice versa. 198 199- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On 200 most processors, whenever two loads are performed such that the 201 second depends on the result of the first (e.g., the first load 202 retrieves the address to which the second load will be directed), 203 the processor will guarantee that the first LOAD will appear to happen 204 before the second with respect to the other components of the system. 205 Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``, 206 ``smp_read_barrier_depends()`` can be just a compiler barrier on 207 weakly-ordered architectures such as Arm or PPC\ [#alpha]_. 208 209 Note that the first load really has to have a _data_ dependency and not 210 a control dependency. If the address for the second load is dependent 211 on the first load, but the dependency is through a conditional rather 212 than actually loading the address itself, then it's a _control_ 213 dependency and a full read barrier or better is required. 214 215.. [#alpha] The DEC Alpha is an exception, because ``smp_read_barrier_depends()`` 216 needs a processor barrier. On strongly-ordered architectures such 217 as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can 218 also be compiler barriers only. 219 220Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are 221mostly used when a data structure has one thread that is always a writer 222and one thread that is always a reader: 223 224 +----------------------------------+----------------------------------+ 225 | thread 1 | thread 2 | 226 +==================================+==================================+ 227 | :: | :: | 228 | | | 229 | qatomic_store_release(&a, x); | y = qatomic_load_acquire(&b); | 230 | qatomic_store_release(&b, y); | x = qatomic_load_acquire(&a); | 231 +----------------------------------+----------------------------------+ 232 233In this case, correctness is easy to check for using the "pairing" 234trick that is explained below. 235 236Sometimes, a thread is accessing many variables that are otherwise 237unrelated to each other (for example because, apart from the current 238thread, exactly one other thread will read or write each of these 239variables). In this case, it is possible to "hoist" the barriers 240outside a loop. For example: 241 242 +------------------------------------------+----------------------------------+ 243 | before | after | 244 +==========================================+==================================+ 245 | :: | :: | 246 | | | 247 | n = 0; | n = 0; | 248 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 249 | n += qatomic_load_acquire(&a[i]); | n += qatomic_read(&a[i]); | 250 | | smp_mb_acquire(); | 251 +------------------------------------------+----------------------------------+ 252 | :: | :: | 253 | | | 254 | | smp_mb_release(); | 255 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 256 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 257 +------------------------------------------+----------------------------------+ 258 259Splitting a loop can also be useful to reduce the number of barriers: 260 261 +------------------------------------------+----------------------------------+ 262 | before | after | 263 +==========================================+==================================+ 264 | :: | :: | 265 | | | 266 | n = 0; | smp_mb_release(); | 267 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 268 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 269 | smp_mb(); | smb_mb(); | 270 | n += qatomic_read(&b[i]); | n = 0; | 271 | } | for (i = 0; i < 10; i++) | 272 | | n += qatomic_read(&b[i]); | 273 +------------------------------------------+----------------------------------+ 274 275In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer 276as well) ``smp_wmb()``: 277 278 +------------------------------------------+----------------------------------+ 279 | before | after | 280 +==========================================+==================================+ 281 | :: | :: | 282 | | | 283 | | smp_mb_release(); | 284 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 285 | qatomic_store_release(&a[i], false); | qatomic_set(&a[i], false); | 286 | qatomic_store_release(&b[i], false); | smb_wmb(); | 287 | } | for (i = 0; i < 10; i++) | 288 | | qatomic_set(&b[i], false); | 289 +------------------------------------------+----------------------------------+ 290 291 292.. _acqrel: 293 294Acquire/release pairing and the *synchronizes-with* relation 295------------------------------------------------------------ 296 297Atomic operations other than ``qatomic_set()`` and ``qatomic_read()`` have 298either *acquire* or *release* semantics\ [#rmw]_. This has two effects: 299 300.. [#rmw] Read-modify-write operations can have both---acquire applies to the 301 read part, and release to the write. 302 303- within a thread, they are ordered either before subsequent operations 304 (for acquire) or after previous operations (for release). 305 306- if a release operation in one thread *synchronizes with* an acquire operation 307 in another thread, the ordering constraints propagates from the first to the 308 second thread. That is, everything before the release operation in the 309 first thread is guaranteed to *happen before* everything after the 310 acquire operation in the second thread. 311 312The concept of acquire and release semantics is not exclusive to atomic 313operations; almost all higher-level synchronization primitives also have 314acquire or release semantics. For example: 315 316- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has 317 release semantics and synchronizes with a ``pthread_mutex_lock`` for the 318 same mutex. 319 320- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics; 321 ``pthread_cond_wait`` has both release semantics (synchronizing with 322 ``pthread_mutex_lock``) and acquire semantics (synchronizing with 323 ``pthread_mutex_unlock`` and signaling of the condition variable). 324 325- ``pthread_create`` has release semantics and synchronizes with the start 326 of the new thread; ``pthread_join`` has acquire semantics and synchronizes 327 with the exiting of the thread. 328 329- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has 330 acquire semantics. 331 332For example, in the following example there are no atomic accesses, but still 333thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit`` 334(release) and ``pthread_join`` (acquire): 335 336 +----------------------+-------------------------------+ 337 | thread 1 | thread 2 | 338 +======================+===============================+ 339 | :: | :: | 340 | | | 341 | *a = 1; | | 342 | pthread_exit(a); | pthread_join(thread1, &a); | 343 | | x = *a; | 344 +----------------------+-------------------------------+ 345 346Synchronization between threads basically descends from this pairing of 347a release operation and an acquire operation. Therefore, atomic operations 348other than ``qatomic_set()`` and ``qatomic_read()`` will almost always be 349paired with another operation of the opposite kind: an acquire operation 350will pair with a release operation and vice versa. This rule of thumb is 351extremely useful; in the case of QEMU, however, note that the other 352operation may actually be in a driver that runs in the guest! 353 354``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``, 355``qatomic_load_acquire()`` and ``qatomic_rcu_read()`` all count 356as acquire operations. ``smp_wmb()``, ``smp_mb_release()``, 357``qatomic_store_release()`` and ``qatomic_rcu_set()`` all count as release 358operations. ``smp_mb()`` counts as both acquire and release, therefore 359it can pair with any other atomic operation. Here is an example: 360 361 +----------------------+------------------------------+ 362 | thread 1 | thread 2 | 363 +======================+==============================+ 364 | :: | :: | 365 | | | 366 | qatomic_set(&a, 1);| | 367 | smp_wmb(); | | 368 | qatomic_set(&b, 2);| x = qatomic_read(&b); | 369 | | smp_rmb(); | 370 | | y = qatomic_read(&a); | 371 +----------------------+------------------------------+ 372 373Note that a load-store pair only counts if the two operations access the 374same variable: that is, a store-release on a variable ``x`` *synchronizes 375with* a load-acquire on a variable ``x``, while a release barrier 376synchronizes with any acquire operation. The following example shows 377correct synchronization: 378 379 +--------------------------------+--------------------------------+ 380 | thread 1 | thread 2 | 381 +================================+================================+ 382 | :: | :: | 383 | | | 384 | qatomic_set(&a, 1); | | 385 | qatomic_store_release(&b, 2);| x = qatomic_load_acquire(&b);| 386 | | y = qatomic_read(&a); | 387 +--------------------------------+--------------------------------+ 388 389Acquire and release semantics of higher-level primitives can also be 390relied upon for the purpose of establishing the *synchronizes with* 391relation. 392 393Note that the "writing" thread is accessing the variables in the 394opposite order as the "reading" thread. This is expected: stores 395before a release operation will normally match the loads after 396the acquire operation, and vice versa. In fact, this happened already 397in the ``pthread_exit``/``pthread_join`` example above. 398 399Finally, this more complex example has more than two accesses and data 400dependency barriers. It also does not use atomic accesses whenever there 401cannot be a data race: 402 403 +----------------------+------------------------------+ 404 | thread 1 | thread 2 | 405 +======================+==============================+ 406 | :: | :: | 407 | | | 408 | b[2] = 1; | | 409 | smp_wmb(); | | 410 | x->i = 2; | | 411 | smp_wmb(); | | 412 | qatomic_set(&a, x);| x = qatomic_read(&a); | 413 | | smp_read_barrier_depends(); | 414 | | y = x->i; | 415 | | smp_read_barrier_depends(); | 416 | | z = b[y]; | 417 +----------------------+------------------------------+ 418 419Comparison with Linux kernel primitives 420======================================= 421 422Here is a list of differences between Linux kernel atomic operations 423and memory barriers, and the equivalents in QEMU: 424 425- atomic operations in Linux are always on a 32-bit int type and 426 use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic 427 and use normal C types. 428 429- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee 430 at all. Linux 4.1 updated them to implement volatile 431 semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). 432 433 QEMU's ``qatomic_read`` and ``qatomic_set`` implement C11 atomic relaxed 434 semantics if the compiler supports it, and volatile semantics otherwise. 435 Both semantics prevent the compiler from doing certain transformations; 436 the difference is that atomic accesses are guaranteed to be atomic, 437 while volatile accesses aren't. Thus, in the volatile case we just cross 438 our fingers hoping that the compiler will generate atomic accesses, 439 since we assume the variables passed are machine-word sized and 440 properly aligned. 441 442 No barriers are implied by ``qatomic_read`` and ``qatomic_set`` in either 443 Linux or QEMU. 444 445- atomic read-modify-write operations in Linux are of three kinds: 446 447 ===================== ========================================= 448 ``atomic_OP`` returns void 449 ``atomic_OP_return`` returns new value of the variable 450 ``atomic_fetch_OP`` returns the old value of the variable 451 ``atomic_cmpxchg`` returns the old value of the variable 452 ===================== ========================================= 453 454 In QEMU, the second kind is named ``atomic_OP_fetch``. 455 456- different atomic read-modify-write operations in Linux imply 457 a different set of memory barriers. In QEMU, all of them enforce 458 sequential consistency: there is a single order in which the 459 program sees them happen. 460 461- however, according to the C11 memory model that QEMU uses, this order 462 does not propagate to other memory accesses on either side of the 463 read-modify-write operation. As far as those are concerned, the 464 operation consist of just a load-acquire followed by a store-release. 465 Stores that precede the RMW operation, and loads that follow it, can 466 still be reordered and will happen *in the middle* of the read-modify-write 467 operation! 468 469 Therefore, the following example is correct in Linux but not in QEMU: 470 471 +----------------------------------+--------------------------------+ 472 | Linux (correct) | QEMU (incorrect) | 473 +==================================+================================+ 474 | :: | :: | 475 | | | 476 | a = atomic_fetch_add(&x, 2); | a = qatomic_fetch_add(&x, 2);| 477 | b = READ_ONCE(&y); | b = qatomic_read(&y); | 478 +----------------------------------+--------------------------------+ 479 480 because the read of ``y`` can be moved (by either the processor or the 481 compiler) before the write of ``x``. 482 483 Fixing this requires a full memory barrier between the write of ``x`` and 484 the read of ``y``. QEMU provides ``smp_mb__before_rmw()`` and 485 ``smp_mb__after_rmw()``; they act both as an optimization, 486 avoiding the memory barrier on processors where it is unnecessary, 487 and as a clarification of this corner case of the C11 memory model: 488 489 +--------------------------------+ 490 | QEMU (correct) | 491 +================================+ 492 | :: | 493 | | 494 | a = qatomic_fetch_add(&x, 2);| 495 | smp_mb__after_rmw(); | 496 | b = qatomic_read(&y); | 497 +--------------------------------+ 498 499 In the common case where only one thread writes ``x``, it is also possible 500 to write it like this: 501 502 +--------------------------------+ 503 | QEMU (correct) | 504 +================================+ 505 | :: | 506 | | 507 | a = qatomic_read(&x); | 508 | qatomic_set_mb(&x, a + 2); | 509 | b = qatomic_read(&y); | 510 +--------------------------------+ 511 512Sources 513======= 514 515- ``Documentation/memory-barriers.txt`` from the Linux kernel 516