1Explanation of the Linux-Kernel Memory Consistency Model 2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 4:Author: Alan Stern <stern@rowland.harvard.edu> 5:Created: October 2017 6 7.. Contents 8 9 1. INTRODUCTION 10 2. BACKGROUND 11 3. A SIMPLE EXAMPLE 12 4. A SELECTION OF MEMORY MODELS 13 5. ORDERING AND CYCLES 14 6. EVENTS 15 7. THE PROGRAM ORDER RELATION: po AND po-loc 16 8. A WARNING 17 9. DEPENDENCY RELATIONS: data, addr, and ctrl 18 10. THE READS-FROM RELATION: rf, rfi, and rfe 19 11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 20 12. THE FROM-READS RELATION: fr, fri, and fre 21 13. AN OPERATIONAL MODEL 22 14. PROPAGATION ORDER RELATION: cumul-fence 23 15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 24 16. SEQUENTIAL CONSISTENCY PER VARIABLE 25 17. ATOMIC UPDATES: rmw 26 18. THE PRESERVED PROGRAM ORDER RELATION: ppo 27 19. AND THEN THERE WAS ALPHA 28 20. THE HAPPENS-BEFORE RELATION: hb 29 21. THE PROPAGATES-BEFORE RELATION: pb 30 22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 31 23. LOCKING 32 24. PLAIN ACCESSES AND DATA RACES 33 25. ODDS AND ENDS 34 35 36 37INTRODUCTION 38------------ 39 40The Linux-kernel memory consistency model (LKMM) is rather complex and 41obscure. This is particularly evident if you read through the 42linux-kernel.bell and linux-kernel.cat files that make up the formal 43version of the model; they are extremely terse and their meanings are 44far from clear. 45 46This document describes the ideas underlying the LKMM. It is meant 47for people who want to understand how the model was designed. It does 48not go into the details of the code in the .bell and .cat files; 49rather, it explains in English what the code expresses symbolically. 50 51Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed 52toward beginners; they explain what memory consistency models are and 53the basic notions shared by all such models. People already familiar 54with these concepts can skim or skip over them. Sections 6 (EVENTS) 55through 12 (THE FROM_READS RELATION) describe the fundamental 56relations used in many models. Starting in Section 13 (AN OPERATIONAL 57MODEL), the workings of the LKMM itself are covered. 58 59Warning: The code examples in this document are not written in the 60proper format for litmus tests. They don't include a header line, the 61initializations are not enclosed in braces, the global variables are 62not passed by pointers, and they don't have an "exists" clause at the 63end. Converting them to the right format is left as an exercise for 64the reader. 65 66 67BACKGROUND 68---------- 69 70A memory consistency model (or just memory model, for short) is 71something which predicts, given a piece of computer code running on a 72particular kind of system, what values may be obtained by the code's 73load instructions. The LKMM makes these predictions for code running 74as part of the Linux kernel. 75 76In practice, people tend to use memory models the other way around. 77That is, given a piece of code and a collection of values specified 78for the loads, the model will predict whether it is possible for the 79code to run in such a way that the loads will indeed obtain the 80specified values. Of course, this is just another way of expressing 81the same idea. 82 83For code running on a uniprocessor system, the predictions are easy: 84Each load instruction must obtain the value written by the most recent 85store instruction accessing the same location (we ignore complicating 86factors such as DMA and mixed-size accesses.) But on multiprocessor 87systems, with multiple CPUs making concurrent accesses to shared 88memory locations, things aren't so simple. 89 90Different architectures have differing memory models, and the Linux 91kernel supports a variety of architectures. The LKMM has to be fairly 92permissive, in the sense that any behavior allowed by one of these 93architectures also has to be allowed by the LKMM. 94 95 96A SIMPLE EXAMPLE 97---------------- 98 99Here is a simple example to illustrate the basic concepts. Consider 100some code running as part of a device driver for an input device. The 101driver might contain an interrupt handler which collects data from the 102device, stores it in a buffer, and sets a flag to indicate the buffer 103is full. Running concurrently on a different CPU might be a part of 104the driver code being executed by a process in the midst of a read(2) 105system call. This code tests the flag to see whether the buffer is 106ready, and if it is, copies the data back to userspace. The buffer 107and the flag are memory locations shared between the two CPUs. 108 109We can abstract out the important pieces of the driver code as follows 110(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple 111assignment statements is discussed later): 112 113 int buf = 0, flag = 0; 114 115 P0() 116 { 117 WRITE_ONCE(buf, 1); 118 WRITE_ONCE(flag, 1); 119 } 120 121 P1() 122 { 123 int r1; 124 int r2 = 0; 125 126 r1 = READ_ONCE(flag); 127 if (r1) 128 r2 = READ_ONCE(buf); 129 } 130 131Here the P0() function represents the interrupt handler running on one 132CPU and P1() represents the read() routine running on another. The 133value 1 stored in buf represents input data collected from the device. 134Thus, P0 stores the data in buf and then sets flag. Meanwhile, P1 135reads flag into the private variable r1, and if it is set, reads the 136data from buf into a second private variable r2 for copying to 137userspace. (Presumably if flag is not set then the driver will wait a 138while and try again.) 139 140This pattern of memory accesses, where one CPU stores values to two 141shared memory locations and another CPU loads from those locations in 142the opposite order, is widely known as the "Message Passing" or MP 143pattern. It is typical of memory access patterns in the kernel. 144 145Please note that this example code is a simplified abstraction. Real 146buffers are usually larger than a single integer, real device drivers 147usually use sleep and wakeup mechanisms rather than polling for I/O 148completion, and real code generally doesn't bother to copy values into 149private variables before using them. All that is beside the point; 150the idea here is simply to illustrate the overall pattern of memory 151accesses by the CPUs. 152 153A memory model will predict what values P1 might obtain for its loads 154from flag and buf, or equivalently, what values r1 and r2 might end up 155with after the code has finished running. 156 157Some predictions are trivial. For instance, no sane memory model would 158predict that r1 = 42 or r2 = -7, because neither of those values ever 159gets stored in flag or buf. 160 161Some nontrivial predictions are nonetheless quite simple. For 162instance, P1 might run entirely before P0 begins, in which case r1 and 163r2 will both be 0 at the end. Or P0 might run entirely before P1 164begins, in which case r1 and r2 will both be 1. 165 166The interesting predictions concern what might happen when the two 167routines run concurrently. One possibility is that P1 runs after P0's 168store to buf but before the store to flag. In this case, r1 and r2 169will again both be 0. (If P1 had been designed to read buf 170unconditionally then we would instead have r1 = 0 and r2 = 1.) 171 172However, the most interesting possibility is where r1 = 1 and r2 = 0. 173If this were to occur it would mean the driver contains a bug, because 174incorrect data would get sent to the user: 0 instead of 1. As it 175happens, the LKMM does predict this outcome can occur, and the example 176driver code shown above is indeed buggy. 177 178 179A SELECTION OF MEMORY MODELS 180---------------------------- 181 182The first widely cited memory model, and the simplest to understand, 183is Sequential Consistency. According to this model, systems behave as 184if each CPU executed its instructions in order but with unspecified 185timing. In other words, the instructions from the various CPUs get 186interleaved in a nondeterministic way, always according to some single 187global order that agrees with the order of the instructions in the 188program source for each CPU. The model says that the value obtained 189by each load is simply the value written by the most recently executed 190store to the same memory location, from any CPU. 191 192For the MP example code shown above, Sequential Consistency predicts 193that the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning 194goes like this: 195 196 Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from 197 it, as loads can obtain values only from earlier stores. 198 199 P1 loads from flag before loading from buf, since CPUs execute 200 their instructions in order. 201 202 P1 must load 0 from buf before P0 stores 1 to it; otherwise r2 203 would be 1 since a load obtains its value from the most recent 204 store to the same address. 205 206 P0 stores 1 to buf before storing 1 to flag, since it executes 207 its instructions in order. 208 209 Since an instruction (in this case, P0's store to flag) cannot 210 execute before itself, the specified outcome is impossible. 211 212However, real computer hardware almost never follows the Sequential 213Consistency memory model; doing so would rule out too many valuable 214performance optimizations. On ARM and PowerPC architectures, for 215instance, the MP example code really does sometimes yield r1 = 1 and 216r2 = 0. 217 218x86 and SPARC follow yet a different memory model: TSO (Total Store 219Ordering). This model predicts that the undesired outcome for the MP 220pattern cannot occur, but in other respects it differs from Sequential 221Consistency. One example is the Store Buffer (SB) pattern, in which 222each CPU stores to its own shared location and then loads from the 223other CPU's location: 224 225 int x = 0, y = 0; 226 227 P0() 228 { 229 int r0; 230 231 WRITE_ONCE(x, 1); 232 r0 = READ_ONCE(y); 233 } 234 235 P1() 236 { 237 int r1; 238 239 WRITE_ONCE(y, 1); 240 r1 = READ_ONCE(x); 241 } 242 243Sequential Consistency predicts that the outcome r0 = 0, r1 = 0 is 244impossible. (Exercise: Figure out the reasoning.) But TSO allows 245this outcome to occur, and in fact it does sometimes occur on x86 and 246SPARC systems. 247 248The LKMM was inspired by the memory models followed by PowerPC, ARM, 249x86, Alpha, and other architectures. However, it is different in 250detail from each of them. 251 252 253ORDERING AND CYCLES 254------------------- 255 256Memory models are all about ordering. Often this is temporal ordering 257(i.e., the order in which certain events occur) but it doesn't have to 258be; consider for example the order of instructions in a program's 259source code. We saw above that Sequential Consistency makes an 260important assumption that CPUs execute instructions in the same order 261as those instructions occur in the code, and there are many other 262instances of ordering playing central roles in memory models. 263 264The counterpart to ordering is a cycle. Ordering rules out cycles: 265It's not possible to have X ordered before Y, Y ordered before Z, and 266Z ordered before X, because this would mean that X is ordered before 267itself. The analysis of the MP example under Sequential Consistency 268involved just such an impossible cycle: 269 270 W: P0 stores 1 to flag executes before 271 X: P1 loads 1 from flag executes before 272 Y: P1 loads 0 from buf executes before 273 Z: P0 stores 1 to buf executes before 274 W: P0 stores 1 to flag. 275 276In short, if a memory model requires certain accesses to be ordered, 277and a certain outcome for the loads in a piece of code can happen only 278if those accesses would form a cycle, then the memory model predicts 279that outcome cannot occur. 280 281The LKMM is defined largely in terms of cycles, as we will see. 282 283 284EVENTS 285------ 286 287The LKMM does not work directly with the C statements that make up 288kernel source code. Instead it considers the effects of those 289statements in a more abstract form, namely, events. The model 290includes three types of events: 291 292 Read events correspond to loads from shared memory, such as 293 calls to READ_ONCE(), smp_load_acquire(), or 294 rcu_dereference(). 295 296 Write events correspond to stores to shared memory, such as 297 calls to WRITE_ONCE(), smp_store_release(), or atomic_set(). 298 299 Fence events correspond to memory barriers (also known as 300 fences), such as calls to smp_rmb() or rcu_read_lock(). 301 302These categories are not exclusive; a read or write event can also be 303a fence. This happens with functions like smp_load_acquire() or 304spin_lock(). However, no single event can be both a read and a write. 305Atomic read-modify-write accesses, such as atomic_inc() or xchg(), 306correspond to a pair of events: a read followed by a write. (The 307write event is omitted for executions where it doesn't occur, such as 308a cmpxchg() where the comparison fails.) 309 310Other parts of the code, those which do not involve interaction with 311shared memory, do not give rise to events. Thus, arithmetic and 312logical computations, control-flow instructions, or accesses to 313private memory or CPU registers are not of central interest to the 314memory model. They only affect the model's predictions indirectly. 315For example, an arithmetic computation might determine the value that 316gets stored to a shared memory location (or in the case of an array 317index, the address where the value gets stored), but the memory model 318is concerned only with the store itself -- its value and its address 319-- not the computation leading up to it. 320 321Events in the LKMM can be linked by various relations, which we will 322describe in the following sections. The memory model requires certain 323of these relations to be orderings, that is, it requires them not to 324have any cycles. 325 326 327THE PROGRAM ORDER RELATION: po AND po-loc 328----------------------------------------- 329 330The most important relation between events is program order (po). You 331can think of it as the order in which statements occur in the source 332code after branches are taken into account and loops have been 333unrolled. A better description might be the order in which 334instructions are presented to a CPU's execution unit. Thus, we say 335that X is po-before Y (written as "X ->po Y" in formulas) if X occurs 336before Y in the instruction stream. 337 338This is inherently a single-CPU relation; two instructions executing 339on different CPUs are never linked by po. Also, it is by definition 340an ordering so it cannot have any cycles. 341 342po-loc is a sub-relation of po. It links two memory accesses when the 343first comes before the second in program order and they access the 344same memory location (the "-loc" suffix). 345 346Although this may seem straightforward, there is one subtle aspect to 347program order we need to explain. The LKMM was inspired by low-level 348architectural memory models which describe the behavior of machine 349code, and it retains their outlook to a considerable extent. The 350read, write, and fence events used by the model are close in spirit to 351individual machine instructions. Nevertheless, the LKMM describes 352kernel code written in C, and the mapping from C to machine code can 353be extremely complex. 354 355Optimizing compilers have great freedom in the way they translate 356source code to object code. They are allowed to apply transformations 357that add memory accesses, eliminate accesses, combine them, split them 358into pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(), 359or one of the other atomic or synchronization primitives prevents a 360large number of compiler optimizations. In particular, it is guaranteed 361that the compiler will not remove such accesses from the generated code 362(unless it can prove the accesses will never be executed), it will not 363change the order in which they occur in the code (within limits imposed 364by the C standard), and it will not introduce extraneous accesses. 365 366The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather 367than ordinary memory accesses. Thanks to this usage, we can be certain 368that in the MP example, the compiler won't reorder P0's write event to 369buf and P0's write event to flag, and similarly for the other shared 370memory accesses in the examples. 371 372Since private variables are not shared between CPUs, they can be 373accessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they 374need not even be stored in normal memory at all -- in principle a 375private variable could be stored in a CPU register (hence the convention 376that these variables have names starting with the letter 'r'). 377 378 379A WARNING 380--------- 381 382The protections provided by READ_ONCE(), WRITE_ONCE(), and others are 383not perfect; and under some circumstances it is possible for the 384compiler to undermine the memory model. Here is an example. Suppose 385both branches of an "if" statement store the same value to the same 386location: 387 388 r1 = READ_ONCE(x); 389 if (r1) { 390 WRITE_ONCE(y, 2); 391 ... /* do something */ 392 } else { 393 WRITE_ONCE(y, 2); 394 ... /* do something else */ 395 } 396 397For this code, the LKMM predicts that the load from x will always be 398executed before either of the stores to y. However, a compiler could 399lift the stores out of the conditional, transforming the code into 400something resembling: 401 402 r1 = READ_ONCE(x); 403 WRITE_ONCE(y, 2); 404 if (r1) { 405 ... /* do something */ 406 } else { 407 ... /* do something else */ 408 } 409 410Given this version of the code, the LKMM would predict that the load 411from x could be executed after the store to y. Thus, the memory 412model's original prediction could be invalidated by the compiler. 413 414Another issue arises from the fact that in C, arguments to many 415operators and function calls can be evaluated in any order. For 416example: 417 418 r1 = f(5) + g(6); 419 420The object code might call f(5) either before or after g(6); the 421memory model cannot assume there is a fixed program order relation 422between them. (In fact, if the function calls are inlined then the 423compiler might even interleave their object code.) 424 425 426DEPENDENCY RELATIONS: data, addr, and ctrl 427------------------------------------------ 428 429We say that two events are linked by a dependency relation when the 430execution of the second event depends in some way on a value obtained 431from memory by the first. The first event must be a read, and the 432value it obtains must somehow affect what the second event does. 433There are three kinds of dependencies: data, address (addr), and 434control (ctrl). 435 436A read and a write event are linked by a data dependency if the value 437obtained by the read affects the value stored by the write. As a very 438simple example: 439 440 int x, y; 441 442 r1 = READ_ONCE(x); 443 WRITE_ONCE(y, r1 + 5); 444 445The value stored by the WRITE_ONCE obviously depends on the value 446loaded by the READ_ONCE. Such dependencies can wind through 447arbitrarily complicated computations, and a write can depend on the 448values of multiple reads. 449 450A read event and another memory access event are linked by an address 451dependency if the value obtained by the read affects the location 452accessed by the other event. The second event can be either a read or 453a write. Here's another simple example: 454 455 int a[20]; 456 int i; 457 458 r1 = READ_ONCE(i); 459 r2 = READ_ONCE(a[r1]); 460 461Here the location accessed by the second READ_ONCE() depends on the 462index value loaded by the first. Pointer indirection also gives rise 463to address dependencies, since the address of a location accessed 464through a pointer will depend on the value read earlier from that 465pointer. 466 467Finally, a read event and another memory access event are linked by a 468control dependency if the value obtained by the read affects whether 469the second event is executed at all. Simple example: 470 471 int x, y; 472 473 r1 = READ_ONCE(x); 474 if (r1) 475 WRITE_ONCE(y, 1984); 476 477Execution of the WRITE_ONCE() is controlled by a conditional expression 478which depends on the value obtained by the READ_ONCE(); hence there is 479a control dependency from the load to the store. 480 481It should be pretty obvious that events can only depend on reads that 482come earlier in program order. Symbolically, if we have R ->data X, 483R ->addr X, or R ->ctrl X (where R is a read event), then we must also 484have R ->po X. It wouldn't make sense for a computation to depend 485somehow on a value that doesn't get loaded from shared memory until 486later in the code! 487 488Here's a trick question: When is a dependency not a dependency? Answer: 489When it is purely syntactic rather than semantic. We say a dependency 490between two accesses is purely syntactic if the second access doesn't 491actually depend on the result of the first. Here is a trivial example: 492 493 r1 = READ_ONCE(x); 494 WRITE_ONCE(y, r1 * 0); 495 496There appears to be a data dependency from the load of x to the store 497of y, since the value to be stored is computed from the value that was 498loaded. But in fact, the value stored does not really depend on 499anything since it will always be 0. Thus the data dependency is only 500syntactic (it appears to exist in the code) but not semantic (the 501second access will always be the same, regardless of the value of the 502first access). Given code like this, a compiler could simply discard 503the value returned by the load from x, which would certainly destroy 504any dependency. (The compiler is not permitted to eliminate entirely 505the load generated for a READ_ONCE() -- that's one of the nice 506properties of READ_ONCE() -- but it is allowed to ignore the load's 507value.) 508 509It's natural to object that no one in their right mind would write 510code like the above. However, macro expansions can easily give rise 511to this sort of thing, in ways that often are not apparent to the 512programmer. 513 514Another mechanism that can lead to purely syntactic dependencies is 515related to the notion of "undefined behavior". Certain program 516behaviors are called "undefined" in the C language specification, 517which means that when they occur there are no guarantees at all about 518the outcome. Consider the following example: 519 520 int a[1]; 521 int i; 522 523 r1 = READ_ONCE(i); 524 r2 = READ_ONCE(a[r1]); 525 526Access beyond the end or before the beginning of an array is one kind 527of undefined behavior. Therefore the compiler doesn't have to worry 528about what will happen if r1 is nonzero, and it can assume that r1 529will always be zero regardless of the value actually loaded from i. 530(If the assumption turns out to be wrong the resulting behavior will 531be undefined anyway, so the compiler doesn't care!) Thus the value 532from the load can be discarded, breaking the address dependency. 533 534The LKMM is unaware that purely syntactic dependencies are different 535from semantic dependencies and therefore mistakenly predicts that the 536accesses in the two examples above will be ordered. This is another 537example of how the compiler can undermine the memory model. Be warned. 538 539 540THE READS-FROM RELATION: rf, rfi, and rfe 541----------------------------------------- 542 543The reads-from relation (rf) links a write event to a read event when 544the value loaded by the read is the value that was stored by the 545write. In colloquial terms, the load "reads from" the store. We 546write W ->rf R to indicate that the load R reads from the store W. We 547further distinguish the cases where the load and the store occur on 548the same CPU (internal reads-from, or rfi) and where they occur on 549different CPUs (external reads-from, or rfe). 550 551For our purposes, a memory location's initial value is treated as 552though it had been written there by an imaginary initial store that 553executes on a separate CPU before the main program runs. 554 555Usage of the rf relation implicitly assumes that loads will always 556read from a single store. It doesn't apply properly in the presence 557of load-tearing, where a load obtains some of its bits from one store 558and some of them from another store. Fortunately, use of READ_ONCE() 559and WRITE_ONCE() will prevent load-tearing; it's not possible to have: 560 561 int x = 0; 562 563 P0() 564 { 565 WRITE_ONCE(x, 0x1234); 566 } 567 568 P1() 569 { 570 int r1; 571 572 r1 = READ_ONCE(x); 573 } 574 575and end up with r1 = 0x1200 (partly from x's initial value and partly 576from the value stored by P0). 577 578On the other hand, load-tearing is unavoidable when mixed-size 579accesses are used. Consider this example: 580 581 union { 582 u32 w; 583 u16 h[2]; 584 } x; 585 586 P0() 587 { 588 WRITE_ONCE(x.h[0], 0x1234); 589 WRITE_ONCE(x.h[1], 0x5678); 590 } 591 592 P1() 593 { 594 int r1; 595 596 r1 = READ_ONCE(x.w); 597 } 598 599If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read 600from both of P0's stores. It is possible to handle mixed-size and 601unaligned accesses in a memory model, but the LKMM currently does not 602attempt to do so. It requires all accesses to be properly aligned and 603of the location's actual size. 604 605 606CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 607------------------------------------------------------------------ 608 609Cache coherence is a general principle requiring that in a 610multi-processor system, the CPUs must share a consistent view of the 611memory contents. Specifically, it requires that for each location in 612shared memory, the stores to that location must form a single global 613ordering which all the CPUs agree on (the coherence order), and this 614ordering must be consistent with the program order for accesses to 615that location. 616 617To put it another way, for any variable x, the coherence order (co) of 618the stores to x is simply the order in which the stores overwrite one 619another. The imaginary store which establishes x's initial value 620comes first in the coherence order; the store which directly 621overwrites the initial value comes second; the store which overwrites 622that value comes third, and so on. 623 624You can think of the coherence order as being the order in which the 625stores reach x's location in memory (or if you prefer a more 626hardware-centric view, the order in which the stores get written to 627x's cache line). We write W ->co W' if W comes before W' in the 628coherence order, that is, if the value stored by W gets overwritten, 629directly or indirectly, by the value stored by W'. 630 631Coherence order is required to be consistent with program order. This 632requirement takes the form of four coherency rules: 633 634 Write-write coherence: If W ->po-loc W' (i.e., W comes before 635 W' in program order and they access the same location), where W 636 and W' are two stores, then W ->co W'. 637 638 Write-read coherence: If W ->po-loc R, where W is a store and R 639 is a load, then R must read from W or from some other store 640 which comes after W in the coherence order. 641 642 Read-write coherence: If R ->po-loc W, where R is a load and W 643 is a store, then the store which R reads from must come before 644 W in the coherence order. 645 646 Read-read coherence: If R ->po-loc R', where R and R' are two 647 loads, then either they read from the same store or else the 648 store read by R comes before the store read by R' in the 649 coherence order. 650 651This is sometimes referred to as sequential consistency per variable, 652because it means that the accesses to any single memory location obey 653the rules of the Sequential Consistency memory model. (According to 654Wikipedia, sequential consistency per variable and cache coherence 655mean the same thing except that cache coherence includes an extra 656requirement that every store eventually becomes visible to every CPU.) 657 658Any reasonable memory model will include cache coherence. Indeed, our 659expectation of cache coherence is so deeply ingrained that violations 660of its requirements look more like hardware bugs than programming 661errors: 662 663 int x; 664 665 P0() 666 { 667 WRITE_ONCE(x, 17); 668 WRITE_ONCE(x, 23); 669 } 670 671If the final value stored in x after this code ran was 17, you would 672think your computer was broken. It would be a violation of the 673write-write coherence rule: Since the store of 23 comes later in 674program order, it must also come later in x's coherence order and 675thus must overwrite the store of 17. 676 677 int x = 0; 678 679 P0() 680 { 681 int r1; 682 683 r1 = READ_ONCE(x); 684 WRITE_ONCE(x, 666); 685 } 686 687If r1 = 666 at the end, this would violate the read-write coherence 688rule: The READ_ONCE() load comes before the WRITE_ONCE() store in 689program order, so it must not read from that store but rather from one 690coming earlier in the coherence order (in this case, x's initial 691value). 692 693 int x = 0; 694 695 P0() 696 { 697 WRITE_ONCE(x, 5); 698 } 699 700 P1() 701 { 702 int r1, r2; 703 704 r1 = READ_ONCE(x); 705 r2 = READ_ONCE(x); 706 } 707 708If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the 709imaginary store which establishes x's initial value) at the end, this 710would violate the read-read coherence rule: The r1 load comes before 711the r2 load in program order, so it must not read from a store that 712comes later in the coherence order. 713 714(As a minor curiosity, if this code had used normal loads instead of 715READ_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5 716and r2 = 0! This results from parallel execution of the operations 717encoded in Itanium's Very-Long-Instruction-Word format, and it is yet 718another motivation for using READ_ONCE() when accessing shared memory 719locations.) 720 721Just like the po relation, co is inherently an ordering -- it is not 722possible for a store to directly or indirectly overwrite itself! And 723just like with the rf relation, we distinguish between stores that 724occur on the same CPU (internal coherence order, or coi) and stores 725that occur on different CPUs (external coherence order, or coe). 726 727On the other hand, stores to different memory locations are never 728related by co, just as instructions on different CPUs are never 729related by po. Coherence order is strictly per-location, or if you 730prefer, each location has its own independent coherence order. 731 732 733THE FROM-READS RELATION: fr, fri, and fre 734----------------------------------------- 735 736The from-reads relation (fr) can be a little difficult for people to 737grok. It describes the situation where a load reads a value that gets 738overwritten by a store. In other words, we have R ->fr W when the 739value that R reads is overwritten (directly or indirectly) by W, or 740equivalently, when R reads from a store which comes earlier than W in 741the coherence order. 742 743For example: 744 745 int x = 0; 746 747 P0() 748 { 749 int r1; 750 751 r1 = READ_ONCE(x); 752 WRITE_ONCE(x, 2); 753 } 754 755The value loaded from x will be 0 (assuming cache coherence!), and it 756gets overwritten by the value 2. Thus there is an fr link from the 757READ_ONCE() to the WRITE_ONCE(). If the code contained any later 758stores to x, there would also be fr links from the READ_ONCE() to 759them. 760 761As with rf, rfi, and rfe, we subdivide the fr relation into fri (when 762the load and the store are on the same CPU) and fre (when they are on 763different CPUs). 764 765Note that the fr relation is determined entirely by the rf and co 766relations; it is not independent. Given a read event R and a write 767event W for the same location, we will have R ->fr W if and only if 768the write which R reads from is co-before W. In symbols, 769 770 (R ->fr W) := (there exists W' with W' ->rf R and W' ->co W). 771 772 773AN OPERATIONAL MODEL 774-------------------- 775 776The LKMM is based on various operational memory models, meaning that 777the models arise from an abstract view of how a computer system 778operates. Here are the main ideas, as incorporated into the LKMM. 779 780The system as a whole is divided into the CPUs and a memory subsystem. 781The CPUs are responsible for executing instructions (not necessarily 782in program order), and they communicate with the memory subsystem. 783For the most part, executing an instruction requires a CPU to perform 784only internal operations. However, loads, stores, and fences involve 785more. 786 787When CPU C executes a store instruction, it tells the memory subsystem 788to store a certain value at a certain location. The memory subsystem 789propagates the store to all the other CPUs as well as to RAM. (As a 790special case, we say that the store propagates to its own CPU at the 791time it is executed.) The memory subsystem also determines where the 792store falls in the location's coherence order. In particular, it must 793arrange for the store to be co-later than (i.e., to overwrite) any 794other store to the same location which has already propagated to CPU C. 795 796When a CPU executes a load instruction R, it first checks to see 797whether there are any as-yet unexecuted store instructions, for the 798same location, that come before R in program order. If there are, it 799uses the value of the po-latest such store as the value obtained by R, 800and we say that the store's value is forwarded to R. Otherwise, the 801CPU asks the memory subsystem for the value to load and we say that R 802is satisfied from memory. The memory subsystem hands back the value 803of the co-latest store to the location in question which has already 804propagated to that CPU. 805 806(In fact, the picture needs to be a little more complicated than this. 807CPUs have local caches, and propagating a store to a CPU really means 808propagating it to the CPU's local cache. A local cache can take some 809time to process the stores that it receives, and a store can't be used 810to satisfy one of the CPU's loads until it has been processed. On 811most architectures, the local caches process stores in 812First-In-First-Out order, and consequently the processing delay 813doesn't matter for the memory model. But on Alpha, the local caches 814have a partitioned design that results in non-FIFO behavior. We will 815discuss this in more detail later.) 816 817Note that load instructions may be executed speculatively and may be 818restarted under certain circumstances. The memory model ignores these 819premature executions; we simply say that the load executes at the 820final time it is forwarded or satisfied. 821 822Executing a fence (or memory barrier) instruction doesn't require a 823CPU to do anything special other than informing the memory subsystem 824about the fence. However, fences do constrain the way CPUs and the 825memory subsystem handle other instructions, in two respects. 826 827First, a fence forces the CPU to execute various instructions in 828program order. Exactly which instructions are ordered depends on the 829type of fence: 830 831 Strong fences, including smp_mb() and synchronize_rcu(), force 832 the CPU to execute all po-earlier instructions before any 833 po-later instructions; 834 835 smp_rmb() forces the CPU to execute all po-earlier loads 836 before any po-later loads; 837 838 smp_wmb() forces the CPU to execute all po-earlier stores 839 before any po-later stores; 840 841 Acquire fences, such as smp_load_acquire(), force the CPU to 842 execute the load associated with the fence (e.g., the load 843 part of an smp_load_acquire()) before any po-later 844 instructions; 845 846 Release fences, such as smp_store_release(), force the CPU to 847 execute all po-earlier instructions before the store 848 associated with the fence (e.g., the store part of an 849 smp_store_release()). 850 851Second, some types of fence affect the way the memory subsystem 852propagates stores. When a fence instruction is executed on CPU C: 853 854 For each other CPU C', smp_wmb() forces all po-earlier stores 855 on C to propagate to C' before any po-later stores do. 856 857 For each other CPU C', any store which propagates to C before 858 a release fence is executed (including all po-earlier 859 stores executed on C) is forced to propagate to C' before the 860 store associated with the release fence does. 861 862 Any store which propagates to C before a strong fence is 863 executed (including all po-earlier stores on C) is forced to 864 propagate to all other CPUs before any instructions po-after 865 the strong fence are executed on C. 866 867The propagation ordering enforced by release fences and strong fences 868affects stores from other CPUs that propagate to CPU C before the 869fence is executed, as well as stores that are executed on C before the 870fence. We describe this property by saying that release fences and 871strong fences are A-cumulative. By contrast, smp_wmb() fences are not 872A-cumulative; they only affect the propagation of stores that are 873executed on C before the fence (i.e., those which precede the fence in 874program order). 875 876rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have 877other properties which we discuss later. 878 879 880PROPAGATION ORDER RELATION: cumul-fence 881--------------------------------------- 882 883The fences which affect propagation order (i.e., strong, release, and 884smp_wmb() fences) are collectively referred to as cumul-fences, even 885though smp_wmb() isn't A-cumulative. The cumul-fence relation is 886defined to link memory access events E and F whenever: 887 888 E and F are both stores on the same CPU and an smp_wmb() fence 889 event occurs between them in program order; or 890 891 F is a release fence and some X comes before F in program order, 892 where either X = E or else E ->rf X; or 893 894 A strong fence event occurs between some X and F in program 895 order, where either X = E or else E ->rf X. 896 897The operational model requires that whenever W and W' are both stores 898and W ->cumul-fence W', then W must propagate to any given CPU 899before W' does. However, for different CPUs C and C', it does not 900require W to propagate to C before W' propagates to C'. 901 902 903DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 904------------------------------------------------- 905 906The LKMM is derived from the restrictions imposed by the design 907outlined above. These restrictions involve the necessity of 908maintaining cache coherence and the fact that a CPU can't operate on a 909value before it knows what that value is, among other things. 910 911The formal version of the LKMM is defined by six requirements, or 912axioms: 913 914 Sequential consistency per variable: This requires that the 915 system obey the four coherency rules. 916 917 Atomicity: This requires that atomic read-modify-write 918 operations really are atomic, that is, no other stores can 919 sneak into the middle of such an update. 920 921 Happens-before: This requires that certain instructions are 922 executed in a specific order. 923 924 Propagation: This requires that certain stores propagate to 925 CPUs and to RAM in a specific order. 926 927 Rcu: This requires that RCU read-side critical sections and 928 grace periods obey the rules of RCU, in particular, the 929 Grace-Period Guarantee. 930 931 Plain-coherence: This requires that plain memory accesses 932 (those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey 933 the operational model's rules regarding cache coherence. 934 935The first and second are quite common; they can be found in many 936memory models (such as those for C11/C++11). The "happens-before" and 937"propagation" axioms have analogs in other memory models as well. The 938"rcu" and "plain-coherence" axioms are specific to the LKMM. 939 940Each of these axioms is discussed below. 941 942 943SEQUENTIAL CONSISTENCY PER VARIABLE 944----------------------------------- 945 946According to the principle of cache coherence, the stores to any fixed 947shared location in memory form a global ordering. We can imagine 948inserting the loads from that location into this ordering, by placing 949each load between the store that it reads from and the following 950store. This leaves the relative positions of loads that read from the 951same store unspecified; let's say they are inserted in program order, 952first for CPU 0, then CPU 1, etc. 953 954You can check that the four coherency rules imply that the rf, co, fr, 955and po-loc relations agree with this global ordering; in other words, 956whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the 957X event comes before the Y event in the global ordering. The LKMM's 958"coherence" axiom expresses this by requiring the union of these 959relations not to have any cycles. This means it must not be possible 960to find events 961 962 X0 -> X1 -> X2 -> ... -> Xn -> X0, 963 964where each of the links is either rf, co, fr, or po-loc. This has to 965hold if the accesses to the fixed memory location can be ordered as 966cache coherence demands. 967 968Although it is not obvious, it can be shown that the converse is also 969true: This LKMM axiom implies that the four coherency rules are 970obeyed. 971 972 973ATOMIC UPDATES: rmw 974------------------- 975 976What does it mean to say that a read-modify-write (rmw) update, such 977as atomic_inc(&x), is atomic? It means that the memory location (x in 978this case) does not get altered between the read and the write events 979making up the atomic operation. In particular, if two CPUs perform 980atomic_inc(&x) concurrently, it must be guaranteed that the final 981value of x will be the initial value plus two. We should never have 982the following sequence of events: 983 984 CPU 0 loads x obtaining 13; 985 CPU 1 loads x obtaining 13; 986 CPU 0 stores 14 to x; 987 CPU 1 stores 14 to x; 988 989where the final value of x is wrong (14 rather than 15). 990 991In this example, CPU 0's increment effectively gets lost because it 992occurs in between CPU 1's load and store. To put it another way, the 993problem is that the position of CPU 0's store in x's coherence order 994is between the store that CPU 1 reads from and the store that CPU 1 995performs. 996 997The same analysis applies to all atomic update operations. Therefore, 998to enforce atomicity the LKMM requires that atomic updates follow this 999rule: Whenever R and W are the read and write events composing an 1000atomic read-modify-write and W' is the write event which R reads from, 1001there must not be any stores coming between W' and W in the coherence 1002order. Equivalently, 1003 1004 (R ->rmw W) implies (there is no X with R ->fr X and X ->co W), 1005 1006where the rmw relation links the read and write events making up each 1007atomic update. This is what the LKMM's "atomic" axiom says. 1008 1009 1010THE PRESERVED PROGRAM ORDER RELATION: ppo 1011----------------------------------------- 1012 1013There are many situations where a CPU is obliged to execute two 1014instructions in program order. We amalgamate them into the ppo (for 1015"preserved program order") relation, which links the po-earlier 1016instruction to the po-later instruction and is thus a sub-relation of 1017po. 1018 1019The operational model already includes a description of one such 1020situation: Fences are a source of ppo links. Suppose X and Y are 1021memory accesses with X ->po Y; then the CPU must execute X before Y if 1022any of the following hold: 1023 1024 A strong (smp_mb() or synchronize_rcu()) fence occurs between 1025 X and Y; 1026 1027 X and Y are both stores and an smp_wmb() fence occurs between 1028 them; 1029 1030 X and Y are both loads and an smp_rmb() fence occurs between 1031 them; 1032 1033 X is also an acquire fence, such as smp_load_acquire(); 1034 1035 Y is also a release fence, such as smp_store_release(). 1036 1037Another possibility, not mentioned earlier but discussed in the next 1038section, is: 1039 1040 X and Y are both loads, X ->addr Y (i.e., there is an address 1041 dependency from X to Y), and X is a READ_ONCE() or an atomic 1042 access. 1043 1044Dependencies can also cause instructions to be executed in program 1045order. This is uncontroversial when the second instruction is a 1046store; either a data, address, or control dependency from a load R to 1047a store W will force the CPU to execute R before W. This is very 1048simply because the CPU cannot tell the memory subsystem about W's 1049store before it knows what value should be stored (in the case of a 1050data dependency), what location it should be stored into (in the case 1051of an address dependency), or whether the store should actually take 1052place (in the case of a control dependency). 1053 1054Dependencies to load instructions are more problematic. To begin with, 1055there is no such thing as a data dependency to a load. Next, a CPU 1056has no reason to respect a control dependency to a load, because it 1057can always satisfy the second load speculatively before the first, and 1058then ignore the result if it turns out that the second load shouldn't 1059be executed after all. And lastly, the real difficulties begin when 1060we consider address dependencies to loads. 1061 1062To be fair about it, all Linux-supported architectures do execute 1063loads in program order if there is an address dependency between them. 1064After all, a CPU cannot ask the memory subsystem to load a value from 1065a particular location before it knows what that location is. However, 1066the split-cache design used by Alpha can cause it to behave in a way 1067that looks as if the loads were executed out of order (see the next 1068section for more details). The kernel includes a workaround for this 1069problem when the loads come from READ_ONCE(), and therefore the LKMM 1070includes address dependencies to loads in the ppo relation. 1071 1072On the other hand, dependencies can indirectly affect the ordering of 1073two loads. This happens when there is a dependency from a load to a 1074store and a second, po-later load reads from that store: 1075 1076 R ->dep W ->rfi R', 1077 1078where the dep link can be either an address or a data dependency. In 1079this situation we know it is possible for the CPU to execute R' before 1080W, because it can forward the value that W will store to R'. But it 1081cannot execute R' before R, because it cannot forward the value before 1082it knows what that value is, or that W and R' do access the same 1083location. However, if there is merely a control dependency between R 1084and W then the CPU can speculatively forward W to R' before executing 1085R; if the speculation turns out to be wrong then the CPU merely has to 1086restart or abandon R'. 1087 1088(In theory, a CPU might forward a store to a load when it runs across 1089an address dependency like this: 1090 1091 r1 = READ_ONCE(ptr); 1092 WRITE_ONCE(*r1, 17); 1093 r2 = READ_ONCE(*r1); 1094 1095because it could tell that the store and the second load access the 1096same location even before it knows what the location's address is. 1097However, none of the architectures supported by the Linux kernel do 1098this.) 1099 1100Two memory accesses of the same location must always be executed in 1101program order if the second access is a store. Thus, if we have 1102 1103 R ->po-loc W 1104 1105(the po-loc link says that R comes before W in program order and they 1106access the same location), the CPU is obliged to execute W after R. 1107If it executed W first then the memory subsystem would respond to R's 1108read request with the value stored by W (or an even later store), in 1109violation of the read-write coherence rule. Similarly, if we had 1110 1111 W ->po-loc W' 1112 1113and the CPU executed W' before W, then the memory subsystem would put 1114W' before W in the coherence order. It would effectively cause W to 1115overwrite W', in violation of the write-write coherence rule. 1116(Interestingly, an early ARMv8 memory model, now obsolete, proposed 1117allowing out-of-order writes like this to occur. The model avoided 1118violating the write-write coherence rule by requiring the CPU not to 1119send the W write to the memory subsystem at all!) 1120 1121 1122AND THEN THERE WAS ALPHA 1123------------------------ 1124 1125As mentioned above, the Alpha architecture is unique in that it does 1126not appear to respect address dependencies to loads. This means that 1127code such as the following: 1128 1129 int x = 0; 1130 int y = -1; 1131 int *ptr = &y; 1132 1133 P0() 1134 { 1135 WRITE_ONCE(x, 1); 1136 smp_wmb(); 1137 WRITE_ONCE(ptr, &x); 1138 } 1139 1140 P1() 1141 { 1142 int *r1; 1143 int r2; 1144 1145 r1 = ptr; 1146 r2 = READ_ONCE(*r1); 1147 } 1148 1149can malfunction on Alpha systems (notice that P1 uses an ordinary load 1150to read ptr instead of READ_ONCE()). It is quite possible that r1 = &x 1151and r2 = 0 at the end, in spite of the address dependency. 1152 1153At first glance this doesn't seem to make sense. We know that the 1154smp_wmb() forces P0's store to x to propagate to P1 before the store 1155to ptr does. And since P1 can't execute its second load 1156until it knows what location to load from, i.e., after executing its 1157first load, the value x = 1 must have propagated to P1 before the 1158second load executed. So why doesn't r2 end up equal to 1? 1159 1160The answer lies in the Alpha's split local caches. Although the two 1161stores do reach P1's local cache in the proper order, it can happen 1162that the first store is processed by a busy part of the cache while 1163the second store is processed by an idle part. As a result, the x = 1 1164value may not become available for P1's CPU to read until after the 1165ptr = &x value does, leading to the undesirable result above. The 1166final effect is that even though the two loads really are executed in 1167program order, it appears that they aren't. 1168 1169This could not have happened if the local cache had processed the 1170incoming stores in FIFO order. By contrast, other architectures 1171maintain at least the appearance of FIFO order. 1172 1173In practice, this difficulty is solved by inserting a special fence 1174between P1's two loads when the kernel is compiled for the Alpha 1175architecture. In fact, as of version 4.15, the kernel automatically 1176adds this fence after every READ_ONCE() and atomic load on Alpha. The 1177effect of the fence is to cause the CPU not to execute any po-later 1178instructions until after the local cache has finished processing all 1179the stores it has already received. Thus, if the code was changed to: 1180 1181 P1() 1182 { 1183 int *r1; 1184 int r2; 1185 1186 r1 = READ_ONCE(ptr); 1187 r2 = READ_ONCE(*r1); 1188 } 1189 1190then we would never get r1 = &x and r2 = 0. By the time P1 executed 1191its second load, the x = 1 store would already be fully processed by 1192the local cache and available for satisfying the read request. Thus 1193we have yet another reason why shared data should always be read with 1194READ_ONCE() or another synchronization primitive rather than accessed 1195directly. 1196 1197The LKMM requires that smp_rmb(), acquire fences, and strong fences 1198share this property: They do not allow the CPU to execute any po-later 1199instructions (or po-later loads in the case of smp_rmb()) until all 1200outstanding stores have been processed by the local cache. In the 1201case of a strong fence, the CPU first has to wait for all of its 1202po-earlier stores to propagate to every other CPU in the system; then 1203it has to wait for the local cache to process all the stores received 1204as of that time -- not just the stores received when the strong fence 1205began. 1206 1207And of course, none of this matters for any architecture other than 1208Alpha. 1209 1210 1211THE HAPPENS-BEFORE RELATION: hb 1212------------------------------- 1213 1214The happens-before relation (hb) links memory accesses that have to 1215execute in a certain order. hb includes the ppo relation and two 1216others, one of which is rfe. 1217 1218W ->rfe R implies that W and R are on different CPUs. It also means 1219that W's store must have propagated to R's CPU before R executed; 1220otherwise R could not have read the value stored by W. Therefore W 1221must have executed before R, and so we have W ->hb R. 1222 1223The equivalent fact need not hold if W ->rfi R (i.e., W and R are on 1224the same CPU). As we have already seen, the operational model allows 1225W's value to be forwarded to R in such cases, meaning that R may well 1226execute before W does. 1227 1228It's important to understand that neither coe nor fre is included in 1229hb, despite their similarities to rfe. For example, suppose we have 1230W ->coe W'. This means that W and W' are stores to the same location, 1231they execute on different CPUs, and W comes before W' in the coherence 1232order (i.e., W' overwrites W). Nevertheless, it is possible for W' to 1233execute before W, because the decision as to which store overwrites 1234the other is made later by the memory subsystem. When the stores are 1235nearly simultaneous, either one can come out on top. Similarly, 1236R ->fre W means that W overwrites the value which R reads, but it 1237doesn't mean that W has to execute after R. All that's necessary is 1238for the memory subsystem not to propagate W to R's CPU until after R 1239has executed, which is possible if W executes shortly before R. 1240 1241The third relation included in hb is like ppo, in that it only links 1242events that are on the same CPU. However it is more difficult to 1243explain, because it arises only indirectly from the requirement of 1244cache coherence. The relation is called prop, and it links two events 1245on CPU C in situations where a store from some other CPU comes after 1246the first event in the coherence order and propagates to C before the 1247second event executes. 1248 1249This is best explained with some examples. The simplest case looks 1250like this: 1251 1252 int x; 1253 1254 P0() 1255 { 1256 int r1; 1257 1258 WRITE_ONCE(x, 1); 1259 r1 = READ_ONCE(x); 1260 } 1261 1262 P1() 1263 { 1264 WRITE_ONCE(x, 8); 1265 } 1266 1267If r1 = 8 at the end then P0's accesses must have executed in program 1268order. We can deduce this from the operational model; if P0's load 1269had executed before its store then the value of the store would have 1270been forwarded to the load, so r1 would have ended up equal to 1, not 12718. In this case there is a prop link from P0's write event to its read 1272event, because P1's store came after P0's store in x's coherence 1273order, and P1's store propagated to P0 before P0's load executed. 1274 1275An equally simple case involves two loads of the same location that 1276read from different stores: 1277 1278 int x = 0; 1279 1280 P0() 1281 { 1282 int r1, r2; 1283 1284 r1 = READ_ONCE(x); 1285 r2 = READ_ONCE(x); 1286 } 1287 1288 P1() 1289 { 1290 WRITE_ONCE(x, 9); 1291 } 1292 1293If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed 1294in program order. If the second load had executed before the first 1295then the x = 9 store must have been propagated to P0 before the first 1296load executed, and so r1 would have been 9 rather than 0. In this 1297case there is a prop link from P0's first read event to its second, 1298because P1's store overwrote the value read by P0's first load, and 1299P1's store propagated to P0 before P0's second load executed. 1300 1301Less trivial examples of prop all involve fences. Unlike the simple 1302examples above, they can require that some instructions are executed 1303out of program order. This next one should look familiar: 1304 1305 int buf = 0, flag = 0; 1306 1307 P0() 1308 { 1309 WRITE_ONCE(buf, 1); 1310 smp_wmb(); 1311 WRITE_ONCE(flag, 1); 1312 } 1313 1314 P1() 1315 { 1316 int r1; 1317 int r2; 1318 1319 r1 = READ_ONCE(flag); 1320 r2 = READ_ONCE(buf); 1321 } 1322 1323This is the MP pattern again, with an smp_wmb() fence between the two 1324stores. If r1 = 1 and r2 = 0 at the end then there is a prop link 1325from P1's second load to its first (backwards!). The reason is 1326similar to the previous examples: The value P1 loads from buf gets 1327overwritten by P0's store to buf, the fence guarantees that the store 1328to buf will propagate to P1 before the store to flag does, and the 1329store to flag propagates to P1 before P1 reads flag. 1330 1331The prop link says that in order to obtain the r1 = 1, r2 = 0 result, 1332P1 must execute its second load before the first. Indeed, if the load 1333from flag were executed first, then the buf = 1 store would already 1334have propagated to P1 by the time P1's load from buf executed, so r2 1335would have been 1 at the end, not 0. (The reasoning holds even for 1336Alpha, although the details are more complicated and we will not go 1337into them.) 1338 1339But what if we put an smp_rmb() fence between P1's loads? The fence 1340would force the two loads to be executed in program order, and it 1341would generate a cycle in the hb relation: The fence would create a ppo 1342link (hence an hb link) from the first load to the second, and the 1343prop relation would give an hb link from the second load to the first. 1344Since an instruction can't execute before itself, we are forced to 1345conclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0 1346outcome is impossible -- as it should be. 1347 1348The formal definition of the prop relation involves a coe or fre link, 1349followed by an arbitrary number of cumul-fence links, ending with an 1350rfe link. You can concoct more exotic examples, containing more than 1351one fence, although this quickly leads to diminishing returns in terms 1352of complexity. For instance, here's an example containing a coe link 1353followed by two cumul-fences and an rfe link, utilizing the fact that 1354release fences are A-cumulative: 1355 1356 int x, y, z; 1357 1358 P0() 1359 { 1360 int r0; 1361 1362 WRITE_ONCE(x, 1); 1363 r0 = READ_ONCE(z); 1364 } 1365 1366 P1() 1367 { 1368 WRITE_ONCE(x, 2); 1369 smp_wmb(); 1370 WRITE_ONCE(y, 1); 1371 } 1372 1373 P2() 1374 { 1375 int r2; 1376 1377 r2 = READ_ONCE(y); 1378 smp_store_release(&z, 1); 1379 } 1380 1381If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop 1382link from P0's store to its load. This is because P0's store gets 1383overwritten by P1's store since x = 2 at the end (a coe link), the 1384smp_wmb() ensures that P1's store to x propagates to P2 before the 1385store to y does (the first cumul-fence), the store to y propagates to P2 1386before P2's load and store execute, P2's smp_store_release() 1387guarantees that the stores to x and y both propagate to P0 before the 1388store to z does (the second cumul-fence), and P0's load executes after the 1389store to z has propagated to P0 (an rfe link). 1390 1391In summary, the fact that the hb relation links memory access events 1392in the order they execute means that it must not have cycles. This 1393requirement is the content of the LKMM's "happens-before" axiom. 1394 1395The LKMM defines yet another relation connected to times of 1396instruction execution, but it is not included in hb. It relies on the 1397particular properties of strong fences, which we cover in the next 1398section. 1399 1400 1401THE PROPAGATES-BEFORE RELATION: pb 1402---------------------------------- 1403 1404The propagates-before (pb) relation capitalizes on the special 1405features of strong fences. It links two events E and F whenever some 1406store is coherence-later than E and propagates to every CPU and to RAM 1407before F executes. The formal definition requires that E be linked to 1408F via a coe or fre link, an arbitrary number of cumul-fences, an 1409optional rfe link, a strong fence, and an arbitrary number of hb 1410links. Let's see how this definition works out. 1411 1412Consider first the case where E is a store (implying that the sequence 1413of links begins with coe). Then there are events W, X, Y, and Z such 1414that: 1415 1416 E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F, 1417 1418where the * suffix indicates an arbitrary number of links of the 1419specified type, and the ? suffix indicates the link is optional (Y may 1420be equal to X). Because of the cumul-fence links, we know that W will 1421propagate to Y's CPU before X does, hence before Y executes and hence 1422before the strong fence executes. Because this fence is strong, we 1423know that W will propagate to every CPU and to RAM before Z executes. 1424And because of the hb links, we know that Z will execute before F. 1425Thus W, which comes later than E in the coherence order, will 1426propagate to every CPU and to RAM before F executes. 1427 1428The case where E is a load is exactly the same, except that the first 1429link in the sequence is fre instead of coe. 1430 1431The existence of a pb link from E to F implies that E must execute 1432before F. To see why, suppose that F executed first. Then W would 1433have propagated to E's CPU before E executed. If E was a store, the 1434memory subsystem would then be forced to make E come after W in the 1435coherence order, contradicting the fact that E ->coe W. If E was a 1436load, the memory subsystem would then be forced to satisfy E's read 1437request with the value stored by W or an even later store, 1438contradicting the fact that E ->fre W. 1439 1440A good example illustrating how pb works is the SB pattern with strong 1441fences: 1442 1443 int x = 0, y = 0; 1444 1445 P0() 1446 { 1447 int r0; 1448 1449 WRITE_ONCE(x, 1); 1450 smp_mb(); 1451 r0 = READ_ONCE(y); 1452 } 1453 1454 P1() 1455 { 1456 int r1; 1457 1458 WRITE_ONCE(y, 1); 1459 smp_mb(); 1460 r1 = READ_ONCE(x); 1461 } 1462 1463If r0 = 0 at the end then there is a pb link from P0's load to P1's 1464load: an fre link from P0's load to P1's store (which overwrites the 1465value read by P0), and a strong fence between P1's store and its load. 1466In this example, the sequences of cumul-fence and hb links are empty. 1467Note that this pb link is not included in hb as an instance of prop, 1468because it does not start and end on the same CPU. 1469 1470Similarly, if r1 = 0 at the end then there is a pb link from P1's load 1471to P0's. This means that if both r1 and r2 were 0 there would be a 1472cycle in pb, which is not possible since an instruction cannot execute 1473before itself. Thus, adding smp_mb() fences to the SB pattern 1474prevents the r0 = 0, r1 = 0 outcome. 1475 1476In summary, the fact that the pb relation links events in the order 1477they execute means that it cannot have cycles. This requirement is 1478the content of the LKMM's "propagation" axiom. 1479 1480 1481RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 1482------------------------------------------------------------------------ 1483 1484RCU (Read-Copy-Update) is a powerful synchronization mechanism. It 1485rests on two concepts: grace periods and read-side critical sections. 1486 1487A grace period is the span of time occupied by a call to 1488synchronize_rcu(). A read-side critical section (or just critical 1489section, for short) is a region of code delimited by rcu_read_lock() 1490at the start and rcu_read_unlock() at the end. Critical sections can 1491be nested, although we won't make use of this fact. 1492 1493As far as memory models are concerned, RCU's main feature is its 1494Grace-Period Guarantee, which states that a critical section can never 1495span a full grace period. In more detail, the Guarantee says: 1496 1497 For any critical section C and any grace period G, at least 1498 one of the following statements must hold: 1499 1500(1) C ends before G does, and in addition, every store that 1501 propagates to C's CPU before the end of C must propagate to 1502 every CPU before G ends. 1503 1504(2) G starts before C does, and in addition, every store that 1505 propagates to G's CPU before the start of G must propagate 1506 to every CPU before C starts. 1507 1508In particular, it is not possible for a critical section to both start 1509before and end after a grace period. 1510 1511Here is a simple example of RCU in action: 1512 1513 int x, y; 1514 1515 P0() 1516 { 1517 rcu_read_lock(); 1518 WRITE_ONCE(x, 1); 1519 WRITE_ONCE(y, 1); 1520 rcu_read_unlock(); 1521 } 1522 1523 P1() 1524 { 1525 int r1, r2; 1526 1527 r1 = READ_ONCE(x); 1528 synchronize_rcu(); 1529 r2 = READ_ONCE(y); 1530 } 1531 1532The Grace Period Guarantee tells us that when this code runs, it will 1533never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1 1534means that P0's store to x propagated to P1 before P1 called 1535synchronize_rcu(), so P0's critical section must have started before 1536P1's grace period, contrary to part (2) of the Guarantee. On the 1537other hand, r2 = 0 means that P0's store to y, which occurs before the 1538end of the critical section, did not propagate to P1 before the end of 1539the grace period, contrary to part (1). Together the results violate 1540the Guarantee. 1541 1542In the kernel's implementations of RCU, the requirements for stores 1543to propagate to every CPU are fulfilled by placing strong fences at 1544suitable places in the RCU-related code. Thus, if a critical section 1545starts before a grace period does then the critical section's CPU will 1546execute an smp_mb() fence after the end of the critical section and 1547some time before the grace period's synchronize_rcu() call returns. 1548And if a critical section ends after a grace period does then the 1549synchronize_rcu() routine will execute an smp_mb() fence at its start 1550and some time before the critical section's opening rcu_read_lock() 1551executes. 1552 1553What exactly do we mean by saying that a critical section "starts 1554before" or "ends after" a grace period? Some aspects of the meaning 1555are pretty obvious, as in the example above, but the details aren't 1556entirely clear. The LKMM formalizes this notion by means of the 1557rcu-link relation. rcu-link encompasses a very general notion of 1558"before": If E and F are RCU fence events (i.e., rcu_read_lock(), 1559rcu_read_unlock(), or synchronize_rcu()) then among other things, 1560E ->rcu-link F includes cases where E is po-before some memory-access 1561event X, F is po-after some memory-access event Y, and we have any of 1562X ->rfe Y, X ->co Y, or X ->fr Y. 1563 1564The formal definition of the rcu-link relation is more than a little 1565obscure, and we won't give it here. It is closely related to the pb 1566relation, and the details don't matter unless you want to comb through 1567a somewhat lengthy formal proof. Pretty much all you need to know 1568about rcu-link is the information in the preceding paragraph. 1569 1570The LKMM also defines the rcu-gp and rcu-rscsi relations. They bring 1571grace periods and read-side critical sections into the picture, in the 1572following way: 1573 1574 E ->rcu-gp F means that E and F are in fact the same event, 1575 and that event is a synchronize_rcu() fence (i.e., a grace 1576 period). 1577 1578 E ->rcu-rscsi F means that E and F are the rcu_read_unlock() 1579 and rcu_read_lock() fence events delimiting some read-side 1580 critical section. (The 'i' at the end of the name emphasizes 1581 that this relation is "inverted": It links the end of the 1582 critical section to the start.) 1583 1584If we think of the rcu-link relation as standing for an extended 1585"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a 1586grace period which ends before Z begins. (In fact it covers more than 1587this, because it also includes cases where some store propagates to 1588Z's CPU before Z begins but doesn't propagate to some other CPU until 1589after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is 1590the end of a critical section which starts before Z begins. 1591 1592The LKMM goes on to define the rcu-order relation as a sequence of 1593rcu-gp and rcu-rscsi links separated by rcu-link links, in which the 1594number of rcu-gp links is >= the number of rcu-rscsi links. For 1595example: 1596 1597 X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1598 1599would imply that X ->rcu-order V, because this sequence contains two 1600rcu-gp links and one rcu-rscsi link. (It also implies that 1601X ->rcu-order T and Z ->rcu-order V.) On the other hand: 1602 1603 X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1604 1605does not imply X ->rcu-order V, because the sequence contains only 1606one rcu-gp link but two rcu-rscsi links. 1607 1608The rcu-order relation is important because the Grace Period Guarantee 1609means that rcu-order links act kind of like strong fences. In 1610particular, E ->rcu-order F implies not only that E begins before F 1611ends, but also that any write po-before E will propagate to every CPU 1612before any instruction po-after F can execute. (However, it does not 1613imply that E must execute before F; in fact, each synchronize_rcu() 1614fence event is linked to itself by rcu-order as a degenerate case.) 1615 1616To prove this in full generality requires some intellectual effort. 1617We'll consider just a very simple case: 1618 1619 G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F. 1620 1621This formula means that G and W are the same event (a grace period), 1622and there are events X, Y and a read-side critical section C such that: 1623 1624 1. G = W is po-before or equal to X; 1625 1626 2. X comes "before" Y in some sense (including rfe, co and fr); 1627 1628 3. Y is po-before Z; 1629 1630 4. Z is the rcu_read_unlock() event marking the end of C; 1631 1632 5. F is the rcu_read_lock() event marking the start of C. 1633 1634From 1 - 4 we deduce that the grace period G ends before the critical 1635section C. Then part (2) of the Grace Period Guarantee says not only 1636that G starts before C does, but also that any write which executes on 1637G's CPU before G starts must propagate to every CPU before C starts. 1638In particular, the write propagates to every CPU before F finishes 1639executing and hence before any instruction po-after F can execute. 1640This sort of reasoning can be extended to handle all the situations 1641covered by rcu-order. 1642 1643The rcu-fence relation is a simple extension of rcu-order. While 1644rcu-order only links certain fence events (calls to synchronize_rcu(), 1645rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events 1646that are separated by an rcu-order link. This is analogous to the way 1647the strong-fence relation links events that are separated by an 1648smp_mb() fence event (as mentioned above, rcu-order links act kind of 1649like strong fences). Written symbolically, X ->rcu-fence Y means 1650there are fence events E and F such that: 1651 1652 X ->po E ->rcu-order F ->po Y. 1653 1654From the discussion above, we see this implies not only that X 1655executes before Y, but also (if X is a store) that X propagates to 1656every CPU before Y executes. Thus rcu-fence is sort of a 1657"super-strong" fence: Unlike the original strong fences (smp_mb() and 1658synchronize_rcu()), rcu-fence is able to link events on different 1659CPUs. (Perhaps this fact should lead us to say that rcu-fence isn't 1660really a fence at all!) 1661 1662Finally, the LKMM defines the RCU-before (rb) relation in terms of 1663rcu-fence. This is done in essentially the same way as the pb 1664relation was defined in terms of strong-fence. We will omit the 1665details; the end result is that E ->rb F implies E must execute 1666before F, just as E ->pb F does (and for much the same reasons). 1667 1668Putting this all together, the LKMM expresses the Grace Period 1669Guarantee by requiring that the rb relation does not contain a cycle. 1670Equivalently, this "rcu" axiom requires that there are no events E 1671and F with E ->rcu-link F ->rcu-order E. Or to put it a third way, 1672the axiom requires that there are no cycles consisting of rcu-gp and 1673rcu-rscsi alternating with rcu-link, where the number of rcu-gp links 1674is >= the number of rcu-rscsi links. 1675 1676Justifying the axiom isn't easy, but it is in fact a valid 1677formalization of the Grace Period Guarantee. We won't attempt to go 1678through the detailed argument, but the following analysis gives a 1679taste of what is involved. Suppose both parts of the Guarantee are 1680violated: A critical section starts before a grace period, and some 1681store propagates to the critical section's CPU before the end of the 1682critical section but doesn't propagate to some other CPU until after 1683the end of the grace period. 1684 1685Putting symbols to these ideas, let L and U be the rcu_read_lock() and 1686rcu_read_unlock() fence events delimiting the critical section in 1687question, and let S be the synchronize_rcu() fence event for the grace 1688period. Saying that the critical section starts before S means there 1689are events Q and R where Q is po-after L (which marks the start of the 1690critical section), Q is "before" R in the sense used by the rcu-link 1691relation, and R is po-before the grace period S. Thus we have: 1692 1693 L ->rcu-link S. 1694 1695Let W be the store mentioned above, let Y come before the end of the 1696critical section and witness that W propagates to the critical 1697section's CPU by reading from W, and let Z on some arbitrary CPU be a 1698witness that W has not propagated to that CPU, where Z happens after 1699some event X which is po-after S. Symbolically, this amounts to: 1700 1701 S ->po X ->hb* Z ->fr W ->rf Y ->po U. 1702 1703The fr link from Z to W indicates that W has not propagated to Z's CPU 1704at the time that Z executes. From this, it can be shown (see the 1705discussion of the rcu-link relation earlier) that S and U are related 1706by rcu-link: 1707 1708 S ->rcu-link U. 1709 1710Since S is a grace period we have S ->rcu-gp S, and since L and U are 1711the start and end of the critical section C we have U ->rcu-rscsi L. 1712From this we obtain: 1713 1714 S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S, 1715 1716a forbidden cycle. Thus the "rcu" axiom rules out this violation of 1717the Grace Period Guarantee. 1718 1719For something a little more down-to-earth, let's see how the axiom 1720works out in practice. Consider the RCU code example from above, this 1721time with statement labels added: 1722 1723 int x, y; 1724 1725 P0() 1726 { 1727 L: rcu_read_lock(); 1728 X: WRITE_ONCE(x, 1); 1729 Y: WRITE_ONCE(y, 1); 1730 U: rcu_read_unlock(); 1731 } 1732 1733 P1() 1734 { 1735 int r1, r2; 1736 1737 Z: r1 = READ_ONCE(x); 1738 S: synchronize_rcu(); 1739 W: r2 = READ_ONCE(y); 1740 } 1741 1742 1743If r2 = 0 at the end then P0's store at Y overwrites the value that 1744P1's load at W reads from, so we have W ->fre Y. Since S ->po W and 1745also Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S 1746because S is a grace period. 1747 1748If r1 = 1 at the end then P1's load at Z reads from P0's store at X, 1749so we have X ->rfe Z. Together with L ->po X and Z ->po S, this 1750yields L ->rcu-link S. And since L and U are the start and end of a 1751critical section, we have U ->rcu-rscsi L. 1752 1753Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a 1754forbidden cycle, violating the "rcu" axiom. Hence the outcome is not 1755allowed by the LKMM, as we would expect. 1756 1757For contrast, let's see what can happen in a more complicated example: 1758 1759 int x, y, z; 1760 1761 P0() 1762 { 1763 int r0; 1764 1765 L0: rcu_read_lock(); 1766 r0 = READ_ONCE(x); 1767 WRITE_ONCE(y, 1); 1768 U0: rcu_read_unlock(); 1769 } 1770 1771 P1() 1772 { 1773 int r1; 1774 1775 r1 = READ_ONCE(y); 1776 S1: synchronize_rcu(); 1777 WRITE_ONCE(z, 1); 1778 } 1779 1780 P2() 1781 { 1782 int r2; 1783 1784 L2: rcu_read_lock(); 1785 r2 = READ_ONCE(z); 1786 WRITE_ONCE(x, 1); 1787 U2: rcu_read_unlock(); 1788 } 1789 1790If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows 1791that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi 1792L2 ->rcu-link U0. However this cycle is not forbidden, because the 1793sequence of relations contains fewer instances of rcu-gp (one) than of 1794rcu-rscsi (two). Consequently the outcome is allowed by the LKMM. 1795The following instruction timing diagram shows how it might actually 1796occur: 1797 1798P0 P1 P2 1799-------------------- -------------------- -------------------- 1800rcu_read_lock() 1801WRITE_ONCE(y, 1) 1802 r1 = READ_ONCE(y) 1803 synchronize_rcu() starts 1804 . rcu_read_lock() 1805 . WRITE_ONCE(x, 1) 1806r0 = READ_ONCE(x) . 1807rcu_read_unlock() . 1808 synchronize_rcu() ends 1809 WRITE_ONCE(z, 1) 1810 r2 = READ_ONCE(z) 1811 rcu_read_unlock() 1812 1813This requires P0 and P2 to execute their loads and stores out of 1814program order, but of course they are allowed to do so. And as you 1815can see, the Grace Period Guarantee is not violated: The critical 1816section in P0 both starts before P1's grace period does and ends 1817before it does, and the critical section in P2 both starts after P1's 1818grace period does and ends after it does. 1819 1820Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in 1821addition to normal RCU. The ideas involved are much the same as 1822above, with new relations srcu-gp and srcu-rscsi added to represent 1823SRCU grace periods and read-side critical sections. There is a 1824restriction on the srcu-gp and srcu-rscsi links that can appear in an 1825rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp 1826links having the same SRCU domain with proper nesting); the details 1827are relatively unimportant. 1828 1829 1830LOCKING 1831------- 1832 1833The LKMM includes locking. In fact, there is special code for locking 1834in the formal model, added in order to make tools run faster. 1835However, this special code is intended to be more or less equivalent 1836to concepts we have already covered. A spinlock_t variable is treated 1837the same as an int, and spin_lock(&s) is treated almost the same as: 1838 1839 while (cmpxchg_acquire(&s, 0, 1) != 0) 1840 cpu_relax(); 1841 1842This waits until s is equal to 0 and then atomically sets it to 1, 1843and the read part of the cmpxchg operation acts as an acquire fence. 1844An alternate way to express the same thing would be: 1845 1846 r = xchg_acquire(&s, 1); 1847 1848along with a requirement that at the end, r = 0. Similarly, 1849spin_trylock(&s) is treated almost the same as: 1850 1851 return !cmpxchg_acquire(&s, 0, 1); 1852 1853which atomically sets s to 1 if it is currently equal to 0 and returns 1854true if it succeeds (the read part of the cmpxchg operation acts as an 1855acquire fence only if the operation is successful). spin_unlock(&s) 1856is treated almost the same as: 1857 1858 smp_store_release(&s, 0); 1859 1860The "almost" qualifiers above need some explanation. In the LKMM, the 1861store-release in a spin_unlock() and the load-acquire which forms the 1862first half of the atomic rmw update in a spin_lock() or a successful 1863spin_trylock() -- we can call these things lock-releases and 1864lock-acquires -- have two properties beyond those of ordinary releases 1865and acquires. 1866 1867First, when a lock-acquire reads from or is po-after a lock-release, 1868the LKMM requires that every instruction po-before the lock-release 1869must execute before any instruction po-after the lock-acquire. This 1870would naturally hold if the release and acquire operations were on 1871different CPUs and accessed the same lock variable, but the LKMM says 1872it also holds when they are on the same CPU, even if they access 1873different lock variables. For example: 1874 1875 int x, y; 1876 spinlock_t s, t; 1877 1878 P0() 1879 { 1880 int r1, r2; 1881 1882 spin_lock(&s); 1883 r1 = READ_ONCE(x); 1884 spin_unlock(&s); 1885 spin_lock(&t); 1886 r2 = READ_ONCE(y); 1887 spin_unlock(&t); 1888 } 1889 1890 P1() 1891 { 1892 WRITE_ONCE(y, 1); 1893 smp_wmb(); 1894 WRITE_ONCE(x, 1); 1895 } 1896 1897Here the second spin_lock() is po-after the first spin_unlock(), and 1898therefore the load of x must execute before the load of y, even though 1899the two locking operations use different locks. Thus we cannot have 1900r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern). 1901 1902This requirement does not apply to ordinary release and acquire 1903fences, only to lock-related operations. For instance, suppose P0() 1904in the example had been written as: 1905 1906 P0() 1907 { 1908 int r1, r2, r3; 1909 1910 r1 = READ_ONCE(x); 1911 smp_store_release(&s, 1); 1912 r3 = smp_load_acquire(&s); 1913 r2 = READ_ONCE(y); 1914 } 1915 1916Then the CPU would be allowed to forward the s = 1 value from the 1917smp_store_release() to the smp_load_acquire(), executing the 1918instructions in the following order: 1919 1920 r3 = smp_load_acquire(&s); // Obtains r3 = 1 1921 r2 = READ_ONCE(y); 1922 r1 = READ_ONCE(x); 1923 smp_store_release(&s, 1); // Value is forwarded 1924 1925and thus it could load y before x, obtaining r2 = 0 and r1 = 1. 1926 1927Second, when a lock-acquire reads from or is po-after a lock-release, 1928and some other stores W and W' occur po-before the lock-release and 1929po-after the lock-acquire respectively, the LKMM requires that W must 1930propagate to each CPU before W' does. For example, consider: 1931 1932 int x, y; 1933 spinlock_t s; 1934 1935 P0() 1936 { 1937 spin_lock(&s); 1938 WRITE_ONCE(x, 1); 1939 spin_unlock(&s); 1940 } 1941 1942 P1() 1943 { 1944 int r1; 1945 1946 spin_lock(&s); 1947 r1 = READ_ONCE(x); 1948 WRITE_ONCE(y, 1); 1949 spin_unlock(&s); 1950 } 1951 1952 P2() 1953 { 1954 int r2, r3; 1955 1956 r2 = READ_ONCE(y); 1957 smp_rmb(); 1958 r3 = READ_ONCE(x); 1959 } 1960 1961If r1 = 1 at the end then the spin_lock() in P1 must have read from 1962the spin_unlock() in P0. Hence the store to x must propagate to P2 1963before the store to y does, so we cannot have r2 = 1 and r3 = 0. But 1964if P1 had used a lock variable different from s, the writes could have 1965propagated in either order. (On the other hand, if the code in P0 and 1966P1 had all executed on a single CPU, as in the example before this 1967one, then the writes would have propagated in order even if the two 1968critical sections used different lock variables.) 1969 1970These two special requirements for lock-release and lock-acquire do 1971not arise from the operational model. Nevertheless, kernel developers 1972have come to expect and rely on them because they do hold on all 1973architectures supported by the Linux kernel, albeit for various 1974differing reasons. 1975 1976 1977PLAIN ACCESSES AND DATA RACES 1978----------------------------- 1979 1980In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y), 1981smp_load_acquire(&z), and so on are collectively referred to as 1982"marked" accesses, because they are all annotated with special 1983operations of one kind or another. Ordinary C-language memory 1984accesses such as x or y = 0 are simply called "plain" accesses. 1985 1986Early versions of the LKMM had nothing to say about plain accesses. 1987The C standard allows compilers to assume that the variables affected 1988by plain accesses are not concurrently read or written by any other 1989threads or CPUs. This leaves compilers free to implement all manner 1990of transformations or optimizations of code containing plain accesses, 1991making such code very difficult for a memory model to handle. 1992 1993Here is just one example of a possible pitfall: 1994 1995 int a = 6; 1996 int *x = &a; 1997 1998 P0() 1999 { 2000 int *r1; 2001 int r2 = 0; 2002 2003 r1 = x; 2004 if (r1 != NULL) 2005 r2 = READ_ONCE(*r1); 2006 } 2007 2008 P1() 2009 { 2010 WRITE_ONCE(x, NULL); 2011 } 2012 2013On the face of it, one would expect that when this code runs, the only 2014possible final values for r2 are 6 and 0, depending on whether or not 2015P1's store to x propagates to P0 before P0's load from x executes. 2016But since P0's load from x is a plain access, the compiler may decide 2017to carry out the load twice (for the comparison against NULL, then again 2018for the READ_ONCE()) and eliminate the temporary variable r1. The 2019object code generated for P0 could therefore end up looking rather 2020like this: 2021 2022 P0() 2023 { 2024 int r2 = 0; 2025 2026 if (x != NULL) 2027 r2 = READ_ONCE(*x); 2028 } 2029 2030And now it is obvious that this code runs the risk of dereferencing a 2031NULL pointer, because P1's store to x might propagate to P0 after the 2032test against NULL has been made but before the READ_ONCE() executes. 2033If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x", 2034the compiler would not have performed this optimization and there 2035would be no possibility of a NULL-pointer dereference. 2036 2037Given the possibility of transformations like this one, the LKMM 2038doesn't try to predict all possible outcomes of code containing plain 2039accesses. It is instead content to determine whether the code 2040violates the compiler's assumptions, which would render the ultimate 2041outcome undefined. 2042 2043In technical terms, the compiler is allowed to assume that when the 2044program executes, there will not be any data races. A "data race" 2045occurs when there are two memory accesses such that: 2046 20471. they access the same location, 2048 20492. at least one of them is a store, 2050 20513. at least one of them is plain, 2052 20534. they occur on different CPUs (or in different threads on the 2054 same CPU), and 2055 20565. they execute concurrently. 2057 2058In the literature, two accesses are said to "conflict" if they satisfy 20591 and 2 above. We'll go a little farther and say that two accesses 2060are "race candidates" if they satisfy 1 - 4. Thus, whether or not two 2061race candidates actually do race in a given execution depends on 2062whether they are concurrent. 2063 2064The LKMM tries to determine whether a program contains race candidates 2065which may execute concurrently; if it does then the LKMM says there is 2066a potential data race and makes no predictions about the program's 2067outcome. 2068 2069Determining whether two accesses are race candidates is easy; you can 2070see that all the concepts involved in the definition above are already 2071part of the memory model. The hard part is telling whether they may 2072execute concurrently. The LKMM takes a conservative attitude, 2073assuming that accesses may be concurrent unless it can prove they 2074are not. 2075 2076If two memory accesses aren't concurrent then one must execute before 2077the other. Therefore the LKMM decides two accesses aren't concurrent 2078if they can be connected by a sequence of hb, pb, and rb links 2079(together referred to as xb, for "executes before"). However, there 2080are two complicating factors. 2081 2082If X is a load and X executes before a store Y, then indeed there is 2083no danger of X and Y being concurrent. After all, Y can't have any 2084effect on the value obtained by X until the memory subsystem has 2085propagated Y from its own CPU to X's CPU, which won't happen until 2086some time after Y executes and thus after X executes. But if X is a 2087store, then even if X executes before Y it is still possible that X 2088will propagate to Y's CPU just as Y is executing. In such a case X 2089could very well interfere somehow with Y, and we would have to 2090consider X and Y to be concurrent. 2091 2092Therefore when X is a store, for X and Y to be non-concurrent the LKMM 2093requires not only that X must execute before Y but also that X must 2094propagate to Y's CPU before Y executes. (Or vice versa, of course, if 2095Y executes before X -- then Y must propagate to X's CPU before X 2096executes if Y is a store.) This is expressed by the visibility 2097relation (vis), where X ->vis Y is defined to hold if there is an 2098intermediate event Z such that: 2099 2100 X is connected to Z by a possibly empty sequence of 2101 cumul-fence links followed by an optional rfe link (if none of 2102 these links are present, X and Z are the same event), 2103 2104and either: 2105 2106 Z is connected to Y by a strong-fence link followed by a 2107 possibly empty sequence of xb links, 2108 2109or: 2110 2111 Z is on the same CPU as Y and is connected to Y by a possibly 2112 empty sequence of xb links (again, if the sequence is empty it 2113 means Z and Y are the same event). 2114 2115The motivations behind this definition are straightforward: 2116 2117 cumul-fence memory barriers force stores that are po-before 2118 the barrier to propagate to other CPUs before stores that are 2119 po-after the barrier. 2120 2121 An rfe link from an event W to an event R says that R reads 2122 from W, which certainly means that W must have propagated to 2123 R's CPU before R executed. 2124 2125 strong-fence memory barriers force stores that are po-before 2126 the barrier, or that propagate to the barrier's CPU before the 2127 barrier executes, to propagate to all CPUs before any events 2128 po-after the barrier can execute. 2129 2130To see how this works out in practice, consider our old friend, the MP 2131pattern (with fences and statement labels, but without the conditional 2132test): 2133 2134 int buf = 0, flag = 0; 2135 2136 P0() 2137 { 2138 X: WRITE_ONCE(buf, 1); 2139 smp_wmb(); 2140 W: WRITE_ONCE(flag, 1); 2141 } 2142 2143 P1() 2144 { 2145 int r1; 2146 int r2 = 0; 2147 2148 Z: r1 = READ_ONCE(flag); 2149 smp_rmb(); 2150 Y: r2 = READ_ONCE(buf); 2151 } 2152 2153The smp_wmb() memory barrier gives a cumul-fence link from X to W, and 2154assuming r1 = 1 at the end, there is an rfe link from W to Z. This 2155means that the store to buf must propagate from P0 to P1 before Z 2156executes. Next, Z and Y are on the same CPU and the smp_rmb() fence 2157provides an xb link from Z to Y (i.e., it forces Z to execute before 2158Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y 2159executes. 2160 2161The second complicating factor mentioned above arises from the fact 2162that when we are considering data races, some of the memory accesses 2163are plain. Now, although we have not said so explicitly, up to this 2164point most of the relations defined by the LKMM (ppo, hb, prop, 2165cumul-fence, pb, and so on -- including vis) apply only to marked 2166accesses. 2167 2168There are good reasons for this restriction. The compiler is not 2169allowed to apply fancy transformations to marked accesses, and 2170consequently each such access in the source code corresponds more or 2171less directly to a single machine instruction in the object code. But 2172plain accesses are a different story; the compiler may combine them, 2173split them up, duplicate them, eliminate them, invent new ones, and 2174who knows what else. Seeing a plain access in the source code tells 2175you almost nothing about what machine instructions will end up in the 2176object code. 2177 2178Fortunately, the compiler isn't completely free; it is subject to some 2179limitations. For one, it is not allowed to introduce a data race into 2180the object code if the source code does not already contain a data 2181race (if it could, memory models would be useless and no multithreaded 2182code would be safe!). For another, it cannot move a plain access past 2183a compiler barrier. 2184 2185A compiler barrier is a kind of fence, but as the name implies, it 2186only affects the compiler; it does not necessarily have any effect on 2187how instructions are executed by the CPU. In Linux kernel source 2188code, the barrier() function is a compiler barrier. It doesn't give 2189rise directly to any machine instructions in the object code; rather, 2190it affects how the compiler generates the rest of the object code. 2191Given source code like this: 2192 2193 ... some memory accesses ... 2194 barrier(); 2195 ... some other memory accesses ... 2196 2197the barrier() function ensures that the machine instructions 2198corresponding to the first group of accesses will all end po-before 2199any machine instructions corresponding to the second group of accesses 2200-- even if some of the accesses are plain. (Of course, the CPU may 2201then execute some of those accesses out of program order, but we 2202already know how to deal with such issues.) Without the barrier() 2203there would be no such guarantee; the two groups of accesses could be 2204intermingled or even reversed in the object code. 2205 2206The LKMM doesn't say much about the barrier() function, but it does 2207require that all fences are also compiler barriers. In addition, it 2208requires that the ordering properties of memory barriers such as 2209smp_rmb() or smp_store_release() apply to plain accesses as well as to 2210marked accesses. 2211 2212This is the key to analyzing data races. Consider the MP pattern 2213again, now using plain accesses for buf: 2214 2215 int buf = 0, flag = 0; 2216 2217 P0() 2218 { 2219 U: buf = 1; 2220 smp_wmb(); 2221 X: WRITE_ONCE(flag, 1); 2222 } 2223 2224 P1() 2225 { 2226 int r1; 2227 int r2 = 0; 2228 2229 Y: r1 = READ_ONCE(flag); 2230 if (r1) { 2231 smp_rmb(); 2232 V: r2 = buf; 2233 } 2234 } 2235 2236This program does not contain a data race. Although the U and V 2237accesses are race candidates, the LKMM can prove they are not 2238concurrent as follows: 2239 2240 The smp_wmb() fence in P0 is both a compiler barrier and a 2241 cumul-fence. It guarantees that no matter what hash of 2242 machine instructions the compiler generates for the plain 2243 access U, all those instructions will be po-before the fence. 2244 Consequently U's store to buf, no matter how it is carried out 2245 at the machine level, must propagate to P1 before X's store to 2246 flag does. 2247 2248 X and Y are both marked accesses. Hence an rfe link from X to 2249 Y is a valid indicator that X propagated to P1 before Y 2250 executed, i.e., X ->vis Y. (And if there is no rfe link then 2251 r1 will be 0, so V will not be executed and ipso facto won't 2252 race with U.) 2253 2254 The smp_rmb() fence in P1 is a compiler barrier as well as a 2255 fence. It guarantees that all the machine-level instructions 2256 corresponding to the access V will be po-after the fence, and 2257 therefore any loads among those instructions will execute 2258 after the fence does and hence after Y does. 2259 2260Thus U's store to buf is forced to propagate to P1 before V's load 2261executes (assuming V does execute), ruling out the possibility of a 2262data race between them. 2263 2264This analysis illustrates how the LKMM deals with plain accesses in 2265general. Suppose R is a plain load and we want to show that R 2266executes before some marked access E. We can do this by finding a 2267marked access X such that R and X are ordered by a suitable fence and 2268X ->xb* E. If E was also a plain access, we would also look for a 2269marked access Y such that X ->xb* Y, and Y and E are ordered by a 2270fence. We describe this arrangement by saying that R is 2271"post-bounded" by X and E is "pre-bounded" by Y. 2272 2273In fact, we go one step further: Since R is a read, we say that R is 2274"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or 2275"w-pre-bounded" by Y, depending on whether E was a store or a load. 2276This distinction is needed because some fences affect only loads 2277(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise 2278the two types of bounds are the same. And as a degenerate case, we 2279say that a marked access pre-bounds and post-bounds itself (e.g., if R 2280above were a marked load then X could simply be taken to be R itself.) 2281 2282The need to distinguish between r- and w-bounding raises yet another 2283issue. When the source code contains a plain store, the compiler is 2284allowed to put plain loads of the same location into the object code. 2285For example, given the source code: 2286 2287 x = 1; 2288 2289the compiler is theoretically allowed to generate object code that 2290looks like: 2291 2292 if (x != 1) 2293 x = 1; 2294 2295thereby adding a load (and possibly replacing the store entirely). 2296For this reason, whenever the LKMM requires a plain store to be 2297w-pre-bounded or w-post-bounded by a marked access, it also requires 2298the store to be r-pre-bounded or r-post-bounded, so as to handle cases 2299where the compiler adds a load. 2300 2301(This may be overly cautious. We don't know of any examples where a 2302compiler has augmented a store with a load in this fashion, and the 2303Linux kernel developers would probably fight pretty hard to change a 2304compiler if it ever did this. Still, better safe than sorry.) 2305 2306Incidentally, the other tranformation -- augmenting a plain load by 2307adding in a store to the same location -- is not allowed. This is 2308because the compiler cannot know whether any other CPUs might perform 2309a concurrent load from that location. Two concurrent loads don't 2310constitute a race (they can't interfere with each other), but a store 2311does race with a concurrent load. Thus adding a store might create a 2312data race where one was not already present in the source code, 2313something the compiler is forbidden to do. Augmenting a store with a 2314load, on the other hand, is acceptable because doing so won't create a 2315data race unless one already existed. 2316 2317The LKMM includes a second way to pre-bound plain accesses, in 2318addition to fences: an address dependency from a marked load. That 2319is, in the sequence: 2320 2321 p = READ_ONCE(ptr); 2322 r = *p; 2323 2324the LKMM says that the marked load of ptr pre-bounds the plain load of 2325*p; the marked load must execute before any of the machine 2326instructions corresponding to the plain load. This is a reasonable 2327stipulation, since after all, the CPU can't perform the load of *p 2328until it knows what value p will hold. Furthermore, without some 2329assumption like this one, some usages typical of RCU would count as 2330data races. For example: 2331 2332 int a = 1, b; 2333 int *ptr = &a; 2334 2335 P0() 2336 { 2337 b = 2; 2338 rcu_assign_pointer(ptr, &b); 2339 } 2340 2341 P1() 2342 { 2343 int *p; 2344 int r; 2345 2346 rcu_read_lock(); 2347 p = rcu_dereference(ptr); 2348 r = *p; 2349 rcu_read_unlock(); 2350 } 2351 2352(In this example the rcu_read_lock() and rcu_read_unlock() calls don't 2353really do anything, because there aren't any grace periods. They are 2354included merely for the sake of good form; typically P0 would call 2355synchronize_rcu() somewhere after the rcu_assign_pointer().) 2356 2357rcu_assign_pointer() performs a store-release, so the plain store to b 2358is definitely w-post-bounded before the store to ptr, and the two 2359stores will propagate to P1 in that order. However, rcu_dereference() 2360is only equivalent to READ_ONCE(). While it is a marked access, it is 2361not a fence or compiler barrier. Hence the only guarantee we have 2362that the load of ptr in P1 is r-pre-bounded before the load of *p 2363(thus avoiding a race) is the assumption about address dependencies. 2364 2365This is a situation where the compiler can undermine the memory model, 2366and a certain amount of care is required when programming constructs 2367like this one. In particular, comparisons between the pointer and 2368other known addresses can cause trouble. If you have something like: 2369 2370 p = rcu_dereference(ptr); 2371 if (p == &x) 2372 r = *p; 2373 2374then the compiler just might generate object code resembling: 2375 2376 p = rcu_dereference(ptr); 2377 if (p == &x) 2378 r = x; 2379 2380or even: 2381 2382 rtemp = x; 2383 p = rcu_dereference(ptr); 2384 if (p == &x) 2385 r = rtemp; 2386 2387which would invalidate the memory model's assumption, since the CPU 2388could now perform the load of x before the load of ptr (there might be 2389a control dependency but no address dependency at the machine level). 2390 2391Finally, it turns out there is a situation in which a plain write does 2392not need to be w-post-bounded: when it is separated from the other 2393race-candidate access by a fence. At first glance this may seem 2394impossible. After all, to be race candidates the two accesses must 2395be on different CPUs, and fences don't link events on different CPUs. 2396Well, normal fences don't -- but rcu-fence can! Here's an example: 2397 2398 int x, y; 2399 2400 P0() 2401 { 2402 WRITE_ONCE(x, 1); 2403 synchronize_rcu(); 2404 y = 3; 2405 } 2406 2407 P1() 2408 { 2409 rcu_read_lock(); 2410 if (READ_ONCE(x) == 0) 2411 y = 2; 2412 rcu_read_unlock(); 2413 } 2414 2415Do the plain stores to y race? Clearly not if P1 reads a non-zero 2416value for x, so let's assume the READ_ONCE(x) does obtain 0. This 2417means that the read-side critical section in P1 must finish executing 2418before the grace period in P0 does, because RCU's Grace-Period 2419Guarantee says that otherwise P0's store to x would have propagated to 2420P1 before the critical section started and so would have been visible 2421to the READ_ONCE(). (Another way of putting it is that the fre link 2422from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link 2423between those two events.) 2424 2425This means there is an rcu-fence link from P1's "y = 2" store to P0's 2426"y = 3" store, and consequently the first must propagate from P1 to P0 2427before the second can execute. Therefore the two stores cannot be 2428concurrent and there is no race, even though P1's plain store to y 2429isn't w-post-bounded by any marked accesses. 2430 2431Putting all this material together yields the following picture. For 2432race-candidate stores W and W', where W ->co W', the LKMM says the 2433stores don't race if W can be linked to W' by a 2434 2435 w-post-bounded ; vis ; w-pre-bounded 2436 2437sequence. If W is plain then they also have to be linked by an 2438 2439 r-post-bounded ; xb* ; w-pre-bounded 2440 2441sequence, and if W' is plain then they also have to be linked by a 2442 2443 w-post-bounded ; vis ; r-pre-bounded 2444 2445sequence. For race-candidate load R and store W, the LKMM says the 2446two accesses don't race if R can be linked to W by an 2447 2448 r-post-bounded ; xb* ; w-pre-bounded 2449 2450sequence or if W can be linked to R by a 2451 2452 w-post-bounded ; vis ; r-pre-bounded 2453 2454sequence. For the cases involving a vis link, the LKMM also accepts 2455sequences in which W is linked to W' or R by a 2456 2457 strong-fence ; xb* ; {w and/or r}-pre-bounded 2458 2459sequence with no post-bounding, and in every case the LKMM also allows 2460the link simply to be a fence with no bounding at all. If no sequence 2461of the appropriate sort exists, the LKMM says that the accesses race. 2462 2463There is one more part of the LKMM related to plain accesses (although 2464not to data races) we should discuss. Recall that many relations such 2465as hb are limited to marked accesses only. As a result, the 2466happens-before, propagates-before, and rcu axioms (which state that 2467various relation must not contain a cycle) doesn't apply to plain 2468accesses. Nevertheless, we do want to rule out such cycles, because 2469they don't make sense even for plain accesses. 2470 2471To this end, the LKMM imposes three extra restrictions, together 2472called the "plain-coherence" axiom because of their resemblance to the 2473rules used by the operational model to ensure cache coherence (that 2474is, the rules governing the memory subsystem's choice of a store to 2475satisfy a load request and its determination of where a store will 2476fall in the coherence order): 2477 2478 If R and W are race candidates and it is possible to link R to 2479 W by one of the xb* sequences listed above, then W ->rfe R is 2480 not allowed (i.e., a load cannot read from a store that it 2481 executes before, even if one or both is plain). 2482 2483 If W and R are race candidates and it is possible to link W to 2484 R by one of the vis sequences listed above, then R ->fre W is 2485 not allowed (i.e., if a store is visible to a load then the 2486 load must read from that store or one coherence-after it). 2487 2488 If W and W' are race candidates and it is possible to link W 2489 to W' by one of the vis sequences listed above, then W' ->co W 2490 is not allowed (i.e., if one store is visible to a second then 2491 the second must come after the first in the coherence order). 2492 2493This is the extent to which the LKMM deals with plain accesses. 2494Perhaps it could say more (for example, plain accesses might 2495contribute to the ppo relation), but at the moment it seems that this 2496minimal, conservative approach is good enough. 2497 2498 2499ODDS AND ENDS 2500------------- 2501 2502This section covers material that didn't quite fit anywhere in the 2503earlier sections. 2504 2505The descriptions in this document don't always match the formal 2506version of the LKMM exactly. For example, the actual formal 2507definition of the prop relation makes the initial coe or fre part 2508optional, and it doesn't require the events linked by the relation to 2509be on the same CPU. These differences are very unimportant; indeed, 2510instances where the coe/fre part of prop is missing are of no interest 2511because all the other parts (fences and rfe) are already included in 2512hb anyway, and where the formal model adds prop into hb, it includes 2513an explicit requirement that the events being linked are on the same 2514CPU. 2515 2516Another minor difference has to do with events that are both memory 2517accesses and fences, such as those corresponding to smp_load_acquire() 2518calls. In the formal model, these events aren't actually both reads 2519and fences; rather, they are read events with an annotation marking 2520them as acquires. (Or write events annotated as releases, in the case 2521smp_store_release().) The final effect is the same. 2522 2523Although we didn't mention it above, the instruction execution 2524ordering provided by the smp_rmb() fence doesn't apply to read events 2525that are part of a non-value-returning atomic update. For instance, 2526given: 2527 2528 atomic_inc(&x); 2529 smp_rmb(); 2530 r1 = READ_ONCE(y); 2531 2532it is not guaranteed that the load from y will execute after the 2533update to x. This is because the ARMv8 architecture allows 2534non-value-returning atomic operations effectively to be executed off 2535the CPU. Basically, the CPU tells the memory subsystem to increment 2536x, and then the increment is carried out by the memory hardware with 2537no further involvement from the CPU. Since the CPU doesn't ever read 2538the value of x, there is nothing for the smp_rmb() fence to act on. 2539 2540The LKMM defines a few extra synchronization operations in terms of 2541things we have already covered. In particular, rcu_dereference() is 2542treated as READ_ONCE() and rcu_assign_pointer() is treated as 2543smp_store_release() -- which is basically how the Linux kernel treats 2544them. 2545 2546Although we said that plain accesses are not linked by the ppo 2547relation, they do contribute to it indirectly. Namely, when there is 2548an address dependency from a marked load R to a plain store W, 2549followed by smp_wmb() and then a marked store W', the LKMM creates a 2550ppo link from R to W'. The reasoning behind this is perhaps a little 2551shaky, but essentially it says there is no way to generate object code 2552for this source code in which W' could execute before R. Just as with 2553pre-bounding by address dependencies, it is possible for the compiler 2554to undermine this relation if sufficient care is not taken. 2555 2556There are a few oddball fences which need special treatment: 2557smp_mb__before_atomic(), smp_mb__after_atomic(), and 2558smp_mb__after_spinlock(). The LKMM uses fence events with special 2559annotations for them; they act as strong fences just like smp_mb() 2560except for the sets of events that they order. Instead of ordering 2561all po-earlier events against all po-later events, as smp_mb() does, 2562they behave as follows: 2563 2564 smp_mb__before_atomic() orders all po-earlier events against 2565 po-later atomic updates and the events following them; 2566 2567 smp_mb__after_atomic() orders po-earlier atomic updates and 2568 the events preceding them against all po-later events; 2569 2570 smp_mb__after_spinlock() orders po-earlier lock acquisition 2571 events and the events preceding them against all po-later 2572 events. 2573 2574Interestingly, RCU and locking each introduce the possibility of 2575deadlock. When faced with code sequences such as: 2576 2577 spin_lock(&s); 2578 spin_lock(&s); 2579 spin_unlock(&s); 2580 spin_unlock(&s); 2581 2582or: 2583 2584 rcu_read_lock(); 2585 synchronize_rcu(); 2586 rcu_read_unlock(); 2587 2588what does the LKMM have to say? Answer: It says there are no allowed 2589executions at all, which makes sense. But this can also lead to 2590misleading results, because if a piece of code has multiple possible 2591executions, some of which deadlock, the model will report only on the 2592non-deadlocking executions. For example: 2593 2594 int x, y; 2595 2596 P0() 2597 { 2598 int r0; 2599 2600 WRITE_ONCE(x, 1); 2601 r0 = READ_ONCE(y); 2602 } 2603 2604 P1() 2605 { 2606 rcu_read_lock(); 2607 if (READ_ONCE(x) > 0) { 2608 WRITE_ONCE(y, 36); 2609 synchronize_rcu(); 2610 } 2611 rcu_read_unlock(); 2612 } 2613 2614Is it possible to end up with r0 = 36 at the end? The LKMM will tell 2615you it is not, but the model won't mention that this is because P1 2616will self-deadlock in the executions where it stores 36 in y. 2617