1Explanation of the Linux-Kernel Memory Consistency Model 2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 3 4:Author: Alan Stern <stern@rowland.harvard.edu> 5:Created: October 2017 6 7.. Contents 8 9 1. INTRODUCTION 10 2. BACKGROUND 11 3. A SIMPLE EXAMPLE 12 4. A SELECTION OF MEMORY MODELS 13 5. ORDERING AND CYCLES 14 6. EVENTS 15 7. THE PROGRAM ORDER RELATION: po AND po-loc 16 8. A WARNING 17 9. DEPENDENCY RELATIONS: data, addr, and ctrl 18 10. THE READS-FROM RELATION: rf, rfi, and rfe 19 11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 20 12. THE FROM-READS RELATION: fr, fri, and fre 21 13. AN OPERATIONAL MODEL 22 14. PROPAGATION ORDER RELATION: cumul-fence 23 15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 24 16. SEQUENTIAL CONSISTENCY PER VARIABLE 25 17. ATOMIC UPDATES: rmw 26 18. THE PRESERVED PROGRAM ORDER RELATION: ppo 27 19. AND THEN THERE WAS ALPHA 28 20. THE HAPPENS-BEFORE RELATION: hb 29 21. THE PROPAGATES-BEFORE RELATION: pb 30 22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 31 23. LOCKING 32 24. PLAIN ACCESSES AND DATA RACES 33 25. ODDS AND ENDS 34 35 36 37INTRODUCTION 38------------ 39 40The Linux-kernel memory consistency model (LKMM) is rather complex and 41obscure. This is particularly evident if you read through the 42linux-kernel.bell and linux-kernel.cat files that make up the formal 43version of the model; they are extremely terse and their meanings are 44far from clear. 45 46This document describes the ideas underlying the LKMM. It is meant 47for people who want to understand how the model was designed. It does 48not go into the details of the code in the .bell and .cat files; 49rather, it explains in English what the code expresses symbolically. 50 51Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed 52toward beginners; they explain what memory consistency models are and 53the basic notions shared by all such models. People already familiar 54with these concepts can skim or skip over them. Sections 6 (EVENTS) 55through 12 (THE FROM_READS RELATION) describe the fundamental 56relations used in many models. Starting in Section 13 (AN OPERATIONAL 57MODEL), the workings of the LKMM itself are covered. 58 59Warning: The code examples in this document are not written in the 60proper format for litmus tests. They don't include a header line, the 61initializations are not enclosed in braces, the global variables are 62not passed by pointers, and they don't have an "exists" clause at the 63end. Converting them to the right format is left as an exercise for 64the reader. 65 66 67BACKGROUND 68---------- 69 70A memory consistency model (or just memory model, for short) is 71something which predicts, given a piece of computer code running on a 72particular kind of system, what values may be obtained by the code's 73load instructions. The LKMM makes these predictions for code running 74as part of the Linux kernel. 75 76In practice, people tend to use memory models the other way around. 77That is, given a piece of code and a collection of values specified 78for the loads, the model will predict whether it is possible for the 79code to run in such a way that the loads will indeed obtain the 80specified values. Of course, this is just another way of expressing 81the same idea. 82 83For code running on a uniprocessor system, the predictions are easy: 84Each load instruction must obtain the value written by the most recent 85store instruction accessing the same location (we ignore complicating 86factors such as DMA and mixed-size accesses.) But on multiprocessor 87systems, with multiple CPUs making concurrent accesses to shared 88memory locations, things aren't so simple. 89 90Different architectures have differing memory models, and the Linux 91kernel supports a variety of architectures. The LKMM has to be fairly 92permissive, in the sense that any behavior allowed by one of these 93architectures also has to be allowed by the LKMM. 94 95 96A SIMPLE EXAMPLE 97---------------- 98 99Here is a simple example to illustrate the basic concepts. Consider 100some code running as part of a device driver for an input device. The 101driver might contain an interrupt handler which collects data from the 102device, stores it in a buffer, and sets a flag to indicate the buffer 103is full. Running concurrently on a different CPU might be a part of 104the driver code being executed by a process in the midst of a read(2) 105system call. This code tests the flag to see whether the buffer is 106ready, and if it is, copies the data back to userspace. The buffer 107and the flag are memory locations shared between the two CPUs. 108 109We can abstract out the important pieces of the driver code as follows 110(the reason for using WRITE_ONCE() and READ_ONCE() instead of simple 111assignment statements is discussed later): 112 113 int buf = 0, flag = 0; 114 115 P0() 116 { 117 WRITE_ONCE(buf, 1); 118 WRITE_ONCE(flag, 1); 119 } 120 121 P1() 122 { 123 int r1; 124 int r2 = 0; 125 126 r1 = READ_ONCE(flag); 127 if (r1) 128 r2 = READ_ONCE(buf); 129 } 130 131Here the P0() function represents the interrupt handler running on one 132CPU and P1() represents the read() routine running on another. The 133value 1 stored in buf represents input data collected from the device. 134Thus, P0 stores the data in buf and then sets flag. Meanwhile, P1 135reads flag into the private variable r1, and if it is set, reads the 136data from buf into a second private variable r2 for copying to 137userspace. (Presumably if flag is not set then the driver will wait a 138while and try again.) 139 140This pattern of memory accesses, where one CPU stores values to two 141shared memory locations and another CPU loads from those locations in 142the opposite order, is widely known as the "Message Passing" or MP 143pattern. It is typical of memory access patterns in the kernel. 144 145Please note that this example code is a simplified abstraction. Real 146buffers are usually larger than a single integer, real device drivers 147usually use sleep and wakeup mechanisms rather than polling for I/O 148completion, and real code generally doesn't bother to copy values into 149private variables before using them. All that is beside the point; 150the idea here is simply to illustrate the overall pattern of memory 151accesses by the CPUs. 152 153A memory model will predict what values P1 might obtain for its loads 154from flag and buf, or equivalently, what values r1 and r2 might end up 155with after the code has finished running. 156 157Some predictions are trivial. For instance, no sane memory model would 158predict that r1 = 42 or r2 = -7, because neither of those values ever 159gets stored in flag or buf. 160 161Some nontrivial predictions are nonetheless quite simple. For 162instance, P1 might run entirely before P0 begins, in which case r1 and 163r2 will both be 0 at the end. Or P0 might run entirely before P1 164begins, in which case r1 and r2 will both be 1. 165 166The interesting predictions concern what might happen when the two 167routines run concurrently. One possibility is that P1 runs after P0's 168store to buf but before the store to flag. In this case, r1 and r2 169will again both be 0. (If P1 had been designed to read buf 170unconditionally then we would instead have r1 = 0 and r2 = 1.) 171 172However, the most interesting possibility is where r1 = 1 and r2 = 0. 173If this were to occur it would mean the driver contains a bug, because 174incorrect data would get sent to the user: 0 instead of 1. As it 175happens, the LKMM does predict this outcome can occur, and the example 176driver code shown above is indeed buggy. 177 178 179A SELECTION OF MEMORY MODELS 180---------------------------- 181 182The first widely cited memory model, and the simplest to understand, 183is Sequential Consistency. According to this model, systems behave as 184if each CPU executed its instructions in order but with unspecified 185timing. In other words, the instructions from the various CPUs get 186interleaved in a nondeterministic way, always according to some single 187global order that agrees with the order of the instructions in the 188program source for each CPU. The model says that the value obtained 189by each load is simply the value written by the most recently executed 190store to the same memory location, from any CPU. 191 192For the MP example code shown above, Sequential Consistency predicts 193that the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning 194goes like this: 195 196 Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from 197 it, as loads can obtain values only from earlier stores. 198 199 P1 loads from flag before loading from buf, since CPUs execute 200 their instructions in order. 201 202 P1 must load 0 from buf before P0 stores 1 to it; otherwise r2 203 would be 1 since a load obtains its value from the most recent 204 store to the same address. 205 206 P0 stores 1 to buf before storing 1 to flag, since it executes 207 its instructions in order. 208 209 Since an instruction (in this case, P0's store to flag) cannot 210 execute before itself, the specified outcome is impossible. 211 212However, real computer hardware almost never follows the Sequential 213Consistency memory model; doing so would rule out too many valuable 214performance optimizations. On ARM and PowerPC architectures, for 215instance, the MP example code really does sometimes yield r1 = 1 and 216r2 = 0. 217 218x86 and SPARC follow yet a different memory model: TSO (Total Store 219Ordering). This model predicts that the undesired outcome for the MP 220pattern cannot occur, but in other respects it differs from Sequential 221Consistency. One example is the Store Buffer (SB) pattern, in which 222each CPU stores to its own shared location and then loads from the 223other CPU's location: 224 225 int x = 0, y = 0; 226 227 P0() 228 { 229 int r0; 230 231 WRITE_ONCE(x, 1); 232 r0 = READ_ONCE(y); 233 } 234 235 P1() 236 { 237 int r1; 238 239 WRITE_ONCE(y, 1); 240 r1 = READ_ONCE(x); 241 } 242 243Sequential Consistency predicts that the outcome r0 = 0, r1 = 0 is 244impossible. (Exercise: Figure out the reasoning.) But TSO allows 245this outcome to occur, and in fact it does sometimes occur on x86 and 246SPARC systems. 247 248The LKMM was inspired by the memory models followed by PowerPC, ARM, 249x86, Alpha, and other architectures. However, it is different in 250detail from each of them. 251 252 253ORDERING AND CYCLES 254------------------- 255 256Memory models are all about ordering. Often this is temporal ordering 257(i.e., the order in which certain events occur) but it doesn't have to 258be; consider for example the order of instructions in a program's 259source code. We saw above that Sequential Consistency makes an 260important assumption that CPUs execute instructions in the same order 261as those instructions occur in the code, and there are many other 262instances of ordering playing central roles in memory models. 263 264The counterpart to ordering is a cycle. Ordering rules out cycles: 265It's not possible to have X ordered before Y, Y ordered before Z, and 266Z ordered before X, because this would mean that X is ordered before 267itself. The analysis of the MP example under Sequential Consistency 268involved just such an impossible cycle: 269 270 W: P0 stores 1 to flag executes before 271 X: P1 loads 1 from flag executes before 272 Y: P1 loads 0 from buf executes before 273 Z: P0 stores 1 to buf executes before 274 W: P0 stores 1 to flag. 275 276In short, if a memory model requires certain accesses to be ordered, 277and a certain outcome for the loads in a piece of code can happen only 278if those accesses would form a cycle, then the memory model predicts 279that outcome cannot occur. 280 281The LKMM is defined largely in terms of cycles, as we will see. 282 283 284EVENTS 285------ 286 287The LKMM does not work directly with the C statements that make up 288kernel source code. Instead it considers the effects of those 289statements in a more abstract form, namely, events. The model 290includes three types of events: 291 292 Read events correspond to loads from shared memory, such as 293 calls to READ_ONCE(), smp_load_acquire(), or 294 rcu_dereference(). 295 296 Write events correspond to stores to shared memory, such as 297 calls to WRITE_ONCE(), smp_store_release(), or atomic_set(). 298 299 Fence events correspond to memory barriers (also known as 300 fences), such as calls to smp_rmb() or rcu_read_lock(). 301 302These categories are not exclusive; a read or write event can also be 303a fence. This happens with functions like smp_load_acquire() or 304spin_lock(). However, no single event can be both a read and a write. 305Atomic read-modify-write accesses, such as atomic_inc() or xchg(), 306correspond to a pair of events: a read followed by a write. (The 307write event is omitted for executions where it doesn't occur, such as 308a cmpxchg() where the comparison fails.) 309 310Other parts of the code, those which do not involve interaction with 311shared memory, do not give rise to events. Thus, arithmetic and 312logical computations, control-flow instructions, or accesses to 313private memory or CPU registers are not of central interest to the 314memory model. They only affect the model's predictions indirectly. 315For example, an arithmetic computation might determine the value that 316gets stored to a shared memory location (or in the case of an array 317index, the address where the value gets stored), but the memory model 318is concerned only with the store itself -- its value and its address 319-- not the computation leading up to it. 320 321Events in the LKMM can be linked by various relations, which we will 322describe in the following sections. The memory model requires certain 323of these relations to be orderings, that is, it requires them not to 324have any cycles. 325 326 327THE PROGRAM ORDER RELATION: po AND po-loc 328----------------------------------------- 329 330The most important relation between events is program order (po). You 331can think of it as the order in which statements occur in the source 332code after branches are taken into account and loops have been 333unrolled. A better description might be the order in which 334instructions are presented to a CPU's execution unit. Thus, we say 335that X is po-before Y (written as "X ->po Y" in formulas) if X occurs 336before Y in the instruction stream. 337 338This is inherently a single-CPU relation; two instructions executing 339on different CPUs are never linked by po. Also, it is by definition 340an ordering so it cannot have any cycles. 341 342po-loc is a sub-relation of po. It links two memory accesses when the 343first comes before the second in program order and they access the 344same memory location (the "-loc" suffix). 345 346Although this may seem straightforward, there is one subtle aspect to 347program order we need to explain. The LKMM was inspired by low-level 348architectural memory models which describe the behavior of machine 349code, and it retains their outlook to a considerable extent. The 350read, write, and fence events used by the model are close in spirit to 351individual machine instructions. Nevertheless, the LKMM describes 352kernel code written in C, and the mapping from C to machine code can 353be extremely complex. 354 355Optimizing compilers have great freedom in the way they translate 356source code to object code. They are allowed to apply transformations 357that add memory accesses, eliminate accesses, combine them, split them 358into pieces, or move them around. The use of READ_ONCE(), WRITE_ONCE(), 359or one of the other atomic or synchronization primitives prevents a 360large number of compiler optimizations. In particular, it is guaranteed 361that the compiler will not remove such accesses from the generated code 362(unless it can prove the accesses will never be executed), it will not 363change the order in which they occur in the code (within limits imposed 364by the C standard), and it will not introduce extraneous accesses. 365 366The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather 367than ordinary memory accesses. Thanks to this usage, we can be certain 368that in the MP example, the compiler won't reorder P0's write event to 369buf and P0's write event to flag, and similarly for the other shared 370memory accesses in the examples. 371 372Since private variables are not shared between CPUs, they can be 373accessed normally without READ_ONCE() or WRITE_ONCE(). In fact, they 374need not even be stored in normal memory at all -- in principle a 375private variable could be stored in a CPU register (hence the convention 376that these variables have names starting with the letter 'r'). 377 378 379A WARNING 380--------- 381 382The protections provided by READ_ONCE(), WRITE_ONCE(), and others are 383not perfect; and under some circumstances it is possible for the 384compiler to undermine the memory model. Here is an example. Suppose 385both branches of an "if" statement store the same value to the same 386location: 387 388 r1 = READ_ONCE(x); 389 if (r1) { 390 WRITE_ONCE(y, 2); 391 ... /* do something */ 392 } else { 393 WRITE_ONCE(y, 2); 394 ... /* do something else */ 395 } 396 397For this code, the LKMM predicts that the load from x will always be 398executed before either of the stores to y. However, a compiler could 399lift the stores out of the conditional, transforming the code into 400something resembling: 401 402 r1 = READ_ONCE(x); 403 WRITE_ONCE(y, 2); 404 if (r1) { 405 ... /* do something */ 406 } else { 407 ... /* do something else */ 408 } 409 410Given this version of the code, the LKMM would predict that the load 411from x could be executed after the store to y. Thus, the memory 412model's original prediction could be invalidated by the compiler. 413 414Another issue arises from the fact that in C, arguments to many 415operators and function calls can be evaluated in any order. For 416example: 417 418 r1 = f(5) + g(6); 419 420The object code might call f(5) either before or after g(6); the 421memory model cannot assume there is a fixed program order relation 422between them. (In fact, if the function calls are inlined then the 423compiler might even interleave their object code.) 424 425 426DEPENDENCY RELATIONS: data, addr, and ctrl 427------------------------------------------ 428 429We say that two events are linked by a dependency relation when the 430execution of the second event depends in some way on a value obtained 431from memory by the first. The first event must be a read, and the 432value it obtains must somehow affect what the second event does. 433There are three kinds of dependencies: data, address (addr), and 434control (ctrl). 435 436A read and a write event are linked by a data dependency if the value 437obtained by the read affects the value stored by the write. As a very 438simple example: 439 440 int x, y; 441 442 r1 = READ_ONCE(x); 443 WRITE_ONCE(y, r1 + 5); 444 445The value stored by the WRITE_ONCE obviously depends on the value 446loaded by the READ_ONCE. Such dependencies can wind through 447arbitrarily complicated computations, and a write can depend on the 448values of multiple reads. 449 450A read event and another memory access event are linked by an address 451dependency if the value obtained by the read affects the location 452accessed by the other event. The second event can be either a read or 453a write. Here's another simple example: 454 455 int a[20]; 456 int i; 457 458 r1 = READ_ONCE(i); 459 r2 = READ_ONCE(a[r1]); 460 461Here the location accessed by the second READ_ONCE() depends on the 462index value loaded by the first. Pointer indirection also gives rise 463to address dependencies, since the address of a location accessed 464through a pointer will depend on the value read earlier from that 465pointer. 466 467Finally, a read event X and a write event Y are linked by a control 468dependency if Y syntactically lies within an arm of an if statement and 469X affects the evaluation of the if condition via a data or address 470dependency (or similarly for a switch statement). Simple example: 471 472 int x, y; 473 474 r1 = READ_ONCE(x); 475 if (r1) 476 WRITE_ONCE(y, 1984); 477 478Execution of the WRITE_ONCE() is controlled by a conditional expression 479which depends on the value obtained by the READ_ONCE(); hence there is 480a control dependency from the load to the store. 481 482It should be pretty obvious that events can only depend on reads that 483come earlier in program order. Symbolically, if we have R ->data X, 484R ->addr X, or R ->ctrl X (where R is a read event), then we must also 485have R ->po X. It wouldn't make sense for a computation to depend 486somehow on a value that doesn't get loaded from shared memory until 487later in the code! 488 489Here's a trick question: When is a dependency not a dependency? Answer: 490When it is purely syntactic rather than semantic. We say a dependency 491between two accesses is purely syntactic if the second access doesn't 492actually depend on the result of the first. Here is a trivial example: 493 494 r1 = READ_ONCE(x); 495 WRITE_ONCE(y, r1 * 0); 496 497There appears to be a data dependency from the load of x to the store 498of y, since the value to be stored is computed from the value that was 499loaded. But in fact, the value stored does not really depend on 500anything since it will always be 0. Thus the data dependency is only 501syntactic (it appears to exist in the code) but not semantic (the 502second access will always be the same, regardless of the value of the 503first access). Given code like this, a compiler could simply discard 504the value returned by the load from x, which would certainly destroy 505any dependency. (The compiler is not permitted to eliminate entirely 506the load generated for a READ_ONCE() -- that's one of the nice 507properties of READ_ONCE() -- but it is allowed to ignore the load's 508value.) 509 510It's natural to object that no one in their right mind would write 511code like the above. However, macro expansions can easily give rise 512to this sort of thing, in ways that often are not apparent to the 513programmer. 514 515Another mechanism that can lead to purely syntactic dependencies is 516related to the notion of "undefined behavior". Certain program 517behaviors are called "undefined" in the C language specification, 518which means that when they occur there are no guarantees at all about 519the outcome. Consider the following example: 520 521 int a[1]; 522 int i; 523 524 r1 = READ_ONCE(i); 525 r2 = READ_ONCE(a[r1]); 526 527Access beyond the end or before the beginning of an array is one kind 528of undefined behavior. Therefore the compiler doesn't have to worry 529about what will happen if r1 is nonzero, and it can assume that r1 530will always be zero regardless of the value actually loaded from i. 531(If the assumption turns out to be wrong the resulting behavior will 532be undefined anyway, so the compiler doesn't care!) Thus the value 533from the load can be discarded, breaking the address dependency. 534 535The LKMM is unaware that purely syntactic dependencies are different 536from semantic dependencies and therefore mistakenly predicts that the 537accesses in the two examples above will be ordered. This is another 538example of how the compiler can undermine the memory model. Be warned. 539 540 541THE READS-FROM RELATION: rf, rfi, and rfe 542----------------------------------------- 543 544The reads-from relation (rf) links a write event to a read event when 545the value loaded by the read is the value that was stored by the 546write. In colloquial terms, the load "reads from" the store. We 547write W ->rf R to indicate that the load R reads from the store W. We 548further distinguish the cases where the load and the store occur on 549the same CPU (internal reads-from, or rfi) and where they occur on 550different CPUs (external reads-from, or rfe). 551 552For our purposes, a memory location's initial value is treated as 553though it had been written there by an imaginary initial store that 554executes on a separate CPU before the main program runs. 555 556Usage of the rf relation implicitly assumes that loads will always 557read from a single store. It doesn't apply properly in the presence 558of load-tearing, where a load obtains some of its bits from one store 559and some of them from another store. Fortunately, use of READ_ONCE() 560and WRITE_ONCE() will prevent load-tearing; it's not possible to have: 561 562 int x = 0; 563 564 P0() 565 { 566 WRITE_ONCE(x, 0x1234); 567 } 568 569 P1() 570 { 571 int r1; 572 573 r1 = READ_ONCE(x); 574 } 575 576and end up with r1 = 0x1200 (partly from x's initial value and partly 577from the value stored by P0). 578 579On the other hand, load-tearing is unavoidable when mixed-size 580accesses are used. Consider this example: 581 582 union { 583 u32 w; 584 u16 h[2]; 585 } x; 586 587 P0() 588 { 589 WRITE_ONCE(x.h[0], 0x1234); 590 WRITE_ONCE(x.h[1], 0x5678); 591 } 592 593 P1() 594 { 595 int r1; 596 597 r1 = READ_ONCE(x.w); 598 } 599 600If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read 601from both of P0's stores. It is possible to handle mixed-size and 602unaligned accesses in a memory model, but the LKMM currently does not 603attempt to do so. It requires all accesses to be properly aligned and 604of the location's actual size. 605 606 607CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe 608------------------------------------------------------------------ 609 610Cache coherence is a general principle requiring that in a 611multi-processor system, the CPUs must share a consistent view of the 612memory contents. Specifically, it requires that for each location in 613shared memory, the stores to that location must form a single global 614ordering which all the CPUs agree on (the coherence order), and this 615ordering must be consistent with the program order for accesses to 616that location. 617 618To put it another way, for any variable x, the coherence order (co) of 619the stores to x is simply the order in which the stores overwrite one 620another. The imaginary store which establishes x's initial value 621comes first in the coherence order; the store which directly 622overwrites the initial value comes second; the store which overwrites 623that value comes third, and so on. 624 625You can think of the coherence order as being the order in which the 626stores reach x's location in memory (or if you prefer a more 627hardware-centric view, the order in which the stores get written to 628x's cache line). We write W ->co W' if W comes before W' in the 629coherence order, that is, if the value stored by W gets overwritten, 630directly or indirectly, by the value stored by W'. 631 632Coherence order is required to be consistent with program order. This 633requirement takes the form of four coherency rules: 634 635 Write-write coherence: If W ->po-loc W' (i.e., W comes before 636 W' in program order and they access the same location), where W 637 and W' are two stores, then W ->co W'. 638 639 Write-read coherence: If W ->po-loc R, where W is a store and R 640 is a load, then R must read from W or from some other store 641 which comes after W in the coherence order. 642 643 Read-write coherence: If R ->po-loc W, where R is a load and W 644 is a store, then the store which R reads from must come before 645 W in the coherence order. 646 647 Read-read coherence: If R ->po-loc R', where R and R' are two 648 loads, then either they read from the same store or else the 649 store read by R comes before the store read by R' in the 650 coherence order. 651 652This is sometimes referred to as sequential consistency per variable, 653because it means that the accesses to any single memory location obey 654the rules of the Sequential Consistency memory model. (According to 655Wikipedia, sequential consistency per variable and cache coherence 656mean the same thing except that cache coherence includes an extra 657requirement that every store eventually becomes visible to every CPU.) 658 659Any reasonable memory model will include cache coherence. Indeed, our 660expectation of cache coherence is so deeply ingrained that violations 661of its requirements look more like hardware bugs than programming 662errors: 663 664 int x; 665 666 P0() 667 { 668 WRITE_ONCE(x, 17); 669 WRITE_ONCE(x, 23); 670 } 671 672If the final value stored in x after this code ran was 17, you would 673think your computer was broken. It would be a violation of the 674write-write coherence rule: Since the store of 23 comes later in 675program order, it must also come later in x's coherence order and 676thus must overwrite the store of 17. 677 678 int x = 0; 679 680 P0() 681 { 682 int r1; 683 684 r1 = READ_ONCE(x); 685 WRITE_ONCE(x, 666); 686 } 687 688If r1 = 666 at the end, this would violate the read-write coherence 689rule: The READ_ONCE() load comes before the WRITE_ONCE() store in 690program order, so it must not read from that store but rather from one 691coming earlier in the coherence order (in this case, x's initial 692value). 693 694 int x = 0; 695 696 P0() 697 { 698 WRITE_ONCE(x, 5); 699 } 700 701 P1() 702 { 703 int r1, r2; 704 705 r1 = READ_ONCE(x); 706 r2 = READ_ONCE(x); 707 } 708 709If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the 710imaginary store which establishes x's initial value) at the end, this 711would violate the read-read coherence rule: The r1 load comes before 712the r2 load in program order, so it must not read from a store that 713comes later in the coherence order. 714 715(As a minor curiosity, if this code had used normal loads instead of 716READ_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5 717and r2 = 0! This results from parallel execution of the operations 718encoded in Itanium's Very-Long-Instruction-Word format, and it is yet 719another motivation for using READ_ONCE() when accessing shared memory 720locations.) 721 722Just like the po relation, co is inherently an ordering -- it is not 723possible for a store to directly or indirectly overwrite itself! And 724just like with the rf relation, we distinguish between stores that 725occur on the same CPU (internal coherence order, or coi) and stores 726that occur on different CPUs (external coherence order, or coe). 727 728On the other hand, stores to different memory locations are never 729related by co, just as instructions on different CPUs are never 730related by po. Coherence order is strictly per-location, or if you 731prefer, each location has its own independent coherence order. 732 733 734THE FROM-READS RELATION: fr, fri, and fre 735----------------------------------------- 736 737The from-reads relation (fr) can be a little difficult for people to 738grok. It describes the situation where a load reads a value that gets 739overwritten by a store. In other words, we have R ->fr W when the 740value that R reads is overwritten (directly or indirectly) by W, or 741equivalently, when R reads from a store which comes earlier than W in 742the coherence order. 743 744For example: 745 746 int x = 0; 747 748 P0() 749 { 750 int r1; 751 752 r1 = READ_ONCE(x); 753 WRITE_ONCE(x, 2); 754 } 755 756The value loaded from x will be 0 (assuming cache coherence!), and it 757gets overwritten by the value 2. Thus there is an fr link from the 758READ_ONCE() to the WRITE_ONCE(). If the code contained any later 759stores to x, there would also be fr links from the READ_ONCE() to 760them. 761 762As with rf, rfi, and rfe, we subdivide the fr relation into fri (when 763the load and the store are on the same CPU) and fre (when they are on 764different CPUs). 765 766Note that the fr relation is determined entirely by the rf and co 767relations; it is not independent. Given a read event R and a write 768event W for the same location, we will have R ->fr W if and only if 769the write which R reads from is co-before W. In symbols, 770 771 (R ->fr W) := (there exists W' with W' ->rf R and W' ->co W). 772 773 774AN OPERATIONAL MODEL 775-------------------- 776 777The LKMM is based on various operational memory models, meaning that 778the models arise from an abstract view of how a computer system 779operates. Here are the main ideas, as incorporated into the LKMM. 780 781The system as a whole is divided into the CPUs and a memory subsystem. 782The CPUs are responsible for executing instructions (not necessarily 783in program order), and they communicate with the memory subsystem. 784For the most part, executing an instruction requires a CPU to perform 785only internal operations. However, loads, stores, and fences involve 786more. 787 788When CPU C executes a store instruction, it tells the memory subsystem 789to store a certain value at a certain location. The memory subsystem 790propagates the store to all the other CPUs as well as to RAM. (As a 791special case, we say that the store propagates to its own CPU at the 792time it is executed.) The memory subsystem also determines where the 793store falls in the location's coherence order. In particular, it must 794arrange for the store to be co-later than (i.e., to overwrite) any 795other store to the same location which has already propagated to CPU C. 796 797When a CPU executes a load instruction R, it first checks to see 798whether there are any as-yet unexecuted store instructions, for the 799same location, that come before R in program order. If there are, it 800uses the value of the po-latest such store as the value obtained by R, 801and we say that the store's value is forwarded to R. Otherwise, the 802CPU asks the memory subsystem for the value to load and we say that R 803is satisfied from memory. The memory subsystem hands back the value 804of the co-latest store to the location in question which has already 805propagated to that CPU. 806 807(In fact, the picture needs to be a little more complicated than this. 808CPUs have local caches, and propagating a store to a CPU really means 809propagating it to the CPU's local cache. A local cache can take some 810time to process the stores that it receives, and a store can't be used 811to satisfy one of the CPU's loads until it has been processed. On 812most architectures, the local caches process stores in 813First-In-First-Out order, and consequently the processing delay 814doesn't matter for the memory model. But on Alpha, the local caches 815have a partitioned design that results in non-FIFO behavior. We will 816discuss this in more detail later.) 817 818Note that load instructions may be executed speculatively and may be 819restarted under certain circumstances. The memory model ignores these 820premature executions; we simply say that the load executes at the 821final time it is forwarded or satisfied. 822 823Executing a fence (or memory barrier) instruction doesn't require a 824CPU to do anything special other than informing the memory subsystem 825about the fence. However, fences do constrain the way CPUs and the 826memory subsystem handle other instructions, in two respects. 827 828First, a fence forces the CPU to execute various instructions in 829program order. Exactly which instructions are ordered depends on the 830type of fence: 831 832 Strong fences, including smp_mb() and synchronize_rcu(), force 833 the CPU to execute all po-earlier instructions before any 834 po-later instructions; 835 836 smp_rmb() forces the CPU to execute all po-earlier loads 837 before any po-later loads; 838 839 smp_wmb() forces the CPU to execute all po-earlier stores 840 before any po-later stores; 841 842 Acquire fences, such as smp_load_acquire(), force the CPU to 843 execute the load associated with the fence (e.g., the load 844 part of an smp_load_acquire()) before any po-later 845 instructions; 846 847 Release fences, such as smp_store_release(), force the CPU to 848 execute all po-earlier instructions before the store 849 associated with the fence (e.g., the store part of an 850 smp_store_release()). 851 852Second, some types of fence affect the way the memory subsystem 853propagates stores. When a fence instruction is executed on CPU C: 854 855 For each other CPU C', smp_wmb() forces all po-earlier stores 856 on C to propagate to C' before any po-later stores do. 857 858 For each other CPU C', any store which propagates to C before 859 a release fence is executed (including all po-earlier 860 stores executed on C) is forced to propagate to C' before the 861 store associated with the release fence does. 862 863 Any store which propagates to C before a strong fence is 864 executed (including all po-earlier stores on C) is forced to 865 propagate to all other CPUs before any instructions po-after 866 the strong fence are executed on C. 867 868The propagation ordering enforced by release fences and strong fences 869affects stores from other CPUs that propagate to CPU C before the 870fence is executed, as well as stores that are executed on C before the 871fence. We describe this property by saying that release fences and 872strong fences are A-cumulative. By contrast, smp_wmb() fences are not 873A-cumulative; they only affect the propagation of stores that are 874executed on C before the fence (i.e., those which precede the fence in 875program order). 876 877rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have 878other properties which we discuss later. 879 880 881PROPAGATION ORDER RELATION: cumul-fence 882--------------------------------------- 883 884The fences which affect propagation order (i.e., strong, release, and 885smp_wmb() fences) are collectively referred to as cumul-fences, even 886though smp_wmb() isn't A-cumulative. The cumul-fence relation is 887defined to link memory access events E and F whenever: 888 889 E and F are both stores on the same CPU and an smp_wmb() fence 890 event occurs between them in program order; or 891 892 F is a release fence and some X comes before F in program order, 893 where either X = E or else E ->rf X; or 894 895 A strong fence event occurs between some X and F in program 896 order, where either X = E or else E ->rf X. 897 898The operational model requires that whenever W and W' are both stores 899and W ->cumul-fence W', then W must propagate to any given CPU 900before W' does. However, for different CPUs C and C', it does not 901require W to propagate to C before W' propagates to C'. 902 903 904DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL 905------------------------------------------------- 906 907The LKMM is derived from the restrictions imposed by the design 908outlined above. These restrictions involve the necessity of 909maintaining cache coherence and the fact that a CPU can't operate on a 910value before it knows what that value is, among other things. 911 912The formal version of the LKMM is defined by six requirements, or 913axioms: 914 915 Sequential consistency per variable: This requires that the 916 system obey the four coherency rules. 917 918 Atomicity: This requires that atomic read-modify-write 919 operations really are atomic, that is, no other stores can 920 sneak into the middle of such an update. 921 922 Happens-before: This requires that certain instructions are 923 executed in a specific order. 924 925 Propagation: This requires that certain stores propagate to 926 CPUs and to RAM in a specific order. 927 928 Rcu: This requires that RCU read-side critical sections and 929 grace periods obey the rules of RCU, in particular, the 930 Grace-Period Guarantee. 931 932 Plain-coherence: This requires that plain memory accesses 933 (those not using READ_ONCE(), WRITE_ONCE(), etc.) must obey 934 the operational model's rules regarding cache coherence. 935 936The first and second are quite common; they can be found in many 937memory models (such as those for C11/C++11). The "happens-before" and 938"propagation" axioms have analogs in other memory models as well. The 939"rcu" and "plain-coherence" axioms are specific to the LKMM. 940 941Each of these axioms is discussed below. 942 943 944SEQUENTIAL CONSISTENCY PER VARIABLE 945----------------------------------- 946 947According to the principle of cache coherence, the stores to any fixed 948shared location in memory form a global ordering. We can imagine 949inserting the loads from that location into this ordering, by placing 950each load between the store that it reads from and the following 951store. This leaves the relative positions of loads that read from the 952same store unspecified; let's say they are inserted in program order, 953first for CPU 0, then CPU 1, etc. 954 955You can check that the four coherency rules imply that the rf, co, fr, 956and po-loc relations agree with this global ordering; in other words, 957whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the 958X event comes before the Y event in the global ordering. The LKMM's 959"coherence" axiom expresses this by requiring the union of these 960relations not to have any cycles. This means it must not be possible 961to find events 962 963 X0 -> X1 -> X2 -> ... -> Xn -> X0, 964 965where each of the links is either rf, co, fr, or po-loc. This has to 966hold if the accesses to the fixed memory location can be ordered as 967cache coherence demands. 968 969Although it is not obvious, it can be shown that the converse is also 970true: This LKMM axiom implies that the four coherency rules are 971obeyed. 972 973 974ATOMIC UPDATES: rmw 975------------------- 976 977What does it mean to say that a read-modify-write (rmw) update, such 978as atomic_inc(&x), is atomic? It means that the memory location (x in 979this case) does not get altered between the read and the write events 980making up the atomic operation. In particular, if two CPUs perform 981atomic_inc(&x) concurrently, it must be guaranteed that the final 982value of x will be the initial value plus two. We should never have 983the following sequence of events: 984 985 CPU 0 loads x obtaining 13; 986 CPU 1 loads x obtaining 13; 987 CPU 0 stores 14 to x; 988 CPU 1 stores 14 to x; 989 990where the final value of x is wrong (14 rather than 15). 991 992In this example, CPU 0's increment effectively gets lost because it 993occurs in between CPU 1's load and store. To put it another way, the 994problem is that the position of CPU 0's store in x's coherence order 995is between the store that CPU 1 reads from and the store that CPU 1 996performs. 997 998The same analysis applies to all atomic update operations. Therefore, 999to enforce atomicity the LKMM requires that atomic updates follow this 1000rule: Whenever R and W are the read and write events composing an 1001atomic read-modify-write and W' is the write event which R reads from, 1002there must not be any stores coming between W' and W in the coherence 1003order. Equivalently, 1004 1005 (R ->rmw W) implies (there is no X with R ->fr X and X ->co W), 1006 1007where the rmw relation links the read and write events making up each 1008atomic update. This is what the LKMM's "atomic" axiom says. 1009 1010 1011THE PRESERVED PROGRAM ORDER RELATION: ppo 1012----------------------------------------- 1013 1014There are many situations where a CPU is obliged to execute two 1015instructions in program order. We amalgamate them into the ppo (for 1016"preserved program order") relation, which links the po-earlier 1017instruction to the po-later instruction and is thus a sub-relation of 1018po. 1019 1020The operational model already includes a description of one such 1021situation: Fences are a source of ppo links. Suppose X and Y are 1022memory accesses with X ->po Y; then the CPU must execute X before Y if 1023any of the following hold: 1024 1025 A strong (smp_mb() or synchronize_rcu()) fence occurs between 1026 X and Y; 1027 1028 X and Y are both stores and an smp_wmb() fence occurs between 1029 them; 1030 1031 X and Y are both loads and an smp_rmb() fence occurs between 1032 them; 1033 1034 X is also an acquire fence, such as smp_load_acquire(); 1035 1036 Y is also a release fence, such as smp_store_release(). 1037 1038Another possibility, not mentioned earlier but discussed in the next 1039section, is: 1040 1041 X and Y are both loads, X ->addr Y (i.e., there is an address 1042 dependency from X to Y), and X is a READ_ONCE() or an atomic 1043 access. 1044 1045Dependencies can also cause instructions to be executed in program 1046order. This is uncontroversial when the second instruction is a 1047store; either a data, address, or control dependency from a load R to 1048a store W will force the CPU to execute R before W. This is very 1049simply because the CPU cannot tell the memory subsystem about W's 1050store before it knows what value should be stored (in the case of a 1051data dependency), what location it should be stored into (in the case 1052of an address dependency), or whether the store should actually take 1053place (in the case of a control dependency). 1054 1055Dependencies to load instructions are more problematic. To begin with, 1056there is no such thing as a data dependency to a load. Next, a CPU 1057has no reason to respect a control dependency to a load, because it 1058can always satisfy the second load speculatively before the first, and 1059then ignore the result if it turns out that the second load shouldn't 1060be executed after all. And lastly, the real difficulties begin when 1061we consider address dependencies to loads. 1062 1063To be fair about it, all Linux-supported architectures do execute 1064loads in program order if there is an address dependency between them. 1065After all, a CPU cannot ask the memory subsystem to load a value from 1066a particular location before it knows what that location is. However, 1067the split-cache design used by Alpha can cause it to behave in a way 1068that looks as if the loads were executed out of order (see the next 1069section for more details). The kernel includes a workaround for this 1070problem when the loads come from READ_ONCE(), and therefore the LKMM 1071includes address dependencies to loads in the ppo relation. 1072 1073On the other hand, dependencies can indirectly affect the ordering of 1074two loads. This happens when there is a dependency from a load to a 1075store and a second, po-later load reads from that store: 1076 1077 R ->dep W ->rfi R', 1078 1079where the dep link can be either an address or a data dependency. In 1080this situation we know it is possible for the CPU to execute R' before 1081W, because it can forward the value that W will store to R'. But it 1082cannot execute R' before R, because it cannot forward the value before 1083it knows what that value is, or that W and R' do access the same 1084location. However, if there is merely a control dependency between R 1085and W then the CPU can speculatively forward W to R' before executing 1086R; if the speculation turns out to be wrong then the CPU merely has to 1087restart or abandon R'. 1088 1089(In theory, a CPU might forward a store to a load when it runs across 1090an address dependency like this: 1091 1092 r1 = READ_ONCE(ptr); 1093 WRITE_ONCE(*r1, 17); 1094 r2 = READ_ONCE(*r1); 1095 1096because it could tell that the store and the second load access the 1097same location even before it knows what the location's address is. 1098However, none of the architectures supported by the Linux kernel do 1099this.) 1100 1101Two memory accesses of the same location must always be executed in 1102program order if the second access is a store. Thus, if we have 1103 1104 R ->po-loc W 1105 1106(the po-loc link says that R comes before W in program order and they 1107access the same location), the CPU is obliged to execute W after R. 1108If it executed W first then the memory subsystem would respond to R's 1109read request with the value stored by W (or an even later store), in 1110violation of the read-write coherence rule. Similarly, if we had 1111 1112 W ->po-loc W' 1113 1114and the CPU executed W' before W, then the memory subsystem would put 1115W' before W in the coherence order. It would effectively cause W to 1116overwrite W', in violation of the write-write coherence rule. 1117(Interestingly, an early ARMv8 memory model, now obsolete, proposed 1118allowing out-of-order writes like this to occur. The model avoided 1119violating the write-write coherence rule by requiring the CPU not to 1120send the W write to the memory subsystem at all!) 1121 1122 1123AND THEN THERE WAS ALPHA 1124------------------------ 1125 1126As mentioned above, the Alpha architecture is unique in that it does 1127not appear to respect address dependencies to loads. This means that 1128code such as the following: 1129 1130 int x = 0; 1131 int y = -1; 1132 int *ptr = &y; 1133 1134 P0() 1135 { 1136 WRITE_ONCE(x, 1); 1137 smp_wmb(); 1138 WRITE_ONCE(ptr, &x); 1139 } 1140 1141 P1() 1142 { 1143 int *r1; 1144 int r2; 1145 1146 r1 = ptr; 1147 r2 = READ_ONCE(*r1); 1148 } 1149 1150can malfunction on Alpha systems (notice that P1 uses an ordinary load 1151to read ptr instead of READ_ONCE()). It is quite possible that r1 = &x 1152and r2 = 0 at the end, in spite of the address dependency. 1153 1154At first glance this doesn't seem to make sense. We know that the 1155smp_wmb() forces P0's store to x to propagate to P1 before the store 1156to ptr does. And since P1 can't execute its second load 1157until it knows what location to load from, i.e., after executing its 1158first load, the value x = 1 must have propagated to P1 before the 1159second load executed. So why doesn't r2 end up equal to 1? 1160 1161The answer lies in the Alpha's split local caches. Although the two 1162stores do reach P1's local cache in the proper order, it can happen 1163that the first store is processed by a busy part of the cache while 1164the second store is processed by an idle part. As a result, the x = 1 1165value may not become available for P1's CPU to read until after the 1166ptr = &x value does, leading to the undesirable result above. The 1167final effect is that even though the two loads really are executed in 1168program order, it appears that they aren't. 1169 1170This could not have happened if the local cache had processed the 1171incoming stores in FIFO order. By contrast, other architectures 1172maintain at least the appearance of FIFO order. 1173 1174In practice, this difficulty is solved by inserting a special fence 1175between P1's two loads when the kernel is compiled for the Alpha 1176architecture. In fact, as of version 4.15, the kernel automatically 1177adds this fence after every READ_ONCE() and atomic load on Alpha. The 1178effect of the fence is to cause the CPU not to execute any po-later 1179instructions until after the local cache has finished processing all 1180the stores it has already received. Thus, if the code was changed to: 1181 1182 P1() 1183 { 1184 int *r1; 1185 int r2; 1186 1187 r1 = READ_ONCE(ptr); 1188 r2 = READ_ONCE(*r1); 1189 } 1190 1191then we would never get r1 = &x and r2 = 0. By the time P1 executed 1192its second load, the x = 1 store would already be fully processed by 1193the local cache and available for satisfying the read request. Thus 1194we have yet another reason why shared data should always be read with 1195READ_ONCE() or another synchronization primitive rather than accessed 1196directly. 1197 1198The LKMM requires that smp_rmb(), acquire fences, and strong fences 1199share this property: They do not allow the CPU to execute any po-later 1200instructions (or po-later loads in the case of smp_rmb()) until all 1201outstanding stores have been processed by the local cache. In the 1202case of a strong fence, the CPU first has to wait for all of its 1203po-earlier stores to propagate to every other CPU in the system; then 1204it has to wait for the local cache to process all the stores received 1205as of that time -- not just the stores received when the strong fence 1206began. 1207 1208And of course, none of this matters for any architecture other than 1209Alpha. 1210 1211 1212THE HAPPENS-BEFORE RELATION: hb 1213------------------------------- 1214 1215The happens-before relation (hb) links memory accesses that have to 1216execute in a certain order. hb includes the ppo relation and two 1217others, one of which is rfe. 1218 1219W ->rfe R implies that W and R are on different CPUs. It also means 1220that W's store must have propagated to R's CPU before R executed; 1221otherwise R could not have read the value stored by W. Therefore W 1222must have executed before R, and so we have W ->hb R. 1223 1224The equivalent fact need not hold if W ->rfi R (i.e., W and R are on 1225the same CPU). As we have already seen, the operational model allows 1226W's value to be forwarded to R in such cases, meaning that R may well 1227execute before W does. 1228 1229It's important to understand that neither coe nor fre is included in 1230hb, despite their similarities to rfe. For example, suppose we have 1231W ->coe W'. This means that W and W' are stores to the same location, 1232they execute on different CPUs, and W comes before W' in the coherence 1233order (i.e., W' overwrites W). Nevertheless, it is possible for W' to 1234execute before W, because the decision as to which store overwrites 1235the other is made later by the memory subsystem. When the stores are 1236nearly simultaneous, either one can come out on top. Similarly, 1237R ->fre W means that W overwrites the value which R reads, but it 1238doesn't mean that W has to execute after R. All that's necessary is 1239for the memory subsystem not to propagate W to R's CPU until after R 1240has executed, which is possible if W executes shortly before R. 1241 1242The third relation included in hb is like ppo, in that it only links 1243events that are on the same CPU. However it is more difficult to 1244explain, because it arises only indirectly from the requirement of 1245cache coherence. The relation is called prop, and it links two events 1246on CPU C in situations where a store from some other CPU comes after 1247the first event in the coherence order and propagates to C before the 1248second event executes. 1249 1250This is best explained with some examples. The simplest case looks 1251like this: 1252 1253 int x; 1254 1255 P0() 1256 { 1257 int r1; 1258 1259 WRITE_ONCE(x, 1); 1260 r1 = READ_ONCE(x); 1261 } 1262 1263 P1() 1264 { 1265 WRITE_ONCE(x, 8); 1266 } 1267 1268If r1 = 8 at the end then P0's accesses must have executed in program 1269order. We can deduce this from the operational model; if P0's load 1270had executed before its store then the value of the store would have 1271been forwarded to the load, so r1 would have ended up equal to 1, not 12728. In this case there is a prop link from P0's write event to its read 1273event, because P1's store came after P0's store in x's coherence 1274order, and P1's store propagated to P0 before P0's load executed. 1275 1276An equally simple case involves two loads of the same location that 1277read from different stores: 1278 1279 int x = 0; 1280 1281 P0() 1282 { 1283 int r1, r2; 1284 1285 r1 = READ_ONCE(x); 1286 r2 = READ_ONCE(x); 1287 } 1288 1289 P1() 1290 { 1291 WRITE_ONCE(x, 9); 1292 } 1293 1294If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed 1295in program order. If the second load had executed before the first 1296then the x = 9 store must have been propagated to P0 before the first 1297load executed, and so r1 would have been 9 rather than 0. In this 1298case there is a prop link from P0's first read event to its second, 1299because P1's store overwrote the value read by P0's first load, and 1300P1's store propagated to P0 before P0's second load executed. 1301 1302Less trivial examples of prop all involve fences. Unlike the simple 1303examples above, they can require that some instructions are executed 1304out of program order. This next one should look familiar: 1305 1306 int buf = 0, flag = 0; 1307 1308 P0() 1309 { 1310 WRITE_ONCE(buf, 1); 1311 smp_wmb(); 1312 WRITE_ONCE(flag, 1); 1313 } 1314 1315 P1() 1316 { 1317 int r1; 1318 int r2; 1319 1320 r1 = READ_ONCE(flag); 1321 r2 = READ_ONCE(buf); 1322 } 1323 1324This is the MP pattern again, with an smp_wmb() fence between the two 1325stores. If r1 = 1 and r2 = 0 at the end then there is a prop link 1326from P1's second load to its first (backwards!). The reason is 1327similar to the previous examples: The value P1 loads from buf gets 1328overwritten by P0's store to buf, the fence guarantees that the store 1329to buf will propagate to P1 before the store to flag does, and the 1330store to flag propagates to P1 before P1 reads flag. 1331 1332The prop link says that in order to obtain the r1 = 1, r2 = 0 result, 1333P1 must execute its second load before the first. Indeed, if the load 1334from flag were executed first, then the buf = 1 store would already 1335have propagated to P1 by the time P1's load from buf executed, so r2 1336would have been 1 at the end, not 0. (The reasoning holds even for 1337Alpha, although the details are more complicated and we will not go 1338into them.) 1339 1340But what if we put an smp_rmb() fence between P1's loads? The fence 1341would force the two loads to be executed in program order, and it 1342would generate a cycle in the hb relation: The fence would create a ppo 1343link (hence an hb link) from the first load to the second, and the 1344prop relation would give an hb link from the second load to the first. 1345Since an instruction can't execute before itself, we are forced to 1346conclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0 1347outcome is impossible -- as it should be. 1348 1349The formal definition of the prop relation involves a coe or fre link, 1350followed by an arbitrary number of cumul-fence links, ending with an 1351rfe link. You can concoct more exotic examples, containing more than 1352one fence, although this quickly leads to diminishing returns in terms 1353of complexity. For instance, here's an example containing a coe link 1354followed by two cumul-fences and an rfe link, utilizing the fact that 1355release fences are A-cumulative: 1356 1357 int x, y, z; 1358 1359 P0() 1360 { 1361 int r0; 1362 1363 WRITE_ONCE(x, 1); 1364 r0 = READ_ONCE(z); 1365 } 1366 1367 P1() 1368 { 1369 WRITE_ONCE(x, 2); 1370 smp_wmb(); 1371 WRITE_ONCE(y, 1); 1372 } 1373 1374 P2() 1375 { 1376 int r2; 1377 1378 r2 = READ_ONCE(y); 1379 smp_store_release(&z, 1); 1380 } 1381 1382If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop 1383link from P0's store to its load. This is because P0's store gets 1384overwritten by P1's store since x = 2 at the end (a coe link), the 1385smp_wmb() ensures that P1's store to x propagates to P2 before the 1386store to y does (the first cumul-fence), the store to y propagates to P2 1387before P2's load and store execute, P2's smp_store_release() 1388guarantees that the stores to x and y both propagate to P0 before the 1389store to z does (the second cumul-fence), and P0's load executes after the 1390store to z has propagated to P0 (an rfe link). 1391 1392In summary, the fact that the hb relation links memory access events 1393in the order they execute means that it must not have cycles. This 1394requirement is the content of the LKMM's "happens-before" axiom. 1395 1396The LKMM defines yet another relation connected to times of 1397instruction execution, but it is not included in hb. It relies on the 1398particular properties of strong fences, which we cover in the next 1399section. 1400 1401 1402THE PROPAGATES-BEFORE RELATION: pb 1403---------------------------------- 1404 1405The propagates-before (pb) relation capitalizes on the special 1406features of strong fences. It links two events E and F whenever some 1407store is coherence-later than E and propagates to every CPU and to RAM 1408before F executes. The formal definition requires that E be linked to 1409F via a coe or fre link, an arbitrary number of cumul-fences, an 1410optional rfe link, a strong fence, and an arbitrary number of hb 1411links. Let's see how this definition works out. 1412 1413Consider first the case where E is a store (implying that the sequence 1414of links begins with coe). Then there are events W, X, Y, and Z such 1415that: 1416 1417 E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F, 1418 1419where the * suffix indicates an arbitrary number of links of the 1420specified type, and the ? suffix indicates the link is optional (Y may 1421be equal to X). Because of the cumul-fence links, we know that W will 1422propagate to Y's CPU before X does, hence before Y executes and hence 1423before the strong fence executes. Because this fence is strong, we 1424know that W will propagate to every CPU and to RAM before Z executes. 1425And because of the hb links, we know that Z will execute before F. 1426Thus W, which comes later than E in the coherence order, will 1427propagate to every CPU and to RAM before F executes. 1428 1429The case where E is a load is exactly the same, except that the first 1430link in the sequence is fre instead of coe. 1431 1432The existence of a pb link from E to F implies that E must execute 1433before F. To see why, suppose that F executed first. Then W would 1434have propagated to E's CPU before E executed. If E was a store, the 1435memory subsystem would then be forced to make E come after W in the 1436coherence order, contradicting the fact that E ->coe W. If E was a 1437load, the memory subsystem would then be forced to satisfy E's read 1438request with the value stored by W or an even later store, 1439contradicting the fact that E ->fre W. 1440 1441A good example illustrating how pb works is the SB pattern with strong 1442fences: 1443 1444 int x = 0, y = 0; 1445 1446 P0() 1447 { 1448 int r0; 1449 1450 WRITE_ONCE(x, 1); 1451 smp_mb(); 1452 r0 = READ_ONCE(y); 1453 } 1454 1455 P1() 1456 { 1457 int r1; 1458 1459 WRITE_ONCE(y, 1); 1460 smp_mb(); 1461 r1 = READ_ONCE(x); 1462 } 1463 1464If r0 = 0 at the end then there is a pb link from P0's load to P1's 1465load: an fre link from P0's load to P1's store (which overwrites the 1466value read by P0), and a strong fence between P1's store and its load. 1467In this example, the sequences of cumul-fence and hb links are empty. 1468Note that this pb link is not included in hb as an instance of prop, 1469because it does not start and end on the same CPU. 1470 1471Similarly, if r1 = 0 at the end then there is a pb link from P1's load 1472to P0's. This means that if both r1 and r2 were 0 there would be a 1473cycle in pb, which is not possible since an instruction cannot execute 1474before itself. Thus, adding smp_mb() fences to the SB pattern 1475prevents the r0 = 0, r1 = 0 outcome. 1476 1477In summary, the fact that the pb relation links events in the order 1478they execute means that it cannot have cycles. This requirement is 1479the content of the LKMM's "propagation" axiom. 1480 1481 1482RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb 1483------------------------------------------------------------------------ 1484 1485RCU (Read-Copy-Update) is a powerful synchronization mechanism. It 1486rests on two concepts: grace periods and read-side critical sections. 1487 1488A grace period is the span of time occupied by a call to 1489synchronize_rcu(). A read-side critical section (or just critical 1490section, for short) is a region of code delimited by rcu_read_lock() 1491at the start and rcu_read_unlock() at the end. Critical sections can 1492be nested, although we won't make use of this fact. 1493 1494As far as memory models are concerned, RCU's main feature is its 1495Grace-Period Guarantee, which states that a critical section can never 1496span a full grace period. In more detail, the Guarantee says: 1497 1498 For any critical section C and any grace period G, at least 1499 one of the following statements must hold: 1500 1501(1) C ends before G does, and in addition, every store that 1502 propagates to C's CPU before the end of C must propagate to 1503 every CPU before G ends. 1504 1505(2) G starts before C does, and in addition, every store that 1506 propagates to G's CPU before the start of G must propagate 1507 to every CPU before C starts. 1508 1509In particular, it is not possible for a critical section to both start 1510before and end after a grace period. 1511 1512Here is a simple example of RCU in action: 1513 1514 int x, y; 1515 1516 P0() 1517 { 1518 rcu_read_lock(); 1519 WRITE_ONCE(x, 1); 1520 WRITE_ONCE(y, 1); 1521 rcu_read_unlock(); 1522 } 1523 1524 P1() 1525 { 1526 int r1, r2; 1527 1528 r1 = READ_ONCE(x); 1529 synchronize_rcu(); 1530 r2 = READ_ONCE(y); 1531 } 1532 1533The Grace Period Guarantee tells us that when this code runs, it will 1534never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1 1535means that P0's store to x propagated to P1 before P1 called 1536synchronize_rcu(), so P0's critical section must have started before 1537P1's grace period, contrary to part (2) of the Guarantee. On the 1538other hand, r2 = 0 means that P0's store to y, which occurs before the 1539end of the critical section, did not propagate to P1 before the end of 1540the grace period, contrary to part (1). Together the results violate 1541the Guarantee. 1542 1543In the kernel's implementations of RCU, the requirements for stores 1544to propagate to every CPU are fulfilled by placing strong fences at 1545suitable places in the RCU-related code. Thus, if a critical section 1546starts before a grace period does then the critical section's CPU will 1547execute an smp_mb() fence after the end of the critical section and 1548some time before the grace period's synchronize_rcu() call returns. 1549And if a critical section ends after a grace period does then the 1550synchronize_rcu() routine will execute an smp_mb() fence at its start 1551and some time before the critical section's opening rcu_read_lock() 1552executes. 1553 1554What exactly do we mean by saying that a critical section "starts 1555before" or "ends after" a grace period? Some aspects of the meaning 1556are pretty obvious, as in the example above, but the details aren't 1557entirely clear. The LKMM formalizes this notion by means of the 1558rcu-link relation. rcu-link encompasses a very general notion of 1559"before": If E and F are RCU fence events (i.e., rcu_read_lock(), 1560rcu_read_unlock(), or synchronize_rcu()) then among other things, 1561E ->rcu-link F includes cases where E is po-before some memory-access 1562event X, F is po-after some memory-access event Y, and we have any of 1563X ->rfe Y, X ->co Y, or X ->fr Y. 1564 1565The formal definition of the rcu-link relation is more than a little 1566obscure, and we won't give it here. It is closely related to the pb 1567relation, and the details don't matter unless you want to comb through 1568a somewhat lengthy formal proof. Pretty much all you need to know 1569about rcu-link is the information in the preceding paragraph. 1570 1571The LKMM also defines the rcu-gp and rcu-rscsi relations. They bring 1572grace periods and read-side critical sections into the picture, in the 1573following way: 1574 1575 E ->rcu-gp F means that E and F are in fact the same event, 1576 and that event is a synchronize_rcu() fence (i.e., a grace 1577 period). 1578 1579 E ->rcu-rscsi F means that E and F are the rcu_read_unlock() 1580 and rcu_read_lock() fence events delimiting some read-side 1581 critical section. (The 'i' at the end of the name emphasizes 1582 that this relation is "inverted": It links the end of the 1583 critical section to the start.) 1584 1585If we think of the rcu-link relation as standing for an extended 1586"before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a 1587grace period which ends before Z begins. (In fact it covers more than 1588this, because it also includes cases where some store propagates to 1589Z's CPU before Z begins but doesn't propagate to some other CPU until 1590after X ends.) Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is 1591the end of a critical section which starts before Z begins. 1592 1593The LKMM goes on to define the rcu-order relation as a sequence of 1594rcu-gp and rcu-rscsi links separated by rcu-link links, in which the 1595number of rcu-gp links is >= the number of rcu-rscsi links. For 1596example: 1597 1598 X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1599 1600would imply that X ->rcu-order V, because this sequence contains two 1601rcu-gp links and one rcu-rscsi link. (It also implies that 1602X ->rcu-order T and Z ->rcu-order V.) On the other hand: 1603 1604 X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V 1605 1606does not imply X ->rcu-order V, because the sequence contains only 1607one rcu-gp link but two rcu-rscsi links. 1608 1609The rcu-order relation is important because the Grace Period Guarantee 1610means that rcu-order links act kind of like strong fences. In 1611particular, E ->rcu-order F implies not only that E begins before F 1612ends, but also that any write po-before E will propagate to every CPU 1613before any instruction po-after F can execute. (However, it does not 1614imply that E must execute before F; in fact, each synchronize_rcu() 1615fence event is linked to itself by rcu-order as a degenerate case.) 1616 1617To prove this in full generality requires some intellectual effort. 1618We'll consider just a very simple case: 1619 1620 G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F. 1621 1622This formula means that G and W are the same event (a grace period), 1623and there are events X, Y and a read-side critical section C such that: 1624 1625 1. G = W is po-before or equal to X; 1626 1627 2. X comes "before" Y in some sense (including rfe, co and fr); 1628 1629 3. Y is po-before Z; 1630 1631 4. Z is the rcu_read_unlock() event marking the end of C; 1632 1633 5. F is the rcu_read_lock() event marking the start of C. 1634 1635From 1 - 4 we deduce that the grace period G ends before the critical 1636section C. Then part (2) of the Grace Period Guarantee says not only 1637that G starts before C does, but also that any write which executes on 1638G's CPU before G starts must propagate to every CPU before C starts. 1639In particular, the write propagates to every CPU before F finishes 1640executing and hence before any instruction po-after F can execute. 1641This sort of reasoning can be extended to handle all the situations 1642covered by rcu-order. 1643 1644The rcu-fence relation is a simple extension of rcu-order. While 1645rcu-order only links certain fence events (calls to synchronize_rcu(), 1646rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events 1647that are separated by an rcu-order link. This is analogous to the way 1648the strong-fence relation links events that are separated by an 1649smp_mb() fence event (as mentioned above, rcu-order links act kind of 1650like strong fences). Written symbolically, X ->rcu-fence Y means 1651there are fence events E and F such that: 1652 1653 X ->po E ->rcu-order F ->po Y. 1654 1655From the discussion above, we see this implies not only that X 1656executes before Y, but also (if X is a store) that X propagates to 1657every CPU before Y executes. Thus rcu-fence is sort of a 1658"super-strong" fence: Unlike the original strong fences (smp_mb() and 1659synchronize_rcu()), rcu-fence is able to link events on different 1660CPUs. (Perhaps this fact should lead us to say that rcu-fence isn't 1661really a fence at all!) 1662 1663Finally, the LKMM defines the RCU-before (rb) relation in terms of 1664rcu-fence. This is done in essentially the same way as the pb 1665relation was defined in terms of strong-fence. We will omit the 1666details; the end result is that E ->rb F implies E must execute 1667before F, just as E ->pb F does (and for much the same reasons). 1668 1669Putting this all together, the LKMM expresses the Grace Period 1670Guarantee by requiring that the rb relation does not contain a cycle. 1671Equivalently, this "rcu" axiom requires that there are no events E 1672and F with E ->rcu-link F ->rcu-order E. Or to put it a third way, 1673the axiom requires that there are no cycles consisting of rcu-gp and 1674rcu-rscsi alternating with rcu-link, where the number of rcu-gp links 1675is >= the number of rcu-rscsi links. 1676 1677Justifying the axiom isn't easy, but it is in fact a valid 1678formalization of the Grace Period Guarantee. We won't attempt to go 1679through the detailed argument, but the following analysis gives a 1680taste of what is involved. Suppose both parts of the Guarantee are 1681violated: A critical section starts before a grace period, and some 1682store propagates to the critical section's CPU before the end of the 1683critical section but doesn't propagate to some other CPU until after 1684the end of the grace period. 1685 1686Putting symbols to these ideas, let L and U be the rcu_read_lock() and 1687rcu_read_unlock() fence events delimiting the critical section in 1688question, and let S be the synchronize_rcu() fence event for the grace 1689period. Saying that the critical section starts before S means there 1690are events Q and R where Q is po-after L (which marks the start of the 1691critical section), Q is "before" R in the sense used by the rcu-link 1692relation, and R is po-before the grace period S. Thus we have: 1693 1694 L ->rcu-link S. 1695 1696Let W be the store mentioned above, let Y come before the end of the 1697critical section and witness that W propagates to the critical 1698section's CPU by reading from W, and let Z on some arbitrary CPU be a 1699witness that W has not propagated to that CPU, where Z happens after 1700some event X which is po-after S. Symbolically, this amounts to: 1701 1702 S ->po X ->hb* Z ->fr W ->rf Y ->po U. 1703 1704The fr link from Z to W indicates that W has not propagated to Z's CPU 1705at the time that Z executes. From this, it can be shown (see the 1706discussion of the rcu-link relation earlier) that S and U are related 1707by rcu-link: 1708 1709 S ->rcu-link U. 1710 1711Since S is a grace period we have S ->rcu-gp S, and since L and U are 1712the start and end of the critical section C we have U ->rcu-rscsi L. 1713From this we obtain: 1714 1715 S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S, 1716 1717a forbidden cycle. Thus the "rcu" axiom rules out this violation of 1718the Grace Period Guarantee. 1719 1720For something a little more down-to-earth, let's see how the axiom 1721works out in practice. Consider the RCU code example from above, this 1722time with statement labels added: 1723 1724 int x, y; 1725 1726 P0() 1727 { 1728 L: rcu_read_lock(); 1729 X: WRITE_ONCE(x, 1); 1730 Y: WRITE_ONCE(y, 1); 1731 U: rcu_read_unlock(); 1732 } 1733 1734 P1() 1735 { 1736 int r1, r2; 1737 1738 Z: r1 = READ_ONCE(x); 1739 S: synchronize_rcu(); 1740 W: r2 = READ_ONCE(y); 1741 } 1742 1743 1744If r2 = 0 at the end then P0's store at Y overwrites the value that 1745P1's load at W reads from, so we have W ->fre Y. Since S ->po W and 1746also Y ->po U, we get S ->rcu-link U. In addition, S ->rcu-gp S 1747because S is a grace period. 1748 1749If r1 = 1 at the end then P1's load at Z reads from P0's store at X, 1750so we have X ->rfe Z. Together with L ->po X and Z ->po S, this 1751yields L ->rcu-link S. And since L and U are the start and end of a 1752critical section, we have U ->rcu-rscsi L. 1753 1754Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a 1755forbidden cycle, violating the "rcu" axiom. Hence the outcome is not 1756allowed by the LKMM, as we would expect. 1757 1758For contrast, let's see what can happen in a more complicated example: 1759 1760 int x, y, z; 1761 1762 P0() 1763 { 1764 int r0; 1765 1766 L0: rcu_read_lock(); 1767 r0 = READ_ONCE(x); 1768 WRITE_ONCE(y, 1); 1769 U0: rcu_read_unlock(); 1770 } 1771 1772 P1() 1773 { 1774 int r1; 1775 1776 r1 = READ_ONCE(y); 1777 S1: synchronize_rcu(); 1778 WRITE_ONCE(z, 1); 1779 } 1780 1781 P2() 1782 { 1783 int r2; 1784 1785 L2: rcu_read_lock(); 1786 r2 = READ_ONCE(z); 1787 WRITE_ONCE(x, 1); 1788 U2: rcu_read_unlock(); 1789 } 1790 1791If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows 1792that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi 1793L2 ->rcu-link U0. However this cycle is not forbidden, because the 1794sequence of relations contains fewer instances of rcu-gp (one) than of 1795rcu-rscsi (two). Consequently the outcome is allowed by the LKMM. 1796The following instruction timing diagram shows how it might actually 1797occur: 1798 1799P0 P1 P2 1800-------------------- -------------------- -------------------- 1801rcu_read_lock() 1802WRITE_ONCE(y, 1) 1803 r1 = READ_ONCE(y) 1804 synchronize_rcu() starts 1805 . rcu_read_lock() 1806 . WRITE_ONCE(x, 1) 1807r0 = READ_ONCE(x) . 1808rcu_read_unlock() . 1809 synchronize_rcu() ends 1810 WRITE_ONCE(z, 1) 1811 r2 = READ_ONCE(z) 1812 rcu_read_unlock() 1813 1814This requires P0 and P2 to execute their loads and stores out of 1815program order, but of course they are allowed to do so. And as you 1816can see, the Grace Period Guarantee is not violated: The critical 1817section in P0 both starts before P1's grace period does and ends 1818before it does, and the critical section in P2 both starts after P1's 1819grace period does and ends after it does. 1820 1821Addendum: The LKMM now supports SRCU (Sleepable Read-Copy-Update) in 1822addition to normal RCU. The ideas involved are much the same as 1823above, with new relations srcu-gp and srcu-rscsi added to represent 1824SRCU grace periods and read-side critical sections. There is a 1825restriction on the srcu-gp and srcu-rscsi links that can appear in an 1826rcu-order sequence (the srcu-rscsi links must be paired with srcu-gp 1827links having the same SRCU domain with proper nesting); the details 1828are relatively unimportant. 1829 1830 1831LOCKING 1832------- 1833 1834The LKMM includes locking. In fact, there is special code for locking 1835in the formal model, added in order to make tools run faster. 1836However, this special code is intended to be more or less equivalent 1837to concepts we have already covered. A spinlock_t variable is treated 1838the same as an int, and spin_lock(&s) is treated almost the same as: 1839 1840 while (cmpxchg_acquire(&s, 0, 1) != 0) 1841 cpu_relax(); 1842 1843This waits until s is equal to 0 and then atomically sets it to 1, 1844and the read part of the cmpxchg operation acts as an acquire fence. 1845An alternate way to express the same thing would be: 1846 1847 r = xchg_acquire(&s, 1); 1848 1849along with a requirement that at the end, r = 0. Similarly, 1850spin_trylock(&s) is treated almost the same as: 1851 1852 return !cmpxchg_acquire(&s, 0, 1); 1853 1854which atomically sets s to 1 if it is currently equal to 0 and returns 1855true if it succeeds (the read part of the cmpxchg operation acts as an 1856acquire fence only if the operation is successful). spin_unlock(&s) 1857is treated almost the same as: 1858 1859 smp_store_release(&s, 0); 1860 1861The "almost" qualifiers above need some explanation. In the LKMM, the 1862store-release in a spin_unlock() and the load-acquire which forms the 1863first half of the atomic rmw update in a spin_lock() or a successful 1864spin_trylock() -- we can call these things lock-releases and 1865lock-acquires -- have two properties beyond those of ordinary releases 1866and acquires. 1867 1868First, when a lock-acquire reads from or is po-after a lock-release, 1869the LKMM requires that every instruction po-before the lock-release 1870must execute before any instruction po-after the lock-acquire. This 1871would naturally hold if the release and acquire operations were on 1872different CPUs and accessed the same lock variable, but the LKMM says 1873it also holds when they are on the same CPU, even if they access 1874different lock variables. For example: 1875 1876 int x, y; 1877 spinlock_t s, t; 1878 1879 P0() 1880 { 1881 int r1, r2; 1882 1883 spin_lock(&s); 1884 r1 = READ_ONCE(x); 1885 spin_unlock(&s); 1886 spin_lock(&t); 1887 r2 = READ_ONCE(y); 1888 spin_unlock(&t); 1889 } 1890 1891 P1() 1892 { 1893 WRITE_ONCE(y, 1); 1894 smp_wmb(); 1895 WRITE_ONCE(x, 1); 1896 } 1897 1898Here the second spin_lock() is po-after the first spin_unlock(), and 1899therefore the load of x must execute before the load of y, even though 1900the two locking operations use different locks. Thus we cannot have 1901r1 = 1 and r2 = 0 at the end (this is an instance of the MP pattern). 1902 1903This requirement does not apply to ordinary release and acquire 1904fences, only to lock-related operations. For instance, suppose P0() 1905in the example had been written as: 1906 1907 P0() 1908 { 1909 int r1, r2, r3; 1910 1911 r1 = READ_ONCE(x); 1912 smp_store_release(&s, 1); 1913 r3 = smp_load_acquire(&s); 1914 r2 = READ_ONCE(y); 1915 } 1916 1917Then the CPU would be allowed to forward the s = 1 value from the 1918smp_store_release() to the smp_load_acquire(), executing the 1919instructions in the following order: 1920 1921 r3 = smp_load_acquire(&s); // Obtains r3 = 1 1922 r2 = READ_ONCE(y); 1923 r1 = READ_ONCE(x); 1924 smp_store_release(&s, 1); // Value is forwarded 1925 1926and thus it could load y before x, obtaining r2 = 0 and r1 = 1. 1927 1928Second, when a lock-acquire reads from or is po-after a lock-release, 1929and some other stores W and W' occur po-before the lock-release and 1930po-after the lock-acquire respectively, the LKMM requires that W must 1931propagate to each CPU before W' does. For example, consider: 1932 1933 int x, y; 1934 spinlock_t s; 1935 1936 P0() 1937 { 1938 spin_lock(&s); 1939 WRITE_ONCE(x, 1); 1940 spin_unlock(&s); 1941 } 1942 1943 P1() 1944 { 1945 int r1; 1946 1947 spin_lock(&s); 1948 r1 = READ_ONCE(x); 1949 WRITE_ONCE(y, 1); 1950 spin_unlock(&s); 1951 } 1952 1953 P2() 1954 { 1955 int r2, r3; 1956 1957 r2 = READ_ONCE(y); 1958 smp_rmb(); 1959 r3 = READ_ONCE(x); 1960 } 1961 1962If r1 = 1 at the end then the spin_lock() in P1 must have read from 1963the spin_unlock() in P0. Hence the store to x must propagate to P2 1964before the store to y does, so we cannot have r2 = 1 and r3 = 0. But 1965if P1 had used a lock variable different from s, the writes could have 1966propagated in either order. (On the other hand, if the code in P0 and 1967P1 had all executed on a single CPU, as in the example before this 1968one, then the writes would have propagated in order even if the two 1969critical sections used different lock variables.) 1970 1971These two special requirements for lock-release and lock-acquire do 1972not arise from the operational model. Nevertheless, kernel developers 1973have come to expect and rely on them because they do hold on all 1974architectures supported by the Linux kernel, albeit for various 1975differing reasons. 1976 1977 1978PLAIN ACCESSES AND DATA RACES 1979----------------------------- 1980 1981In the LKMM, memory accesses such as READ_ONCE(x), atomic_inc(&y), 1982smp_load_acquire(&z), and so on are collectively referred to as 1983"marked" accesses, because they are all annotated with special 1984operations of one kind or another. Ordinary C-language memory 1985accesses such as x or y = 0 are simply called "plain" accesses. 1986 1987Early versions of the LKMM had nothing to say about plain accesses. 1988The C standard allows compilers to assume that the variables affected 1989by plain accesses are not concurrently read or written by any other 1990threads or CPUs. This leaves compilers free to implement all manner 1991of transformations or optimizations of code containing plain accesses, 1992making such code very difficult for a memory model to handle. 1993 1994Here is just one example of a possible pitfall: 1995 1996 int a = 6; 1997 int *x = &a; 1998 1999 P0() 2000 { 2001 int *r1; 2002 int r2 = 0; 2003 2004 r1 = x; 2005 if (r1 != NULL) 2006 r2 = READ_ONCE(*r1); 2007 } 2008 2009 P1() 2010 { 2011 WRITE_ONCE(x, NULL); 2012 } 2013 2014On the face of it, one would expect that when this code runs, the only 2015possible final values for r2 are 6 and 0, depending on whether or not 2016P1's store to x propagates to P0 before P0's load from x executes. 2017But since P0's load from x is a plain access, the compiler may decide 2018to carry out the load twice (for the comparison against NULL, then again 2019for the READ_ONCE()) and eliminate the temporary variable r1. The 2020object code generated for P0 could therefore end up looking rather 2021like this: 2022 2023 P0() 2024 { 2025 int r2 = 0; 2026 2027 if (x != NULL) 2028 r2 = READ_ONCE(*x); 2029 } 2030 2031And now it is obvious that this code runs the risk of dereferencing a 2032NULL pointer, because P1's store to x might propagate to P0 after the 2033test against NULL has been made but before the READ_ONCE() executes. 2034If the original code had said "r1 = READ_ONCE(x)" instead of "r1 = x", 2035the compiler would not have performed this optimization and there 2036would be no possibility of a NULL-pointer dereference. 2037 2038Given the possibility of transformations like this one, the LKMM 2039doesn't try to predict all possible outcomes of code containing plain 2040accesses. It is instead content to determine whether the code 2041violates the compiler's assumptions, which would render the ultimate 2042outcome undefined. 2043 2044In technical terms, the compiler is allowed to assume that when the 2045program executes, there will not be any data races. A "data race" 2046occurs when there are two memory accesses such that: 2047 20481. they access the same location, 2049 20502. at least one of them is a store, 2051 20523. at least one of them is plain, 2053 20544. they occur on different CPUs (or in different threads on the 2055 same CPU), and 2056 20575. they execute concurrently. 2058 2059In the literature, two accesses are said to "conflict" if they satisfy 20601 and 2 above. We'll go a little farther and say that two accesses 2061are "race candidates" if they satisfy 1 - 4. Thus, whether or not two 2062race candidates actually do race in a given execution depends on 2063whether they are concurrent. 2064 2065The LKMM tries to determine whether a program contains race candidates 2066which may execute concurrently; if it does then the LKMM says there is 2067a potential data race and makes no predictions about the program's 2068outcome. 2069 2070Determining whether two accesses are race candidates is easy; you can 2071see that all the concepts involved in the definition above are already 2072part of the memory model. The hard part is telling whether they may 2073execute concurrently. The LKMM takes a conservative attitude, 2074assuming that accesses may be concurrent unless it can prove they 2075are not. 2076 2077If two memory accesses aren't concurrent then one must execute before 2078the other. Therefore the LKMM decides two accesses aren't concurrent 2079if they can be connected by a sequence of hb, pb, and rb links 2080(together referred to as xb, for "executes before"). However, there 2081are two complicating factors. 2082 2083If X is a load and X executes before a store Y, then indeed there is 2084no danger of X and Y being concurrent. After all, Y can't have any 2085effect on the value obtained by X until the memory subsystem has 2086propagated Y from its own CPU to X's CPU, which won't happen until 2087some time after Y executes and thus after X executes. But if X is a 2088store, then even if X executes before Y it is still possible that X 2089will propagate to Y's CPU just as Y is executing. In such a case X 2090could very well interfere somehow with Y, and we would have to 2091consider X and Y to be concurrent. 2092 2093Therefore when X is a store, for X and Y to be non-concurrent the LKMM 2094requires not only that X must execute before Y but also that X must 2095propagate to Y's CPU before Y executes. (Or vice versa, of course, if 2096Y executes before X -- then Y must propagate to X's CPU before X 2097executes if Y is a store.) This is expressed by the visibility 2098relation (vis), where X ->vis Y is defined to hold if there is an 2099intermediate event Z such that: 2100 2101 X is connected to Z by a possibly empty sequence of 2102 cumul-fence links followed by an optional rfe link (if none of 2103 these links are present, X and Z are the same event), 2104 2105and either: 2106 2107 Z is connected to Y by a strong-fence link followed by a 2108 possibly empty sequence of xb links, 2109 2110or: 2111 2112 Z is on the same CPU as Y and is connected to Y by a possibly 2113 empty sequence of xb links (again, if the sequence is empty it 2114 means Z and Y are the same event). 2115 2116The motivations behind this definition are straightforward: 2117 2118 cumul-fence memory barriers force stores that are po-before 2119 the barrier to propagate to other CPUs before stores that are 2120 po-after the barrier. 2121 2122 An rfe link from an event W to an event R says that R reads 2123 from W, which certainly means that W must have propagated to 2124 R's CPU before R executed. 2125 2126 strong-fence memory barriers force stores that are po-before 2127 the barrier, or that propagate to the barrier's CPU before the 2128 barrier executes, to propagate to all CPUs before any events 2129 po-after the barrier can execute. 2130 2131To see how this works out in practice, consider our old friend, the MP 2132pattern (with fences and statement labels, but without the conditional 2133test): 2134 2135 int buf = 0, flag = 0; 2136 2137 P0() 2138 { 2139 X: WRITE_ONCE(buf, 1); 2140 smp_wmb(); 2141 W: WRITE_ONCE(flag, 1); 2142 } 2143 2144 P1() 2145 { 2146 int r1; 2147 int r2 = 0; 2148 2149 Z: r1 = READ_ONCE(flag); 2150 smp_rmb(); 2151 Y: r2 = READ_ONCE(buf); 2152 } 2153 2154The smp_wmb() memory barrier gives a cumul-fence link from X to W, and 2155assuming r1 = 1 at the end, there is an rfe link from W to Z. This 2156means that the store to buf must propagate from P0 to P1 before Z 2157executes. Next, Z and Y are on the same CPU and the smp_rmb() fence 2158provides an xb link from Z to Y (i.e., it forces Z to execute before 2159Y). Therefore we have X ->vis Y: X must propagate to Y's CPU before Y 2160executes. 2161 2162The second complicating factor mentioned above arises from the fact 2163that when we are considering data races, some of the memory accesses 2164are plain. Now, although we have not said so explicitly, up to this 2165point most of the relations defined by the LKMM (ppo, hb, prop, 2166cumul-fence, pb, and so on -- including vis) apply only to marked 2167accesses. 2168 2169There are good reasons for this restriction. The compiler is not 2170allowed to apply fancy transformations to marked accesses, and 2171consequently each such access in the source code corresponds more or 2172less directly to a single machine instruction in the object code. But 2173plain accesses are a different story; the compiler may combine them, 2174split them up, duplicate them, eliminate them, invent new ones, and 2175who knows what else. Seeing a plain access in the source code tells 2176you almost nothing about what machine instructions will end up in the 2177object code. 2178 2179Fortunately, the compiler isn't completely free; it is subject to some 2180limitations. For one, it is not allowed to introduce a data race into 2181the object code if the source code does not already contain a data 2182race (if it could, memory models would be useless and no multithreaded 2183code would be safe!). For another, it cannot move a plain access past 2184a compiler barrier. 2185 2186A compiler barrier is a kind of fence, but as the name implies, it 2187only affects the compiler; it does not necessarily have any effect on 2188how instructions are executed by the CPU. In Linux kernel source 2189code, the barrier() function is a compiler barrier. It doesn't give 2190rise directly to any machine instructions in the object code; rather, 2191it affects how the compiler generates the rest of the object code. 2192Given source code like this: 2193 2194 ... some memory accesses ... 2195 barrier(); 2196 ... some other memory accesses ... 2197 2198the barrier() function ensures that the machine instructions 2199corresponding to the first group of accesses will all end po-before 2200any machine instructions corresponding to the second group of accesses 2201-- even if some of the accesses are plain. (Of course, the CPU may 2202then execute some of those accesses out of program order, but we 2203already know how to deal with such issues.) Without the barrier() 2204there would be no such guarantee; the two groups of accesses could be 2205intermingled or even reversed in the object code. 2206 2207The LKMM doesn't say much about the barrier() function, but it does 2208require that all fences are also compiler barriers. In addition, it 2209requires that the ordering properties of memory barriers such as 2210smp_rmb() or smp_store_release() apply to plain accesses as well as to 2211marked accesses. 2212 2213This is the key to analyzing data races. Consider the MP pattern 2214again, now using plain accesses for buf: 2215 2216 int buf = 0, flag = 0; 2217 2218 P0() 2219 { 2220 U: buf = 1; 2221 smp_wmb(); 2222 X: WRITE_ONCE(flag, 1); 2223 } 2224 2225 P1() 2226 { 2227 int r1; 2228 int r2 = 0; 2229 2230 Y: r1 = READ_ONCE(flag); 2231 if (r1) { 2232 smp_rmb(); 2233 V: r2 = buf; 2234 } 2235 } 2236 2237This program does not contain a data race. Although the U and V 2238accesses are race candidates, the LKMM can prove they are not 2239concurrent as follows: 2240 2241 The smp_wmb() fence in P0 is both a compiler barrier and a 2242 cumul-fence. It guarantees that no matter what hash of 2243 machine instructions the compiler generates for the plain 2244 access U, all those instructions will be po-before the fence. 2245 Consequently U's store to buf, no matter how it is carried out 2246 at the machine level, must propagate to P1 before X's store to 2247 flag does. 2248 2249 X and Y are both marked accesses. Hence an rfe link from X to 2250 Y is a valid indicator that X propagated to P1 before Y 2251 executed, i.e., X ->vis Y. (And if there is no rfe link then 2252 r1 will be 0, so V will not be executed and ipso facto won't 2253 race with U.) 2254 2255 The smp_rmb() fence in P1 is a compiler barrier as well as a 2256 fence. It guarantees that all the machine-level instructions 2257 corresponding to the access V will be po-after the fence, and 2258 therefore any loads among those instructions will execute 2259 after the fence does and hence after Y does. 2260 2261Thus U's store to buf is forced to propagate to P1 before V's load 2262executes (assuming V does execute), ruling out the possibility of a 2263data race between them. 2264 2265This analysis illustrates how the LKMM deals with plain accesses in 2266general. Suppose R is a plain load and we want to show that R 2267executes before some marked access E. We can do this by finding a 2268marked access X such that R and X are ordered by a suitable fence and 2269X ->xb* E. If E was also a plain access, we would also look for a 2270marked access Y such that X ->xb* Y, and Y and E are ordered by a 2271fence. We describe this arrangement by saying that R is 2272"post-bounded" by X and E is "pre-bounded" by Y. 2273 2274In fact, we go one step further: Since R is a read, we say that R is 2275"r-post-bounded" by X. Similarly, E would be "r-pre-bounded" or 2276"w-pre-bounded" by Y, depending on whether E was a store or a load. 2277This distinction is needed because some fences affect only loads 2278(i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise 2279the two types of bounds are the same. And as a degenerate case, we 2280say that a marked access pre-bounds and post-bounds itself (e.g., if R 2281above were a marked load then X could simply be taken to be R itself.) 2282 2283The need to distinguish between r- and w-bounding raises yet another 2284issue. When the source code contains a plain store, the compiler is 2285allowed to put plain loads of the same location into the object code. 2286For example, given the source code: 2287 2288 x = 1; 2289 2290the compiler is theoretically allowed to generate object code that 2291looks like: 2292 2293 if (x != 1) 2294 x = 1; 2295 2296thereby adding a load (and possibly replacing the store entirely). 2297For this reason, whenever the LKMM requires a plain store to be 2298w-pre-bounded or w-post-bounded by a marked access, it also requires 2299the store to be r-pre-bounded or r-post-bounded, so as to handle cases 2300where the compiler adds a load. 2301 2302(This may be overly cautious. We don't know of any examples where a 2303compiler has augmented a store with a load in this fashion, and the 2304Linux kernel developers would probably fight pretty hard to change a 2305compiler if it ever did this. Still, better safe than sorry.) 2306 2307Incidentally, the other tranformation -- augmenting a plain load by 2308adding in a store to the same location -- is not allowed. This is 2309because the compiler cannot know whether any other CPUs might perform 2310a concurrent load from that location. Two concurrent loads don't 2311constitute a race (they can't interfere with each other), but a store 2312does race with a concurrent load. Thus adding a store might create a 2313data race where one was not already present in the source code, 2314something the compiler is forbidden to do. Augmenting a store with a 2315load, on the other hand, is acceptable because doing so won't create a 2316data race unless one already existed. 2317 2318The LKMM includes a second way to pre-bound plain accesses, in 2319addition to fences: an address dependency from a marked load. That 2320is, in the sequence: 2321 2322 p = READ_ONCE(ptr); 2323 r = *p; 2324 2325the LKMM says that the marked load of ptr pre-bounds the plain load of 2326*p; the marked load must execute before any of the machine 2327instructions corresponding to the plain load. This is a reasonable 2328stipulation, since after all, the CPU can't perform the load of *p 2329until it knows what value p will hold. Furthermore, without some 2330assumption like this one, some usages typical of RCU would count as 2331data races. For example: 2332 2333 int a = 1, b; 2334 int *ptr = &a; 2335 2336 P0() 2337 { 2338 b = 2; 2339 rcu_assign_pointer(ptr, &b); 2340 } 2341 2342 P1() 2343 { 2344 int *p; 2345 int r; 2346 2347 rcu_read_lock(); 2348 p = rcu_dereference(ptr); 2349 r = *p; 2350 rcu_read_unlock(); 2351 } 2352 2353(In this example the rcu_read_lock() and rcu_read_unlock() calls don't 2354really do anything, because there aren't any grace periods. They are 2355included merely for the sake of good form; typically P0 would call 2356synchronize_rcu() somewhere after the rcu_assign_pointer().) 2357 2358rcu_assign_pointer() performs a store-release, so the plain store to b 2359is definitely w-post-bounded before the store to ptr, and the two 2360stores will propagate to P1 in that order. However, rcu_dereference() 2361is only equivalent to READ_ONCE(). While it is a marked access, it is 2362not a fence or compiler barrier. Hence the only guarantee we have 2363that the load of ptr in P1 is r-pre-bounded before the load of *p 2364(thus avoiding a race) is the assumption about address dependencies. 2365 2366This is a situation where the compiler can undermine the memory model, 2367and a certain amount of care is required when programming constructs 2368like this one. In particular, comparisons between the pointer and 2369other known addresses can cause trouble. If you have something like: 2370 2371 p = rcu_dereference(ptr); 2372 if (p == &x) 2373 r = *p; 2374 2375then the compiler just might generate object code resembling: 2376 2377 p = rcu_dereference(ptr); 2378 if (p == &x) 2379 r = x; 2380 2381or even: 2382 2383 rtemp = x; 2384 p = rcu_dereference(ptr); 2385 if (p == &x) 2386 r = rtemp; 2387 2388which would invalidate the memory model's assumption, since the CPU 2389could now perform the load of x before the load of ptr (there might be 2390a control dependency but no address dependency at the machine level). 2391 2392Finally, it turns out there is a situation in which a plain write does 2393not need to be w-post-bounded: when it is separated from the other 2394race-candidate access by a fence. At first glance this may seem 2395impossible. After all, to be race candidates the two accesses must 2396be on different CPUs, and fences don't link events on different CPUs. 2397Well, normal fences don't -- but rcu-fence can! Here's an example: 2398 2399 int x, y; 2400 2401 P0() 2402 { 2403 WRITE_ONCE(x, 1); 2404 synchronize_rcu(); 2405 y = 3; 2406 } 2407 2408 P1() 2409 { 2410 rcu_read_lock(); 2411 if (READ_ONCE(x) == 0) 2412 y = 2; 2413 rcu_read_unlock(); 2414 } 2415 2416Do the plain stores to y race? Clearly not if P1 reads a non-zero 2417value for x, so let's assume the READ_ONCE(x) does obtain 0. This 2418means that the read-side critical section in P1 must finish executing 2419before the grace period in P0 does, because RCU's Grace-Period 2420Guarantee says that otherwise P0's store to x would have propagated to 2421P1 before the critical section started and so would have been visible 2422to the READ_ONCE(). (Another way of putting it is that the fre link 2423from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link 2424between those two events.) 2425 2426This means there is an rcu-fence link from P1's "y = 2" store to P0's 2427"y = 3" store, and consequently the first must propagate from P1 to P0 2428before the second can execute. Therefore the two stores cannot be 2429concurrent and there is no race, even though P1's plain store to y 2430isn't w-post-bounded by any marked accesses. 2431 2432Putting all this material together yields the following picture. For 2433race-candidate stores W and W', where W ->co W', the LKMM says the 2434stores don't race if W can be linked to W' by a 2435 2436 w-post-bounded ; vis ; w-pre-bounded 2437 2438sequence. If W is plain then they also have to be linked by an 2439 2440 r-post-bounded ; xb* ; w-pre-bounded 2441 2442sequence, and if W' is plain then they also have to be linked by a 2443 2444 w-post-bounded ; vis ; r-pre-bounded 2445 2446sequence. For race-candidate load R and store W, the LKMM says the 2447two accesses don't race if R can be linked to W by an 2448 2449 r-post-bounded ; xb* ; w-pre-bounded 2450 2451sequence or if W can be linked to R by a 2452 2453 w-post-bounded ; vis ; r-pre-bounded 2454 2455sequence. For the cases involving a vis link, the LKMM also accepts 2456sequences in which W is linked to W' or R by a 2457 2458 strong-fence ; xb* ; {w and/or r}-pre-bounded 2459 2460sequence with no post-bounding, and in every case the LKMM also allows 2461the link simply to be a fence with no bounding at all. If no sequence 2462of the appropriate sort exists, the LKMM says that the accesses race. 2463 2464There is one more part of the LKMM related to plain accesses (although 2465not to data races) we should discuss. Recall that many relations such 2466as hb are limited to marked accesses only. As a result, the 2467happens-before, propagates-before, and rcu axioms (which state that 2468various relation must not contain a cycle) doesn't apply to plain 2469accesses. Nevertheless, we do want to rule out such cycles, because 2470they don't make sense even for plain accesses. 2471 2472To this end, the LKMM imposes three extra restrictions, together 2473called the "plain-coherence" axiom because of their resemblance to the 2474rules used by the operational model to ensure cache coherence (that 2475is, the rules governing the memory subsystem's choice of a store to 2476satisfy a load request and its determination of where a store will 2477fall in the coherence order): 2478 2479 If R and W are race candidates and it is possible to link R to 2480 W by one of the xb* sequences listed above, then W ->rfe R is 2481 not allowed (i.e., a load cannot read from a store that it 2482 executes before, even if one or both is plain). 2483 2484 If W and R are race candidates and it is possible to link W to 2485 R by one of the vis sequences listed above, then R ->fre W is 2486 not allowed (i.e., if a store is visible to a load then the 2487 load must read from that store or one coherence-after it). 2488 2489 If W and W' are race candidates and it is possible to link W 2490 to W' by one of the vis sequences listed above, then W' ->co W 2491 is not allowed (i.e., if one store is visible to a second then 2492 the second must come after the first in the coherence order). 2493 2494This is the extent to which the LKMM deals with plain accesses. 2495Perhaps it could say more (for example, plain accesses might 2496contribute to the ppo relation), but at the moment it seems that this 2497minimal, conservative approach is good enough. 2498 2499 2500ODDS AND ENDS 2501------------- 2502 2503This section covers material that didn't quite fit anywhere in the 2504earlier sections. 2505 2506The descriptions in this document don't always match the formal 2507version of the LKMM exactly. For example, the actual formal 2508definition of the prop relation makes the initial coe or fre part 2509optional, and it doesn't require the events linked by the relation to 2510be on the same CPU. These differences are very unimportant; indeed, 2511instances where the coe/fre part of prop is missing are of no interest 2512because all the other parts (fences and rfe) are already included in 2513hb anyway, and where the formal model adds prop into hb, it includes 2514an explicit requirement that the events being linked are on the same 2515CPU. 2516 2517Another minor difference has to do with events that are both memory 2518accesses and fences, such as those corresponding to smp_load_acquire() 2519calls. In the formal model, these events aren't actually both reads 2520and fences; rather, they are read events with an annotation marking 2521them as acquires. (Or write events annotated as releases, in the case 2522smp_store_release().) The final effect is the same. 2523 2524Although we didn't mention it above, the instruction execution 2525ordering provided by the smp_rmb() fence doesn't apply to read events 2526that are part of a non-value-returning atomic update. For instance, 2527given: 2528 2529 atomic_inc(&x); 2530 smp_rmb(); 2531 r1 = READ_ONCE(y); 2532 2533it is not guaranteed that the load from y will execute after the 2534update to x. This is because the ARMv8 architecture allows 2535non-value-returning atomic operations effectively to be executed off 2536the CPU. Basically, the CPU tells the memory subsystem to increment 2537x, and then the increment is carried out by the memory hardware with 2538no further involvement from the CPU. Since the CPU doesn't ever read 2539the value of x, there is nothing for the smp_rmb() fence to act on. 2540 2541The LKMM defines a few extra synchronization operations in terms of 2542things we have already covered. In particular, rcu_dereference() is 2543treated as READ_ONCE() and rcu_assign_pointer() is treated as 2544smp_store_release() -- which is basically how the Linux kernel treats 2545them. 2546 2547Although we said that plain accesses are not linked by the ppo 2548relation, they do contribute to it indirectly. Namely, when there is 2549an address dependency from a marked load R to a plain store W, 2550followed by smp_wmb() and then a marked store W', the LKMM creates a 2551ppo link from R to W'. The reasoning behind this is perhaps a little 2552shaky, but essentially it says there is no way to generate object code 2553for this source code in which W' could execute before R. Just as with 2554pre-bounding by address dependencies, it is possible for the compiler 2555to undermine this relation if sufficient care is not taken. 2556 2557There are a few oddball fences which need special treatment: 2558smp_mb__before_atomic(), smp_mb__after_atomic(), and 2559smp_mb__after_spinlock(). The LKMM uses fence events with special 2560annotations for them; they act as strong fences just like smp_mb() 2561except for the sets of events that they order. Instead of ordering 2562all po-earlier events against all po-later events, as smp_mb() does, 2563they behave as follows: 2564 2565 smp_mb__before_atomic() orders all po-earlier events against 2566 po-later atomic updates and the events following them; 2567 2568 smp_mb__after_atomic() orders po-earlier atomic updates and 2569 the events preceding them against all po-later events; 2570 2571 smp_mb__after_spinlock() orders po-earlier lock acquisition 2572 events and the events preceding them against all po-later 2573 events. 2574 2575Interestingly, RCU and locking each introduce the possibility of 2576deadlock. When faced with code sequences such as: 2577 2578 spin_lock(&s); 2579 spin_lock(&s); 2580 spin_unlock(&s); 2581 spin_unlock(&s); 2582 2583or: 2584 2585 rcu_read_lock(); 2586 synchronize_rcu(); 2587 rcu_read_unlock(); 2588 2589what does the LKMM have to say? Answer: It says there are no allowed 2590executions at all, which makes sense. But this can also lead to 2591misleading results, because if a piece of code has multiple possible 2592executions, some of which deadlock, the model will report only on the 2593non-deadlocking executions. For example: 2594 2595 int x, y; 2596 2597 P0() 2598 { 2599 int r0; 2600 2601 WRITE_ONCE(x, 1); 2602 r0 = READ_ONCE(y); 2603 } 2604 2605 P1() 2606 { 2607 rcu_read_lock(); 2608 if (READ_ONCE(x) > 0) { 2609 WRITE_ONCE(y, 36); 2610 synchronize_rcu(); 2611 } 2612 rcu_read_unlock(); 2613 } 2614 2615Is it possible to end up with r0 = 36 at the end? The LKMM will tell 2616you it is not, but the model won't mention that this is because P1 2617will self-deadlock in the executions where it stores 36 in y. 2618