memory-model/Documentation/explanation.txt

1 Explanation of the Linux-Kernel Memory Consistency Model
15   7. THE PROGRAM ORDER RELATION: po AND po-loc
18   10. THE READS-FROM RELATION: rf, rfi, and rfe
20   12. THE FROM-READS RELATION: fr, fri, and fre
22   14. PROPAGATION ORDER RELATION: cumul-fence
28   20. THE HAPPENS-BEFORE RELATION: hb
29   21. THE PROPAGATES-BEFORE RELATION: pb
30   22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
31   23. SRCU READ-SIDE CRITICAL SECTIONS
39 ------------
41 The Linux-kernel memory consistency model (LKMM) is rather complex and
43 linux-kernel.bell and linux-kernel.cat files that make up the formal
69 ----------
87 factors such as DMA and mixed-size accesses.)  But on multiprocessor
98 ----------------
159 predict that r1 = 42 or r2 = -7, because neither of those values ever
181 ----------------------------
255 -------------------
286 ------
306 Atomic read-modify-write accesses, such as atomic_inc() or xchg(),
313 logical computations, control-flow instructions, or accesses to
319 is concerned only with the store itself -- its value and its address
320 -- not the computation leading up to it.
328 THE PROGRAM ORDER RELATION: po AND po-loc
329 -----------------------------------------
336 that X is po-before Y (written as "X ->po Y" in formulas) if X occurs
339 This is inherently a single-CPU relation; two instructions executing
343 po-loc is a sub-relation of po.  It links two memory accesses when the
345 same memory location (the "-loc" suffix).
348 program order we need to explain.  The LKMM was inspired by low-level
375 need not even be stored in normal memory at all -- in principle a
381 ---------
428 ------------------------------------------
484 come earlier in program order.  Symbolically, if we have R ->data X,
485 R ->addr X, or R ->ctrl X (where R is a read event), then we must also
486 have R ->po X.  It wouldn't make sense for a computation to depend
507 the load generated for a READ_ONCE() -- that's one of the nice
508 properties of READ_ONCE() -- but it is allowed to ignore the load's
542 THE READS-FROM RELATION: rf, rfi, and rfe
543 -----------------------------------------
545 The reads-from relation (rf) links a write event to a read event when
548 write W ->rf R to indicate that the load R reads from the store W.  We
550 the same CPU (internal reads-from, or rfi) and where they occur on
551 different CPUs (external reads-from, or rfe).
559 of load-tearing, where a load obtains some of its bits from one store
561 and WRITE_ONCE() will prevent load-tearing; it's not possible to have:
580 On the other hand, load-tearing is unavoidable when mixed-size
601 If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read
602 from both of P0's stores.  It is possible to handle mixed-size and
609 ------------------------------------------------------------------
612 multi-processor system, the CPUs must share a consistent view of the
628 hardware-centric view, the order in which the stores get written to
629 x's cache line).  We write W ->co W' if W comes before W' in the
636 	Write-write coherence: If W ->po-loc W' (i.e., W comes before
638 	and W' are two stores, then W ->co W'.
640 	Write-read coherence: If W ->po-loc R, where W is a store and R
644 	Read-write coherence: If R ->po-loc W, where R is a load and W
648 	Read-read coherence: If R ->po-loc R', where R and R' are two
675 write-write coherence rule: Since the store of 23 comes later in
689 If r1 = 666 at the end, this would violate the read-write coherence
712 would violate the read-read coherence rule: The r1 load comes before
719 encoded in Itanium's Very-Long-Instruction-Word format, and it is yet
723 Just like the po relation, co is inherently an ordering -- it is not
731 related by po.  Coherence order is strictly per-location, or if you
735 THE FROM-READS RELATION: fr, fri, and fre
736 -----------------------------------------
738 The from-reads relation (fr) can be a little difficult for people to
740 overwritten by a store.  In other words, we have R ->fr W when the
769 event W for the same location, we will have R ->fr W if and only if
770 the write which R reads from is co-before W.  In symbols,
772 	(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
776 --------------------
795 arrange for the store to be co-later than (i.e., to overwrite) any
799 whether there are any as-yet unexecuted store instructions, for the
801 uses the value of the po-latest such store as the value obtained by R,
805 of the co-latest store to the location in question which has already
814 First-In-First-Out order, and consequently the processing delay
816 have a partitioned design that results in non-FIFO behavior.  We will
834 	the CPU to execute all po-earlier instructions before any
835 	po-later instructions;
837 	smp_rmb() forces the CPU to execute all po-earlier loads
838 	before any po-later loads;
840 	smp_wmb() forces the CPU to execute all po-earlier stores
841 	before any po-later stores;
845 	part of an smp_load_acquire()) before any po-later
849 	execute all po-earlier instructions before the store
856 	For each other CPU C', smp_wmb() forces all po-earlier stores
857 	on C to propagate to C' before any po-later stores do.
860 	a release fence is executed (including all po-earlier
865 	executed (including all po-earlier stores on C) is forced to
866 	propagate to all other CPUs before any instructions po-after
873 strong fences are A-cumulative.  By contrast, smp_wmb() fences are not
874 A-cumulative; they only affect the propagation of stores that are
882 PROPAGATION ORDER RELATION: cumul-fence
883 ---------------------------------------
886 smp_wmb() fences) are collectively referred to as cumul-fences, even
887 though smp_wmb() isn't A-cumulative.  The cumul-fence relation is
894 	where either X = E or else E ->rf X; or
897 	order, where either X = E or else E ->rf X.
900 and W ->cumul-fence W', then W must propagate to any given CPU
906 -------------------------------------------------
919 	Atomicity: This requires that atomic read-modify-write
923 	Happens-before: This requires that certain instructions are
929 	Rcu: This requires that RCU read-side critical sections and
931 	Grace-Period Guarantee.
933 	Plain-coherence: This requires that plain memory accesses
938 memory models (such as those for C11/C++11).  The "happens-before" and
940 "rcu" and "plain-coherence" axioms are specific to the LKMM.
946 -----------------------------------
957 and po-loc relations agree with this global ordering; in other words,
958 whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the
964 	X0 -> X1 -> X2 -> ... -> Xn -> X0,
966 where each of the links is either rf, co, fr, or po-loc.  This has to
976 -------------------
978 What does it mean to say that a read-modify-write (rmw) update, such
1002 atomic read-modify-write and W' is the write event which R reads from,
1006 	(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
1016 	Z0 ->rf Y1 ->rmw Z1 ->rf ... ->rf Yn ->rmw Zn,
1019 degenerate case).  We write this relation as: Z0 ->rmw-sequence Zn.
1023 cumul-fence relation.  That is, if we have:
1025 	U ->cumul-fence X -> rmw-sequence Y
1027 then also U ->cumul-fence Y.  Thinking about this in terms of the
1028 operational model, U ->cumul-fence X says that the store U propagates
1032 the w-post-bounded relation defined below in the PLAIN ACCESSES AND
1038 updates with full-barrier semantics did not always guarantee ordering
1039 at least as strong as atomic updates with release-barrier semantics.)
1043 -----------------------------------------
1047 "preserved program order") relation, which links the po-earlier
1048 instruction to the po-later instruction and is thus a sub-relation of
1053 memory accesses with X ->po Y; then the CPU must execute X before Y if
1072 	X and Y are both loads, X ->addr Y (i.e., there is an address
1094 To be fair about it, all Linux-supported architectures do execute
1098 the split-cache design used by Alpha can cause it to behave in a way
1106 store and a second, po-later load reads from that store:
1108 	R ->dep W ->rfi R',
1135 	R ->po-loc W
1137 (the po-loc link says that R comes before W in program order and they
1141 violation of the read-write coherence rule.  Similarly, if we had
1143 	W ->po-loc W'
1147 overwrite W', in violation of the write-write coherence rule.
1149 allowing out-of-order writes like this to occur.  The model avoided
1150 violating the write-write coherence rule by requiring the CPU not to
1155 ------------------------
1162 	int y = -1;
1209 effect of the fence is to cause the CPU not to execute any po-later
1230 share this property: They do not allow the CPU to execute any po-later
1231 instructions (or po-later loads in the case of smp_rmb()) until all
1234 po-earlier stores to propagate to every other CPU in the system; then
1236 as of that time -- not just the stores received when the strong fence
1243 THE HAPPENS-BEFORE RELATION: hb
1244 -------------------------------
1246 The happens-before relation (hb) links memory accesses that have to
1250 W ->rfe R implies that W and R are on different CPUs.  It also means
1253 must have executed before R, and so we have W ->hb R.
1255 The equivalent fact need not hold if W ->rfi R (i.e., W and R are on
1262 W ->coe W'.  This means that W and W' are stores to the same location,
1268 R ->fre W means that W overwrites the value which R reads, but it
1378 outcome is impossible -- as it should be.
1381 followed by an arbitrary number of cumul-fence links, ending with an
1385 followed by two cumul-fences and an rfe link, utilizing the fact that
1386 release fences are A-cumulative:
1417 store to y does (the first cumul-fence), the store to y propagates to P2
1420 store to z does (the second cumul-fence), and P0's load executes after the
1425 requirement is the content of the LKMM's "happens-before" axiom.
1433 THE PROPAGATES-BEFORE RELATION: pb
1434 ----------------------------------
1436 The propagates-before (pb) relation capitalizes on the special
1438 store is coherence-later than E and propagates to every CPU and to RAM
1440 F via a coe or fre link, an arbitrary number of cumul-fences, an
1448 	E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F,
1452 be equal to X).  Because of the cumul-fence links, we know that W will
1467 coherence order, contradicting the fact that E ->coe W.  If E was a
1470 contradicting the fact that E ->fre W.
1498 In this example, the sequences of cumul-fence and hb links are empty.
1513 RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
1514 ------------------------------------------------------------------------
1516 RCU (Read-Copy-Update) is a powerful synchronization mechanism.  It
1517 rests on two concepts: grace periods and read-side critical sections.
1520 synchronize_rcu().  A read-side critical section (or just critical
1526 Grace-Period Guarantee, which states that a critical section can never
1576 suitable places in the RCU-related code.  Thus, if a critical section
1589 rcu-link relation.  rcu-link encompasses a very general notion of
1592 E ->rcu-link F includes cases where E is po-before some memory-access
1593 event X, F is po-after some memory-access event Y, and we have any of
1594 X ->rfe Y, X ->co Y, or X ->fr Y.
1596 The formal definition of the rcu-link relation is more than a little
1600 about rcu-link is the information in the preceding paragraph.
1602 The LKMM also defines the rcu-gp and rcu-rscsi relations.  They bring
1603 grace periods and read-side critical sections into the picture, in the
1606 	E ->rcu-gp F means that E and F are in fact the same event,
1610 	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
1611 	and rcu_read_lock() fence events delimiting some read-side
1616 If we think of the rcu-link relation as standing for an extended
1617 "before", then X ->rcu-gp Y ->rcu-link Z roughly says that X is a
1621 after X ends.)  Similarly, X ->rcu-rscsi Y ->rcu-link Z says that X is
1624 The LKMM goes on to define the rcu-order relation as a sequence of
1625 rcu-gp and rcu-rscsi links separated by rcu-link links, in which the
1626 number of rcu-gp links is >= the number of rcu-rscsi links.  For
1629 	X ->rcu-gp Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
1631 would imply that X ->rcu-order V, because this sequence contains two
1632 rcu-gp links and one rcu-rscsi link.  (It also implies that
1633 X ->rcu-order T and Z ->rcu-order V.)  On the other hand:
1635 	X ->rcu-rscsi Y ->rcu-link Z ->rcu-rscsi T ->rcu-link U ->rcu-gp V
1637 does not imply X ->rcu-order V, because the sequence contains only
1638 one rcu-gp link but two rcu-rscsi links.
1640 The rcu-order relation is important because the Grace Period Guarantee
1641 means that rcu-order links act kind of like strong fences.  In
1642 particular, E ->rcu-order F implies not only that E begins before F
1643 ends, but also that any write po-before E will propagate to every CPU
1644 before any instruction po-after F can execute.  (However, it does not
1646 fence event is linked to itself by rcu-order as a degenerate case.)
1651 	G ->rcu-gp W ->rcu-link Z ->rcu-rscsi F.
1654 and there are events X, Y and a read-side critical section C such that:
1656 	1. G = W is po-before or equal to X;
1660 	3. Y is po-before Z;
1666 From 1 - 4 we deduce that the grace period G ends before the critical
1671 executing and hence before any instruction po-after F can execute.
1673 covered by rcu-order.
1675 The rcu-fence relation is a simple extension of rcu-order.  While
1676 rcu-order only links certain fence events (calls to synchronize_rcu(),
1677 rcu_read_lock(), or rcu_read_unlock()), rcu-fence links any events
1678 that are separated by an rcu-order link.  This is analogous to the way
1679 the strong-fence relation links events that are separated by an
1680 smp_mb() fence event (as mentioned above, rcu-order links act kind of
1681 like strong fences).  Written symbolically, X ->rcu-fence Y means
1684 	X ->po E ->rcu-order F ->po Y.
1688 every CPU before Y executes.  Thus rcu-fence is sort of a
1689 "super-strong" fence: Unlike the original strong fences (smp_mb() and
1690 synchronize_rcu()), rcu-fence is able to link events on different
1691 CPUs.  (Perhaps this fact should lead us to say that rcu-fence isn't
1694 Finally, the LKMM defines the RCU-before (rb) relation in terms of
1695 rcu-fence.  This is done in essentially the same way as the pb
1696 relation was defined in terms of strong-fence.  We will omit the
1697 details; the end result is that E ->rb F implies E must execute
1698 before F, just as E ->pb F does (and for much the same reasons).
1703 and F with E ->rcu-link F ->rcu-order E.  Or to put it a third way,
1704 the axiom requires that there are no cycles consisting of rcu-gp and
1705 rcu-rscsi alternating with rcu-link, where the number of rcu-gp links
1706 is >= the number of rcu-rscsi links.
1721 are events Q and R where Q is po-after L (which marks the start of the
1722 critical section), Q is "before" R in the sense used by the rcu-link
1723 relation, and R is po-before the grace period S.  Thus we have:
1725 	L ->rcu-link S.
1731 some event X which is po-after S.  Symbolically, this amounts to:
1733 	S ->po X ->hb* Z ->fr W ->rf Y ->po U.
1737 discussion of the rcu-link relation earlier) that S and U are related
1738 by rcu-link:
1740 	S ->rcu-link U.
1742 Since S is a grace period we have S ->rcu-gp S, and since L and U are
1743 the start and end of the critical section C we have U ->rcu-rscsi L.
1746 	S ->rcu-gp S ->rcu-link U ->rcu-rscsi L ->rcu-link S,
1751 For something a little more down-to-earth, let's see how the axiom
1776 P1's load at W reads from, so we have W ->fre Y.  Since S ->po W and
1777 also Y ->po U, we get S ->rcu-link U.  In addition, S ->rcu-gp S
1781 so we have X ->rfe Z.  Together with L ->po X and Z ->po S, this
1782 yields L ->rcu-link S.  And since L and U are the start and end of a
1783 critical section, we have U ->rcu-rscsi L.
1785 Then U ->rcu-rscsi L ->rcu-link S ->rcu-gp S ->rcu-link U is a
1823 that U0 ->rcu-rscsi L0 ->rcu-link S1 ->rcu-gp S1 ->rcu-link U2 ->rcu-rscsi
1824 L2 ->rcu-link U0.  However this cycle is not forbidden, because the
1825 sequence of relations contains fewer instances of rcu-gp (one) than of
1826 rcu-rscsi (two).  Consequently the outcome is allowed by the LKMM.
1831 --------------------	--------------------	--------------------
1852 The LKMM supports SRCU (Sleepable Read-Copy-Update) in addition to
1854 relations srcu-gp and srcu-rscsi added to represent SRCU grace periods
1855 and read-side critical sections.  However, there are some significant
1856 differences between RCU read-side critical sections and their SRCU
1860 SRCU READ-SIDE CRITICAL SECTIONS
1861 --------------------------------
1863 The LKMM uses the srcu-rscsi relation to model SRCU read-side critical
1864 sections.  They differ from RCU read-side critical sections in the
1870 	an SRCU domain, and calls linked by srcu-rscsi must have the
1871 	same domain.  Read-side critical sections and grace periods
1882 	read-side critical sections to overlap partially, as in the
1894 	created two nested (fully overlapping) read-side critical
1903 	an SRCU read-side critical section to start on one CPU and end
1912 belonging to the "srcu-lock" and "srcu-unlock" event classes
1920 from the load (srcu-lock) to the store (srcu-unlock).  For example,
1938 except for the presence of the special srcu-lock and srcu-unlock
1939 annotations.  You can see at once that we have A ->data C and
1940 B ->data D.  These dependencies tell the LKMM that C is the
1941 srcu-unlock event matching srcu-lock event A, and D is the
1942 srcu-unlock event matching srcu-lock event B.
1986 	A[srcu-lock] ->data B[once] ->rf C[once] ->data D[srcu-unlock].
1988 The LKMM defines a carry-srcu-data relation to express this pattern;
1994 an srcu-lock event and the final data dependency leading to the
1995 matching srcu-unlock event.  carry-srcu-data is complicated by the
1997 sequence are instances of srcu-unlock.  This is necessary because in a
2008 	A ->data B ->rf C ->data D.
2010 This would cause carry-srcu-data to mistakenly extend a data
2012 srcu-unlock event matching A's srcu-lock.  To avoid such problems,
2013 carry-srcu-data does not accept sequences in which the ends of any of
2014 the intermediate ->data links (B above) is an srcu-unlock event.
2018 -------
2048 store-release in a spin_unlock() and the load-acquire which forms the
2050 spin_trylock() -- we can call these things lock-releases and
2051 lock-acquires -- have two properties beyond those of ordinary releases
2054 First, when a lock-acquire reads from or is po-after a lock-release,
2055 the LKMM requires that every instruction po-before the lock-release
2056 must execute before any instruction po-after the lock-acquire.  This
2084 Here the second spin_lock() is po-after the first spin_unlock(), and
2090 fences, only to lock-related operations.  For instance, suppose P0()
2114 Second, when a lock-acquire reads from or is po-after a lock-release,
2115 and some other stores W and W' occur po-before the lock-release and
2116 po-after the lock-acquire respectively, the LKMM requires that W must
2157 These two special requirements for lock-release and lock-acquire do
2165 -----------------------------
2170 operations of one kind or another.  Ordinary C-language memory
2176 threads or CPUs.  This leaves compilers free to implement all manner
2222 would be no possibility of a NULL-pointer dereference.
2247 are "race candidates" if they satisfy 1 - 4.  Thus, whether or not two
2279 Therefore when X is a store, for X and Y to be non-concurrent the LKMM
2282 Y executes before X -- then Y must propagate to X's CPU before X
2284 relation (vis), where X ->vis Y is defined to hold if there is an
2288 	cumul-fence links followed by an optional rfe link (if none of
2293 	Z is connected to Y by a strong-fence link followed by a
2304 	cumul-fence memory barriers force stores that are po-before
2306 	po-after the barrier.
2312 	strong-fence memory barriers force stores that are po-before
2315 	po-after the barrier can execute.
2340 The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
2345 Y).  Therefore we have X ->vis Y: X must propagate to Y's CPU before Y
2352 cumul-fence, pb, and so on -- including vis) apply only to marked
2365 Fortunately, the compiler isn't completely free; it is subject to some
2385 corresponding to the first group of accesses will all end po-before
2387 -- even if some of the accesses are plain.  (Of course, the CPU may
2428 	cumul-fence.  It guarantees that no matter what hash of
2430 	access U, all those instructions will be po-before the fence.
2437 	executed, i.e., X ->vis Y.  (And if there is no rfe link then
2442 	fence.  It guarantees that all the machine-level instructions
2443 	corresponding to the access V will be po-after the fence, and
2455 X ->xb* E.  If E was also a plain access, we would also look for a
2456 marked access Y such that X ->xb* Y, and Y and E are ordered by a
2458 "post-bounded" by X and E is "pre-bounded" by Y.
2461 "r-post-bounded" by X.  Similarly, E would be "r-pre-bounded" or
2462 "w-pre-bounded" by Y, depending on whether E was a store or a load.
2466 say that a marked access pre-bounds and post-bounds itself (e.g., if R
2469 The need to distinguish between r- and w-bounding raises yet another
2484 w-pre-bounded or w-post-bounded by a marked access, it also requires
2485 the store to be r-pre-bounded or r-post-bounded, so as to handle cases
2493 Incidentally, the other tranformation -- augmenting a plain load by
2494 adding in a store to the same location -- is not allowed.  This is
2504 The LKMM includes a second way to pre-bound plain accesses, in
2511 the LKMM says that the marked load of ptr pre-bounds the plain load of
2544 rcu_assign_pointer() performs a store-release, so the plain store to b
2545 is definitely w-post-bounded before the store to ptr, and the two
2549 that the load of ptr in P1 is r-pre-bounded before the load of *p
2579 not need to be w-post-bounded: when it is separated from the other
2580 race-candidate access by a fence.  At first glance this may seem
2583 Well, normal fences don't -- but rcu-fence can!  Here's an example:
2602 Do the plain stores to y race?  Clearly not if P1 reads a non-zero
2604 means that the read-side critical section in P1 must finish executing
2605 before the grace period in P0 does, because RCU's Grace-Period
2609 from the READ_ONCE() to the WRITE_ONCE() gives rise to an rcu-link
2612 This means there is an rcu-fence link from P1's "y = 2" store to P0's
2616 isn't w-post-bounded by any marked accesses.
2619 race-candidate stores W and W', where W ->co W', the LKMM says the
2622 	w-post-bounded ; vis ; w-pre-bounded
2626 	r-post-bounded ; xb* ; w-pre-bounded
2630 	w-post-bounded ; vis ; r-pre-bounded
2632 sequence.  For race-candidate load R and store W, the LKMM says the
2635 	r-post-bounded ; xb* ; w-pre-bounded
2639 	w-post-bounded ; vis ; r-pre-bounded
2644 	strong-fence ; xb* ; {w and/or r}-pre-bounded
2646 sequence with no post-bounding, and in every case the LKMM also allows
2653 happens-before, propagates-before, and rcu axioms (which state that
2659 called the "plain-coherence" axiom because of their resemblance to the
2666 	W by one of the xb* sequences listed above, then W ->rfe R is
2671 	R by one of the vis sequences listed above, then R ->fre W is
2673 	load must read from that store or one coherence-after it).
2676 	to W' by one of the vis sequences listed above, then W' ->co W
2687 -------------
2712 that are part of a non-value-returning atomic update.  For instance,
2721 non-value-returning atomic operations effectively to be executed off
2730 smp_store_release() -- which is basically how the Linux kernel treats
2740 pre-bounding by address dependencies, it is possible for the compiler
2755 all po-earlier events against all po-later events, as smp_mb() does,
2758 	smp_mb__before_atomic() orders all po-earlier events against
2759 	po-later atomic updates and the events following them;
2761 	smp_mb__after_atomic() orders po-earlier atomic updates and
2762 	the events preceding them against all po-later events;
2764 	smp_mb__after_spinlock() orders po-earlier lock acquisition
2765 	events and the events preceding them against all po-later
2786 non-deadlocking executions.  For example:
2810 will self-deadlock in the executions where it stores 36 in y.