1ccc9971eSMauro Carvalho Chehab======================================================
2ccc9971eSMauro Carvalho ChehabA Tour Through TREE_RCU's Grace-Period Memory Ordering
3ccc9971eSMauro Carvalho Chehab======================================================
4ccc9971eSMauro Carvalho Chehab
5ccc9971eSMauro Carvalho ChehabAugust 8, 2017
6ccc9971eSMauro Carvalho Chehab
7d18c265fSSeongJae ParkThis article was contributed by Paul E. McKenney
8ccc9971eSMauro Carvalho Chehab
9ccc9971eSMauro Carvalho ChehabIntroduction
10ccc9971eSMauro Carvalho Chehab============
11ccc9971eSMauro Carvalho Chehab
12ccc9971eSMauro Carvalho ChehabThis document gives a rough visual overview of how Tree RCU's
13ccc9971eSMauro Carvalho Chehabgrace-period memory ordering guarantee is provided.
14ccc9971eSMauro Carvalho Chehab
15ccc9971eSMauro Carvalho ChehabWhat Is Tree RCU's Grace Period Memory Ordering Guarantee?
16ccc9971eSMauro Carvalho Chehab==========================================================
17ccc9971eSMauro Carvalho Chehab
18ccc9971eSMauro Carvalho ChehabRCU grace periods provide extremely strong memory-ordering guarantees
19ccc9971eSMauro Carvalho Chehabfor non-idle non-offline code.
20ccc9971eSMauro Carvalho ChehabAny code that happens after the end of a given RCU grace period is guaranteed
21ccc9971eSMauro Carvalho Chehabto see the effects of all accesses prior to the beginning of that grace
22ccc9971eSMauro Carvalho Chehabperiod that are within RCU read-side critical sections.
23ccc9971eSMauro Carvalho ChehabSimilarly, any code that happens before the beginning of a given RCU grace
2418389c45SPaul E. McKenneyperiod is guaranteed to not see the effects of all accesses following the end
25ccc9971eSMauro Carvalho Chehabof that grace period that are within RCU read-side critical sections.
26ccc9971eSMauro Carvalho Chehab
27ccc9971eSMauro Carvalho ChehabNote well that RCU-sched read-side critical sections include any region
28ccc9971eSMauro Carvalho Chehabof code for which preemption is disabled.
29ccc9971eSMauro Carvalho ChehabGiven that each individual machine instruction can be thought of as
30ccc9971eSMauro Carvalho Chehaban extremely small region of preemption-disabled code, one can think of
31ccc9971eSMauro Carvalho Chehab``synchronize_rcu()`` as ``smp_mb()`` on steroids.
32ccc9971eSMauro Carvalho Chehab
33ccc9971eSMauro Carvalho ChehabRCU updaters use this guarantee by splitting their updates into
34ccc9971eSMauro Carvalho Chehabtwo phases, one of which is executed before the grace period and
35ccc9971eSMauro Carvalho Chehabthe other of which is executed after the grace period.
36ccc9971eSMauro Carvalho ChehabIn the most common use case, phase one removes an element from
37ccc9971eSMauro Carvalho Chehaba linked RCU-protected data structure, and phase two frees that element.
38ccc9971eSMauro Carvalho ChehabFor this to work, any readers that have witnessed state prior to the
39ccc9971eSMauro Carvalho Chehabphase-one update (in the common case, removal) must not witness state
40ccc9971eSMauro Carvalho Chehabfollowing the phase-two update (in the common case, freeing).
41ccc9971eSMauro Carvalho Chehab
42ccc9971eSMauro Carvalho ChehabThe RCU implementation provides this guarantee using a network
43ccc9971eSMauro Carvalho Chehabof lock-based critical sections, memory barriers, and per-CPU
44ccc9971eSMauro Carvalho Chehabprocessing, as is described in the following sections.
45ccc9971eSMauro Carvalho Chehab
46ccc9971eSMauro Carvalho ChehabTree RCU Grace Period Memory Ordering Building Blocks
47ccc9971eSMauro Carvalho Chehab=====================================================
48ccc9971eSMauro Carvalho Chehab
49ccc9971eSMauro Carvalho ChehabThe workhorse for RCU's grace-period memory ordering is the
50ccc9971eSMauro Carvalho Chehabcritical section for the ``rcu_node`` structure's
51d18c265fSSeongJae Park``->lock``. These critical sections use helper functions for lock
52ccc9971eSMauro Carvalho Chehabacquisition, including ``raw_spin_lock_rcu_node()``,
53ccc9971eSMauro Carvalho Chehab``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
54ccc9971eSMauro Carvalho ChehabTheir lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
55ccc9971eSMauro Carvalho Chehab``raw_spin_unlock_irq_rcu_node()``, and
56ccc9971eSMauro Carvalho Chehab``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
57ccc9971eSMauro Carvalho ChehabFor completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
58ccc9971eSMauro Carvalho ChehabThe key point is that the lock-acquisition functions, including
59ccc9971eSMauro Carvalho Chehab``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
60ccc9971eSMauro Carvalho Chehabimmediately after successful acquisition of the lock.
61ccc9971eSMauro Carvalho Chehab
62ccc9971eSMauro Carvalho ChehabTherefore, for any given ``rcu_node`` structure, any access
63ccc9971eSMauro Carvalho Chehabhappening before one of the above lock-release functions will be seen
64ccc9971eSMauro Carvalho Chehabby all CPUs as happening before any access happening after a later
65ccc9971eSMauro Carvalho Chehabone of the above lock-acquisition functions.
66ccc9971eSMauro Carvalho ChehabFurthermore, any access happening before one of the
67ccc9971eSMauro Carvalho Chehababove lock-release function on any given CPU will be seen by all
68ccc9971eSMauro Carvalho ChehabCPUs as happening before any access happening after a later one
69ccc9971eSMauro Carvalho Chehabof the above lock-acquisition functions executing on that same CPU,
70ccc9971eSMauro Carvalho Chehabeven if the lock-release and lock-acquisition functions are operating
71ccc9971eSMauro Carvalho Chehabon different ``rcu_node`` structures.
72ccc9971eSMauro Carvalho ChehabTree RCU uses these two ordering guarantees to form an ordering
73ccc9971eSMauro Carvalho Chehabnetwork among all CPUs that were in any way involved in the grace
74ccc9971eSMauro Carvalho Chehabperiod, including any CPUs that came online or went offline during
75ccc9971eSMauro Carvalho Chehabthe grace period in question.
76ccc9971eSMauro Carvalho Chehab
77ccc9971eSMauro Carvalho ChehabThe following litmus test exhibits the ordering effects of these
78ccc9971eSMauro Carvalho Chehablock-acquisition and lock-release functions::
79ccc9971eSMauro Carvalho Chehab
80ccc9971eSMauro Carvalho Chehab    1 int x, y, z;
81ccc9971eSMauro Carvalho Chehab    2
82ccc9971eSMauro Carvalho Chehab    3 void task0(void)
83ccc9971eSMauro Carvalho Chehab    4 {
84ccc9971eSMauro Carvalho Chehab    5   raw_spin_lock_rcu_node(rnp);
85ccc9971eSMauro Carvalho Chehab    6   WRITE_ONCE(x, 1);
86ccc9971eSMauro Carvalho Chehab    7   r1 = READ_ONCE(y);
87ccc9971eSMauro Carvalho Chehab    8   raw_spin_unlock_rcu_node(rnp);
88ccc9971eSMauro Carvalho Chehab    9 }
89ccc9971eSMauro Carvalho Chehab   10
90ccc9971eSMauro Carvalho Chehab   11 void task1(void)
91ccc9971eSMauro Carvalho Chehab   12 {
92ccc9971eSMauro Carvalho Chehab   13   raw_spin_lock_rcu_node(rnp);
93ccc9971eSMauro Carvalho Chehab   14   WRITE_ONCE(y, 1);
94ccc9971eSMauro Carvalho Chehab   15   r2 = READ_ONCE(z);
95ccc9971eSMauro Carvalho Chehab   16   raw_spin_unlock_rcu_node(rnp);
96ccc9971eSMauro Carvalho Chehab   17 }
97ccc9971eSMauro Carvalho Chehab   18
98ccc9971eSMauro Carvalho Chehab   19 void task2(void)
99ccc9971eSMauro Carvalho Chehab   20 {
100ccc9971eSMauro Carvalho Chehab   21   WRITE_ONCE(z, 1);
101ccc9971eSMauro Carvalho Chehab   22   smp_mb();
102ccc9971eSMauro Carvalho Chehab   23   r3 = READ_ONCE(x);
103ccc9971eSMauro Carvalho Chehab   24 }
104ccc9971eSMauro Carvalho Chehab   25
105d18c265fSSeongJae Park   26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
106ccc9971eSMauro Carvalho Chehab
107d18c265fSSeongJae ParkThe ``WARN_ON()`` is evaluated at "the end of time",
108ccc9971eSMauro Carvalho Chehabafter all changes have propagated throughout the system.
109ccc9971eSMauro Carvalho ChehabWithout the ``smp_mb__after_unlock_lock()`` provided by the
110ccc9971eSMauro Carvalho Chehabacquisition functions, this ``WARN_ON()`` could trigger, for example
111ccc9971eSMauro Carvalho Chehabon PowerPC.
112ccc9971eSMauro Carvalho ChehabThe ``smp_mb__after_unlock_lock()`` invocations prevent this
113ccc9971eSMauro Carvalho Chehab``WARN_ON()`` from triggering.
114ccc9971eSMauro Carvalho Chehab
115c28adaccSFrederic Weisbecker+-----------------------------------------------------------------------+
116c28adaccSFrederic Weisbecker| **Quick Quiz**:                                                       |
117c28adaccSFrederic Weisbecker+-----------------------------------------------------------------------+
118c28adaccSFrederic Weisbecker| But the chain of rcu_node-structure lock acquisitions guarantees      |
119c28adaccSFrederic Weisbecker| that new readers will see all of the updater's pre-grace-period       |
120c28adaccSFrederic Weisbecker| accesses and also guarantees that the updater's post-grace-period     |
121c28adaccSFrederic Weisbecker| accesses will see all of the old reader's accesses.  So why do we     |
122c28adaccSFrederic Weisbecker| need all of those calls to smp_mb__after_unlock_lock()?               |
123c28adaccSFrederic Weisbecker+-----------------------------------------------------------------------+
124c28adaccSFrederic Weisbecker| **Answer**:                                                           |
125c28adaccSFrederic Weisbecker+-----------------------------------------------------------------------+
126c28adaccSFrederic Weisbecker| Because we must provide ordering for RCU's polling grace-period       |
127c28adaccSFrederic Weisbecker| primitives, for example, get_state_synchronize_rcu() and              |
128c28adaccSFrederic Weisbecker| poll_state_synchronize_rcu().  Consider this code::                   |
129c28adaccSFrederic Weisbecker|                                                                       |
130c28adaccSFrederic Weisbecker|  CPU 0                                     CPU 1                      |
131c28adaccSFrederic Weisbecker|  ----                                      ----                       |
132c28adaccSFrederic Weisbecker|  WRITE_ONCE(X, 1)                          WRITE_ONCE(Y, 1)           |
133c28adaccSFrederic Weisbecker|  g = get_state_synchronize_rcu()           smp_mb()                   |
134c28adaccSFrederic Weisbecker|  while (!poll_state_synchronize_rcu(g))    r1 = READ_ONCE(X)          |
135c28adaccSFrederic Weisbecker|          continue;                                                    |
136c28adaccSFrederic Weisbecker|  r0 = READ_ONCE(Y)                                                    |
137c28adaccSFrederic Weisbecker|                                                                       |
138c28adaccSFrederic Weisbecker| RCU guarantees that the outcome r0 == 0 && r1 == 0 will not           |
139c28adaccSFrederic Weisbecker| happen, even if CPU 1 is in an RCU extended quiescent state           |
140c28adaccSFrederic Weisbecker| (idle or offline) and thus won't interact directly with the RCU       |
141c28adaccSFrederic Weisbecker| core processing at all.                                               |
142c28adaccSFrederic Weisbecker+-----------------------------------------------------------------------+
143c28adaccSFrederic Weisbecker
144ccc9971eSMauro Carvalho ChehabThis approach must be extended to include idle CPUs, which need
145ccc9971eSMauro Carvalho ChehabRCU's grace-period memory ordering guarantee to extend to any
146ccc9971eSMauro Carvalho ChehabRCU read-side critical sections preceding and following the current
147ccc9971eSMauro Carvalho Chehabidle sojourn.
148ccc9971eSMauro Carvalho ChehabThis case is handled by calls to the strongly ordered
149ccc9971eSMauro Carvalho Chehab``atomic_add_return()`` read-modify-write atomic operation that
150ccc9971eSMauro Carvalho Chehabis invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
151ccc9971eSMauro Carvalho Chehabtime and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
152ccc9971eSMauro Carvalho ChehabThe grace-period kthread invokes ``rcu_dynticks_snap()`` and
153ccc9971eSMauro Carvalho Chehab``rcu_dynticks_in_eqs_since()`` (both of which invoke
154ccc9971eSMauro Carvalho Chehaban ``atomic_add_return()`` of zero) to detect idle CPUs.
155ccc9971eSMauro Carvalho Chehab
156ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
157ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
158ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
159ccc9971eSMauro Carvalho Chehab| But what about CPUs that remain offline for the entire grace period?  |
160ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
161ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
162ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
163ccc9971eSMauro Carvalho Chehab| Such CPUs will be offline at the beginning of the grace period, so    |
164ccc9971eSMauro Carvalho Chehab| the grace period won't expect quiescent states from them. Races       |
165ccc9971eSMauro Carvalho Chehab| between grace-period start and CPU-hotplug operations are mediated    |
166ccc9971eSMauro Carvalho Chehab| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described    |
167ccc9971eSMauro Carvalho Chehab| above.                                                                |
168ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
169ccc9971eSMauro Carvalho Chehab
170ccc9971eSMauro Carvalho ChehabThe approach must be extended to handle one final case, that of waking a
171*c4af9e00SRandy Dunlaptask blocked in ``synchronize_rcu()``. This task might be affined to
172ccc9971eSMauro Carvalho Chehaba CPU that is not yet aware that the grace period has ended, and thus
173ccc9971eSMauro Carvalho Chehabmight not yet be subject to the grace period's memory ordering.
174ccc9971eSMauro Carvalho ChehabTherefore, there is an ``smp_mb()`` after the return from
175ccc9971eSMauro Carvalho Chehab``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
176ccc9971eSMauro Carvalho Chehab
177ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
178ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
179ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
180ccc9971eSMauro Carvalho Chehab| What? Where??? I don't see any ``smp_mb()`` after the return from     |
181ccc9971eSMauro Carvalho Chehab| ``wait_for_completion()``!!!                                          |
182ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
183ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
184ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
185ccc9971eSMauro Carvalho Chehab| That would be because I spotted the need for that ``smp_mb()`` during |
186ccc9971eSMauro Carvalho Chehab| the creation of this documentation, and it is therefore unlikely to   |
187ccc9971eSMauro Carvalho Chehab| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter     |
188ccc9971eSMauro Carvalho Chehab| Zijlstra, and Jonathan Cameron for asking questions that sensitized   |
189ccc9971eSMauro Carvalho Chehab| me to the rather elaborate sequence of events that demonstrate the    |
190ccc9971eSMauro Carvalho Chehab| need for this memory barrier.                                         |
191ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
192ccc9971eSMauro Carvalho Chehab
193ccc9971eSMauro Carvalho ChehabTree RCU's grace--period memory-ordering guarantees rely most heavily on
194ccc9971eSMauro Carvalho Chehabthe ``rcu_node`` structure's ``->lock`` field, so much so that it is
195ccc9971eSMauro Carvalho Chehabnecessary to abbreviate this pattern in the diagrams in the next
196ccc9971eSMauro Carvalho Chehabsection. For example, consider the ``rcu_prepare_for_idle()`` function
197ccc9971eSMauro Carvalho Chehabshown below, which is one of several functions that enforce ordering of
198ccc9971eSMauro Carvalho Chehabnewly arrived RCU callbacks against future grace periods:
199ccc9971eSMauro Carvalho Chehab
200ccc9971eSMauro Carvalho Chehab::
201ccc9971eSMauro Carvalho Chehab
202ccc9971eSMauro Carvalho Chehab    1 static void rcu_prepare_for_idle(void)
203ccc9971eSMauro Carvalho Chehab    2 {
204ccc9971eSMauro Carvalho Chehab    3   bool needwake;
2053ac85878SZhouyi Zhou    4   struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
2063ac85878SZhouyi Zhou    5   struct rcu_node *rnp;
2073ac85878SZhouyi Zhou    6   int tne;
2083ac85878SZhouyi Zhou    7
2093ac85878SZhouyi Zhou    8   lockdep_assert_irqs_disabled();
2103ac85878SZhouyi Zhou    9   if (rcu_rdp_is_offloaded(rdp))
2113ac85878SZhouyi Zhou   10     return;
2123ac85878SZhouyi Zhou   11
2133ac85878SZhouyi Zhou   12   /* Handle nohz enablement switches conservatively. */
214ccc9971eSMauro Carvalho Chehab   13   tne = READ_ONCE(tick_nohz_active);
2153ac85878SZhouyi Zhou   14   if (tne != rdp->tick_nohz_enabled_snap) {
2163ac85878SZhouyi Zhou   15     if (!rcu_segcblist_empty(&rdp->cblist))
2173ac85878SZhouyi Zhou   16       invoke_rcu_core(); /* force nohz to see update. */
2183ac85878SZhouyi Zhou   17     rdp->tick_nohz_enabled_snap = tne;
219ccc9971eSMauro Carvalho Chehab   18     return;
220ccc9971eSMauro Carvalho Chehab   19	}
221ccc9971eSMauro Carvalho Chehab   20   if (!tne)
222ccc9971eSMauro Carvalho Chehab   21     return;
2233ac85878SZhouyi Zhou   22
2243ac85878SZhouyi Zhou   23   /*
2253ac85878SZhouyi Zhou   24    * If we have not yet accelerated this jiffy, accelerate all
2263ac85878SZhouyi Zhou   25    * callbacks on this CPU.
2273ac85878SZhouyi Zhou   26   */
2283ac85878SZhouyi Zhou   27   if (rdp->last_accelerate == jiffies)
2293ac85878SZhouyi Zhou   28     return;
2303ac85878SZhouyi Zhou   29   rdp->last_accelerate = jiffies;
2313ac85878SZhouyi Zhou   30   if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
2323ac85878SZhouyi Zhou   31     rnp = rdp->mynode;
2333ac85878SZhouyi Zhou   32     raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
2343ac85878SZhouyi Zhou   33     needwake = rcu_accelerate_cbs(rnp, rdp);
2353ac85878SZhouyi Zhou   34     raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
2363ac85878SZhouyi Zhou   35     if (needwake)
2373ac85878SZhouyi Zhou   36       rcu_gp_kthread_wake();
2383ac85878SZhouyi Zhou   37   }
2393ac85878SZhouyi Zhou   38 }
240ccc9971eSMauro Carvalho Chehab
241ccc9971eSMauro Carvalho ChehabBut the only part of ``rcu_prepare_for_idle()`` that really matters for
2423ac85878SZhouyi Zhouthis discussion are lines 32–34. We will therefore abbreviate this
243ccc9971eSMauro Carvalho Chehabfunction as follows:
244ccc9971eSMauro Carvalho Chehab
245ccc9971eSMauro Carvalho Chehab.. kernel-figure:: rcu_node-lock.svg
246ccc9971eSMauro Carvalho Chehab
247ccc9971eSMauro Carvalho ChehabThe box represents the ``rcu_node`` structure's ``->lock`` critical
248ccc9971eSMauro Carvalho Chehabsection, with the double line on top representing the additional
249ccc9971eSMauro Carvalho Chehab``smp_mb__after_unlock_lock()``.
250ccc9971eSMauro Carvalho Chehab
251ccc9971eSMauro Carvalho ChehabTree RCU Grace Period Memory Ordering Components
252ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
253ccc9971eSMauro Carvalho Chehab
254ccc9971eSMauro Carvalho ChehabTree RCU's grace-period memory-ordering guarantee is provided by a
255ccc9971eSMauro Carvalho Chehabnumber of RCU components:
256ccc9971eSMauro Carvalho Chehab
25707335c16SJoel Fernandes (Google)#. `Callback Registry`_
25807335c16SJoel Fernandes (Google)#. `Grace-Period Initialization`_
25907335c16SJoel Fernandes (Google)#. `Self-Reported Quiescent States`_
26007335c16SJoel Fernandes (Google)#. `Dynamic Tick Interface`_
26107335c16SJoel Fernandes (Google)#. `CPU-Hotplug Interface`_
26207335c16SJoel Fernandes (Google)#. `Forcing Quiescent States`_
26307335c16SJoel Fernandes (Google)#. `Grace-Period Cleanup`_
26407335c16SJoel Fernandes (Google)#. `Callback Invocation`_
265ccc9971eSMauro Carvalho Chehab
266ccc9971eSMauro Carvalho ChehabEach of the following section looks at the corresponding component in
267ccc9971eSMauro Carvalho Chehabdetail.
268ccc9971eSMauro Carvalho Chehab
269ccc9971eSMauro Carvalho ChehabCallback Registry
270ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^
271ccc9971eSMauro Carvalho Chehab
272ccc9971eSMauro Carvalho ChehabIf RCU's grace-period guarantee is to mean anything at all, any access
273ccc9971eSMauro Carvalho Chehabthat happens before a given invocation of ``call_rcu()`` must also
274ccc9971eSMauro Carvalho Chehabhappen before the corresponding grace period. The implementation of this
275ccc9971eSMauro Carvalho Chehabportion of RCU's grace period guarantee is shown in the following
276ccc9971eSMauro Carvalho Chehabfigure:
277ccc9971eSMauro Carvalho Chehab
278ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-callback-registry.svg
279ccc9971eSMauro Carvalho Chehab
280ccc9971eSMauro Carvalho ChehabBecause ``call_rcu()`` normally acts only on CPU-local state, it
281ccc9971eSMauro Carvalho Chehabprovides no ordering guarantees, either for itself or for phase one of
282ccc9971eSMauro Carvalho Chehabthe update (which again will usually be removal of an element from an
283ccc9971eSMauro Carvalho ChehabRCU-protected data structure). It simply enqueues the ``rcu_head``
284ccc9971eSMauro Carvalho Chehabstructure on a per-CPU list, which cannot become associated with a grace
285ccc9971eSMauro Carvalho Chehabperiod until a later call to ``rcu_accelerate_cbs()``, as shown in the
286ccc9971eSMauro Carvalho Chehabdiagram above.
287ccc9971eSMauro Carvalho Chehab
288ccc9971eSMauro Carvalho ChehabOne set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
289ccc9971eSMauro Carvalho Chehabvia ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
290ccc9971eSMauro Carvalho Chehabcurrent CPU is inundated with queued ``rcu_head`` structures) or more
291ccc9971eSMauro Carvalho Chehablikely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
292ccc9971eSMauro Carvalho Chehabis taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
293ccc9971eSMauro Carvalho Chehabinvokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
294ccc9971eSMauro Carvalho Chehabfinal code path on the right is taken only in kernels built with
295ccc9971eSMauro Carvalho Chehab``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
296ccc9971eSMauro Carvalho Chehab``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
297ccc9971eSMauro Carvalho Chehab``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
298ccc9971eSMauro Carvalho Chehabis invoked on a surviving CPU after the outgoing CPU has been completely
299ccc9971eSMauro Carvalho Chehabofflined.
300ccc9971eSMauro Carvalho Chehab
301ccc9971eSMauro Carvalho ChehabThere are a few other code paths within grace-period processing that
302ccc9971eSMauro Carvalho Chehabopportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
303ccc9971eSMauro Carvalho Chehaball of the CPU's recently queued ``rcu_head`` structures are associated
304ccc9971eSMauro Carvalho Chehabwith a future grace-period number under the protection of the CPU's lead
305ccc9971eSMauro Carvalho Chehab``rcu_node`` structure's ``->lock``. In all cases, there is full
306ccc9971eSMauro Carvalho Chehabordering against any prior critical section for that same ``rcu_node``
307ccc9971eSMauro Carvalho Chehabstructure's ``->lock``, and also full ordering against any of the
308ccc9971eSMauro Carvalho Chehabcurrent task's or CPU's prior critical sections for any ``rcu_node``
309ccc9971eSMauro Carvalho Chehabstructure's ``->lock``.
310ccc9971eSMauro Carvalho Chehab
311ccc9971eSMauro Carvalho ChehabThe next section will show how this ordering ensures that any accesses
312ccc9971eSMauro Carvalho Chehabprior to the ``call_rcu()`` (particularly including phase one of the
313ccc9971eSMauro Carvalho Chehabupdate) happen before the start of the corresponding grace period.
314ccc9971eSMauro Carvalho Chehab
315ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
316ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
317ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
318ccc9971eSMauro Carvalho Chehab| But what about ``synchronize_rcu()``?                                 |
319ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
320ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
321ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
322ccc9971eSMauro Carvalho Chehab| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
323ccc9971eSMauro Carvalho Chehab| which invokes it. So either way, it eventually comes down to          |
324ccc9971eSMauro Carvalho Chehab| ``call_rcu()``.                                                       |
325ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
326ccc9971eSMauro Carvalho Chehab
327ccc9971eSMauro Carvalho ChehabGrace-Period Initialization
328ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^^^^
329ccc9971eSMauro Carvalho Chehab
330ccc9971eSMauro Carvalho ChehabGrace-period initialization is carried out by the grace-period kernel
331ccc9971eSMauro Carvalho Chehabthread, which makes several passes over the ``rcu_node`` tree within the
332ccc9971eSMauro Carvalho Chehab``rcu_gp_init()`` function. This means that showing the full flow of
333ccc9971eSMauro Carvalho Chehabordering through the grace-period computation will require duplicating
334ccc9971eSMauro Carvalho Chehabthis tree. If you find this confusing, please note that the state of the
335ccc9971eSMauro Carvalho Chehab``rcu_node`` changes over time, just like Heraclitus's river. However,
336ccc9971eSMauro Carvalho Chehabto keep the ``rcu_node`` river tractable, the grace-period kernel
337ccc9971eSMauro Carvalho Chehabthread's traversals are presented in multiple parts, starting in this
338ccc9971eSMauro Carvalho Chehabsection with the various phases of grace-period initialization.
339ccc9971eSMauro Carvalho Chehab
340ccc9971eSMauro Carvalho ChehabThe first ordering-related grace-period initialization action is to
341ccc9971eSMauro Carvalho Chehabadvance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
342ccc9971eSMauro Carvalho Chehabcounter, as shown below:
343ccc9971eSMauro Carvalho Chehab
344ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-gp-init-1.svg
345ccc9971eSMauro Carvalho Chehab
346ccc9971eSMauro Carvalho ChehabThe actual increment is carried out using ``smp_store_release()``, which
347ccc9971eSMauro Carvalho Chehabhelps reject false-positive RCU CPU stall detection. Note that only the
348ccc9971eSMauro Carvalho Chehabroot ``rcu_node`` structure is touched.
349ccc9971eSMauro Carvalho Chehab
350ccc9971eSMauro Carvalho ChehabThe first pass through the ``rcu_node`` tree updates bitmasks based on
351ccc9971eSMauro Carvalho ChehabCPUs having come online or gone offline since the start of the previous
352ccc9971eSMauro Carvalho Chehabgrace period. In the common case where the number of online CPUs for
353ccc9971eSMauro Carvalho Chehabthis ``rcu_node`` structure has not transitioned to or from zero, this
354ccc9971eSMauro Carvalho Chehabpass will scan only the leaf ``rcu_node`` structures. However, if the
355ccc9971eSMauro Carvalho Chehabnumber of online CPUs for a given leaf ``rcu_node`` structure has
356ccc9971eSMauro Carvalho Chehabtransitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
357ccc9971eSMauro Carvalho Chehabfirst incoming CPU. Similarly, if the number of online CPUs for a given
358ccc9971eSMauro Carvalho Chehableaf ``rcu_node`` structure has transitioned to zero,
359ccc9971eSMauro Carvalho Chehab``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
360ccc9971eSMauro Carvalho ChehabThe diagram below shows the path of ordering if the leftmost
361ccc9971eSMauro Carvalho Chehab``rcu_node`` structure onlines its first CPU and if the next
362ccc9971eSMauro Carvalho Chehab``rcu_node`` structure has no online CPUs (or, alternatively if the
363ccc9971eSMauro Carvalho Chehableftmost ``rcu_node`` structure offlines its last CPU and if the next
364ccc9971eSMauro Carvalho Chehab``rcu_node`` structure has no online CPUs).
365ccc9971eSMauro Carvalho Chehab
36658d0db86SFrederic Weisbecker.. kernel-figure:: TreeRCU-gp-init-2.svg
367ccc9971eSMauro Carvalho Chehab
368ccc9971eSMauro Carvalho ChehabThe final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
369ccc9971eSMauro Carvalho Chehabbreadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
370ccc9971eSMauro Carvalho Chehabto the newly advanced value from the ``rcu_state`` structure, as shown
371ccc9971eSMauro Carvalho Chehabin the following diagram.
372ccc9971eSMauro Carvalho Chehab
37358d0db86SFrederic Weisbecker.. kernel-figure:: TreeRCU-gp-init-3.svg
374ccc9971eSMauro Carvalho Chehab
375ccc9971eSMauro Carvalho ChehabThis change will also cause each CPU's next call to
376ccc9971eSMauro Carvalho Chehab``__note_gp_changes()`` to notice that a new grace period has started,
377ccc9971eSMauro Carvalho Chehabas described in the next section. But because the grace-period kthread
378ccc9971eSMauro Carvalho Chehabstarted the grace period at the root (with the advancing of the
379ccc9971eSMauro Carvalho Chehab``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
380ccc9971eSMauro Carvalho Chehab``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
381ccc9971eSMauro Carvalho Chehabthe start of the grace period will happen after the actual start of the
382ccc9971eSMauro Carvalho Chehabgrace period.
383ccc9971eSMauro Carvalho Chehab
384ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
385ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
386ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
387ccc9971eSMauro Carvalho Chehab| But what about the CPU that started the grace period? Why wouldn't it |
388ccc9971eSMauro Carvalho Chehab| see the start of the grace period right when it started that grace    |
389ccc9971eSMauro Carvalho Chehab| period?                                                               |
390ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
391ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
392ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
393ccc9971eSMauro Carvalho Chehab| In some deep philosophical and overly anthromorphized sense, yes, the |
394ccc9971eSMauro Carvalho Chehab| CPU starting the grace period is immediately aware of having done so. |
395ccc9971eSMauro Carvalho Chehab| However, if we instead assume that RCU is not self-aware, then even   |
396ccc9971eSMauro Carvalho Chehab| the CPU starting the grace period does not really become aware of the |
397ccc9971eSMauro Carvalho Chehab| start of this grace period until its first call to                    |
398ccc9971eSMauro Carvalho Chehab| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
399ccc9971eSMauro Carvalho Chehab| early notification because it invokes ``__note_gp_changes()`` during  |
400ccc9971eSMauro Carvalho Chehab| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node``         |
401ccc9971eSMauro Carvalho Chehab| structure.                                                            |
402ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
403ccc9971eSMauro Carvalho Chehab
404ccc9971eSMauro Carvalho ChehabSelf-Reported Quiescent States
405ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
406ccc9971eSMauro Carvalho Chehab
407ccc9971eSMauro Carvalho ChehabWhen all entities that might block the grace period have reported
408ccc9971eSMauro Carvalho Chehabquiescent states (or as described in a later section, had quiescent
409ccc9971eSMauro Carvalho Chehabstates reported on their behalf), the grace period can end. Online
410ccc9971eSMauro Carvalho Chehabnon-idle CPUs report their own quiescent states, as shown in the
411ccc9971eSMauro Carvalho Chehabfollowing diagram:
412ccc9971eSMauro Carvalho Chehab
413ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-qs.svg
414ccc9971eSMauro Carvalho Chehab
415ccc9971eSMauro Carvalho ChehabThis is for the last CPU to report a quiescent state, which signals the
416ccc9971eSMauro Carvalho Chehabend of the grace period. Earlier quiescent states would push up the
417ccc9971eSMauro Carvalho Chehab``rcu_node`` tree only until they encountered an ``rcu_node`` structure
418ccc9971eSMauro Carvalho Chehabthat is waiting for additional quiescent states. However, ordering is
419ccc9971eSMauro Carvalho Chehabnevertheless preserved because some later quiescent state will acquire
420ccc9971eSMauro Carvalho Chehabthat ``rcu_node`` structure's ``->lock``.
421ccc9971eSMauro Carvalho Chehab
422ccc9971eSMauro Carvalho ChehabAny number of events can lead up to a CPU invoking ``note_gp_changes``
423ccc9971eSMauro Carvalho Chehab(or alternatively, directly invoking ``__note_gp_changes()``), at which
424ccc9971eSMauro Carvalho Chehabpoint that CPU will notice the start of a new grace period while holding
425ccc9971eSMauro Carvalho Chehabits leaf ``rcu_node`` lock. Therefore, all execution shown in this
426ccc9971eSMauro Carvalho Chehabdiagram happens after the start of the grace period. In addition, this
427ccc9971eSMauro Carvalho ChehabCPU will consider any RCU read-side critical section that started before
428ccc9971eSMauro Carvalho Chehabthe invocation of ``__note_gp_changes()`` to have started before the
429ccc9971eSMauro Carvalho Chehabgrace period, and thus a critical section that the grace period must
430ccc9971eSMauro Carvalho Chehabwait on.
431ccc9971eSMauro Carvalho Chehab
432ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
433ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
434ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
435ccc9971eSMauro Carvalho Chehab| But a RCU read-side critical section might have started after the     |
436ccc9971eSMauro Carvalho Chehab| beginning of the grace period (the advancing of ``->gp_seq`` from     |
437ccc9971eSMauro Carvalho Chehab| earlier), so why should the grace period wait on such a critical      |
438ccc9971eSMauro Carvalho Chehab| section?                                                              |
439ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
440ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
441ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
442ccc9971eSMauro Carvalho Chehab| It is indeed not necessary for the grace period to wait on such a     |
443ccc9971eSMauro Carvalho Chehab| critical section. However, it is permissible to wait on it. And it is |
444ccc9971eSMauro Carvalho Chehab| furthermore important to wait on it, as this lazy approach is far     |
445ccc9971eSMauro Carvalho Chehab| more scalable than a “big bang” all-at-once grace-period start could  |
446ccc9971eSMauro Carvalho Chehab| possibly be.                                                          |
447ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
448ccc9971eSMauro Carvalho Chehab
449ccc9971eSMauro Carvalho ChehabIf the CPU does a context switch, a quiescent state will be noted by
450b1ec18eaSSebastian Andrzej Siewior``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
451ccc9971eSMauro Carvalho Chehabtakes a scheduler-clock interrupt while executing in usermode, a
452ccc9971eSMauro Carvalho Chehabquiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
453ccc9971eSMauro Carvalho ChehabEither way, the passage through a quiescent state will be noted in a
454ccc9971eSMauro Carvalho Chehabper-CPU variable.
455ccc9971eSMauro Carvalho Chehab
456ccc9971eSMauro Carvalho ChehabThe next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
457ccc9971eSMauro Carvalho Chehabexample, after the next scheduler-clock interrupt), ``rcu_core()`` will
458ccc9971eSMauro Carvalho Chehabinvoke ``rcu_check_quiescent_state()``, which will notice the recorded
459ccc9971eSMauro Carvalho Chehabquiescent state, and invoke ``rcu_report_qs_rdp()``. If
460ccc9971eSMauro Carvalho Chehab``rcu_report_qs_rdp()`` verifies that the quiescent state really does
461ccc9971eSMauro Carvalho Chehabapply to the current grace period, it invokes ``rcu_report_rnp()`` which
462ccc9971eSMauro Carvalho Chehabtraverses up the ``rcu_node`` tree as shown at the bottom of the
463ccc9971eSMauro Carvalho Chehabdiagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
464ccc9971eSMauro Carvalho Chehabfield, and propagating up the tree when the result is zero.
465ccc9971eSMauro Carvalho Chehab
466ccc9971eSMauro Carvalho ChehabNote that traversal passes upwards out of a given ``rcu_node`` structure
467ccc9971eSMauro Carvalho Chehabonly if the current CPU is reporting the last quiescent state for the
468ccc9971eSMauro Carvalho Chehabsubtree headed by that ``rcu_node`` structure. A key point is that if a
469ccc9971eSMauro Carvalho ChehabCPU's traversal stops at a given ``rcu_node`` structure, then there will
470ccc9971eSMauro Carvalho Chehabbe a later traversal by another CPU (or perhaps the same one) that
471ccc9971eSMauro Carvalho Chehabproceeds upwards from that point, and the ``rcu_node`` ``->lock``
472ccc9971eSMauro Carvalho Chehabguarantees that the first CPU's quiescent state happens before the
473ccc9971eSMauro Carvalho Chehabremainder of the second CPU's traversal. Applying this line of thought
474ccc9971eSMauro Carvalho Chehabrepeatedly shows that all CPUs' quiescent states happen before the last
475ccc9971eSMauro Carvalho ChehabCPU traverses through the root ``rcu_node`` structure, the “last CPU”
476ccc9971eSMauro Carvalho Chehabbeing the one that clears the last bit in the root ``rcu_node``
477ccc9971eSMauro Carvalho Chehabstructure's ``->qsmask`` field.
478ccc9971eSMauro Carvalho Chehab
479ccc9971eSMauro Carvalho ChehabDynamic Tick Interface
480ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^
481ccc9971eSMauro Carvalho Chehab
482ccc9971eSMauro Carvalho ChehabDue to energy-efficiency considerations, RCU is forbidden from
483ccc9971eSMauro Carvalho Chehabdisturbing idle CPUs. CPUs are therefore required to notify RCU when
484ccc9971eSMauro Carvalho Chehabentering or leaving idle state, which they do via fully ordered
485ccc9971eSMauro Carvalho Chehabvalue-returning atomic operations on a per-CPU variable. The ordering
486ccc9971eSMauro Carvalho Chehabeffects are as shown below:
487ccc9971eSMauro Carvalho Chehab
488ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-dyntick.svg
489ccc9971eSMauro Carvalho Chehab
490ccc9971eSMauro Carvalho ChehabThe RCU grace-period kernel thread samples the per-CPU idleness variable
491ccc9971eSMauro Carvalho Chehabwhile holding the corresponding CPU's leaf ``rcu_node`` structure's
492ccc9971eSMauro Carvalho Chehab``->lock``. This means that any RCU read-side critical sections that
493ccc9971eSMauro Carvalho Chehabprecede the idle period (the oval near the top of the diagram above)
494ccc9971eSMauro Carvalho Chehabwill happen before the end of the current grace period. Similarly, the
495ccc9971eSMauro Carvalho Chehabbeginning of the current grace period will happen before any RCU
496ccc9971eSMauro Carvalho Chehabread-side critical sections that follow the idle period (the oval near
497ccc9971eSMauro Carvalho Chehabthe bottom of the diagram above).
498ccc9971eSMauro Carvalho Chehab
499ccc9971eSMauro Carvalho ChehabPlumbing this into the full grace-period execution is described
5004f8af077SNícolas F. R. A. Prado`below <Forcing Quiescent States_>`__.
501ccc9971eSMauro Carvalho Chehab
502ccc9971eSMauro Carvalho ChehabCPU-Hotplug Interface
503ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^
504ccc9971eSMauro Carvalho Chehab
505ccc9971eSMauro Carvalho ChehabRCU is also forbidden from disturbing offline CPUs, which might well be
506ccc9971eSMauro Carvalho Chehabpowered off and removed from the system completely. CPUs are therefore
507ccc9971eSMauro Carvalho Chehabrequired to notify RCU of their comings and goings as part of the
508ccc9971eSMauro Carvalho Chehabcorresponding CPU hotplug operations. The ordering effects are shown
509ccc9971eSMauro Carvalho Chehabbelow:
510ccc9971eSMauro Carvalho Chehab
511ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-hotplug.svg
512ccc9971eSMauro Carvalho Chehab
513ccc9971eSMauro Carvalho ChehabBecause CPU hotplug operations are much less frequent than idle
514ccc9971eSMauro Carvalho Chehabtransitions, they are heavier weight, and thus acquire the CPU's leaf
515ccc9971eSMauro Carvalho Chehab``rcu_node`` structure's ``->lock`` and update this structure's
516ccc9971eSMauro Carvalho Chehab``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
517ccc9971eSMauro Carvalho Chehabmask to detect CPUs having gone offline since the beginning of this
518ccc9971eSMauro Carvalho Chehabgrace period.
519ccc9971eSMauro Carvalho Chehab
520ccc9971eSMauro Carvalho ChehabPlumbing this into the full grace-period execution is described
5214f8af077SNícolas F. R. A. Prado`below <Forcing Quiescent States_>`__.
522ccc9971eSMauro Carvalho Chehab
523ccc9971eSMauro Carvalho ChehabForcing Quiescent States
524ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^^^^^
525ccc9971eSMauro Carvalho Chehab
526ccc9971eSMauro Carvalho ChehabAs noted above, idle and offline CPUs cannot report their own quiescent
527ccc9971eSMauro Carvalho Chehabstates, and therefore the grace-period kernel thread must do the
528ccc9971eSMauro Carvalho Chehabreporting on their behalf. This process is called “forcing quiescent
529ccc9971eSMauro Carvalho Chehabstates”, it is repeated every few jiffies, and its ordering effects are
530ccc9971eSMauro Carvalho Chehabshown below:
531ccc9971eSMauro Carvalho Chehab
532ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-gp-fqs.svg
533ccc9971eSMauro Carvalho Chehab
534ccc9971eSMauro Carvalho ChehabEach pass of quiescent state forcing is guaranteed to traverse the leaf
535ccc9971eSMauro Carvalho Chehab``rcu_node`` structures, and if there are no new quiescent states due to
536ccc9971eSMauro Carvalho Chehabrecently idled and/or offlined CPUs, then only the leaves are traversed.
537ccc9971eSMauro Carvalho ChehabHowever, if there is a newly offlined CPU as illustrated on the left or
538ccc9971eSMauro Carvalho Chehaba newly idled CPU as illustrated on the right, the corresponding
539ccc9971eSMauro Carvalho Chehabquiescent state will be driven up towards the root. As with
540ccc9971eSMauro Carvalho Chehabself-reported quiescent states, the upwards driving stops once it
541ccc9971eSMauro Carvalho Chehabreaches an ``rcu_node`` structure that has quiescent states outstanding
542ccc9971eSMauro Carvalho Chehabfrom other CPUs.
543ccc9971eSMauro Carvalho Chehab
544ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
545ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
546ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
547ccc9971eSMauro Carvalho Chehab| The leftmost drive to root stopped before it reached the root         |
548ccc9971eSMauro Carvalho Chehab| ``rcu_node`` structure, which means that there are still CPUs         |
549ccc9971eSMauro Carvalho Chehab| subordinate to that structure on which the current grace period is    |
550ccc9971eSMauro Carvalho Chehab| waiting. Given that, how is it possible that the rightmost drive to   |
551ccc9971eSMauro Carvalho Chehab| root ended the grace period?                                          |
552ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
553ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
554ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
555ccc9971eSMauro Carvalho Chehab| Good analysis! It is in fact impossible in the absence of bugs in     |
556ccc9971eSMauro Carvalho Chehab| RCU. But this diagram is complex enough as it is, so simplicity       |
557ccc9971eSMauro Carvalho Chehab| overrode accuracy. You can think of it as poetic license, or you can  |
558ccc9971eSMauro Carvalho Chehab| think of it as misdirection that is resolved in the                   |
5594f8af077SNícolas F. R. A. Prado| `stitched-together diagram <Putting It All Together_>`__.             |
560ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
561ccc9971eSMauro Carvalho Chehab
562ccc9971eSMauro Carvalho ChehabGrace-Period Cleanup
563ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^^
564ccc9971eSMauro Carvalho Chehab
565ccc9971eSMauro Carvalho ChehabGrace-period cleanup first scans the ``rcu_node`` tree breadth-first
566ccc9971eSMauro Carvalho Chehabadvancing all the ``->gp_seq`` fields, then it advances the
567ccc9971eSMauro Carvalho Chehab``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
568ccc9971eSMauro Carvalho Chehabshown below:
569ccc9971eSMauro Carvalho Chehab
570ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-gp-cleanup.svg
571ccc9971eSMauro Carvalho Chehab
572ccc9971eSMauro Carvalho ChehabAs indicated by the oval at the bottom of the diagram, once grace-period
573ccc9971eSMauro Carvalho Chehabcleanup is complete, the next grace period can begin.
574ccc9971eSMauro Carvalho Chehab
575ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
576ccc9971eSMauro Carvalho Chehab| **Quick Quiz**:                                                       |
577ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
578ccc9971eSMauro Carvalho Chehab| But when precisely does the grace period end?                         |
579ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
580ccc9971eSMauro Carvalho Chehab| **Answer**:                                                           |
581ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
582ccc9971eSMauro Carvalho Chehab| There is no useful single point at which the grace period can be said |
583ccc9971eSMauro Carvalho Chehab| to end. The earliest reasonable candidate is as soon as the last CPU  |
584ccc9971eSMauro Carvalho Chehab| has reported its quiescent state, but it may be some milliseconds     |
585ccc9971eSMauro Carvalho Chehab| before RCU becomes aware of this. The latest reasonable candidate is  |
586ccc9971eSMauro Carvalho Chehab| once the ``rcu_state`` structure's ``->gp_seq`` field has been        |
587ccc9971eSMauro Carvalho Chehab| updated, but it is quite possible that some CPUs have already         |
588ccc9971eSMauro Carvalho Chehab| completed phase two of their updates by that time. In short, if you   |
589ccc9971eSMauro Carvalho Chehab| are going to work with RCU, you need to learn to embrace uncertainty. |
590ccc9971eSMauro Carvalho Chehab+-----------------------------------------------------------------------+
591ccc9971eSMauro Carvalho Chehab
592ccc9971eSMauro Carvalho ChehabCallback Invocation
593ccc9971eSMauro Carvalho Chehab^^^^^^^^^^^^^^^^^^^
594ccc9971eSMauro Carvalho Chehab
595ccc9971eSMauro Carvalho ChehabOnce a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
596ccc9971eSMauro Carvalho Chehabbeen updated, that CPU can begin invoking its RCU callbacks that were
597ccc9971eSMauro Carvalho Chehabwaiting for this grace period to end. These callbacks are identified by
598ccc9971eSMauro Carvalho Chehab``rcu_advance_cbs()``, which is usually invoked by
599ccc9971eSMauro Carvalho Chehab``__note_gp_changes()``. As shown in the diagram below, this invocation
600ccc9971eSMauro Carvalho Chehabcan be triggered by the scheduling-clock interrupt
601ccc9971eSMauro Carvalho Chehab(``rcu_sched_clock_irq()`` on the left) or by idle entry
602ccc9971eSMauro Carvalho Chehab(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
603ccc9971eSMauro Carvalho Chehabwith ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
604ccc9971eSMauro Carvalho Chehabraised, which results in ``rcu_do_batch()`` invoking the callbacks,
605ccc9971eSMauro Carvalho Chehabwhich in turn allows those callbacks to carry out (either directly or
606ccc9971eSMauro Carvalho Chehabindirectly via wakeup) the needed phase-two processing for each update.
607ccc9971eSMauro Carvalho Chehab
608ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-callback-invocation.svg
609ccc9971eSMauro Carvalho Chehab
610ccc9971eSMauro Carvalho ChehabPlease note that callback invocation can also be prompted by any number
611ccc9971eSMauro Carvalho Chehabof corner-case code paths, for example, when a CPU notes that it has
612ccc9971eSMauro Carvalho Chehabexcessive numbers of callbacks queued. In all cases, the CPU acquires
613ccc9971eSMauro Carvalho Chehabits leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
614ccc9971eSMauro Carvalho Chehabwhich preserves the required ordering against the newly completed grace
615ccc9971eSMauro Carvalho Chehabperiod.
616ccc9971eSMauro Carvalho Chehab
617ccc9971eSMauro Carvalho ChehabHowever, if the callback function communicates to other CPUs, for
618ccc9971eSMauro Carvalho Chehabexample, doing a wakeup, then it is that function's responsibility to
619ccc9971eSMauro Carvalho Chehabmaintain ordering. For example, if the callback function wakes up a task
620ccc9971eSMauro Carvalho Chehabthat runs on some other CPU, proper ordering must in place in both the
621ccc9971eSMauro Carvalho Chehabcallback function and the task being awakened. To see why this is
622ccc9971eSMauro Carvalho Chehabimportant, consider the top half of the `grace-period
6234f8af077SNícolas F. R. A. Pradocleanup`_ diagram. The callback might be
624ccc9971eSMauro Carvalho Chehabrunning on a CPU corresponding to the leftmost leaf ``rcu_node``
625ccc9971eSMauro Carvalho Chehabstructure, and awaken a task that is to run on a CPU corresponding to
626ccc9971eSMauro Carvalho Chehabthe rightmost leaf ``rcu_node`` structure, and the grace-period kernel
627ccc9971eSMauro Carvalho Chehabthread might not yet have reached the rightmost leaf. In this case, the
628ccc9971eSMauro Carvalho Chehabgrace period's memory ordering might not yet have reached that CPU, so
629ccc9971eSMauro Carvalho Chehabagain the callback function and the awakened task must supply proper
630ccc9971eSMauro Carvalho Chehabordering.
631ccc9971eSMauro Carvalho Chehab
632ccc9971eSMauro Carvalho ChehabPutting It All Together
633ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~~~~~~~~~
634ccc9971eSMauro Carvalho Chehab
635ccc9971eSMauro Carvalho ChehabA stitched-together diagram is here:
636ccc9971eSMauro Carvalho Chehab
637ccc9971eSMauro Carvalho Chehab.. kernel-figure:: TreeRCU-gp.svg
638ccc9971eSMauro Carvalho Chehab
639ccc9971eSMauro Carvalho ChehabLegal Statement
640ccc9971eSMauro Carvalho Chehab~~~~~~~~~~~~~~~
641ccc9971eSMauro Carvalho Chehab
642ccc9971eSMauro Carvalho ChehabThis work represents the view of the author and does not necessarily
643ccc9971eSMauro Carvalho Chehabrepresent the view of IBM.
644ccc9971eSMauro Carvalho Chehab
645ccc9971eSMauro Carvalho ChehabLinux is a registered trademark of Linus Torvalds.
646ccc9971eSMauro Carvalho Chehab
647ccc9971eSMauro Carvalho ChehabOther company, product, and service names may be trademarks or service
648ccc9971eSMauro Carvalho Chehabmarks of others.
649