xref: /openbmc/linux/Documentation/locking/rt-mutex-design.rst (revision 0898782247ae533d1f4e47a06bc5d4870931b284)
1*387b1468SMauro Carvalho Chehab==============================
2*387b1468SMauro Carvalho ChehabRT-mutex implementation design
3*387b1468SMauro Carvalho Chehab==============================
4*387b1468SMauro Carvalho Chehab
5*387b1468SMauro Carvalho ChehabCopyright (c) 2006 Steven Rostedt
6*387b1468SMauro Carvalho Chehab
7*387b1468SMauro Carvalho ChehabLicensed under the GNU Free Documentation License, Version 1.2
8*387b1468SMauro Carvalho Chehab
9*387b1468SMauro Carvalho Chehab
10*387b1468SMauro Carvalho ChehabThis document tries to describe the design of the rtmutex.c implementation.
11*387b1468SMauro Carvalho ChehabIt doesn't describe the reasons why rtmutex.c exists. For that please see
12*387b1468SMauro Carvalho ChehabDocumentation/locking/rt-mutex.rst.  Although this document does explain problems
13*387b1468SMauro Carvalho Chehabthat happen without this code, but that is in the concept to understand
14*387b1468SMauro Carvalho Chehabwhat the code actually is doing.
15*387b1468SMauro Carvalho Chehab
16*387b1468SMauro Carvalho ChehabThe goal of this document is to help others understand the priority
17*387b1468SMauro Carvalho Chehabinheritance (PI) algorithm that is used, as well as reasons for the
18*387b1468SMauro Carvalho Chehabdecisions that were made to implement PI in the manner that was done.
19*387b1468SMauro Carvalho Chehab
20*387b1468SMauro Carvalho Chehab
21*387b1468SMauro Carvalho ChehabUnbounded Priority Inversion
22*387b1468SMauro Carvalho Chehab----------------------------
23*387b1468SMauro Carvalho Chehab
24*387b1468SMauro Carvalho ChehabPriority inversion is when a lower priority process executes while a higher
25*387b1468SMauro Carvalho Chehabpriority process wants to run.  This happens for several reasons, and
26*387b1468SMauro Carvalho Chehabmost of the time it can't be helped.  Anytime a high priority process wants
27*387b1468SMauro Carvalho Chehabto use a resource that a lower priority process has (a mutex for example),
28*387b1468SMauro Carvalho Chehabthe high priority process must wait until the lower priority process is done
29*387b1468SMauro Carvalho Chehabwith the resource.  This is a priority inversion.  What we want to prevent
30*387b1468SMauro Carvalho Chehabis something called unbounded priority inversion.  That is when the high
31*387b1468SMauro Carvalho Chehabpriority process is prevented from running by a lower priority process for
32*387b1468SMauro Carvalho Chehaban undetermined amount of time.
33*387b1468SMauro Carvalho Chehab
34*387b1468SMauro Carvalho ChehabThe classic example of unbounded priority inversion is where you have three
35*387b1468SMauro Carvalho Chehabprocesses, let's call them processes A, B, and C, where A is the highest
36*387b1468SMauro Carvalho Chehabpriority process, C is the lowest, and B is in between. A tries to grab a lock
37*387b1468SMauro Carvalho Chehabthat C owns and must wait and lets C run to release the lock. But in the
38*387b1468SMauro Carvalho Chehabmeantime, B executes, and since B is of a higher priority than C, it preempts C,
39*387b1468SMauro Carvalho Chehabbut by doing so, it is in fact preempting A which is a higher priority process.
40*387b1468SMauro Carvalho ChehabNow there's no way of knowing how long A will be sleeping waiting for C
41*387b1468SMauro Carvalho Chehabto release the lock, because for all we know, B is a CPU hog and will
42*387b1468SMauro Carvalho Chehabnever give C a chance to release the lock.  This is called unbounded priority
43*387b1468SMauro Carvalho Chehabinversion.
44*387b1468SMauro Carvalho Chehab
45*387b1468SMauro Carvalho ChehabHere's a little ASCII art to show the problem::
46*387b1468SMauro Carvalho Chehab
47*387b1468SMauro Carvalho Chehab     grab lock L1 (owned by C)
48*387b1468SMauro Carvalho Chehab       |
49*387b1468SMauro Carvalho Chehab  A ---+
50*387b1468SMauro Carvalho Chehab          C preempted by B
51*387b1468SMauro Carvalho Chehab            |
52*387b1468SMauro Carvalho Chehab  C    +----+
53*387b1468SMauro Carvalho Chehab
54*387b1468SMauro Carvalho Chehab  B         +-------->
55*387b1468SMauro Carvalho Chehab                  B now keeps A from running.
56*387b1468SMauro Carvalho Chehab
57*387b1468SMauro Carvalho Chehab
58*387b1468SMauro Carvalho ChehabPriority Inheritance (PI)
59*387b1468SMauro Carvalho Chehab-------------------------
60*387b1468SMauro Carvalho Chehab
61*387b1468SMauro Carvalho ChehabThere are several ways to solve this issue, but other ways are out of scope
62*387b1468SMauro Carvalho Chehabfor this document.  Here we only discuss PI.
63*387b1468SMauro Carvalho Chehab
64*387b1468SMauro Carvalho ChehabPI is where a process inherits the priority of another process if the other
65*387b1468SMauro Carvalho Chehabprocess blocks on a lock owned by the current process.  To make this easier
66*387b1468SMauro Carvalho Chehabto understand, let's use the previous example, with processes A, B, and C again.
67*387b1468SMauro Carvalho Chehab
68*387b1468SMauro Carvalho ChehabThis time, when A blocks on the lock owned by C, C would inherit the priority
69*387b1468SMauro Carvalho Chehabof A.  So now if B becomes runnable, it would not preempt C, since C now has
70*387b1468SMauro Carvalho Chehabthe high priority of A.  As soon as C releases the lock, it loses its
71*387b1468SMauro Carvalho Chehabinherited priority, and A then can continue with the resource that C had.
72*387b1468SMauro Carvalho Chehab
73*387b1468SMauro Carvalho ChehabTerminology
74*387b1468SMauro Carvalho Chehab-----------
75*387b1468SMauro Carvalho Chehab
76*387b1468SMauro Carvalho ChehabHere I explain some terminology that is used in this document to help describe
77*387b1468SMauro Carvalho Chehabthe design that is used to implement PI.
78*387b1468SMauro Carvalho Chehab
79*387b1468SMauro Carvalho ChehabPI chain
80*387b1468SMauro Carvalho Chehab         - The PI chain is an ordered series of locks and processes that cause
81*387b1468SMauro Carvalho Chehab           processes to inherit priorities from a previous process that is
82*387b1468SMauro Carvalho Chehab           blocked on one of its locks.  This is described in more detail
83*387b1468SMauro Carvalho Chehab           later in this document.
84*387b1468SMauro Carvalho Chehab
85*387b1468SMauro Carvalho Chehabmutex
86*387b1468SMauro Carvalho Chehab         - In this document, to differentiate from locks that implement
87*387b1468SMauro Carvalho Chehab           PI and spin locks that are used in the PI code, from now on
88*387b1468SMauro Carvalho Chehab           the PI locks will be called a mutex.
89*387b1468SMauro Carvalho Chehab
90*387b1468SMauro Carvalho Chehablock
91*387b1468SMauro Carvalho Chehab         - In this document from now on, I will use the term lock when
92*387b1468SMauro Carvalho Chehab           referring to spin locks that are used to protect parts of the PI
93*387b1468SMauro Carvalho Chehab           algorithm.  These locks disable preemption for UP (when
94*387b1468SMauro Carvalho Chehab           CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from
95*387b1468SMauro Carvalho Chehab           entering critical sections simultaneously.
96*387b1468SMauro Carvalho Chehab
97*387b1468SMauro Carvalho Chehabspin lock
98*387b1468SMauro Carvalho Chehab         - Same as lock above.
99*387b1468SMauro Carvalho Chehab
100*387b1468SMauro Carvalho Chehabwaiter
101*387b1468SMauro Carvalho Chehab         - A waiter is a struct that is stored on the stack of a blocked
102*387b1468SMauro Carvalho Chehab           process.  Since the scope of the waiter is within the code for
103*387b1468SMauro Carvalho Chehab           a process being blocked on the mutex, it is fine to allocate
104*387b1468SMauro Carvalho Chehab           the waiter on the process's stack (local variable).  This
105*387b1468SMauro Carvalho Chehab           structure holds a pointer to the task, as well as the mutex that
106*387b1468SMauro Carvalho Chehab           the task is blocked on.  It also has rbtree node structures to
107*387b1468SMauro Carvalho Chehab           place the task in the waiters rbtree of a mutex as well as the
108*387b1468SMauro Carvalho Chehab           pi_waiters rbtree of a mutex owner task (described below).
109*387b1468SMauro Carvalho Chehab
110*387b1468SMauro Carvalho Chehab           waiter is sometimes used in reference to the task that is waiting
111*387b1468SMauro Carvalho Chehab           on a mutex. This is the same as waiter->task.
112*387b1468SMauro Carvalho Chehab
113*387b1468SMauro Carvalho Chehabwaiters
114*387b1468SMauro Carvalho Chehab         - A list of processes that are blocked on a mutex.
115*387b1468SMauro Carvalho Chehab
116*387b1468SMauro Carvalho Chehabtop waiter
117*387b1468SMauro Carvalho Chehab         - The highest priority process waiting on a specific mutex.
118*387b1468SMauro Carvalho Chehab
119*387b1468SMauro Carvalho Chehabtop pi waiter
120*387b1468SMauro Carvalho Chehab              - The highest priority process waiting on one of the mutexes
121*387b1468SMauro Carvalho Chehab                that a specific process owns.
122*387b1468SMauro Carvalho Chehab
123*387b1468SMauro Carvalho ChehabNote:
124*387b1468SMauro Carvalho Chehab       task and process are used interchangeably in this document, mostly to
125*387b1468SMauro Carvalho Chehab       differentiate between two processes that are being described together.
126*387b1468SMauro Carvalho Chehab
127*387b1468SMauro Carvalho Chehab
128*387b1468SMauro Carvalho ChehabPI chain
129*387b1468SMauro Carvalho Chehab--------
130*387b1468SMauro Carvalho Chehab
131*387b1468SMauro Carvalho ChehabThe PI chain is a list of processes and mutexes that may cause priority
132*387b1468SMauro Carvalho Chehabinheritance to take place.  Multiple chains may converge, but a chain
133*387b1468SMauro Carvalho Chehabwould never diverge, since a process can't be blocked on more than one
134*387b1468SMauro Carvalho Chehabmutex at a time.
135*387b1468SMauro Carvalho Chehab
136*387b1468SMauro Carvalho ChehabExample::
137*387b1468SMauro Carvalho Chehab
138*387b1468SMauro Carvalho Chehab   Process:  A, B, C, D, E
139*387b1468SMauro Carvalho Chehab   Mutexes:  L1, L2, L3, L4
140*387b1468SMauro Carvalho Chehab
141*387b1468SMauro Carvalho Chehab   A owns: L1
142*387b1468SMauro Carvalho Chehab           B blocked on L1
143*387b1468SMauro Carvalho Chehab           B owns L2
144*387b1468SMauro Carvalho Chehab                  C blocked on L2
145*387b1468SMauro Carvalho Chehab                  C owns L3
146*387b1468SMauro Carvalho Chehab                         D blocked on L3
147*387b1468SMauro Carvalho Chehab                         D owns L4
148*387b1468SMauro Carvalho Chehab                                E blocked on L4
149*387b1468SMauro Carvalho Chehab
150*387b1468SMauro Carvalho ChehabThe chain would be::
151*387b1468SMauro Carvalho Chehab
152*387b1468SMauro Carvalho Chehab   E->L4->D->L3->C->L2->B->L1->A
153*387b1468SMauro Carvalho Chehab
154*387b1468SMauro Carvalho ChehabTo show where two chains merge, we could add another process F and
155*387b1468SMauro Carvalho Chehabanother mutex L5 where B owns L5 and F is blocked on mutex L5.
156*387b1468SMauro Carvalho Chehab
157*387b1468SMauro Carvalho ChehabThe chain for F would be::
158*387b1468SMauro Carvalho Chehab
159*387b1468SMauro Carvalho Chehab   F->L5->B->L1->A
160*387b1468SMauro Carvalho Chehab
161*387b1468SMauro Carvalho ChehabSince a process may own more than one mutex, but never be blocked on more than
162*387b1468SMauro Carvalho Chehabone, the chains merge.
163*387b1468SMauro Carvalho Chehab
164*387b1468SMauro Carvalho ChehabHere we show both chains::
165*387b1468SMauro Carvalho Chehab
166*387b1468SMauro Carvalho Chehab   E->L4->D->L3->C->L2-+
167*387b1468SMauro Carvalho Chehab                       |
168*387b1468SMauro Carvalho Chehab                       +->B->L1->A
169*387b1468SMauro Carvalho Chehab                       |
170*387b1468SMauro Carvalho Chehab                 F->L5-+
171*387b1468SMauro Carvalho Chehab
172*387b1468SMauro Carvalho ChehabFor PI to work, the processes at the right end of these chains (or we may
173*387b1468SMauro Carvalho Chehabalso call it the Top of the chain) must be equal to or higher in priority
174*387b1468SMauro Carvalho Chehabthan the processes to the left or below in the chain.
175*387b1468SMauro Carvalho Chehab
176*387b1468SMauro Carvalho ChehabAlso since a mutex may have more than one process blocked on it, we can
177*387b1468SMauro Carvalho Chehabhave multiple chains merge at mutexes.  If we add another process G that is
178*387b1468SMauro Carvalho Chehabblocked on mutex L2::
179*387b1468SMauro Carvalho Chehab
180*387b1468SMauro Carvalho Chehab  G->L2->B->L1->A
181*387b1468SMauro Carvalho Chehab
182*387b1468SMauro Carvalho ChehabAnd once again, to show how this can grow I will show the merging chains
183*387b1468SMauro Carvalho Chehabagain::
184*387b1468SMauro Carvalho Chehab
185*387b1468SMauro Carvalho Chehab   E->L4->D->L3->C-+
186*387b1468SMauro Carvalho Chehab                   +->L2-+
187*387b1468SMauro Carvalho Chehab                   |     |
188*387b1468SMauro Carvalho Chehab                 G-+     +->B->L1->A
189*387b1468SMauro Carvalho Chehab                         |
190*387b1468SMauro Carvalho Chehab                   F->L5-+
191*387b1468SMauro Carvalho Chehab
192*387b1468SMauro Carvalho ChehabIf process G has the highest priority in the chain, then all the tasks up
193*387b1468SMauro Carvalho Chehabthe chain (A and B in this example), must have their priorities increased
194*387b1468SMauro Carvalho Chehabto that of G.
195*387b1468SMauro Carvalho Chehab
196*387b1468SMauro Carvalho ChehabMutex Waiters Tree
197*387b1468SMauro Carvalho Chehab------------------
198*387b1468SMauro Carvalho Chehab
199*387b1468SMauro Carvalho ChehabEvery mutex keeps track of all the waiters that are blocked on itself. The
200*387b1468SMauro Carvalho Chehabmutex has a rbtree to store these waiters by priority.  This tree is protected
201*387b1468SMauro Carvalho Chehabby a spin lock that is located in the struct of the mutex. This lock is called
202*387b1468SMauro Carvalho Chehabwait_lock.
203*387b1468SMauro Carvalho Chehab
204*387b1468SMauro Carvalho Chehab
205*387b1468SMauro Carvalho ChehabTask PI Tree
206*387b1468SMauro Carvalho Chehab------------
207*387b1468SMauro Carvalho Chehab
208*387b1468SMauro Carvalho ChehabTo keep track of the PI chains, each process has its own PI rbtree.  This is
209*387b1468SMauro Carvalho Chehaba tree of all top waiters of the mutexes that are owned by the process.
210*387b1468SMauro Carvalho ChehabNote that this tree only holds the top waiters and not all waiters that are
211*387b1468SMauro Carvalho Chehabblocked on mutexes owned by the process.
212*387b1468SMauro Carvalho Chehab
213*387b1468SMauro Carvalho ChehabThe top of the task's PI tree is always the highest priority task that
214*387b1468SMauro Carvalho Chehabis waiting on a mutex that is owned by the task.  So if the task has
215*387b1468SMauro Carvalho Chehabinherited a priority, it will always be the priority of the task that is
216*387b1468SMauro Carvalho Chehabat the top of this tree.
217*387b1468SMauro Carvalho Chehab
218*387b1468SMauro Carvalho ChehabThis tree is stored in the task structure of a process as a rbtree called
219*387b1468SMauro Carvalho Chehabpi_waiters.  It is protected by a spin lock also in the task structure,
220*387b1468SMauro Carvalho Chehabcalled pi_lock.  This lock may also be taken in interrupt context, so when
221*387b1468SMauro Carvalho Chehablocking the pi_lock, interrupts must be disabled.
222*387b1468SMauro Carvalho Chehab
223*387b1468SMauro Carvalho Chehab
224*387b1468SMauro Carvalho ChehabDepth of the PI Chain
225*387b1468SMauro Carvalho Chehab---------------------
226*387b1468SMauro Carvalho Chehab
227*387b1468SMauro Carvalho ChehabThe maximum depth of the PI chain is not dynamic, and could actually be
228*387b1468SMauro Carvalho Chehabdefined.  But is very complex to figure it out, since it depends on all
229*387b1468SMauro Carvalho Chehabthe nesting of mutexes.  Let's look at the example where we have 3 mutexes,
230*387b1468SMauro Carvalho ChehabL1, L2, and L3, and four separate functions func1, func2, func3 and func4.
231*387b1468SMauro Carvalho ChehabThe following shows a locking order of L1->L2->L3, but may not actually
232*387b1468SMauro Carvalho Chehabbe directly nested that way::
233*387b1468SMauro Carvalho Chehab
234*387b1468SMauro Carvalho Chehab  void func1(void)
235*387b1468SMauro Carvalho Chehab  {
236*387b1468SMauro Carvalho Chehab	mutex_lock(L1);
237*387b1468SMauro Carvalho Chehab
238*387b1468SMauro Carvalho Chehab	/* do anything */
239*387b1468SMauro Carvalho Chehab
240*387b1468SMauro Carvalho Chehab	mutex_unlock(L1);
241*387b1468SMauro Carvalho Chehab  }
242*387b1468SMauro Carvalho Chehab
243*387b1468SMauro Carvalho Chehab  void func2(void)
244*387b1468SMauro Carvalho Chehab  {
245*387b1468SMauro Carvalho Chehab	mutex_lock(L1);
246*387b1468SMauro Carvalho Chehab	mutex_lock(L2);
247*387b1468SMauro Carvalho Chehab
248*387b1468SMauro Carvalho Chehab	/* do something */
249*387b1468SMauro Carvalho Chehab
250*387b1468SMauro Carvalho Chehab	mutex_unlock(L2);
251*387b1468SMauro Carvalho Chehab	mutex_unlock(L1);
252*387b1468SMauro Carvalho Chehab  }
253*387b1468SMauro Carvalho Chehab
254*387b1468SMauro Carvalho Chehab  void func3(void)
255*387b1468SMauro Carvalho Chehab  {
256*387b1468SMauro Carvalho Chehab	mutex_lock(L2);
257*387b1468SMauro Carvalho Chehab	mutex_lock(L3);
258*387b1468SMauro Carvalho Chehab
259*387b1468SMauro Carvalho Chehab	/* do something else */
260*387b1468SMauro Carvalho Chehab
261*387b1468SMauro Carvalho Chehab	mutex_unlock(L3);
262*387b1468SMauro Carvalho Chehab	mutex_unlock(L2);
263*387b1468SMauro Carvalho Chehab  }
264*387b1468SMauro Carvalho Chehab
265*387b1468SMauro Carvalho Chehab  void func4(void)
266*387b1468SMauro Carvalho Chehab  {
267*387b1468SMauro Carvalho Chehab	mutex_lock(L3);
268*387b1468SMauro Carvalho Chehab
269*387b1468SMauro Carvalho Chehab	/* do something again */
270*387b1468SMauro Carvalho Chehab
271*387b1468SMauro Carvalho Chehab	mutex_unlock(L3);
272*387b1468SMauro Carvalho Chehab  }
273*387b1468SMauro Carvalho Chehab
274*387b1468SMauro Carvalho ChehabNow we add 4 processes that run each of these functions separately.
275*387b1468SMauro Carvalho ChehabProcesses A, B, C, and D which run functions func1, func2, func3 and func4
276*387b1468SMauro Carvalho Chehabrespectively, and such that D runs first and A last.  With D being preempted
277*387b1468SMauro Carvalho Chehabin func4 in the "do something again" area, we have a locking that follows::
278*387b1468SMauro Carvalho Chehab
279*387b1468SMauro Carvalho Chehab  D owns L3
280*387b1468SMauro Carvalho Chehab         C blocked on L3
281*387b1468SMauro Carvalho Chehab         C owns L2
282*387b1468SMauro Carvalho Chehab                B blocked on L2
283*387b1468SMauro Carvalho Chehab                B owns L1
284*387b1468SMauro Carvalho Chehab                       A blocked on L1
285*387b1468SMauro Carvalho Chehab
286*387b1468SMauro Carvalho Chehab  And thus we have the chain A->L1->B->L2->C->L3->D.
287*387b1468SMauro Carvalho Chehab
288*387b1468SMauro Carvalho ChehabThis gives us a PI depth of 4 (four processes), but looking at any of the
289*387b1468SMauro Carvalho Chehabfunctions individually, it seems as though they only have at most a locking
290*387b1468SMauro Carvalho Chehabdepth of two.  So, although the locking depth is defined at compile time,
291*387b1468SMauro Carvalho Chehabit still is very difficult to find the possibilities of that depth.
292*387b1468SMauro Carvalho Chehab
293*387b1468SMauro Carvalho ChehabNow since mutexes can be defined by user-land applications, we don't want a DOS
294*387b1468SMauro Carvalho Chehabtype of application that nests large amounts of mutexes to create a large
295*387b1468SMauro Carvalho ChehabPI chain, and have the code holding spin locks while looking at a large
296*387b1468SMauro Carvalho Chehabamount of data.  So to prevent this, the implementation not only implements
297*387b1468SMauro Carvalho Chehaba maximum lock depth, but also only holds at most two different locks at a
298*387b1468SMauro Carvalho Chehabtime, as it walks the PI chain.  More about this below.
299*387b1468SMauro Carvalho Chehab
300*387b1468SMauro Carvalho Chehab
301*387b1468SMauro Carvalho ChehabMutex owner and flags
302*387b1468SMauro Carvalho Chehab---------------------
303*387b1468SMauro Carvalho Chehab
304*387b1468SMauro Carvalho ChehabThe mutex structure contains a pointer to the owner of the mutex.  If the
305*387b1468SMauro Carvalho Chehabmutex is not owned, this owner is set to NULL.  Since all architectures
306*387b1468SMauro Carvalho Chehabhave the task structure on at least a two byte alignment (and if this is
307*387b1468SMauro Carvalho Chehabnot true, the rtmutex.c code will be broken!), this allows for the least
308*387b1468SMauro Carvalho Chehabsignificant bit to be used as a flag.  Bit 0 is used as the "Has Waiters"
309*387b1468SMauro Carvalho Chehabflag. It's set whenever there are waiters on a mutex.
310*387b1468SMauro Carvalho Chehab
311*387b1468SMauro Carvalho ChehabSee Documentation/locking/rt-mutex.rst for further details.
312*387b1468SMauro Carvalho Chehab
313*387b1468SMauro Carvalho Chehabcmpxchg Tricks
314*387b1468SMauro Carvalho Chehab--------------
315*387b1468SMauro Carvalho Chehab
316*387b1468SMauro Carvalho ChehabSome architectures implement an atomic cmpxchg (Compare and Exchange).  This
317*387b1468SMauro Carvalho Chehabis used (when applicable) to keep the fast path of grabbing and releasing
318*387b1468SMauro Carvalho Chehabmutexes short.
319*387b1468SMauro Carvalho Chehab
320*387b1468SMauro Carvalho Chehabcmpxchg is basically the following function performed atomically::
321*387b1468SMauro Carvalho Chehab
322*387b1468SMauro Carvalho Chehab  unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C)
323*387b1468SMauro Carvalho Chehab  {
324*387b1468SMauro Carvalho Chehab	unsigned long T = *A;
325*387b1468SMauro Carvalho Chehab	if (*A == *B) {
326*387b1468SMauro Carvalho Chehab		*A = *C;
327*387b1468SMauro Carvalho Chehab	}
328*387b1468SMauro Carvalho Chehab	return T;
329*387b1468SMauro Carvalho Chehab  }
330*387b1468SMauro Carvalho Chehab  #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c)
331*387b1468SMauro Carvalho Chehab
332*387b1468SMauro Carvalho ChehabThis is really nice to have, since it allows you to only update a variable
333*387b1468SMauro Carvalho Chehabif the variable is what you expect it to be.  You know if it succeeded if
334*387b1468SMauro Carvalho Chehabthe return value (the old value of A) is equal to B.
335*387b1468SMauro Carvalho Chehab
336*387b1468SMauro Carvalho ChehabThe macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If
337*387b1468SMauro Carvalho Chehabthe architecture does not support CMPXCHG, then this macro is simply set
338*387b1468SMauro Carvalho Chehabto fail every time.  But if CMPXCHG is supported, then this will
339*387b1468SMauro Carvalho Chehabhelp out extremely to keep the fast path short.
340*387b1468SMauro Carvalho Chehab
341*387b1468SMauro Carvalho ChehabThe use of rt_mutex_cmpxchg with the flags in the owner field help optimize
342*387b1468SMauro Carvalho Chehabthe system for architectures that support it.  This will also be explained
343*387b1468SMauro Carvalho Chehablater in this document.
344*387b1468SMauro Carvalho Chehab
345*387b1468SMauro Carvalho Chehab
346*387b1468SMauro Carvalho ChehabPriority adjustments
347*387b1468SMauro Carvalho Chehab--------------------
348*387b1468SMauro Carvalho Chehab
349*387b1468SMauro Carvalho ChehabThe implementation of the PI code in rtmutex.c has several places that a
350*387b1468SMauro Carvalho Chehabprocess must adjust its priority.  With the help of the pi_waiters of a
351*387b1468SMauro Carvalho Chehabprocess this is rather easy to know what needs to be adjusted.
352*387b1468SMauro Carvalho Chehab
353*387b1468SMauro Carvalho ChehabThe functions implementing the task adjustments are rt_mutex_adjust_prio
354*387b1468SMauro Carvalho Chehaband rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
355*387b1468SMauro Carvalho Chehab
356*387b1468SMauro Carvalho Chehabrt_mutex_adjust_prio examines the priority of the task, and the highest
357*387b1468SMauro Carvalho Chehabpriority process that is waiting any of mutexes owned by the task. Since
358*387b1468SMauro Carvalho Chehabthe pi_waiters of a task holds an order by priority of all the top waiters
359*387b1468SMauro Carvalho Chehabof all the mutexes that the task owns, we simply need to compare the top
360*387b1468SMauro Carvalho Chehabpi waiter to its own normal/deadline priority and take the higher one.
361*387b1468SMauro Carvalho ChehabThen rt_mutex_setprio is called to adjust the priority of the task to the
362*387b1468SMauro Carvalho Chehabnew priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
363*387b1468SMauro Carvalho Chehabto implement the actual change in priority.
364*387b1468SMauro Carvalho Chehab
365*387b1468SMauro Carvalho ChehabNote:
366*387b1468SMauro Carvalho Chehab	For the "prio" field in task_struct, the lower the number, the
367*387b1468SMauro Carvalho Chehab	higher the priority. A "prio" of 5 is of higher priority than a
368*387b1468SMauro Carvalho Chehab	"prio" of 10.
369*387b1468SMauro Carvalho Chehab
370*387b1468SMauro Carvalho ChehabIt is interesting to note that rt_mutex_adjust_prio can either increase
371*387b1468SMauro Carvalho Chehabor decrease the priority of the task.  In the case that a higher priority
372*387b1468SMauro Carvalho Chehabprocess has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
373*387b1468SMauro Carvalho Chehabwould increase/boost the task's priority.  But if a higher priority task
374*387b1468SMauro Carvalho Chehabwere for some reason to leave the mutex (timeout or signal), this same function
375*387b1468SMauro Carvalho Chehabwould decrease/unboost the priority of the task.  That is because the pi_waiters
376*387b1468SMauro Carvalho Chehabalways contains the highest priority task that is waiting on a mutex owned
377*387b1468SMauro Carvalho Chehabby the task, so we only need to compare the priority of that top pi waiter
378*387b1468SMauro Carvalho Chehabto the normal priority of the given task.
379*387b1468SMauro Carvalho Chehab
380*387b1468SMauro Carvalho Chehab
381*387b1468SMauro Carvalho ChehabHigh level overview of the PI chain walk
382*387b1468SMauro Carvalho Chehab----------------------------------------
383*387b1468SMauro Carvalho Chehab
384*387b1468SMauro Carvalho ChehabThe PI chain walk is implemented by the function rt_mutex_adjust_prio_chain.
385*387b1468SMauro Carvalho Chehab
386*387b1468SMauro Carvalho ChehabThe implementation has gone through several iterations, and has ended up
387*387b1468SMauro Carvalho Chehabwith what we believe is the best.  It walks the PI chain by only grabbing
388*387b1468SMauro Carvalho Chehabat most two locks at a time, and is very efficient.
389*387b1468SMauro Carvalho Chehab
390*387b1468SMauro Carvalho ChehabThe rt_mutex_adjust_prio_chain can be used either to boost or lower process
391*387b1468SMauro Carvalho Chehabpriorities.
392*387b1468SMauro Carvalho Chehab
393*387b1468SMauro Carvalho Chehabrt_mutex_adjust_prio_chain is called with a task to be checked for PI
394*387b1468SMauro Carvalho Chehab(de)boosting (the owner of a mutex that a process is blocking on), a flag to
395*387b1468SMauro Carvalho Chehabcheck for deadlocking, the mutex that the task owns, a pointer to a waiter
396*387b1468SMauro Carvalho Chehabthat is the process's waiter struct that is blocked on the mutex (although this
397*387b1468SMauro Carvalho Chehabparameter may be NULL for deboosting), a pointer to the mutex on which the task
398*387b1468SMauro Carvalho Chehabis blocked, and a top_task as the top waiter of the mutex.
399*387b1468SMauro Carvalho Chehab
400*387b1468SMauro Carvalho ChehabFor this explanation, I will not mention deadlock detection. This explanation
401*387b1468SMauro Carvalho Chehabwill try to stay at a high level.
402*387b1468SMauro Carvalho Chehab
403*387b1468SMauro Carvalho ChehabWhen this function is called, there are no locks held.  That also means
404*387b1468SMauro Carvalho Chehabthat the state of the owner and lock can change when entered into this function.
405*387b1468SMauro Carvalho Chehab
406*387b1468SMauro Carvalho ChehabBefore this function is called, the task has already had rt_mutex_adjust_prio
407*387b1468SMauro Carvalho Chehabperformed on it.  This means that the task is set to the priority that it
408*387b1468SMauro Carvalho Chehabshould be at, but the rbtree nodes of the task's waiter have not been updated
409*387b1468SMauro Carvalho Chehabwith the new priorities, and this task may not be in the proper locations
410*387b1468SMauro Carvalho Chehabin the pi_waiters and waiters trees that the task is blocked on. This function
411*387b1468SMauro Carvalho Chehabsolves all that.
412*387b1468SMauro Carvalho Chehab
413*387b1468SMauro Carvalho ChehabThe main operation of this function is summarized by Thomas Gleixner in
414*387b1468SMauro Carvalho Chehabrtmutex.c. See the 'Chain walk basics and protection scope' comment for further
415*387b1468SMauro Carvalho Chehabdetails.
416*387b1468SMauro Carvalho Chehab
417*387b1468SMauro Carvalho ChehabTaking of a mutex (The walk through)
418*387b1468SMauro Carvalho Chehab------------------------------------
419*387b1468SMauro Carvalho Chehab
420*387b1468SMauro Carvalho ChehabOK, now let's take a look at the detailed walk through of what happens when
421*387b1468SMauro Carvalho Chehabtaking a mutex.
422*387b1468SMauro Carvalho Chehab
423*387b1468SMauro Carvalho ChehabThe first thing that is tried is the fast taking of the mutex.  This is
424*387b1468SMauro Carvalho Chehabdone when we have CMPXCHG enabled (otherwise the fast taking automatically
425*387b1468SMauro Carvalho Chehabfails).  Only when the owner field of the mutex is NULL can the lock be
426*387b1468SMauro Carvalho Chehabtaken with the CMPXCHG and nothing else needs to be done.
427*387b1468SMauro Carvalho Chehab
428*387b1468SMauro Carvalho ChehabIf there is contention on the lock, we go about the slow path
429*387b1468SMauro Carvalho Chehab(rt_mutex_slowlock).
430*387b1468SMauro Carvalho Chehab
431*387b1468SMauro Carvalho ChehabThe slow path function is where the task's waiter structure is created on
432*387b1468SMauro Carvalho Chehabthe stack.  This is because the waiter structure is only needed for the
433*387b1468SMauro Carvalho Chehabscope of this function.  The waiter structure holds the nodes to store
434*387b1468SMauro Carvalho Chehabthe task on the waiters tree of the mutex, and if need be, the pi_waiters
435*387b1468SMauro Carvalho Chehabtree of the owner.
436*387b1468SMauro Carvalho Chehab
437*387b1468SMauro Carvalho ChehabThe wait_lock of the mutex is taken since the slow path of unlocking the
438*387b1468SMauro Carvalho Chehabmutex also takes this lock.
439*387b1468SMauro Carvalho Chehab
440*387b1468SMauro Carvalho ChehabWe then call try_to_take_rt_mutex.  This is where the architecture that
441*387b1468SMauro Carvalho Chehabdoes not implement CMPXCHG would always grab the lock (if there's no
442*387b1468SMauro Carvalho Chehabcontention).
443*387b1468SMauro Carvalho Chehab
444*387b1468SMauro Carvalho Chehabtry_to_take_rt_mutex is used every time the task tries to grab a mutex in the
445*387b1468SMauro Carvalho Chehabslow path.  The first thing that is done here is an atomic setting of
446*387b1468SMauro Carvalho Chehabthe "Has Waiters" flag of the mutex's owner field. By setting this flag
447*387b1468SMauro Carvalho Chehabnow, the current owner of the mutex being contended for can't release the mutex
448*387b1468SMauro Carvalho Chehabwithout going into the slow unlock path, and it would then need to grab the
449*387b1468SMauro Carvalho Chehabwait_lock, which this code currently holds. So setting the "Has Waiters" flag
450*387b1468SMauro Carvalho Chehabforces the current owner to synchronize with this code.
451*387b1468SMauro Carvalho Chehab
452*387b1468SMauro Carvalho ChehabThe lock is taken if the following are true:
453*387b1468SMauro Carvalho Chehab
454*387b1468SMauro Carvalho Chehab   1) The lock has no owner
455*387b1468SMauro Carvalho Chehab   2) The current task is the highest priority against all other
456*387b1468SMauro Carvalho Chehab      waiters of the lock
457*387b1468SMauro Carvalho Chehab
458*387b1468SMauro Carvalho ChehabIf the task succeeds to acquire the lock, then the task is set as the
459*387b1468SMauro Carvalho Chehabowner of the lock, and if the lock still has waiters, the top_waiter
460*387b1468SMauro Carvalho Chehab(highest priority task waiting on the lock) is added to this task's
461*387b1468SMauro Carvalho Chehabpi_waiters tree.
462*387b1468SMauro Carvalho Chehab
463*387b1468SMauro Carvalho ChehabIf the lock is not taken by try_to_take_rt_mutex(), then the
464*387b1468SMauro Carvalho Chehabtask_blocks_on_rt_mutex() function is called. This will add the task to
465*387b1468SMauro Carvalho Chehabthe lock's waiter tree and propagate the pi chain of the lock as well
466*387b1468SMauro Carvalho Chehabas the lock's owner's pi_waiters tree. This is described in the next
467*387b1468SMauro Carvalho Chehabsection.
468*387b1468SMauro Carvalho Chehab
469*387b1468SMauro Carvalho ChehabTask blocks on mutex
470*387b1468SMauro Carvalho Chehab--------------------
471*387b1468SMauro Carvalho Chehab
472*387b1468SMauro Carvalho ChehabThe accounting of a mutex and process is done with the waiter structure of
473*387b1468SMauro Carvalho Chehabthe process.  The "task" field is set to the process, and the "lock" field
474*387b1468SMauro Carvalho Chehabto the mutex.  The rbtree node of waiter are initialized to the processes
475*387b1468SMauro Carvalho Chehabcurrent priority.
476*387b1468SMauro Carvalho Chehab
477*387b1468SMauro Carvalho ChehabSince the wait_lock was taken at the entry of the slow lock, we can safely
478*387b1468SMauro Carvalho Chehabadd the waiter to the task waiter tree.  If the current process is the
479*387b1468SMauro Carvalho Chehabhighest priority process currently waiting on this mutex, then we remove the
480*387b1468SMauro Carvalho Chehabprevious top waiter process (if it exists) from the pi_waiters of the owner,
481*387b1468SMauro Carvalho Chehaband add the current process to that tree.  Since the pi_waiter of the owner
482*387b1468SMauro Carvalho Chehabhas changed, we call rt_mutex_adjust_prio on the owner to see if the owner
483*387b1468SMauro Carvalho Chehabshould adjust its priority accordingly.
484*387b1468SMauro Carvalho Chehab
485*387b1468SMauro Carvalho ChehabIf the owner is also blocked on a lock, and had its pi_waiters changed
486*387b1468SMauro Carvalho Chehab(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
487*387b1468SMauro Carvalho Chehaband run rt_mutex_adjust_prio_chain on the owner, as described earlier.
488*387b1468SMauro Carvalho Chehab
489*387b1468SMauro Carvalho ChehabNow all locks are released, and if the current process is still blocked on a
490*387b1468SMauro Carvalho Chehabmutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
491*387b1468SMauro Carvalho Chehab
492*387b1468SMauro Carvalho ChehabWaking up in the loop
493*387b1468SMauro Carvalho Chehab---------------------
494*387b1468SMauro Carvalho Chehab
495*387b1468SMauro Carvalho ChehabThe task can then wake up for a couple of reasons:
496*387b1468SMauro Carvalho Chehab  1) The previous lock owner released the lock, and the task now is top_waiter
497*387b1468SMauro Carvalho Chehab  2) we received a signal or timeout
498*387b1468SMauro Carvalho Chehab
499*387b1468SMauro Carvalho ChehabIn both cases, the task will try again to acquire the lock. If it
500*387b1468SMauro Carvalho Chehabdoes, then it will take itself off the waiters tree and set itself back
501*387b1468SMauro Carvalho Chehabto the TASK_RUNNING state.
502*387b1468SMauro Carvalho Chehab
503*387b1468SMauro Carvalho ChehabIn first case, if the lock was acquired by another task before this task
504*387b1468SMauro Carvalho Chehabcould get the lock, then it will go back to sleep and wait to be woken again.
505*387b1468SMauro Carvalho Chehab
506*387b1468SMauro Carvalho ChehabThe second case is only applicable for tasks that are grabbing a mutex
507*387b1468SMauro Carvalho Chehabthat can wake up before getting the lock, either due to a signal or
508*387b1468SMauro Carvalho Chehaba timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
509*387b1468SMauro Carvalho Chehabtake the lock again, if it succeeds, then the task will return with the
510*387b1468SMauro Carvalho Chehablock held, otherwise it will return with -EINTR if the task was woken
511*387b1468SMauro Carvalho Chehabby a signal, or -ETIMEDOUT if it timed out.
512*387b1468SMauro Carvalho Chehab
513*387b1468SMauro Carvalho Chehab
514*387b1468SMauro Carvalho ChehabUnlocking the Mutex
515*387b1468SMauro Carvalho Chehab-------------------
516*387b1468SMauro Carvalho Chehab
517*387b1468SMauro Carvalho ChehabThe unlocking of a mutex also has a fast path for those architectures with
518*387b1468SMauro Carvalho ChehabCMPXCHG.  Since the taking of a mutex on contention always sets the
519*387b1468SMauro Carvalho Chehab"Has Waiters" flag of the mutex's owner, we use this to know if we need to
520*387b1468SMauro Carvalho Chehabtake the slow path when unlocking the mutex.  If the mutex doesn't have any
521*387b1468SMauro Carvalho Chehabwaiters, the owner field of the mutex would equal the current process and
522*387b1468SMauro Carvalho Chehabthe mutex can be unlocked by just replacing the owner field with NULL.
523*387b1468SMauro Carvalho Chehab
524*387b1468SMauro Carvalho ChehabIf the owner field has the "Has Waiters" bit set (or CMPXCHG is not available),
525*387b1468SMauro Carvalho Chehabthe slow unlock path is taken.
526*387b1468SMauro Carvalho Chehab
527*387b1468SMauro Carvalho ChehabThe first thing done in the slow unlock path is to take the wait_lock of the
528*387b1468SMauro Carvalho Chehabmutex.  This synchronizes the locking and unlocking of the mutex.
529*387b1468SMauro Carvalho Chehab
530*387b1468SMauro Carvalho ChehabA check is made to see if the mutex has waiters or not.  On architectures that
531*387b1468SMauro Carvalho Chehabdo not have CMPXCHG, this is the location that the owner of the mutex will
532*387b1468SMauro Carvalho Chehabdetermine if a waiter needs to be awoken or not.  On architectures that
533*387b1468SMauro Carvalho Chehabdo have CMPXCHG, that check is done in the fast path, but it is still needed
534*387b1468SMauro Carvalho Chehabin the slow path too.  If a waiter of a mutex woke up because of a signal
535*387b1468SMauro Carvalho Chehabor timeout between the time the owner failed the fast path CMPXCHG check and
536*387b1468SMauro Carvalho Chehabthe grabbing of the wait_lock, the mutex may not have any waiters, thus the
537*387b1468SMauro Carvalho Chehabowner still needs to make this check. If there are no waiters then the mutex
538*387b1468SMauro Carvalho Chehabowner field is set to NULL, the wait_lock is released and nothing more is
539*387b1468SMauro Carvalho Chehabneeded.
540*387b1468SMauro Carvalho Chehab
541*387b1468SMauro Carvalho ChehabIf there are waiters, then we need to wake one up.
542*387b1468SMauro Carvalho Chehab
543*387b1468SMauro Carvalho ChehabOn the wake up code, the pi_lock of the current owner is taken.  The top
544*387b1468SMauro Carvalho Chehabwaiter of the lock is found and removed from the waiters tree of the mutex
545*387b1468SMauro Carvalho Chehabas well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
546*387b1468SMauro Carvalho Chehabmarked to prevent lower priority tasks from stealing the lock.
547*387b1468SMauro Carvalho Chehab
548*387b1468SMauro Carvalho ChehabFinally we unlock the pi_lock of the pending owner and wake it up.
549*387b1468SMauro Carvalho Chehab
550*387b1468SMauro Carvalho Chehab
551*387b1468SMauro Carvalho ChehabContact
552*387b1468SMauro Carvalho Chehab-------
553*387b1468SMauro Carvalho Chehab
554*387b1468SMauro Carvalho ChehabFor updates on this document, please email Steven Rostedt <rostedt@goodmis.org>
555*387b1468SMauro Carvalho Chehab
556*387b1468SMauro Carvalho Chehab
557*387b1468SMauro Carvalho ChehabCredits
558*387b1468SMauro Carvalho Chehab-------
559*387b1468SMauro Carvalho Chehab
560*387b1468SMauro Carvalho ChehabAuthor:  Steven Rostedt <rostedt@goodmis.org>
561*387b1468SMauro Carvalho Chehab
562*387b1468SMauro Carvalho ChehabUpdated: Alex Shi <alex.shi@linaro.org>	- 7/6/2017
563*387b1468SMauro Carvalho Chehab
564*387b1468SMauro Carvalho ChehabOriginal Reviewers:
565*387b1468SMauro Carvalho Chehab		     Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
566*387b1468SMauro Carvalho Chehab		     Randy Dunlap
567*387b1468SMauro Carvalho Chehab
568*387b1468SMauro Carvalho ChehabUpdate (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
569*387b1468SMauro Carvalho Chehab
570*387b1468SMauro Carvalho ChehabUpdates
571*387b1468SMauro Carvalho Chehab-------
572*387b1468SMauro Carvalho Chehab
573*387b1468SMauro Carvalho ChehabThis document was originally written for 2.6.17-rc3-mm1
574*387b1468SMauro Carvalho Chehabwas updated on 4.12
575