1*387b1468SMauro Carvalho Chehab============================== 2*387b1468SMauro Carvalho ChehabRT-mutex implementation design 3*387b1468SMauro Carvalho Chehab============================== 4*387b1468SMauro Carvalho Chehab 5*387b1468SMauro Carvalho ChehabCopyright (c) 2006 Steven Rostedt 6*387b1468SMauro Carvalho Chehab 7*387b1468SMauro Carvalho ChehabLicensed under the GNU Free Documentation License, Version 1.2 8*387b1468SMauro Carvalho Chehab 9*387b1468SMauro Carvalho Chehab 10*387b1468SMauro Carvalho ChehabThis document tries to describe the design of the rtmutex.c implementation. 11*387b1468SMauro Carvalho ChehabIt doesn't describe the reasons why rtmutex.c exists. For that please see 12*387b1468SMauro Carvalho ChehabDocumentation/locking/rt-mutex.rst. Although this document does explain problems 13*387b1468SMauro Carvalho Chehabthat happen without this code, but that is in the concept to understand 14*387b1468SMauro Carvalho Chehabwhat the code actually is doing. 15*387b1468SMauro Carvalho Chehab 16*387b1468SMauro Carvalho ChehabThe goal of this document is to help others understand the priority 17*387b1468SMauro Carvalho Chehabinheritance (PI) algorithm that is used, as well as reasons for the 18*387b1468SMauro Carvalho Chehabdecisions that were made to implement PI in the manner that was done. 19*387b1468SMauro Carvalho Chehab 20*387b1468SMauro Carvalho Chehab 21*387b1468SMauro Carvalho ChehabUnbounded Priority Inversion 22*387b1468SMauro Carvalho Chehab---------------------------- 23*387b1468SMauro Carvalho Chehab 24*387b1468SMauro Carvalho ChehabPriority inversion is when a lower priority process executes while a higher 25*387b1468SMauro Carvalho Chehabpriority process wants to run. This happens for several reasons, and 26*387b1468SMauro Carvalho Chehabmost of the time it can't be helped. Anytime a high priority process wants 27*387b1468SMauro Carvalho Chehabto use a resource that a lower priority process has (a mutex for example), 28*387b1468SMauro Carvalho Chehabthe high priority process must wait until the lower priority process is done 29*387b1468SMauro Carvalho Chehabwith the resource. This is a priority inversion. What we want to prevent 30*387b1468SMauro Carvalho Chehabis something called unbounded priority inversion. That is when the high 31*387b1468SMauro Carvalho Chehabpriority process is prevented from running by a lower priority process for 32*387b1468SMauro Carvalho Chehaban undetermined amount of time. 33*387b1468SMauro Carvalho Chehab 34*387b1468SMauro Carvalho ChehabThe classic example of unbounded priority inversion is where you have three 35*387b1468SMauro Carvalho Chehabprocesses, let's call them processes A, B, and C, where A is the highest 36*387b1468SMauro Carvalho Chehabpriority process, C is the lowest, and B is in between. A tries to grab a lock 37*387b1468SMauro Carvalho Chehabthat C owns and must wait and lets C run to release the lock. But in the 38*387b1468SMauro Carvalho Chehabmeantime, B executes, and since B is of a higher priority than C, it preempts C, 39*387b1468SMauro Carvalho Chehabbut by doing so, it is in fact preempting A which is a higher priority process. 40*387b1468SMauro Carvalho ChehabNow there's no way of knowing how long A will be sleeping waiting for C 41*387b1468SMauro Carvalho Chehabto release the lock, because for all we know, B is a CPU hog and will 42*387b1468SMauro Carvalho Chehabnever give C a chance to release the lock. This is called unbounded priority 43*387b1468SMauro Carvalho Chehabinversion. 44*387b1468SMauro Carvalho Chehab 45*387b1468SMauro Carvalho ChehabHere's a little ASCII art to show the problem:: 46*387b1468SMauro Carvalho Chehab 47*387b1468SMauro Carvalho Chehab grab lock L1 (owned by C) 48*387b1468SMauro Carvalho Chehab | 49*387b1468SMauro Carvalho Chehab A ---+ 50*387b1468SMauro Carvalho Chehab C preempted by B 51*387b1468SMauro Carvalho Chehab | 52*387b1468SMauro Carvalho Chehab C +----+ 53*387b1468SMauro Carvalho Chehab 54*387b1468SMauro Carvalho Chehab B +--------> 55*387b1468SMauro Carvalho Chehab B now keeps A from running. 56*387b1468SMauro Carvalho Chehab 57*387b1468SMauro Carvalho Chehab 58*387b1468SMauro Carvalho ChehabPriority Inheritance (PI) 59*387b1468SMauro Carvalho Chehab------------------------- 60*387b1468SMauro Carvalho Chehab 61*387b1468SMauro Carvalho ChehabThere are several ways to solve this issue, but other ways are out of scope 62*387b1468SMauro Carvalho Chehabfor this document. Here we only discuss PI. 63*387b1468SMauro Carvalho Chehab 64*387b1468SMauro Carvalho ChehabPI is where a process inherits the priority of another process if the other 65*387b1468SMauro Carvalho Chehabprocess blocks on a lock owned by the current process. To make this easier 66*387b1468SMauro Carvalho Chehabto understand, let's use the previous example, with processes A, B, and C again. 67*387b1468SMauro Carvalho Chehab 68*387b1468SMauro Carvalho ChehabThis time, when A blocks on the lock owned by C, C would inherit the priority 69*387b1468SMauro Carvalho Chehabof A. So now if B becomes runnable, it would not preempt C, since C now has 70*387b1468SMauro Carvalho Chehabthe high priority of A. As soon as C releases the lock, it loses its 71*387b1468SMauro Carvalho Chehabinherited priority, and A then can continue with the resource that C had. 72*387b1468SMauro Carvalho Chehab 73*387b1468SMauro Carvalho ChehabTerminology 74*387b1468SMauro Carvalho Chehab----------- 75*387b1468SMauro Carvalho Chehab 76*387b1468SMauro Carvalho ChehabHere I explain some terminology that is used in this document to help describe 77*387b1468SMauro Carvalho Chehabthe design that is used to implement PI. 78*387b1468SMauro Carvalho Chehab 79*387b1468SMauro Carvalho ChehabPI chain 80*387b1468SMauro Carvalho Chehab - The PI chain is an ordered series of locks and processes that cause 81*387b1468SMauro Carvalho Chehab processes to inherit priorities from a previous process that is 82*387b1468SMauro Carvalho Chehab blocked on one of its locks. This is described in more detail 83*387b1468SMauro Carvalho Chehab later in this document. 84*387b1468SMauro Carvalho Chehab 85*387b1468SMauro Carvalho Chehabmutex 86*387b1468SMauro Carvalho Chehab - In this document, to differentiate from locks that implement 87*387b1468SMauro Carvalho Chehab PI and spin locks that are used in the PI code, from now on 88*387b1468SMauro Carvalho Chehab the PI locks will be called a mutex. 89*387b1468SMauro Carvalho Chehab 90*387b1468SMauro Carvalho Chehablock 91*387b1468SMauro Carvalho Chehab - In this document from now on, I will use the term lock when 92*387b1468SMauro Carvalho Chehab referring to spin locks that are used to protect parts of the PI 93*387b1468SMauro Carvalho Chehab algorithm. These locks disable preemption for UP (when 94*387b1468SMauro Carvalho Chehab CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from 95*387b1468SMauro Carvalho Chehab entering critical sections simultaneously. 96*387b1468SMauro Carvalho Chehab 97*387b1468SMauro Carvalho Chehabspin lock 98*387b1468SMauro Carvalho Chehab - Same as lock above. 99*387b1468SMauro Carvalho Chehab 100*387b1468SMauro Carvalho Chehabwaiter 101*387b1468SMauro Carvalho Chehab - A waiter is a struct that is stored on the stack of a blocked 102*387b1468SMauro Carvalho Chehab process. Since the scope of the waiter is within the code for 103*387b1468SMauro Carvalho Chehab a process being blocked on the mutex, it is fine to allocate 104*387b1468SMauro Carvalho Chehab the waiter on the process's stack (local variable). This 105*387b1468SMauro Carvalho Chehab structure holds a pointer to the task, as well as the mutex that 106*387b1468SMauro Carvalho Chehab the task is blocked on. It also has rbtree node structures to 107*387b1468SMauro Carvalho Chehab place the task in the waiters rbtree of a mutex as well as the 108*387b1468SMauro Carvalho Chehab pi_waiters rbtree of a mutex owner task (described below). 109*387b1468SMauro Carvalho Chehab 110*387b1468SMauro Carvalho Chehab waiter is sometimes used in reference to the task that is waiting 111*387b1468SMauro Carvalho Chehab on a mutex. This is the same as waiter->task. 112*387b1468SMauro Carvalho Chehab 113*387b1468SMauro Carvalho Chehabwaiters 114*387b1468SMauro Carvalho Chehab - A list of processes that are blocked on a mutex. 115*387b1468SMauro Carvalho Chehab 116*387b1468SMauro Carvalho Chehabtop waiter 117*387b1468SMauro Carvalho Chehab - The highest priority process waiting on a specific mutex. 118*387b1468SMauro Carvalho Chehab 119*387b1468SMauro Carvalho Chehabtop pi waiter 120*387b1468SMauro Carvalho Chehab - The highest priority process waiting on one of the mutexes 121*387b1468SMauro Carvalho Chehab that a specific process owns. 122*387b1468SMauro Carvalho Chehab 123*387b1468SMauro Carvalho ChehabNote: 124*387b1468SMauro Carvalho Chehab task and process are used interchangeably in this document, mostly to 125*387b1468SMauro Carvalho Chehab differentiate between two processes that are being described together. 126*387b1468SMauro Carvalho Chehab 127*387b1468SMauro Carvalho Chehab 128*387b1468SMauro Carvalho ChehabPI chain 129*387b1468SMauro Carvalho Chehab-------- 130*387b1468SMauro Carvalho Chehab 131*387b1468SMauro Carvalho ChehabThe PI chain is a list of processes and mutexes that may cause priority 132*387b1468SMauro Carvalho Chehabinheritance to take place. Multiple chains may converge, but a chain 133*387b1468SMauro Carvalho Chehabwould never diverge, since a process can't be blocked on more than one 134*387b1468SMauro Carvalho Chehabmutex at a time. 135*387b1468SMauro Carvalho Chehab 136*387b1468SMauro Carvalho ChehabExample:: 137*387b1468SMauro Carvalho Chehab 138*387b1468SMauro Carvalho Chehab Process: A, B, C, D, E 139*387b1468SMauro Carvalho Chehab Mutexes: L1, L2, L3, L4 140*387b1468SMauro Carvalho Chehab 141*387b1468SMauro Carvalho Chehab A owns: L1 142*387b1468SMauro Carvalho Chehab B blocked on L1 143*387b1468SMauro Carvalho Chehab B owns L2 144*387b1468SMauro Carvalho Chehab C blocked on L2 145*387b1468SMauro Carvalho Chehab C owns L3 146*387b1468SMauro Carvalho Chehab D blocked on L3 147*387b1468SMauro Carvalho Chehab D owns L4 148*387b1468SMauro Carvalho Chehab E blocked on L4 149*387b1468SMauro Carvalho Chehab 150*387b1468SMauro Carvalho ChehabThe chain would be:: 151*387b1468SMauro Carvalho Chehab 152*387b1468SMauro Carvalho Chehab E->L4->D->L3->C->L2->B->L1->A 153*387b1468SMauro Carvalho Chehab 154*387b1468SMauro Carvalho ChehabTo show where two chains merge, we could add another process F and 155*387b1468SMauro Carvalho Chehabanother mutex L5 where B owns L5 and F is blocked on mutex L5. 156*387b1468SMauro Carvalho Chehab 157*387b1468SMauro Carvalho ChehabThe chain for F would be:: 158*387b1468SMauro Carvalho Chehab 159*387b1468SMauro Carvalho Chehab F->L5->B->L1->A 160*387b1468SMauro Carvalho Chehab 161*387b1468SMauro Carvalho ChehabSince a process may own more than one mutex, but never be blocked on more than 162*387b1468SMauro Carvalho Chehabone, the chains merge. 163*387b1468SMauro Carvalho Chehab 164*387b1468SMauro Carvalho ChehabHere we show both chains:: 165*387b1468SMauro Carvalho Chehab 166*387b1468SMauro Carvalho Chehab E->L4->D->L3->C->L2-+ 167*387b1468SMauro Carvalho Chehab | 168*387b1468SMauro Carvalho Chehab +->B->L1->A 169*387b1468SMauro Carvalho Chehab | 170*387b1468SMauro Carvalho Chehab F->L5-+ 171*387b1468SMauro Carvalho Chehab 172*387b1468SMauro Carvalho ChehabFor PI to work, the processes at the right end of these chains (or we may 173*387b1468SMauro Carvalho Chehabalso call it the Top of the chain) must be equal to or higher in priority 174*387b1468SMauro Carvalho Chehabthan the processes to the left or below in the chain. 175*387b1468SMauro Carvalho Chehab 176*387b1468SMauro Carvalho ChehabAlso since a mutex may have more than one process blocked on it, we can 177*387b1468SMauro Carvalho Chehabhave multiple chains merge at mutexes. If we add another process G that is 178*387b1468SMauro Carvalho Chehabblocked on mutex L2:: 179*387b1468SMauro Carvalho Chehab 180*387b1468SMauro Carvalho Chehab G->L2->B->L1->A 181*387b1468SMauro Carvalho Chehab 182*387b1468SMauro Carvalho ChehabAnd once again, to show how this can grow I will show the merging chains 183*387b1468SMauro Carvalho Chehabagain:: 184*387b1468SMauro Carvalho Chehab 185*387b1468SMauro Carvalho Chehab E->L4->D->L3->C-+ 186*387b1468SMauro Carvalho Chehab +->L2-+ 187*387b1468SMauro Carvalho Chehab | | 188*387b1468SMauro Carvalho Chehab G-+ +->B->L1->A 189*387b1468SMauro Carvalho Chehab | 190*387b1468SMauro Carvalho Chehab F->L5-+ 191*387b1468SMauro Carvalho Chehab 192*387b1468SMauro Carvalho ChehabIf process G has the highest priority in the chain, then all the tasks up 193*387b1468SMauro Carvalho Chehabthe chain (A and B in this example), must have their priorities increased 194*387b1468SMauro Carvalho Chehabto that of G. 195*387b1468SMauro Carvalho Chehab 196*387b1468SMauro Carvalho ChehabMutex Waiters Tree 197*387b1468SMauro Carvalho Chehab------------------ 198*387b1468SMauro Carvalho Chehab 199*387b1468SMauro Carvalho ChehabEvery mutex keeps track of all the waiters that are blocked on itself. The 200*387b1468SMauro Carvalho Chehabmutex has a rbtree to store these waiters by priority. This tree is protected 201*387b1468SMauro Carvalho Chehabby a spin lock that is located in the struct of the mutex. This lock is called 202*387b1468SMauro Carvalho Chehabwait_lock. 203*387b1468SMauro Carvalho Chehab 204*387b1468SMauro Carvalho Chehab 205*387b1468SMauro Carvalho ChehabTask PI Tree 206*387b1468SMauro Carvalho Chehab------------ 207*387b1468SMauro Carvalho Chehab 208*387b1468SMauro Carvalho ChehabTo keep track of the PI chains, each process has its own PI rbtree. This is 209*387b1468SMauro Carvalho Chehaba tree of all top waiters of the mutexes that are owned by the process. 210*387b1468SMauro Carvalho ChehabNote that this tree only holds the top waiters and not all waiters that are 211*387b1468SMauro Carvalho Chehabblocked on mutexes owned by the process. 212*387b1468SMauro Carvalho Chehab 213*387b1468SMauro Carvalho ChehabThe top of the task's PI tree is always the highest priority task that 214*387b1468SMauro Carvalho Chehabis waiting on a mutex that is owned by the task. So if the task has 215*387b1468SMauro Carvalho Chehabinherited a priority, it will always be the priority of the task that is 216*387b1468SMauro Carvalho Chehabat the top of this tree. 217*387b1468SMauro Carvalho Chehab 218*387b1468SMauro Carvalho ChehabThis tree is stored in the task structure of a process as a rbtree called 219*387b1468SMauro Carvalho Chehabpi_waiters. It is protected by a spin lock also in the task structure, 220*387b1468SMauro Carvalho Chehabcalled pi_lock. This lock may also be taken in interrupt context, so when 221*387b1468SMauro Carvalho Chehablocking the pi_lock, interrupts must be disabled. 222*387b1468SMauro Carvalho Chehab 223*387b1468SMauro Carvalho Chehab 224*387b1468SMauro Carvalho ChehabDepth of the PI Chain 225*387b1468SMauro Carvalho Chehab--------------------- 226*387b1468SMauro Carvalho Chehab 227*387b1468SMauro Carvalho ChehabThe maximum depth of the PI chain is not dynamic, and could actually be 228*387b1468SMauro Carvalho Chehabdefined. But is very complex to figure it out, since it depends on all 229*387b1468SMauro Carvalho Chehabthe nesting of mutexes. Let's look at the example where we have 3 mutexes, 230*387b1468SMauro Carvalho ChehabL1, L2, and L3, and four separate functions func1, func2, func3 and func4. 231*387b1468SMauro Carvalho ChehabThe following shows a locking order of L1->L2->L3, but may not actually 232*387b1468SMauro Carvalho Chehabbe directly nested that way:: 233*387b1468SMauro Carvalho Chehab 234*387b1468SMauro Carvalho Chehab void func1(void) 235*387b1468SMauro Carvalho Chehab { 236*387b1468SMauro Carvalho Chehab mutex_lock(L1); 237*387b1468SMauro Carvalho Chehab 238*387b1468SMauro Carvalho Chehab /* do anything */ 239*387b1468SMauro Carvalho Chehab 240*387b1468SMauro Carvalho Chehab mutex_unlock(L1); 241*387b1468SMauro Carvalho Chehab } 242*387b1468SMauro Carvalho Chehab 243*387b1468SMauro Carvalho Chehab void func2(void) 244*387b1468SMauro Carvalho Chehab { 245*387b1468SMauro Carvalho Chehab mutex_lock(L1); 246*387b1468SMauro Carvalho Chehab mutex_lock(L2); 247*387b1468SMauro Carvalho Chehab 248*387b1468SMauro Carvalho Chehab /* do something */ 249*387b1468SMauro Carvalho Chehab 250*387b1468SMauro Carvalho Chehab mutex_unlock(L2); 251*387b1468SMauro Carvalho Chehab mutex_unlock(L1); 252*387b1468SMauro Carvalho Chehab } 253*387b1468SMauro Carvalho Chehab 254*387b1468SMauro Carvalho Chehab void func3(void) 255*387b1468SMauro Carvalho Chehab { 256*387b1468SMauro Carvalho Chehab mutex_lock(L2); 257*387b1468SMauro Carvalho Chehab mutex_lock(L3); 258*387b1468SMauro Carvalho Chehab 259*387b1468SMauro Carvalho Chehab /* do something else */ 260*387b1468SMauro Carvalho Chehab 261*387b1468SMauro Carvalho Chehab mutex_unlock(L3); 262*387b1468SMauro Carvalho Chehab mutex_unlock(L2); 263*387b1468SMauro Carvalho Chehab } 264*387b1468SMauro Carvalho Chehab 265*387b1468SMauro Carvalho Chehab void func4(void) 266*387b1468SMauro Carvalho Chehab { 267*387b1468SMauro Carvalho Chehab mutex_lock(L3); 268*387b1468SMauro Carvalho Chehab 269*387b1468SMauro Carvalho Chehab /* do something again */ 270*387b1468SMauro Carvalho Chehab 271*387b1468SMauro Carvalho Chehab mutex_unlock(L3); 272*387b1468SMauro Carvalho Chehab } 273*387b1468SMauro Carvalho Chehab 274*387b1468SMauro Carvalho ChehabNow we add 4 processes that run each of these functions separately. 275*387b1468SMauro Carvalho ChehabProcesses A, B, C, and D which run functions func1, func2, func3 and func4 276*387b1468SMauro Carvalho Chehabrespectively, and such that D runs first and A last. With D being preempted 277*387b1468SMauro Carvalho Chehabin func4 in the "do something again" area, we have a locking that follows:: 278*387b1468SMauro Carvalho Chehab 279*387b1468SMauro Carvalho Chehab D owns L3 280*387b1468SMauro Carvalho Chehab C blocked on L3 281*387b1468SMauro Carvalho Chehab C owns L2 282*387b1468SMauro Carvalho Chehab B blocked on L2 283*387b1468SMauro Carvalho Chehab B owns L1 284*387b1468SMauro Carvalho Chehab A blocked on L1 285*387b1468SMauro Carvalho Chehab 286*387b1468SMauro Carvalho Chehab And thus we have the chain A->L1->B->L2->C->L3->D. 287*387b1468SMauro Carvalho Chehab 288*387b1468SMauro Carvalho ChehabThis gives us a PI depth of 4 (four processes), but looking at any of the 289*387b1468SMauro Carvalho Chehabfunctions individually, it seems as though they only have at most a locking 290*387b1468SMauro Carvalho Chehabdepth of two. So, although the locking depth is defined at compile time, 291*387b1468SMauro Carvalho Chehabit still is very difficult to find the possibilities of that depth. 292*387b1468SMauro Carvalho Chehab 293*387b1468SMauro Carvalho ChehabNow since mutexes can be defined by user-land applications, we don't want a DOS 294*387b1468SMauro Carvalho Chehabtype of application that nests large amounts of mutexes to create a large 295*387b1468SMauro Carvalho ChehabPI chain, and have the code holding spin locks while looking at a large 296*387b1468SMauro Carvalho Chehabamount of data. So to prevent this, the implementation not only implements 297*387b1468SMauro Carvalho Chehaba maximum lock depth, but also only holds at most two different locks at a 298*387b1468SMauro Carvalho Chehabtime, as it walks the PI chain. More about this below. 299*387b1468SMauro Carvalho Chehab 300*387b1468SMauro Carvalho Chehab 301*387b1468SMauro Carvalho ChehabMutex owner and flags 302*387b1468SMauro Carvalho Chehab--------------------- 303*387b1468SMauro Carvalho Chehab 304*387b1468SMauro Carvalho ChehabThe mutex structure contains a pointer to the owner of the mutex. If the 305*387b1468SMauro Carvalho Chehabmutex is not owned, this owner is set to NULL. Since all architectures 306*387b1468SMauro Carvalho Chehabhave the task structure on at least a two byte alignment (and if this is 307*387b1468SMauro Carvalho Chehabnot true, the rtmutex.c code will be broken!), this allows for the least 308*387b1468SMauro Carvalho Chehabsignificant bit to be used as a flag. Bit 0 is used as the "Has Waiters" 309*387b1468SMauro Carvalho Chehabflag. It's set whenever there are waiters on a mutex. 310*387b1468SMauro Carvalho Chehab 311*387b1468SMauro Carvalho ChehabSee Documentation/locking/rt-mutex.rst for further details. 312*387b1468SMauro Carvalho Chehab 313*387b1468SMauro Carvalho Chehabcmpxchg Tricks 314*387b1468SMauro Carvalho Chehab-------------- 315*387b1468SMauro Carvalho Chehab 316*387b1468SMauro Carvalho ChehabSome architectures implement an atomic cmpxchg (Compare and Exchange). This 317*387b1468SMauro Carvalho Chehabis used (when applicable) to keep the fast path of grabbing and releasing 318*387b1468SMauro Carvalho Chehabmutexes short. 319*387b1468SMauro Carvalho Chehab 320*387b1468SMauro Carvalho Chehabcmpxchg is basically the following function performed atomically:: 321*387b1468SMauro Carvalho Chehab 322*387b1468SMauro Carvalho Chehab unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) 323*387b1468SMauro Carvalho Chehab { 324*387b1468SMauro Carvalho Chehab unsigned long T = *A; 325*387b1468SMauro Carvalho Chehab if (*A == *B) { 326*387b1468SMauro Carvalho Chehab *A = *C; 327*387b1468SMauro Carvalho Chehab } 328*387b1468SMauro Carvalho Chehab return T; 329*387b1468SMauro Carvalho Chehab } 330*387b1468SMauro Carvalho Chehab #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) 331*387b1468SMauro Carvalho Chehab 332*387b1468SMauro Carvalho ChehabThis is really nice to have, since it allows you to only update a variable 333*387b1468SMauro Carvalho Chehabif the variable is what you expect it to be. You know if it succeeded if 334*387b1468SMauro Carvalho Chehabthe return value (the old value of A) is equal to B. 335*387b1468SMauro Carvalho Chehab 336*387b1468SMauro Carvalho ChehabThe macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If 337*387b1468SMauro Carvalho Chehabthe architecture does not support CMPXCHG, then this macro is simply set 338*387b1468SMauro Carvalho Chehabto fail every time. But if CMPXCHG is supported, then this will 339*387b1468SMauro Carvalho Chehabhelp out extremely to keep the fast path short. 340*387b1468SMauro Carvalho Chehab 341*387b1468SMauro Carvalho ChehabThe use of rt_mutex_cmpxchg with the flags in the owner field help optimize 342*387b1468SMauro Carvalho Chehabthe system for architectures that support it. This will also be explained 343*387b1468SMauro Carvalho Chehablater in this document. 344*387b1468SMauro Carvalho Chehab 345*387b1468SMauro Carvalho Chehab 346*387b1468SMauro Carvalho ChehabPriority adjustments 347*387b1468SMauro Carvalho Chehab-------------------- 348*387b1468SMauro Carvalho Chehab 349*387b1468SMauro Carvalho ChehabThe implementation of the PI code in rtmutex.c has several places that a 350*387b1468SMauro Carvalho Chehabprocess must adjust its priority. With the help of the pi_waiters of a 351*387b1468SMauro Carvalho Chehabprocess this is rather easy to know what needs to be adjusted. 352*387b1468SMauro Carvalho Chehab 353*387b1468SMauro Carvalho ChehabThe functions implementing the task adjustments are rt_mutex_adjust_prio 354*387b1468SMauro Carvalho Chehaband rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio. 355*387b1468SMauro Carvalho Chehab 356*387b1468SMauro Carvalho Chehabrt_mutex_adjust_prio examines the priority of the task, and the highest 357*387b1468SMauro Carvalho Chehabpriority process that is waiting any of mutexes owned by the task. Since 358*387b1468SMauro Carvalho Chehabthe pi_waiters of a task holds an order by priority of all the top waiters 359*387b1468SMauro Carvalho Chehabof all the mutexes that the task owns, we simply need to compare the top 360*387b1468SMauro Carvalho Chehabpi waiter to its own normal/deadline priority and take the higher one. 361*387b1468SMauro Carvalho ChehabThen rt_mutex_setprio is called to adjust the priority of the task to the 362*387b1468SMauro Carvalho Chehabnew priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c 363*387b1468SMauro Carvalho Chehabto implement the actual change in priority. 364*387b1468SMauro Carvalho Chehab 365*387b1468SMauro Carvalho ChehabNote: 366*387b1468SMauro Carvalho Chehab For the "prio" field in task_struct, the lower the number, the 367*387b1468SMauro Carvalho Chehab higher the priority. A "prio" of 5 is of higher priority than a 368*387b1468SMauro Carvalho Chehab "prio" of 10. 369*387b1468SMauro Carvalho Chehab 370*387b1468SMauro Carvalho ChehabIt is interesting to note that rt_mutex_adjust_prio can either increase 371*387b1468SMauro Carvalho Chehabor decrease the priority of the task. In the case that a higher priority 372*387b1468SMauro Carvalho Chehabprocess has just blocked on a mutex owned by the task, rt_mutex_adjust_prio 373*387b1468SMauro Carvalho Chehabwould increase/boost the task's priority. But if a higher priority task 374*387b1468SMauro Carvalho Chehabwere for some reason to leave the mutex (timeout or signal), this same function 375*387b1468SMauro Carvalho Chehabwould decrease/unboost the priority of the task. That is because the pi_waiters 376*387b1468SMauro Carvalho Chehabalways contains the highest priority task that is waiting on a mutex owned 377*387b1468SMauro Carvalho Chehabby the task, so we only need to compare the priority of that top pi waiter 378*387b1468SMauro Carvalho Chehabto the normal priority of the given task. 379*387b1468SMauro Carvalho Chehab 380*387b1468SMauro Carvalho Chehab 381*387b1468SMauro Carvalho ChehabHigh level overview of the PI chain walk 382*387b1468SMauro Carvalho Chehab---------------------------------------- 383*387b1468SMauro Carvalho Chehab 384*387b1468SMauro Carvalho ChehabThe PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. 385*387b1468SMauro Carvalho Chehab 386*387b1468SMauro Carvalho ChehabThe implementation has gone through several iterations, and has ended up 387*387b1468SMauro Carvalho Chehabwith what we believe is the best. It walks the PI chain by only grabbing 388*387b1468SMauro Carvalho Chehabat most two locks at a time, and is very efficient. 389*387b1468SMauro Carvalho Chehab 390*387b1468SMauro Carvalho ChehabThe rt_mutex_adjust_prio_chain can be used either to boost or lower process 391*387b1468SMauro Carvalho Chehabpriorities. 392*387b1468SMauro Carvalho Chehab 393*387b1468SMauro Carvalho Chehabrt_mutex_adjust_prio_chain is called with a task to be checked for PI 394*387b1468SMauro Carvalho Chehab(de)boosting (the owner of a mutex that a process is blocking on), a flag to 395*387b1468SMauro Carvalho Chehabcheck for deadlocking, the mutex that the task owns, a pointer to a waiter 396*387b1468SMauro Carvalho Chehabthat is the process's waiter struct that is blocked on the mutex (although this 397*387b1468SMauro Carvalho Chehabparameter may be NULL for deboosting), a pointer to the mutex on which the task 398*387b1468SMauro Carvalho Chehabis blocked, and a top_task as the top waiter of the mutex. 399*387b1468SMauro Carvalho Chehab 400*387b1468SMauro Carvalho ChehabFor this explanation, I will not mention deadlock detection. This explanation 401*387b1468SMauro Carvalho Chehabwill try to stay at a high level. 402*387b1468SMauro Carvalho Chehab 403*387b1468SMauro Carvalho ChehabWhen this function is called, there are no locks held. That also means 404*387b1468SMauro Carvalho Chehabthat the state of the owner and lock can change when entered into this function. 405*387b1468SMauro Carvalho Chehab 406*387b1468SMauro Carvalho ChehabBefore this function is called, the task has already had rt_mutex_adjust_prio 407*387b1468SMauro Carvalho Chehabperformed on it. This means that the task is set to the priority that it 408*387b1468SMauro Carvalho Chehabshould be at, but the rbtree nodes of the task's waiter have not been updated 409*387b1468SMauro Carvalho Chehabwith the new priorities, and this task may not be in the proper locations 410*387b1468SMauro Carvalho Chehabin the pi_waiters and waiters trees that the task is blocked on. This function 411*387b1468SMauro Carvalho Chehabsolves all that. 412*387b1468SMauro Carvalho Chehab 413*387b1468SMauro Carvalho ChehabThe main operation of this function is summarized by Thomas Gleixner in 414*387b1468SMauro Carvalho Chehabrtmutex.c. See the 'Chain walk basics and protection scope' comment for further 415*387b1468SMauro Carvalho Chehabdetails. 416*387b1468SMauro Carvalho Chehab 417*387b1468SMauro Carvalho ChehabTaking of a mutex (The walk through) 418*387b1468SMauro Carvalho Chehab------------------------------------ 419*387b1468SMauro Carvalho Chehab 420*387b1468SMauro Carvalho ChehabOK, now let's take a look at the detailed walk through of what happens when 421*387b1468SMauro Carvalho Chehabtaking a mutex. 422*387b1468SMauro Carvalho Chehab 423*387b1468SMauro Carvalho ChehabThe first thing that is tried is the fast taking of the mutex. This is 424*387b1468SMauro Carvalho Chehabdone when we have CMPXCHG enabled (otherwise the fast taking automatically 425*387b1468SMauro Carvalho Chehabfails). Only when the owner field of the mutex is NULL can the lock be 426*387b1468SMauro Carvalho Chehabtaken with the CMPXCHG and nothing else needs to be done. 427*387b1468SMauro Carvalho Chehab 428*387b1468SMauro Carvalho ChehabIf there is contention on the lock, we go about the slow path 429*387b1468SMauro Carvalho Chehab(rt_mutex_slowlock). 430*387b1468SMauro Carvalho Chehab 431*387b1468SMauro Carvalho ChehabThe slow path function is where the task's waiter structure is created on 432*387b1468SMauro Carvalho Chehabthe stack. This is because the waiter structure is only needed for the 433*387b1468SMauro Carvalho Chehabscope of this function. The waiter structure holds the nodes to store 434*387b1468SMauro Carvalho Chehabthe task on the waiters tree of the mutex, and if need be, the pi_waiters 435*387b1468SMauro Carvalho Chehabtree of the owner. 436*387b1468SMauro Carvalho Chehab 437*387b1468SMauro Carvalho ChehabThe wait_lock of the mutex is taken since the slow path of unlocking the 438*387b1468SMauro Carvalho Chehabmutex also takes this lock. 439*387b1468SMauro Carvalho Chehab 440*387b1468SMauro Carvalho ChehabWe then call try_to_take_rt_mutex. This is where the architecture that 441*387b1468SMauro Carvalho Chehabdoes not implement CMPXCHG would always grab the lock (if there's no 442*387b1468SMauro Carvalho Chehabcontention). 443*387b1468SMauro Carvalho Chehab 444*387b1468SMauro Carvalho Chehabtry_to_take_rt_mutex is used every time the task tries to grab a mutex in the 445*387b1468SMauro Carvalho Chehabslow path. The first thing that is done here is an atomic setting of 446*387b1468SMauro Carvalho Chehabthe "Has Waiters" flag of the mutex's owner field. By setting this flag 447*387b1468SMauro Carvalho Chehabnow, the current owner of the mutex being contended for can't release the mutex 448*387b1468SMauro Carvalho Chehabwithout going into the slow unlock path, and it would then need to grab the 449*387b1468SMauro Carvalho Chehabwait_lock, which this code currently holds. So setting the "Has Waiters" flag 450*387b1468SMauro Carvalho Chehabforces the current owner to synchronize with this code. 451*387b1468SMauro Carvalho Chehab 452*387b1468SMauro Carvalho ChehabThe lock is taken if the following are true: 453*387b1468SMauro Carvalho Chehab 454*387b1468SMauro Carvalho Chehab 1) The lock has no owner 455*387b1468SMauro Carvalho Chehab 2) The current task is the highest priority against all other 456*387b1468SMauro Carvalho Chehab waiters of the lock 457*387b1468SMauro Carvalho Chehab 458*387b1468SMauro Carvalho ChehabIf the task succeeds to acquire the lock, then the task is set as the 459*387b1468SMauro Carvalho Chehabowner of the lock, and if the lock still has waiters, the top_waiter 460*387b1468SMauro Carvalho Chehab(highest priority task waiting on the lock) is added to this task's 461*387b1468SMauro Carvalho Chehabpi_waiters tree. 462*387b1468SMauro Carvalho Chehab 463*387b1468SMauro Carvalho ChehabIf the lock is not taken by try_to_take_rt_mutex(), then the 464*387b1468SMauro Carvalho Chehabtask_blocks_on_rt_mutex() function is called. This will add the task to 465*387b1468SMauro Carvalho Chehabthe lock's waiter tree and propagate the pi chain of the lock as well 466*387b1468SMauro Carvalho Chehabas the lock's owner's pi_waiters tree. This is described in the next 467*387b1468SMauro Carvalho Chehabsection. 468*387b1468SMauro Carvalho Chehab 469*387b1468SMauro Carvalho ChehabTask blocks on mutex 470*387b1468SMauro Carvalho Chehab-------------------- 471*387b1468SMauro Carvalho Chehab 472*387b1468SMauro Carvalho ChehabThe accounting of a mutex and process is done with the waiter structure of 473*387b1468SMauro Carvalho Chehabthe process. The "task" field is set to the process, and the "lock" field 474*387b1468SMauro Carvalho Chehabto the mutex. The rbtree node of waiter are initialized to the processes 475*387b1468SMauro Carvalho Chehabcurrent priority. 476*387b1468SMauro Carvalho Chehab 477*387b1468SMauro Carvalho ChehabSince the wait_lock was taken at the entry of the slow lock, we can safely 478*387b1468SMauro Carvalho Chehabadd the waiter to the task waiter tree. If the current process is the 479*387b1468SMauro Carvalho Chehabhighest priority process currently waiting on this mutex, then we remove the 480*387b1468SMauro Carvalho Chehabprevious top waiter process (if it exists) from the pi_waiters of the owner, 481*387b1468SMauro Carvalho Chehaband add the current process to that tree. Since the pi_waiter of the owner 482*387b1468SMauro Carvalho Chehabhas changed, we call rt_mutex_adjust_prio on the owner to see if the owner 483*387b1468SMauro Carvalho Chehabshould adjust its priority accordingly. 484*387b1468SMauro Carvalho Chehab 485*387b1468SMauro Carvalho ChehabIf the owner is also blocked on a lock, and had its pi_waiters changed 486*387b1468SMauro Carvalho Chehab(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead 487*387b1468SMauro Carvalho Chehaband run rt_mutex_adjust_prio_chain on the owner, as described earlier. 488*387b1468SMauro Carvalho Chehab 489*387b1468SMauro Carvalho ChehabNow all locks are released, and if the current process is still blocked on a 490*387b1468SMauro Carvalho Chehabmutex (waiter "task" field is not NULL), then we go to sleep (call schedule). 491*387b1468SMauro Carvalho Chehab 492*387b1468SMauro Carvalho ChehabWaking up in the loop 493*387b1468SMauro Carvalho Chehab--------------------- 494*387b1468SMauro Carvalho Chehab 495*387b1468SMauro Carvalho ChehabThe task can then wake up for a couple of reasons: 496*387b1468SMauro Carvalho Chehab 1) The previous lock owner released the lock, and the task now is top_waiter 497*387b1468SMauro Carvalho Chehab 2) we received a signal or timeout 498*387b1468SMauro Carvalho Chehab 499*387b1468SMauro Carvalho ChehabIn both cases, the task will try again to acquire the lock. If it 500*387b1468SMauro Carvalho Chehabdoes, then it will take itself off the waiters tree and set itself back 501*387b1468SMauro Carvalho Chehabto the TASK_RUNNING state. 502*387b1468SMauro Carvalho Chehab 503*387b1468SMauro Carvalho ChehabIn first case, if the lock was acquired by another task before this task 504*387b1468SMauro Carvalho Chehabcould get the lock, then it will go back to sleep and wait to be woken again. 505*387b1468SMauro Carvalho Chehab 506*387b1468SMauro Carvalho ChehabThe second case is only applicable for tasks that are grabbing a mutex 507*387b1468SMauro Carvalho Chehabthat can wake up before getting the lock, either due to a signal or 508*387b1468SMauro Carvalho Chehaba timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to 509*387b1468SMauro Carvalho Chehabtake the lock again, if it succeeds, then the task will return with the 510*387b1468SMauro Carvalho Chehablock held, otherwise it will return with -EINTR if the task was woken 511*387b1468SMauro Carvalho Chehabby a signal, or -ETIMEDOUT if it timed out. 512*387b1468SMauro Carvalho Chehab 513*387b1468SMauro Carvalho Chehab 514*387b1468SMauro Carvalho ChehabUnlocking the Mutex 515*387b1468SMauro Carvalho Chehab------------------- 516*387b1468SMauro Carvalho Chehab 517*387b1468SMauro Carvalho ChehabThe unlocking of a mutex also has a fast path for those architectures with 518*387b1468SMauro Carvalho ChehabCMPXCHG. Since the taking of a mutex on contention always sets the 519*387b1468SMauro Carvalho Chehab"Has Waiters" flag of the mutex's owner, we use this to know if we need to 520*387b1468SMauro Carvalho Chehabtake the slow path when unlocking the mutex. If the mutex doesn't have any 521*387b1468SMauro Carvalho Chehabwaiters, the owner field of the mutex would equal the current process and 522*387b1468SMauro Carvalho Chehabthe mutex can be unlocked by just replacing the owner field with NULL. 523*387b1468SMauro Carvalho Chehab 524*387b1468SMauro Carvalho ChehabIf the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), 525*387b1468SMauro Carvalho Chehabthe slow unlock path is taken. 526*387b1468SMauro Carvalho Chehab 527*387b1468SMauro Carvalho ChehabThe first thing done in the slow unlock path is to take the wait_lock of the 528*387b1468SMauro Carvalho Chehabmutex. This synchronizes the locking and unlocking of the mutex. 529*387b1468SMauro Carvalho Chehab 530*387b1468SMauro Carvalho ChehabA check is made to see if the mutex has waiters or not. On architectures that 531*387b1468SMauro Carvalho Chehabdo not have CMPXCHG, this is the location that the owner of the mutex will 532*387b1468SMauro Carvalho Chehabdetermine if a waiter needs to be awoken or not. On architectures that 533*387b1468SMauro Carvalho Chehabdo have CMPXCHG, that check is done in the fast path, but it is still needed 534*387b1468SMauro Carvalho Chehabin the slow path too. If a waiter of a mutex woke up because of a signal 535*387b1468SMauro Carvalho Chehabor timeout between the time the owner failed the fast path CMPXCHG check and 536*387b1468SMauro Carvalho Chehabthe grabbing of the wait_lock, the mutex may not have any waiters, thus the 537*387b1468SMauro Carvalho Chehabowner still needs to make this check. If there are no waiters then the mutex 538*387b1468SMauro Carvalho Chehabowner field is set to NULL, the wait_lock is released and nothing more is 539*387b1468SMauro Carvalho Chehabneeded. 540*387b1468SMauro Carvalho Chehab 541*387b1468SMauro Carvalho ChehabIf there are waiters, then we need to wake one up. 542*387b1468SMauro Carvalho Chehab 543*387b1468SMauro Carvalho ChehabOn the wake up code, the pi_lock of the current owner is taken. The top 544*387b1468SMauro Carvalho Chehabwaiter of the lock is found and removed from the waiters tree of the mutex 545*387b1468SMauro Carvalho Chehabas well as the pi_waiters tree of the current owner. The "Has Waiters" bit is 546*387b1468SMauro Carvalho Chehabmarked to prevent lower priority tasks from stealing the lock. 547*387b1468SMauro Carvalho Chehab 548*387b1468SMauro Carvalho ChehabFinally we unlock the pi_lock of the pending owner and wake it up. 549*387b1468SMauro Carvalho Chehab 550*387b1468SMauro Carvalho Chehab 551*387b1468SMauro Carvalho ChehabContact 552*387b1468SMauro Carvalho Chehab------- 553*387b1468SMauro Carvalho Chehab 554*387b1468SMauro Carvalho ChehabFor updates on this document, please email Steven Rostedt <rostedt@goodmis.org> 555*387b1468SMauro Carvalho Chehab 556*387b1468SMauro Carvalho Chehab 557*387b1468SMauro Carvalho ChehabCredits 558*387b1468SMauro Carvalho Chehab------- 559*387b1468SMauro Carvalho Chehab 560*387b1468SMauro Carvalho ChehabAuthor: Steven Rostedt <rostedt@goodmis.org> 561*387b1468SMauro Carvalho Chehab 562*387b1468SMauro Carvalho ChehabUpdated: Alex Shi <alex.shi@linaro.org> - 7/6/2017 563*387b1468SMauro Carvalho Chehab 564*387b1468SMauro Carvalho ChehabOriginal Reviewers: 565*387b1468SMauro Carvalho Chehab Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and 566*387b1468SMauro Carvalho Chehab Randy Dunlap 567*387b1468SMauro Carvalho Chehab 568*387b1468SMauro Carvalho ChehabUpdate (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior 569*387b1468SMauro Carvalho Chehab 570*387b1468SMauro Carvalho ChehabUpdates 571*387b1468SMauro Carvalho Chehab------- 572*387b1468SMauro Carvalho Chehab 573*387b1468SMauro Carvalho ChehabThis document was originally written for 2.6.17-rc3-mm1 574*387b1468SMauro Carvalho Chehabwas updated on 4.12 575