195ca6d73SMauro Carvalho Chehab================ 295ca6d73SMauro Carvalho ChehabFutex Requeue PI 395ca6d73SMauro Carvalho Chehab================ 495ca6d73SMauro Carvalho Chehab 595ca6d73SMauro Carvalho ChehabRequeueing of tasks from a non-PI futex to a PI futex requires 695ca6d73SMauro Carvalho Chehabspecial handling in order to ensure the underlying rt_mutex is never 795ca6d73SMauro Carvalho Chehableft without an owner if it has waiters; doing so would break the PI 8*8c7a729dSAlexander Aringboosting logic [see rt-mutex-design.rst] For the purposes of 995ca6d73SMauro Carvalho Chehabbrevity, this action will be referred to as "requeue_pi" throughout 1095ca6d73SMauro Carvalho Chehabthis document. Priority inheritance is abbreviated throughout as 1195ca6d73SMauro Carvalho Chehab"PI". 1295ca6d73SMauro Carvalho Chehab 1395ca6d73SMauro Carvalho ChehabMotivation 1495ca6d73SMauro Carvalho Chehab---------- 1595ca6d73SMauro Carvalho Chehab 1695ca6d73SMauro Carvalho ChehabWithout requeue_pi, the glibc implementation of 1795ca6d73SMauro Carvalho Chehabpthread_cond_broadcast() must resort to waking all the tasks waiting 1895ca6d73SMauro Carvalho Chehabon a pthread_condvar and letting them try to sort out which task 1995ca6d73SMauro Carvalho Chehabgets to run first in classic thundering-herd formation. An ideal 2095ca6d73SMauro Carvalho Chehabimplementation would wake the highest-priority waiter, and leave the 2195ca6d73SMauro Carvalho Chehabrest to the natural wakeup inherent in unlocking the mutex 2295ca6d73SMauro Carvalho Chehabassociated with the condvar. 2395ca6d73SMauro Carvalho Chehab 2495ca6d73SMauro Carvalho ChehabConsider the simplified glibc calls:: 2595ca6d73SMauro Carvalho Chehab 2695ca6d73SMauro Carvalho Chehab /* caller must lock mutex */ 2795ca6d73SMauro Carvalho Chehab pthread_cond_wait(cond, mutex) 2895ca6d73SMauro Carvalho Chehab { 2995ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 3095ca6d73SMauro Carvalho Chehab unlock(mutex); 3195ca6d73SMauro Carvalho Chehab do { 3295ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 3395ca6d73SMauro Carvalho Chehab futex_wait(cond->__data.__futex); 3495ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 3595ca6d73SMauro Carvalho Chehab } while(...) 3695ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 3795ca6d73SMauro Carvalho Chehab lock(mutex); 3895ca6d73SMauro Carvalho Chehab } 3995ca6d73SMauro Carvalho Chehab 4095ca6d73SMauro Carvalho Chehab pthread_cond_broadcast(cond) 4195ca6d73SMauro Carvalho Chehab { 4295ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 4395ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 4495ca6d73SMauro Carvalho Chehab futex_requeue(cond->data.__futex, cond->mutex); 4595ca6d73SMauro Carvalho Chehab } 4695ca6d73SMauro Carvalho Chehab 4795ca6d73SMauro Carvalho ChehabOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex 4895ca6d73SMauro Carvalho Chehabhas waiters. Note that pthread_cond_wait() attempts to lock the 4995ca6d73SMauro Carvalho Chehabmutex only after it has returned to user space. This will leave the 5095ca6d73SMauro Carvalho Chehabunderlying rt_mutex with waiters, and no owner, breaking the 5195ca6d73SMauro Carvalho Chehabpreviously mentioned PI-boosting algorithms. 5295ca6d73SMauro Carvalho Chehab 5395ca6d73SMauro Carvalho ChehabIn order to support PI-aware pthread_condvar's, the kernel needs to 5495ca6d73SMauro Carvalho Chehabbe able to requeue tasks to PI futexes. This support implies that 5595ca6d73SMauro Carvalho Chehabupon a successful futex_wait system call, the caller would return to 5695ca6d73SMauro Carvalho Chehabuser space already holding the PI futex. The glibc implementation 5795ca6d73SMauro Carvalho Chehabwould be modified as follows:: 5895ca6d73SMauro Carvalho Chehab 5995ca6d73SMauro Carvalho Chehab 6095ca6d73SMauro Carvalho Chehab /* caller must lock mutex */ 6195ca6d73SMauro Carvalho Chehab pthread_cond_wait_pi(cond, mutex) 6295ca6d73SMauro Carvalho Chehab { 6395ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 6495ca6d73SMauro Carvalho Chehab unlock(mutex); 6595ca6d73SMauro Carvalho Chehab do { 6695ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 6795ca6d73SMauro Carvalho Chehab futex_wait_requeue_pi(cond->__data.__futex); 6895ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 6995ca6d73SMauro Carvalho Chehab } while(...) 7095ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 7195ca6d73SMauro Carvalho Chehab /* the kernel acquired the mutex for us */ 7295ca6d73SMauro Carvalho Chehab } 7395ca6d73SMauro Carvalho Chehab 7495ca6d73SMauro Carvalho Chehab pthread_cond_broadcast_pi(cond) 7595ca6d73SMauro Carvalho Chehab { 7695ca6d73SMauro Carvalho Chehab lock(cond->__data.__lock); 7795ca6d73SMauro Carvalho Chehab unlock(cond->__data.__lock); 7895ca6d73SMauro Carvalho Chehab futex_requeue_pi(cond->data.__futex, cond->mutex); 7995ca6d73SMauro Carvalho Chehab } 8095ca6d73SMauro Carvalho Chehab 8195ca6d73SMauro Carvalho ChehabThe actual glibc implementation will likely test for PI and make the 8295ca6d73SMauro Carvalho Chehabnecessary changes inside the existing calls rather than creating new 8395ca6d73SMauro Carvalho Chehabcalls for the PI cases. Similar changes are needed for 8495ca6d73SMauro Carvalho Chehabpthread_cond_timedwait() and pthread_cond_signal(). 8595ca6d73SMauro Carvalho Chehab 8695ca6d73SMauro Carvalho ChehabImplementation 8795ca6d73SMauro Carvalho Chehab-------------- 8895ca6d73SMauro Carvalho Chehab 8995ca6d73SMauro Carvalho ChehabIn order to ensure the rt_mutex has an owner if it has waiters, it 9095ca6d73SMauro Carvalho Chehabis necessary for both the requeue code, as well as the waiting code, 9195ca6d73SMauro Carvalho Chehabto be able to acquire the rt_mutex before returning to user space. 9295ca6d73SMauro Carvalho ChehabThe requeue code cannot simply wake the waiter and leave it to 9395ca6d73SMauro Carvalho Chehabacquire the rt_mutex as it would open a race window between the 9495ca6d73SMauro Carvalho Chehabrequeue call returning to user space and the waiter waking and 9595ca6d73SMauro Carvalho Chehabstarting to run. This is especially true in the uncontended case. 9695ca6d73SMauro Carvalho Chehab 9795ca6d73SMauro Carvalho ChehabThe solution involves two new rt_mutex helper routines, 9895ca6d73SMauro Carvalho Chehabrt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which 9995ca6d73SMauro Carvalho Chehaballow the requeue code to acquire an uncontended rt_mutex on behalf 10095ca6d73SMauro Carvalho Chehabof the waiter and to enqueue the waiter on a contended rt_mutex. 10195ca6d73SMauro Carvalho ChehabTwo new system calls provide the kernel<->user interface to 10295ca6d73SMauro Carvalho Chehabrequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI. 10395ca6d73SMauro Carvalho Chehab 10495ca6d73SMauro Carvalho ChehabFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() 10595ca6d73SMauro Carvalho Chehaband pthread_cond_timedwait()) to block on the initial futex and wait 10695ca6d73SMauro Carvalho Chehabto be requeued to a PI-aware futex. The implementation is the 10795ca6d73SMauro Carvalho Chehabresult of a high-speed collision between futex_wait() and 10895ca6d73SMauro Carvalho Chehabfutex_lock_pi(), with some extra logic to check for the additional 10995ca6d73SMauro Carvalho Chehabwake-up scenarios. 11095ca6d73SMauro Carvalho Chehab 11195ca6d73SMauro Carvalho ChehabFUTEX_CMP_REQUEUE_PI is called by the waker 11295ca6d73SMauro Carvalho Chehab(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and 11395ca6d73SMauro Carvalho Chehabpossibly wake the waiting tasks. Internally, this system call is 11495ca6d73SMauro Carvalho Chehabstill handled by futex_requeue (by passing requeue_pi=1). Before 11595ca6d73SMauro Carvalho Chehabrequeueing, futex_requeue() attempts to acquire the requeue target 11695ca6d73SMauro Carvalho ChehabPI futex on behalf of the top waiter. If it can, this waiter is 11795ca6d73SMauro Carvalho Chehabwoken. futex_requeue() then proceeds to requeue the remaining 11895ca6d73SMauro Carvalho Chehabnr_wake+nr_requeue tasks to the PI futex, calling 11995ca6d73SMauro Carvalho Chehabrt_mutex_start_proxy_lock() prior to each requeue to prepare the 12095ca6d73SMauro Carvalho Chehabtask as a waiter on the underlying rt_mutex. It is possible that 12195ca6d73SMauro Carvalho Chehabthe lock can be acquired at this stage as well, if so, the next 12295ca6d73SMauro Carvalho Chehabwaiter is woken to finish the acquisition of the lock. 12395ca6d73SMauro Carvalho Chehab 12495ca6d73SMauro Carvalho ChehabFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but 12595ca6d73SMauro Carvalho Chehabtheir sum is all that really matters. futex_requeue() will wake or 12695ca6d73SMauro Carvalho Chehabrequeue up to nr_wake + nr_requeue tasks. It will wake only as many 12795ca6d73SMauro Carvalho Chehabtasks as it can acquire the lock for, which in the majority of cases 12895ca6d73SMauro Carvalho Chehabshould be 0 as good programming practice dictates that the caller of 12995ca6d73SMauro Carvalho Chehabeither pthread_cond_broadcast() or pthread_cond_signal() acquire the 13095ca6d73SMauro Carvalho Chehabmutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that 13195ca6d73SMauro Carvalho Chehabnr_wake=1. nr_requeue should be INT_MAX for broadcast and 0 for 13295ca6d73SMauro Carvalho Chehabsignal. 133