xref: /openbmc/linux/Documentation/locking/futex-requeue-pi.rst (revision 762f99f4f3cb41a775b5157dd761217beba65873)
195ca6d73SMauro Carvalho Chehab================
295ca6d73SMauro Carvalho ChehabFutex Requeue PI
395ca6d73SMauro Carvalho Chehab================
495ca6d73SMauro Carvalho Chehab
595ca6d73SMauro Carvalho ChehabRequeueing of tasks from a non-PI futex to a PI futex requires
695ca6d73SMauro Carvalho Chehabspecial handling in order to ensure the underlying rt_mutex is never
795ca6d73SMauro Carvalho Chehableft without an owner if it has waiters; doing so would break the PI
8*8c7a729dSAlexander Aringboosting logic [see rt-mutex-design.rst] For the purposes of
995ca6d73SMauro Carvalho Chehabbrevity, this action will be referred to as "requeue_pi" throughout
1095ca6d73SMauro Carvalho Chehabthis document.  Priority inheritance is abbreviated throughout as
1195ca6d73SMauro Carvalho Chehab"PI".
1295ca6d73SMauro Carvalho Chehab
1395ca6d73SMauro Carvalho ChehabMotivation
1495ca6d73SMauro Carvalho Chehab----------
1595ca6d73SMauro Carvalho Chehab
1695ca6d73SMauro Carvalho ChehabWithout requeue_pi, the glibc implementation of
1795ca6d73SMauro Carvalho Chehabpthread_cond_broadcast() must resort to waking all the tasks waiting
1895ca6d73SMauro Carvalho Chehabon a pthread_condvar and letting them try to sort out which task
1995ca6d73SMauro Carvalho Chehabgets to run first in classic thundering-herd formation.  An ideal
2095ca6d73SMauro Carvalho Chehabimplementation would wake the highest-priority waiter, and leave the
2195ca6d73SMauro Carvalho Chehabrest to the natural wakeup inherent in unlocking the mutex
2295ca6d73SMauro Carvalho Chehabassociated with the condvar.
2395ca6d73SMauro Carvalho Chehab
2495ca6d73SMauro Carvalho ChehabConsider the simplified glibc calls::
2595ca6d73SMauro Carvalho Chehab
2695ca6d73SMauro Carvalho Chehab	/* caller must lock mutex */
2795ca6d73SMauro Carvalho Chehab	pthread_cond_wait(cond, mutex)
2895ca6d73SMauro Carvalho Chehab	{
2995ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
3095ca6d73SMauro Carvalho Chehab		unlock(mutex);
3195ca6d73SMauro Carvalho Chehab		do {
3295ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
3395ca6d73SMauro Carvalho Chehab		futex_wait(cond->__data.__futex);
3495ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
3595ca6d73SMauro Carvalho Chehab		} while(...)
3695ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
3795ca6d73SMauro Carvalho Chehab		lock(mutex);
3895ca6d73SMauro Carvalho Chehab	}
3995ca6d73SMauro Carvalho Chehab
4095ca6d73SMauro Carvalho Chehab	pthread_cond_broadcast(cond)
4195ca6d73SMauro Carvalho Chehab	{
4295ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
4395ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
4495ca6d73SMauro Carvalho Chehab		futex_requeue(cond->data.__futex, cond->mutex);
4595ca6d73SMauro Carvalho Chehab	}
4695ca6d73SMauro Carvalho Chehab
4795ca6d73SMauro Carvalho ChehabOnce pthread_cond_broadcast() requeues the tasks, the cond->mutex
4895ca6d73SMauro Carvalho Chehabhas waiters. Note that pthread_cond_wait() attempts to lock the
4995ca6d73SMauro Carvalho Chehabmutex only after it has returned to user space.  This will leave the
5095ca6d73SMauro Carvalho Chehabunderlying rt_mutex with waiters, and no owner, breaking the
5195ca6d73SMauro Carvalho Chehabpreviously mentioned PI-boosting algorithms.
5295ca6d73SMauro Carvalho Chehab
5395ca6d73SMauro Carvalho ChehabIn order to support PI-aware pthread_condvar's, the kernel needs to
5495ca6d73SMauro Carvalho Chehabbe able to requeue tasks to PI futexes.  This support implies that
5595ca6d73SMauro Carvalho Chehabupon a successful futex_wait system call, the caller would return to
5695ca6d73SMauro Carvalho Chehabuser space already holding the PI futex.  The glibc implementation
5795ca6d73SMauro Carvalho Chehabwould be modified as follows::
5895ca6d73SMauro Carvalho Chehab
5995ca6d73SMauro Carvalho Chehab
6095ca6d73SMauro Carvalho Chehab	/* caller must lock mutex */
6195ca6d73SMauro Carvalho Chehab	pthread_cond_wait_pi(cond, mutex)
6295ca6d73SMauro Carvalho Chehab	{
6395ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
6495ca6d73SMauro Carvalho Chehab		unlock(mutex);
6595ca6d73SMauro Carvalho Chehab		do {
6695ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
6795ca6d73SMauro Carvalho Chehab		futex_wait_requeue_pi(cond->__data.__futex);
6895ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
6995ca6d73SMauro Carvalho Chehab		} while(...)
7095ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
7195ca6d73SMauro Carvalho Chehab		/* the kernel acquired the mutex for us */
7295ca6d73SMauro Carvalho Chehab	}
7395ca6d73SMauro Carvalho Chehab
7495ca6d73SMauro Carvalho Chehab	pthread_cond_broadcast_pi(cond)
7595ca6d73SMauro Carvalho Chehab	{
7695ca6d73SMauro Carvalho Chehab		lock(cond->__data.__lock);
7795ca6d73SMauro Carvalho Chehab		unlock(cond->__data.__lock);
7895ca6d73SMauro Carvalho Chehab		futex_requeue_pi(cond->data.__futex, cond->mutex);
7995ca6d73SMauro Carvalho Chehab	}
8095ca6d73SMauro Carvalho Chehab
8195ca6d73SMauro Carvalho ChehabThe actual glibc implementation will likely test for PI and make the
8295ca6d73SMauro Carvalho Chehabnecessary changes inside the existing calls rather than creating new
8395ca6d73SMauro Carvalho Chehabcalls for the PI cases.  Similar changes are needed for
8495ca6d73SMauro Carvalho Chehabpthread_cond_timedwait() and pthread_cond_signal().
8595ca6d73SMauro Carvalho Chehab
8695ca6d73SMauro Carvalho ChehabImplementation
8795ca6d73SMauro Carvalho Chehab--------------
8895ca6d73SMauro Carvalho Chehab
8995ca6d73SMauro Carvalho ChehabIn order to ensure the rt_mutex has an owner if it has waiters, it
9095ca6d73SMauro Carvalho Chehabis necessary for both the requeue code, as well as the waiting code,
9195ca6d73SMauro Carvalho Chehabto be able to acquire the rt_mutex before returning to user space.
9295ca6d73SMauro Carvalho ChehabThe requeue code cannot simply wake the waiter and leave it to
9395ca6d73SMauro Carvalho Chehabacquire the rt_mutex as it would open a race window between the
9495ca6d73SMauro Carvalho Chehabrequeue call returning to user space and the waiter waking and
9595ca6d73SMauro Carvalho Chehabstarting to run.  This is especially true in the uncontended case.
9695ca6d73SMauro Carvalho Chehab
9795ca6d73SMauro Carvalho ChehabThe solution involves two new rt_mutex helper routines,
9895ca6d73SMauro Carvalho Chehabrt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which
9995ca6d73SMauro Carvalho Chehaballow the requeue code to acquire an uncontended rt_mutex on behalf
10095ca6d73SMauro Carvalho Chehabof the waiter and to enqueue the waiter on a contended rt_mutex.
10195ca6d73SMauro Carvalho ChehabTwo new system calls provide the kernel<->user interface to
10295ca6d73SMauro Carvalho Chehabrequeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI.
10395ca6d73SMauro Carvalho Chehab
10495ca6d73SMauro Carvalho ChehabFUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait()
10595ca6d73SMauro Carvalho Chehaband pthread_cond_timedwait()) to block on the initial futex and wait
10695ca6d73SMauro Carvalho Chehabto be requeued to a PI-aware futex.  The implementation is the
10795ca6d73SMauro Carvalho Chehabresult of a high-speed collision between futex_wait() and
10895ca6d73SMauro Carvalho Chehabfutex_lock_pi(), with some extra logic to check for the additional
10995ca6d73SMauro Carvalho Chehabwake-up scenarios.
11095ca6d73SMauro Carvalho Chehab
11195ca6d73SMauro Carvalho ChehabFUTEX_CMP_REQUEUE_PI is called by the waker
11295ca6d73SMauro Carvalho Chehab(pthread_cond_broadcast() and pthread_cond_signal()) to requeue and
11395ca6d73SMauro Carvalho Chehabpossibly wake the waiting tasks. Internally, this system call is
11495ca6d73SMauro Carvalho Chehabstill handled by futex_requeue (by passing requeue_pi=1).  Before
11595ca6d73SMauro Carvalho Chehabrequeueing, futex_requeue() attempts to acquire the requeue target
11695ca6d73SMauro Carvalho ChehabPI futex on behalf of the top waiter.  If it can, this waiter is
11795ca6d73SMauro Carvalho Chehabwoken.  futex_requeue() then proceeds to requeue the remaining
11895ca6d73SMauro Carvalho Chehabnr_wake+nr_requeue tasks to the PI futex, calling
11995ca6d73SMauro Carvalho Chehabrt_mutex_start_proxy_lock() prior to each requeue to prepare the
12095ca6d73SMauro Carvalho Chehabtask as a waiter on the underlying rt_mutex.  It is possible that
12195ca6d73SMauro Carvalho Chehabthe lock can be acquired at this stage as well, if so, the next
12295ca6d73SMauro Carvalho Chehabwaiter is woken to finish the acquisition of the lock.
12395ca6d73SMauro Carvalho Chehab
12495ca6d73SMauro Carvalho ChehabFUTEX_CMP_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but
12595ca6d73SMauro Carvalho Chehabtheir sum is all that really matters.  futex_requeue() will wake or
12695ca6d73SMauro Carvalho Chehabrequeue up to nr_wake + nr_requeue tasks.  It will wake only as many
12795ca6d73SMauro Carvalho Chehabtasks as it can acquire the lock for, which in the majority of cases
12895ca6d73SMauro Carvalho Chehabshould be 0 as good programming practice dictates that the caller of
12995ca6d73SMauro Carvalho Chehabeither pthread_cond_broadcast() or pthread_cond_signal() acquire the
13095ca6d73SMauro Carvalho Chehabmutex prior to making the call. FUTEX_CMP_REQUEUE_PI requires that
13195ca6d73SMauro Carvalho Chehabnr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for
13295ca6d73SMauro Carvalho Chehabsignal.
133