195ca6d73SMauro Carvalho Chehab====================
295ca6d73SMauro Carvalho ChehabThe robust futex ABI
395ca6d73SMauro Carvalho Chehab====================
495ca6d73SMauro Carvalho Chehab
595ca6d73SMauro Carvalho Chehab:Author: Started by Paul Jackson <pj@sgi.com>
695ca6d73SMauro Carvalho Chehab
795ca6d73SMauro Carvalho Chehab
895ca6d73SMauro Carvalho ChehabRobust_futexes provide a mechanism that is used in addition to normal
995ca6d73SMauro Carvalho Chehabfutexes, for kernel assist of cleanup of held locks on task exit.
1095ca6d73SMauro Carvalho Chehab
1195ca6d73SMauro Carvalho ChehabThe interesting data as to what futexes a thread is holding is kept on a
1295ca6d73SMauro Carvalho Chehablinked list in user space, where it can be updated efficiently as locks
1395ca6d73SMauro Carvalho Chehabare taken and dropped, without kernel intervention.  The only additional
1495ca6d73SMauro Carvalho Chehabkernel intervention required for robust_futexes above and beyond what is
1595ca6d73SMauro Carvalho Chehabrequired for futexes is:
1695ca6d73SMauro Carvalho Chehab
1795ca6d73SMauro Carvalho Chehab 1) a one time call, per thread, to tell the kernel where its list of
1895ca6d73SMauro Carvalho Chehab    held robust_futexes begins, and
1995ca6d73SMauro Carvalho Chehab 2) internal kernel code at exit, to handle any listed locks held
2095ca6d73SMauro Carvalho Chehab    by the exiting thread.
2195ca6d73SMauro Carvalho Chehab
2295ca6d73SMauro Carvalho ChehabThe existing normal futexes already provide a "Fast Userspace Locking"
2395ca6d73SMauro Carvalho Chehabmechanism, which handles uncontested locking without needing a system
2495ca6d73SMauro Carvalho Chehabcall, and handles contested locking by maintaining a list of waiting
2595ca6d73SMauro Carvalho Chehabthreads in the kernel.  Options on the sys_futex(2) system call support
2695ca6d73SMauro Carvalho Chehabwaiting on a particular futex, and waking up the next waiter on a
2795ca6d73SMauro Carvalho Chehabparticular futex.
2895ca6d73SMauro Carvalho Chehab
2995ca6d73SMauro Carvalho ChehabFor robust_futexes to work, the user code (typically in a library such
3095ca6d73SMauro Carvalho Chehabas glibc linked with the application) has to manage and place the
3195ca6d73SMauro Carvalho Chehabnecessary list elements exactly as the kernel expects them.  If it fails
3295ca6d73SMauro Carvalho Chehabto do so, then improperly listed locks will not be cleaned up on exit,
3395ca6d73SMauro Carvalho Chehabprobably causing deadlock or other such failure of the other threads
3495ca6d73SMauro Carvalho Chehabwaiting on the same locks.
3595ca6d73SMauro Carvalho Chehab
3695ca6d73SMauro Carvalho ChehabA thread that anticipates possibly using robust_futexes should first
3795ca6d73SMauro Carvalho Chehabissue the system call::
3895ca6d73SMauro Carvalho Chehab
3995ca6d73SMauro Carvalho Chehab    asmlinkage long
4095ca6d73SMauro Carvalho Chehab    sys_set_robust_list(struct robust_list_head __user *head, size_t len);
4195ca6d73SMauro Carvalho Chehab
4295ca6d73SMauro Carvalho ChehabThe pointer 'head' points to a structure in the threads address space
4395ca6d73SMauro Carvalho Chehabconsisting of three words.  Each word is 32 bits on 32 bit arch's, or 64
4495ca6d73SMauro Carvalho Chehabbits on 64 bit arch's, and local byte order.  Each thread should have
4595ca6d73SMauro Carvalho Chehabits own thread private 'head'.
4695ca6d73SMauro Carvalho Chehab
4795ca6d73SMauro Carvalho ChehabIf a thread is running in 32 bit compatibility mode on a 64 native arch
4895ca6d73SMauro Carvalho Chehabkernel, then it can actually have two such structures - one using 32 bit
4995ca6d73SMauro Carvalho Chehabwords for 32 bit compatibility mode, and one using 64 bit words for 64
5095ca6d73SMauro Carvalho Chehabbit native mode.  The kernel, if it is a 64 bit kernel supporting 32 bit
5195ca6d73SMauro Carvalho Chehabcompatibility mode, will attempt to process both lists on each task
5295ca6d73SMauro Carvalho Chehabexit, if the corresponding sys_set_robust_list() call has been made to
5395ca6d73SMauro Carvalho Chehabsetup that list.
5495ca6d73SMauro Carvalho Chehab
5595ca6d73SMauro Carvalho Chehab  The first word in the memory structure at 'head' contains a
5695ca6d73SMauro Carvalho Chehab  pointer to a single linked list of 'lock entries', one per lock,
5795ca6d73SMauro Carvalho Chehab  as described below.  If the list is empty, the pointer will point
5895ca6d73SMauro Carvalho Chehab  to itself, 'head'.  The last 'lock entry' points back to the 'head'.
5995ca6d73SMauro Carvalho Chehab
6095ca6d73SMauro Carvalho Chehab  The second word, called 'offset', specifies the offset from the
6195ca6d73SMauro Carvalho Chehab  address of the associated 'lock entry', plus or minus, of what will
6295ca6d73SMauro Carvalho Chehab  be called the 'lock word', from that 'lock entry'.  The 'lock word'
6395ca6d73SMauro Carvalho Chehab  is always a 32 bit word, unlike the other words above.  The 'lock
6495ca6d73SMauro Carvalho Chehab  word' holds 2 flag bits in the upper 2 bits, and the thread id (TID)
6595ca6d73SMauro Carvalho Chehab  of the thread holding the lock in the bottom 30 bits.  See further
6695ca6d73SMauro Carvalho Chehab  below for a description of the flag bits.
6795ca6d73SMauro Carvalho Chehab
6895ca6d73SMauro Carvalho Chehab  The third word, called 'list_op_pending', contains transient copy of
6995ca6d73SMauro Carvalho Chehab  the address of the 'lock entry', during list insertion and removal,
7095ca6d73SMauro Carvalho Chehab  and is needed to correctly resolve races should a thread exit while
7195ca6d73SMauro Carvalho Chehab  in the middle of a locking or unlocking operation.
7295ca6d73SMauro Carvalho Chehab
7395ca6d73SMauro Carvalho ChehabEach 'lock entry' on the single linked list starting at 'head' consists
7495ca6d73SMauro Carvalho Chehabof just a single word, pointing to the next 'lock entry', or back to
7595ca6d73SMauro Carvalho Chehab'head' if there are no more entries.  In addition, nearby to each 'lock
7695ca6d73SMauro Carvalho Chehabentry', at an offset from the 'lock entry' specified by the 'offset'
7795ca6d73SMauro Carvalho Chehabword, is one 'lock word'.
7895ca6d73SMauro Carvalho Chehab
7995ca6d73SMauro Carvalho ChehabThe 'lock word' is always 32 bits, and is intended to be the same 32 bit
8095ca6d73SMauro Carvalho Chehablock variable used by the futex mechanism, in conjunction with
8195ca6d73SMauro Carvalho Chehabrobust_futexes.  The kernel will only be able to wakeup the next thread
8295ca6d73SMauro Carvalho Chehabwaiting for a lock on a threads exit if that next thread used the futex
8395ca6d73SMauro Carvalho Chehabmechanism to register the address of that 'lock word' with the kernel.
8495ca6d73SMauro Carvalho Chehab
8595ca6d73SMauro Carvalho ChehabFor each futex lock currently held by a thread, if it wants this
8695ca6d73SMauro Carvalho Chehabrobust_futex support for exit cleanup of that lock, it should have one
8795ca6d73SMauro Carvalho Chehab'lock entry' on this list, with its associated 'lock word' at the
8895ca6d73SMauro Carvalho Chehabspecified 'offset'.  Should a thread die while holding any such locks,
8995ca6d73SMauro Carvalho Chehabthe kernel will walk this list, mark any such locks with a bit
9095ca6d73SMauro Carvalho Chehabindicating their holder died, and wakeup the next thread waiting for
9195ca6d73SMauro Carvalho Chehabthat lock using the futex mechanism.
9295ca6d73SMauro Carvalho Chehab
9395ca6d73SMauro Carvalho ChehabWhen a thread has invoked the above system call to indicate it
9495ca6d73SMauro Carvalho Chehabanticipates using robust_futexes, the kernel stores the passed in 'head'
9595ca6d73SMauro Carvalho Chehabpointer for that task.  The task may retrieve that value later on by
9695ca6d73SMauro Carvalho Chehabusing the system call::
9795ca6d73SMauro Carvalho Chehab
9895ca6d73SMauro Carvalho Chehab    asmlinkage long
9995ca6d73SMauro Carvalho Chehab    sys_get_robust_list(int pid, struct robust_list_head __user **head_ptr,
10095ca6d73SMauro Carvalho Chehab                        size_t __user *len_ptr);
10195ca6d73SMauro Carvalho Chehab
10295ca6d73SMauro Carvalho ChehabIt is anticipated that threads will use robust_futexes embedded in
10395ca6d73SMauro Carvalho Chehablarger, user level locking structures, one per lock.  The kernel
10495ca6d73SMauro Carvalho Chehabrobust_futex mechanism doesn't care what else is in that structure, so
10595ca6d73SMauro Carvalho Chehablong as the 'offset' to the 'lock word' is the same for all
10695ca6d73SMauro Carvalho Chehabrobust_futexes used by that thread.  The thread should link those locks
10795ca6d73SMauro Carvalho Chehabit currently holds using the 'lock entry' pointers.  It may also have
10895ca6d73SMauro Carvalho Chehabother links between the locks, such as the reverse side of a double
10995ca6d73SMauro Carvalho Chehablinked list, but that doesn't matter to the kernel.
11095ca6d73SMauro Carvalho Chehab
11195ca6d73SMauro Carvalho ChehabBy keeping its locks linked this way, on a list starting with a 'head'
11295ca6d73SMauro Carvalho Chehabpointer known to the kernel, the kernel can provide to a thread the
11395ca6d73SMauro Carvalho Chehabessential service available for robust_futexes, which is to help clean
11495ca6d73SMauro Carvalho Chehabup locks held at the time of (a perhaps unexpectedly) exit.
11595ca6d73SMauro Carvalho Chehab
11695ca6d73SMauro Carvalho ChehabActual locking and unlocking, during normal operations, is handled
11795ca6d73SMauro Carvalho Chehabentirely by user level code in the contending threads, and by the
11895ca6d73SMauro Carvalho Chehabexisting futex mechanism to wait for, and wakeup, locks.  The kernels
11995ca6d73SMauro Carvalho Chehabonly essential involvement in robust_futexes is to remember where the
12095ca6d73SMauro Carvalho Chehablist 'head' is, and to walk the list on thread exit, handling locks
12195ca6d73SMauro Carvalho Chehabstill held by the departing thread, as described below.
12295ca6d73SMauro Carvalho Chehab
12395ca6d73SMauro Carvalho ChehabThere may exist thousands of futex lock structures in a threads shared
12495ca6d73SMauro Carvalho Chehabmemory, on various data structures, at a given point in time. Only those
12595ca6d73SMauro Carvalho Chehablock structures for locks currently held by that thread should be on
12695ca6d73SMauro Carvalho Chehabthat thread's robust_futex linked lock list a given time.
12795ca6d73SMauro Carvalho Chehab
12895ca6d73SMauro Carvalho ChehabA given futex lock structure in a user shared memory region may be held
12995ca6d73SMauro Carvalho Chehabat different times by any of the threads with access to that region. The
13095ca6d73SMauro Carvalho Chehabthread currently holding such a lock, if any, is marked with the threads
13195ca6d73SMauro Carvalho ChehabTID in the lower 30 bits of the 'lock word'.
13295ca6d73SMauro Carvalho Chehab
13395ca6d73SMauro Carvalho ChehabWhen adding or removing a lock from its list of held locks, in order for
13495ca6d73SMauro Carvalho Chehabthe kernel to correctly handle lock cleanup regardless of when the task
13595ca6d73SMauro Carvalho Chehabexits (perhaps it gets an unexpected signal 9 in the middle of
13695ca6d73SMauro Carvalho Chehabmanipulating this list), the user code must observe the following
13795ca6d73SMauro Carvalho Chehabprotocol on 'lock entry' insertion and removal:
13895ca6d73SMauro Carvalho Chehab
13995ca6d73SMauro Carvalho ChehabOn insertion:
14095ca6d73SMauro Carvalho Chehab
14195ca6d73SMauro Carvalho Chehab 1) set the 'list_op_pending' word to the address of the 'lock entry'
14295ca6d73SMauro Carvalho Chehab    to be inserted,
14395ca6d73SMauro Carvalho Chehab 2) acquire the futex lock,
14495ca6d73SMauro Carvalho Chehab 3) add the lock entry, with its thread id (TID) in the bottom 30 bits
14595ca6d73SMauro Carvalho Chehab    of the 'lock word', to the linked list starting at 'head', and
14695ca6d73SMauro Carvalho Chehab 4) clear the 'list_op_pending' word.
14795ca6d73SMauro Carvalho Chehab
14895ca6d73SMauro Carvalho ChehabOn removal:
14995ca6d73SMauro Carvalho Chehab
15095ca6d73SMauro Carvalho Chehab 1) set the 'list_op_pending' word to the address of the 'lock entry'
15195ca6d73SMauro Carvalho Chehab    to be removed,
15295ca6d73SMauro Carvalho Chehab 2) remove the lock entry for this lock from the 'head' list,
15395ca6d73SMauro Carvalho Chehab 3) release the futex lock, and
15495ca6d73SMauro Carvalho Chehab 4) clear the 'lock_op_pending' word.
15595ca6d73SMauro Carvalho Chehab
15695ca6d73SMauro Carvalho ChehabOn exit, the kernel will consider the address stored in
15795ca6d73SMauro Carvalho Chehab'list_op_pending' and the address of each 'lock word' found by walking
15895ca6d73SMauro Carvalho Chehabthe list starting at 'head'.  For each such address, if the bottom 30
15995ca6d73SMauro Carvalho Chehabbits of the 'lock word' at offset 'offset' from that address equals the
16095ca6d73SMauro Carvalho Chehabexiting threads TID, then the kernel will do two things:
16195ca6d73SMauro Carvalho Chehab
16295ca6d73SMauro Carvalho Chehab 1) if bit 31 (0x80000000) is set in that word, then attempt a futex
16395ca6d73SMauro Carvalho Chehab    wakeup on that address, which will waken the next thread that has
16495ca6d73SMauro Carvalho Chehab    used to the futex mechanism to wait on that address, and
16595ca6d73SMauro Carvalho Chehab 2) atomically set  bit 30 (0x40000000) in the 'lock word'.
16695ca6d73SMauro Carvalho Chehab
16795ca6d73SMauro Carvalho ChehabIn the above, bit 31 was set by futex waiters on that lock to indicate
16895ca6d73SMauro Carvalho Chehabthey were waiting, and bit 30 is set by the kernel to indicate that the
16995ca6d73SMauro Carvalho Chehablock owner died holding the lock.
17095ca6d73SMauro Carvalho Chehab
17195ca6d73SMauro Carvalho ChehabThe kernel exit code will silently stop scanning the list further if at
17295ca6d73SMauro Carvalho Chehabany point:
17395ca6d73SMauro Carvalho Chehab
17495ca6d73SMauro Carvalho Chehab 1) the 'head' pointer or an subsequent linked list pointer
17595ca6d73SMauro Carvalho Chehab    is not a valid address of a user space word
17695ca6d73SMauro Carvalho Chehab 2) the calculated location of the 'lock word' (address plus
17795ca6d73SMauro Carvalho Chehab    'offset') is not the valid address of a 32 bit user space
17895ca6d73SMauro Carvalho Chehab    word
17995ca6d73SMauro Carvalho Chehab 3) if the list contains more than 1 million (subject to
18095ca6d73SMauro Carvalho Chehab    future kernel configuration changes) elements.
18195ca6d73SMauro Carvalho Chehab
18295ca6d73SMauro Carvalho ChehabWhen the kernel sees a list entry whose 'lock word' doesn't have the
18395ca6d73SMauro Carvalho Chehabcurrent threads TID in the lower 30 bits, it does nothing with that
18495ca6d73SMauro Carvalho Chehabentry, and goes on to the next entry.
185