17d200960SDavid Woodhouse $Id: README.Locking,v 1.12 2005/04/13 13:22:35 dwmw2 Exp $ 21da177e4SLinus Torvalds 31da177e4SLinus Torvalds JFFS2 LOCKING DOCUMENTATION 41da177e4SLinus Torvalds --------------------------- 51da177e4SLinus Torvalds 61da177e4SLinus TorvaldsAt least theoretically, JFFS2 does not require the Big Kernel Lock 71da177e4SLinus Torvalds(BKL), which was always helpfully obtained for it by Linux 2.4 VFS 81da177e4SLinus Torvaldscode. It has its own locking, as described below. 91da177e4SLinus Torvalds 101da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for 111da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to 121da177e4SLinus Torvaldsbe fairly close. 131da177e4SLinus Torvalds 141da177e4SLinus Torvalds 151da177e4SLinus Torvalds alloc_sem 161da177e4SLinus Torvalds --------- 171da177e4SLinus Torvalds 181da177e4SLinus TorvaldsThe alloc_sem is a per-filesystem semaphore, used primarily to ensure 191da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically 201da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed 211da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that 221da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of 231da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby 241da177e4SLinus Torvaldspreventing any other write activity on the file system during a 251da177e4SLinus Torvaldsgarbage collect pass. 261da177e4SLinus Torvalds 271da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes 281da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to 291da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new 301da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem 311da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the 321da177e4SLinus Torvaldstime this happens are part of the new node, not just something that 331da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes 341da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to 351da177e4SLinus Torvaldsthe medium. 361da177e4SLinus Torvalds 371da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer, 381da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the 391da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see 401da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though. 411da177e4SLinus Torvalds 421da177e4SLinus TorvaldsOrdering constraints: See f->sem. 431da177e4SLinus Torvalds 441da177e4SLinus Torvalds 451da177e4SLinus Torvalds File Semaphore f->sem 461da177e4SLinus Torvalds --------------------- 471da177e4SLinus Torvalds 481da177e4SLinus TorvaldsThis is the JFFS2-internal equivalent of the inode semaphore i->i_sem. 491da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data, 501da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on 511da177e4SLinus Torvaldserase_completion_lock), etc. 521da177e4SLinus Torvalds 531da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to 541da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem 551da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The 561da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a 571da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the 581da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem 591da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this 601da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem 611da177e4SLinus Torvaldsbefore calling the space allocation functions. 621da177e4SLinus Torvalds 631da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal 641da177e4SLinus Torvaldssemaphore, which is obtained by the garbage collection code and also 651da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space. 661da177e4SLinus Torvalds 671da177e4SLinus TorvaldsOrdering constraints: 681da177e4SLinus Torvalds 691da177e4SLinus Torvalds 1. Never attempt to allocate space or lock alloc_sem with 701da177e4SLinus Torvalds any f->sem held. 711da177e4SLinus Torvalds 2. Never attempt to lock two file semaphores in one thread. 721da177e4SLinus Torvalds No ordering rules have been made for doing so. 731da177e4SLinus Torvalds 741da177e4SLinus Torvalds 751da177e4SLinus Torvalds erase_completion_lock spinlock 761da177e4SLinus Torvalds ------------------------------ 771da177e4SLinus Torvalds 781da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the 791da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and 801da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special 811da177e4SLinus Torvaldscase - see below. 821da177e4SLinus Torvalds 831da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions 841da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody 851da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use 861da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh(). 871da177e4SLinus Torvalds 881da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special 891da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in 901da177e4SLinus Torvaldsthe list are protected by the file semaphore f->sem. But the erase 911da177e4SLinus Torvaldscode may remove _obsolete_ nodes from the list while holding only the 921da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the 931da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as 941da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an 951da177e4SLinus Torvaldsobsolete one. 961da177e4SLinus Torvalds 971da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task 981da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the 991da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC 1001da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path. 1011da177e4SLinus Torvalds 1021da177e4SLinus Torvalds 1031da177e4SLinus Torvalds inocache_lock spinlock 1041da177e4SLinus Torvalds ---------------------- 1051da177e4SLinus Torvalds 1061da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the 1071da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the 1081da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock 1091da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets. 1101da177e4SLinus Torvalds 1117d200960SDavid WoodhouseThis spinlock also covers allocation of new inode numbers, which is 1127d200960SDavid Woodhousecurrently just '++->highest_ino++', but might one day get more complicated 1137d200960SDavid Woodhouseif we need to deal with wrapping after 4 milliard inode numbers are used. 1147d200960SDavid Woodhouse 1151da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache 1161da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking 1171da177e4SLinus Torvaldsthe inocache_lock spinlock. 1181da177e4SLinus Torvalds 1191da177e4SLinus TorvaldsOrdering constraints: 1201da177e4SLinus Torvalds 1211da177e4SLinus Torvalds If both erase_completion_lock and inocache_lock are needed, the 1221da177e4SLinus Torvalds c->erase_completion has to be acquired first. 1231da177e4SLinus Torvalds 1241da177e4SLinus Torvalds 1251da177e4SLinus Torvalds erase_free_sem 1261da177e4SLinus Torvalds -------------- 1271da177e4SLinus Torvalds 1281da177e4SLinus TorvaldsThis semaphore is only used by the erase code which frees obsolete 1291da177e4SLinus Torvaldsnode references and the jffs2_garbage_collect_deletion_dirent() 1301da177e4SLinus Torvaldsfunction. The latter function on NAND flash must read _obsolete_ nodes 1311da177e4SLinus Torvaldsto determine whether the 'deletion dirent' under consideration can be 1321da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has 1331da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the 1341da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more 1351da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing 1361da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage 1371da177e4SLinus Torvaldscollection code is looking at them. 1381da177e4SLinus Torvalds 1391da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed. 1401da177e4SLinus Torvalds 1411da177e4SLinus Torvalds 1421da177e4SLinus Torvalds wbuf_sem 1431da177e4SLinus Torvalds -------- 1441da177e4SLinus Torvalds 1451da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the 1461da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write 1471da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata 1481da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by 1491da177e4SLinus Torvaldsthe buffer. 1501da177e4SLinus Torvalds 1511da177e4SLinus TorvaldsOrdering constraints: 1521da177e4SLinus Torvalds Lock wbuf_sem last, after the alloc_sem or and f->sem. 1538b0b339dSKaiGai Kohei 1548b0b339dSKaiGai Kohei 1558b0b339dSKaiGai Kohei c->xattr_sem 1568b0b339dSKaiGai Kohei ------------ 1578b0b339dSKaiGai Kohei 1588b0b339dSKaiGai KoheiThis read/write semaphore protects against concurrent access to the 1598b0b339dSKaiGai Koheixattr related objects which include stuff in superblock and ic->xref. 1608b0b339dSKaiGai KoheiIn read-only path, write-semaphore is too much exclusion. It's enough 1618b0b339dSKaiGai Koheiby read-semaphore. But you must hold write-semaphore when updating, 1628b0b339dSKaiGai Koheicreating or deleting any xattr related object. 1638b0b339dSKaiGai Kohei 1648b0b339dSKaiGai KoheiOnce xattr_sem released, there would be no assurance for the existence 1658b0b339dSKaiGai Koheiof those objects. Thus, a series of processes is often required to retry, 1668b0b339dSKaiGai Koheiwhen updating such a object is necessary under holding read semaphore. 1678b0b339dSKaiGai KoheiFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and 1688b0b339dSKaiGai Koheixdatum at first. But it retries this process with holding write-semaphore 1698b0b339dSKaiGai Koheiafter release read-semaphore, if it's necessary to load name/value pair 1708b0b339dSKaiGai Koheifrom medium. 1718b0b339dSKaiGai Kohei 1728b0b339dSKaiGai KoheiOrdering constraints: 1738b0b339dSKaiGai Kohei Lock xattr_sem last, after the alloc_sem. 174