11da177e4SLinus Torvalds 21da177e4SLinus Torvalds JFFS2 LOCKING DOCUMENTATION 31da177e4SLinus Torvalds --------------------------- 41da177e4SLinus Torvalds 51da177e4SLinus TorvaldsAt least theoretically, JFFS2 does not require the Big Kernel Lock 61da177e4SLinus Torvalds(BKL), which was always helpfully obtained for it by Linux 2.4 VFS 71da177e4SLinus Torvaldscode. It has its own locking, as described below. 81da177e4SLinus Torvalds 91da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for 101da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to 111da177e4SLinus Torvaldsbe fairly close. 121da177e4SLinus Torvalds 131da177e4SLinus Torvalds 141da177e4SLinus Torvalds alloc_sem 151da177e4SLinus Torvalds --------- 161da177e4SLinus Torvalds 17ced22070SDavid WoodhouseThe alloc_sem is a per-filesystem mutex, used primarily to ensure 181da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically 191da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed 201da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that 211da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of 221da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby 231da177e4SLinus Torvaldspreventing any other write activity on the file system during a 241da177e4SLinus Torvaldsgarbage collect pass. 251da177e4SLinus Torvalds 261da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes 271da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to 281da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new 291da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem 301da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the 311da177e4SLinus Torvaldstime this happens are part of the new node, not just something that 321da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes 331da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to 341da177e4SLinus Torvaldsthe medium. 351da177e4SLinus Torvalds 361da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer, 371da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the 381da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see 391da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though. 401da177e4SLinus Torvalds 411da177e4SLinus TorvaldsOrdering constraints: See f->sem. 421da177e4SLinus Torvalds 431da177e4SLinus Torvalds 44ced22070SDavid Woodhouse File Mutex f->sem 451da177e4SLinus Torvalds --------------------- 461da177e4SLinus Torvalds 47ced22070SDavid WoodhouseThis is the JFFS2-internal equivalent of the inode mutex i->i_sem. 481da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data, 491da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on 501da177e4SLinus Torvaldserase_completion_lock), etc. 511da177e4SLinus Torvalds 521da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to 531da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem 541da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The 551da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a 561da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the 571da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem 581da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this 591da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem 601da177e4SLinus Torvaldsbefore calling the space allocation functions. 611da177e4SLinus Torvalds 621da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal 63ced22070SDavid Woodhousemutex, which is obtained by the garbage collection code and also 641da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space. 651da177e4SLinus Torvalds 661da177e4SLinus TorvaldsOrdering constraints: 671da177e4SLinus Torvalds 681da177e4SLinus Torvalds 1. Never attempt to allocate space or lock alloc_sem with 691da177e4SLinus Torvalds any f->sem held. 70ced22070SDavid Woodhouse 2. Never attempt to lock two file mutexes in one thread. 711da177e4SLinus Torvalds No ordering rules have been made for doing so. 721da177e4SLinus Torvalds 731da177e4SLinus Torvalds 741da177e4SLinus Torvalds erase_completion_lock spinlock 751da177e4SLinus Torvalds ------------------------------ 761da177e4SLinus Torvalds 771da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the 781da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and 791da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special 801da177e4SLinus Torvaldscase - see below. 811da177e4SLinus Torvalds 821da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions 831da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody 841da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use 851da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh(). 861da177e4SLinus Torvalds 871da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special 881da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in 89ced22070SDavid Woodhousethe list are protected by the file mutex f->sem. But the erase code 90ced22070SDavid Woodhousemay remove _obsolete_ nodes from the list while holding only the 911da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the 921da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as 931da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an 941da177e4SLinus Torvaldsobsolete one. 951da177e4SLinus Torvalds 961da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task 971da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the 981da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC 991da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path. 1001da177e4SLinus Torvalds 1011da177e4SLinus Torvalds 1021da177e4SLinus Torvalds inocache_lock spinlock 1031da177e4SLinus Torvalds ---------------------- 1041da177e4SLinus Torvalds 1051da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the 1061da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the 1071da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock 1081da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets. 1091da177e4SLinus Torvalds 1107d200960SDavid WoodhouseThis spinlock also covers allocation of new inode numbers, which is 1117d200960SDavid Woodhousecurrently just '++->highest_ino++', but might one day get more complicated 1127d200960SDavid Woodhouseif we need to deal with wrapping after 4 milliard inode numbers are used. 1137d200960SDavid Woodhouse 1141da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache 1151da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking 1161da177e4SLinus Torvaldsthe inocache_lock spinlock. 1171da177e4SLinus Torvalds 1181da177e4SLinus TorvaldsOrdering constraints: 1191da177e4SLinus Torvalds 1201da177e4SLinus Torvalds If both erase_completion_lock and inocache_lock are needed, the 1211da177e4SLinus Torvalds c->erase_completion has to be acquired first. 1221da177e4SLinus Torvalds 1231da177e4SLinus Torvalds 1241da177e4SLinus Torvalds erase_free_sem 1251da177e4SLinus Torvalds -------------- 1261da177e4SLinus Torvalds 127ced22070SDavid WoodhouseThis mutex is only used by the erase code which frees obsolete node 128ced22070SDavid Woodhousereferences and the jffs2_garbage_collect_deletion_dirent() function. 129ced22070SDavid WoodhouseThe latter function on NAND flash must read _obsolete_ nodes to 130ced22070SDavid Woodhousedetermine whether the 'deletion dirent' under consideration can be 1311da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has 1321da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the 1331da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more 1341da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing 1351da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage 1361da177e4SLinus Torvaldscollection code is looking at them. 1371da177e4SLinus Torvalds 1381da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed. 1391da177e4SLinus Torvalds 1401da177e4SLinus Torvalds 1411da177e4SLinus Torvalds wbuf_sem 1421da177e4SLinus Torvalds -------- 1431da177e4SLinus Torvalds 1441da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the 1451da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write 1461da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata 1471da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by 1481da177e4SLinus Torvaldsthe buffer. 1491da177e4SLinus Torvalds 1501da177e4SLinus TorvaldsOrdering constraints: 1511da177e4SLinus Torvalds Lock wbuf_sem last, after the alloc_sem or and f->sem. 1528b0b339dSKaiGai Kohei 1538b0b339dSKaiGai Kohei 1548b0b339dSKaiGai Kohei c->xattr_sem 1558b0b339dSKaiGai Kohei ------------ 1568b0b339dSKaiGai Kohei 1578b0b339dSKaiGai KoheiThis read/write semaphore protects against concurrent access to the 1588b0b339dSKaiGai Koheixattr related objects which include stuff in superblock and ic->xref. 1598b0b339dSKaiGai KoheiIn read-only path, write-semaphore is too much exclusion. It's enough 1608b0b339dSKaiGai Koheiby read-semaphore. But you must hold write-semaphore when updating, 1618b0b339dSKaiGai Koheicreating or deleting any xattr related object. 1628b0b339dSKaiGai Kohei 1638b0b339dSKaiGai KoheiOnce xattr_sem released, there would be no assurance for the existence 1648b0b339dSKaiGai Koheiof those objects. Thus, a series of processes is often required to retry, 1658b0b339dSKaiGai Koheiwhen updating such a object is necessary under holding read semaphore. 1668b0b339dSKaiGai KoheiFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and 1678b0b339dSKaiGai Koheixdatum at first. But it retries this process with holding write-semaphore 1688b0b339dSKaiGai Koheiafter release read-semaphore, if it's necessary to load name/value pair 1698b0b339dSKaiGai Koheifrom medium. 1708b0b339dSKaiGai Kohei 1718b0b339dSKaiGai KoheiOrdering constraints: 1728b0b339dSKaiGai Kohei Lock xattr_sem last, after the alloc_sem. 173