xref: /openbmc/linux/fs/jffs2/README.Locking (revision 8b0b339d)
17d200960SDavid Woodhouse	$Id: README.Locking,v 1.12 2005/04/13 13:22:35 dwmw2 Exp $
21da177e4SLinus Torvalds
31da177e4SLinus Torvalds	JFFS2 LOCKING DOCUMENTATION
41da177e4SLinus Torvalds	---------------------------
51da177e4SLinus Torvalds
61da177e4SLinus TorvaldsAt least theoretically, JFFS2 does not require the Big Kernel Lock
71da177e4SLinus Torvalds(BKL), which was always helpfully obtained for it by Linux 2.4 VFS
81da177e4SLinus Torvaldscode. It has its own locking, as described below.
91da177e4SLinus Torvalds
101da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for
111da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to
121da177e4SLinus Torvaldsbe fairly close.
131da177e4SLinus Torvalds
141da177e4SLinus Torvalds
151da177e4SLinus Torvalds	alloc_sem
161da177e4SLinus Torvalds	---------
171da177e4SLinus Torvalds
181da177e4SLinus TorvaldsThe alloc_sem is a per-filesystem semaphore, used primarily to ensure
191da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically
201da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed
211da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that
221da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of
231da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby
241da177e4SLinus Torvaldspreventing any other write activity on the file system during a
251da177e4SLinus Torvaldsgarbage collect pass.
261da177e4SLinus Torvalds
271da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes
281da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to
291da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new
301da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem
311da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the
321da177e4SLinus Torvaldstime this happens are part of the new node, not just something that
331da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes
341da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to
351da177e4SLinus Torvaldsthe medium.
361da177e4SLinus Torvalds
371da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer,
381da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the
391da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see
401da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though.
411da177e4SLinus Torvalds
421da177e4SLinus TorvaldsOrdering constraints: See f->sem.
431da177e4SLinus Torvalds
441da177e4SLinus Torvalds
451da177e4SLinus Torvalds	File Semaphore f->sem
461da177e4SLinus Torvalds	---------------------
471da177e4SLinus Torvalds
481da177e4SLinus TorvaldsThis is the JFFS2-internal equivalent of the inode semaphore i->i_sem.
491da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data,
501da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on
511da177e4SLinus Torvaldserase_completion_lock), etc.
521da177e4SLinus Torvalds
531da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to
541da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem
551da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The
561da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a
571da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the
581da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem
591da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this
601da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem
611da177e4SLinus Torvaldsbefore calling the space allocation functions.
621da177e4SLinus Torvalds
631da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal
641da177e4SLinus Torvaldssemaphore, which is obtained by the garbage collection code and also
651da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space.
661da177e4SLinus Torvalds
671da177e4SLinus TorvaldsOrdering constraints:
681da177e4SLinus Torvalds
691da177e4SLinus Torvalds	1. Never attempt to allocate space or lock alloc_sem with
701da177e4SLinus Torvalds	   any f->sem held.
711da177e4SLinus Torvalds	2. Never attempt to lock two file semaphores in one thread.
721da177e4SLinus Torvalds	   No ordering rules have been made for doing so.
731da177e4SLinus Torvalds
741da177e4SLinus Torvalds
751da177e4SLinus Torvalds	erase_completion_lock spinlock
761da177e4SLinus Torvalds	------------------------------
771da177e4SLinus Torvalds
781da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the
791da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and
801da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special
811da177e4SLinus Torvaldscase - see below.
821da177e4SLinus Torvalds
831da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions
841da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody
851da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use
861da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh().
871da177e4SLinus Torvalds
881da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special
891da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in
901da177e4SLinus Torvaldsthe list are protected by the file semaphore f->sem. But the erase
911da177e4SLinus Torvaldscode may remove _obsolete_ nodes from the list while holding only the
921da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the
931da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as
941da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an
951da177e4SLinus Torvaldsobsolete one.
961da177e4SLinus Torvalds
971da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task
981da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the
991da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC
1001da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path.
1011da177e4SLinus Torvalds
1021da177e4SLinus Torvalds
1031da177e4SLinus Torvalds	inocache_lock spinlock
1041da177e4SLinus Torvalds	----------------------
1051da177e4SLinus Torvalds
1061da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the
1071da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the
1081da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock
1091da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets.
1101da177e4SLinus Torvalds
1117d200960SDavid WoodhouseThis spinlock also covers allocation of new inode numbers, which is
1127d200960SDavid Woodhousecurrently just '++->highest_ino++', but might one day get more complicated
1137d200960SDavid Woodhouseif we need to deal with wrapping after 4 milliard inode numbers are used.
1147d200960SDavid Woodhouse
1151da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache
1161da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking
1171da177e4SLinus Torvaldsthe inocache_lock spinlock.
1181da177e4SLinus Torvalds
1191da177e4SLinus TorvaldsOrdering constraints:
1201da177e4SLinus Torvalds
1211da177e4SLinus Torvalds	If both erase_completion_lock and inocache_lock are needed, the
1221da177e4SLinus Torvalds	c->erase_completion has to be acquired first.
1231da177e4SLinus Torvalds
1241da177e4SLinus Torvalds
1251da177e4SLinus Torvalds	erase_free_sem
1261da177e4SLinus Torvalds	--------------
1271da177e4SLinus Torvalds
1281da177e4SLinus TorvaldsThis semaphore is only used by the erase code which frees obsolete
1291da177e4SLinus Torvaldsnode references and the jffs2_garbage_collect_deletion_dirent()
1301da177e4SLinus Torvaldsfunction. The latter function on NAND flash must read _obsolete_ nodes
1311da177e4SLinus Torvaldsto determine whether the 'deletion dirent' under consideration can be
1321da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has
1331da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the
1341da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more
1351da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing
1361da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage
1371da177e4SLinus Torvaldscollection code is looking at them.
1381da177e4SLinus Torvalds
1391da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed.
1401da177e4SLinus Torvalds
1411da177e4SLinus Torvalds
1421da177e4SLinus Torvalds	wbuf_sem
1431da177e4SLinus Torvalds	--------
1441da177e4SLinus Torvalds
1451da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the
1461da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write
1471da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata
1481da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by
1491da177e4SLinus Torvaldsthe buffer.
1501da177e4SLinus Torvalds
1511da177e4SLinus TorvaldsOrdering constraints:
1521da177e4SLinus Torvalds	Lock wbuf_sem last, after the alloc_sem or and f->sem.
1538b0b339dSKaiGai Kohei
1548b0b339dSKaiGai Kohei
1558b0b339dSKaiGai Kohei	c->xattr_sem
1568b0b339dSKaiGai Kohei	------------
1578b0b339dSKaiGai Kohei
1588b0b339dSKaiGai KoheiThis read/write semaphore protects against concurrent access to the
1598b0b339dSKaiGai Koheixattr related objects which include stuff in superblock and ic->xref.
1608b0b339dSKaiGai KoheiIn read-only path, write-semaphore is too much exclusion. It's enough
1618b0b339dSKaiGai Koheiby read-semaphore. But you must hold write-semaphore when updating,
1628b0b339dSKaiGai Koheicreating or deleting any xattr related object.
1638b0b339dSKaiGai Kohei
1648b0b339dSKaiGai KoheiOnce xattr_sem released, there would be no assurance for the existence
1658b0b339dSKaiGai Koheiof those objects. Thus, a series of processes is often required to retry,
1668b0b339dSKaiGai Koheiwhen updating such a object is necessary under holding read semaphore.
1678b0b339dSKaiGai KoheiFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and
1688b0b339dSKaiGai Koheixdatum at first. But it retries this process with holding write-semaphore
1698b0b339dSKaiGai Koheiafter release read-semaphore, if it's necessary to load name/value pair
1708b0b339dSKaiGai Koheifrom medium.
1718b0b339dSKaiGai Kohei
1728b0b339dSKaiGai KoheiOrdering constraints:
1738b0b339dSKaiGai Kohei	Lock xattr_sem last, after the alloc_sem.
174