xref: /openbmc/linux/fs/jffs2/README.Locking (revision 49e91e70)
11da177e4SLinus Torvalds
21da177e4SLinus Torvalds	JFFS2 LOCKING DOCUMENTATION
31da177e4SLinus Torvalds	---------------------------
41da177e4SLinus Torvalds
51da177e4SLinus TorvaldsThis document attempts to describe the existing locking rules for
61da177e4SLinus TorvaldsJFFS2. It is not expected to remain perfectly up to date, but ought to
71da177e4SLinus Torvaldsbe fairly close.
81da177e4SLinus Torvalds
91da177e4SLinus Torvalds
101da177e4SLinus Torvalds	alloc_sem
111da177e4SLinus Torvalds	---------
121da177e4SLinus Torvalds
13ced22070SDavid WoodhouseThe alloc_sem is a per-filesystem mutex, used primarily to ensure
141da177e4SLinus Torvaldscontiguous allocation of space on the medium. It is automatically
151da177e4SLinus Torvaldsobtained during space allocations (jffs2_reserve_space()) and freed
161da177e4SLinus Torvaldsupon write completion (jffs2_complete_reservation()). Note that
171da177e4SLinus Torvaldsthe garbage collector will obtain this right at the beginning of
181da177e4SLinus Torvaldsjffs2_garbage_collect_pass() and release it at the end, thereby
191da177e4SLinus Torvaldspreventing any other write activity on the file system during a
201da177e4SLinus Torvaldsgarbage collect pass.
211da177e4SLinus Torvalds
221da177e4SLinus TorvaldsWhen writing new nodes, the alloc_sem must be held until the new nodes
231da177e4SLinus Torvaldshave been properly linked into the data structures for the inode to
241da177e4SLinus Torvaldswhich they belong. This is for the benefit of NAND flash - adding new
251da177e4SLinus Torvaldsnodes to an inode may obsolete old ones, and by holding the alloc_sem
261da177e4SLinus Torvaldsuntil this happens we ensure that any data in the write-buffer at the
271da177e4SLinus Torvaldstime this happens are part of the new node, not just something that
281da177e4SLinus Torvaldswas written afterwards. Hence, we can ensure the newly-obsoleted nodes
291da177e4SLinus Torvaldsdon't actually get erased until the write-buffer has been flushed to
301da177e4SLinus Torvaldsthe medium.
311da177e4SLinus Torvalds
321da177e4SLinus TorvaldsWith the introduction of NAND flash support and the write-buffer,
331da177e4SLinus Torvaldsthe alloc_sem is also used to protect the wbuf-related members of the
341da177e4SLinus Torvaldsjffs2_sb_info structure. Atomically reading the wbuf_len member to see
351da177e4SLinus Torvaldsif the wbuf is currently holding any data is permitted, though.
361da177e4SLinus Torvalds
371da177e4SLinus TorvaldsOrdering constraints: See f->sem.
381da177e4SLinus Torvalds
391da177e4SLinus Torvalds
40ced22070SDavid Woodhouse	File Mutex f->sem
411da177e4SLinus Torvalds	---------------------
421da177e4SLinus Torvalds
43ced22070SDavid WoodhouseThis is the JFFS2-internal equivalent of the inode mutex i->i_sem.
441da177e4SLinus TorvaldsIt protects the contents of the jffs2_inode_info private inode data,
451da177e4SLinus Torvaldsincluding the linked list of node fragments (but see the notes below on
461da177e4SLinus Torvaldserase_completion_lock), etc.
471da177e4SLinus Torvalds
481da177e4SLinus TorvaldsThe reason that the i_sem itself isn't used for this purpose is to
491da177e4SLinus Torvaldsavoid deadlocks with garbage collection -- the VFS will lock the i_sem
501da177e4SLinus Torvaldsbefore calling a function which may need to allocate space. The
511da177e4SLinus Torvaldsallocation may trigger garbage-collection, which may need to move a
521da177e4SLinus Torvaldsnode belonging to the inode which was locked in the first place by the
531da177e4SLinus TorvaldsVFS. If the garbage collection code were to attempt to lock the i_sem
541da177e4SLinus Torvaldsof the inode from which it's garbage-collecting a physical node, this
551da177e4SLinus Torvaldslead to deadlock, unless we played games with unlocking the i_sem
561da177e4SLinus Torvaldsbefore calling the space allocation functions.
571da177e4SLinus Torvalds
581da177e4SLinus TorvaldsInstead of playing such games, we just have an extra internal
59ced22070SDavid Woodhousemutex, which is obtained by the garbage collection code and also
601da177e4SLinus Torvaldsby the normal file system code _after_ allocation of space.
611da177e4SLinus Torvalds
621da177e4SLinus TorvaldsOrdering constraints:
631da177e4SLinus Torvalds
641da177e4SLinus Torvalds	1. Never attempt to allocate space or lock alloc_sem with
651da177e4SLinus Torvalds	   any f->sem held.
66ced22070SDavid Woodhouse	2. Never attempt to lock two file mutexes in one thread.
671da177e4SLinus Torvalds	   No ordering rules have been made for doing so.
6849e91e70SDavid Woodhouse	3. Never lock a page cache page with f->sem held.
691da177e4SLinus Torvalds
701da177e4SLinus Torvalds
711da177e4SLinus Torvalds	erase_completion_lock spinlock
721da177e4SLinus Torvalds	------------------------------
731da177e4SLinus Torvalds
741da177e4SLinus TorvaldsThis is used to serialise access to the eraseblock lists, to the
751da177e4SLinus Torvaldsper-eraseblock lists of physical jffs2_raw_node_ref structures, and
761da177e4SLinus Torvalds(NB) the per-inode list of physical nodes. The latter is a special
771da177e4SLinus Torvaldscase - see below.
781da177e4SLinus Torvalds
791da177e4SLinus TorvaldsAs the MTD API no longer permits erase-completion callback functions
801da177e4SLinus Torvaldsto be called from bottom-half (timer) context (on the basis that nobody
811da177e4SLinus Torvaldsever actually implemented such a thing), it's now sufficient to use
821da177e4SLinus Torvaldsa simple spin_lock() rather than spin_lock_bh().
831da177e4SLinus Torvalds
841da177e4SLinus TorvaldsNote that the per-inode list of physical nodes (f->nodes) is a special
851da177e4SLinus Torvaldscase. Any changes to _valid_ nodes (i.e. ->flash_offset & 1 == 0) in
86ced22070SDavid Woodhousethe list are protected by the file mutex f->sem. But the erase code
87ced22070SDavid Woodhousemay remove _obsolete_ nodes from the list while holding only the
881da177e4SLinus Torvaldserase_completion_lock. So you can walk the list only while holding the
891da177e4SLinus Torvaldserase_completion_lock, and can drop the lock temporarily mid-walk as
901da177e4SLinus Torvaldslong as the pointer you're holding is to a _valid_ node, not an
911da177e4SLinus Torvaldsobsolete one.
921da177e4SLinus Torvalds
931da177e4SLinus TorvaldsThe erase_completion_lock is also used to protect the c->gc_task
941da177e4SLinus Torvaldspointer when the garbage collection thread exits. The code to kill the
951da177e4SLinus TorvaldsGC thread locks it, sends the signal, then unlocks it - while the GC
961da177e4SLinus Torvaldsthread itself locks it, zeroes c->gc_task, then unlocks on the exit path.
971da177e4SLinus Torvalds
981da177e4SLinus Torvalds
991da177e4SLinus Torvalds	inocache_lock spinlock
1001da177e4SLinus Torvalds	----------------------
1011da177e4SLinus Torvalds
1021da177e4SLinus TorvaldsThis spinlock protects the hashed list (c->inocache_list) of the
1031da177e4SLinus Torvaldsin-core jffs2_inode_cache objects (each inode in JFFS2 has the
1041da177e4SLinus Torvaldscorrespondent jffs2_inode_cache object). So, the inocache_lock
1051da177e4SLinus Torvaldshas to be locked while walking the c->inocache_list hash buckets.
1061da177e4SLinus Torvalds
1077d200960SDavid WoodhouseThis spinlock also covers allocation of new inode numbers, which is
1087d200960SDavid Woodhousecurrently just '++->highest_ino++', but might one day get more complicated
1097d200960SDavid Woodhouseif we need to deal with wrapping after 4 milliard inode numbers are used.
1107d200960SDavid Woodhouse
1111da177e4SLinus TorvaldsNote, the f->sem guarantees that the correspondent jffs2_inode_cache
1121da177e4SLinus Torvaldswill not be removed. So, it is allowed to access it without locking
1131da177e4SLinus Torvaldsthe inocache_lock spinlock.
1141da177e4SLinus Torvalds
1151da177e4SLinus TorvaldsOrdering constraints:
1161da177e4SLinus Torvalds
1171da177e4SLinus Torvalds	If both erase_completion_lock and inocache_lock are needed, the
1181da177e4SLinus Torvalds	c->erase_completion has to be acquired first.
1191da177e4SLinus Torvalds
1201da177e4SLinus Torvalds
1211da177e4SLinus Torvalds	erase_free_sem
1221da177e4SLinus Torvalds	--------------
1231da177e4SLinus Torvalds
124ced22070SDavid WoodhouseThis mutex is only used by the erase code which frees obsolete node
125ced22070SDavid Woodhousereferences and the jffs2_garbage_collect_deletion_dirent() function.
126ced22070SDavid WoodhouseThe latter function on NAND flash must read _obsolete_ nodes to
127ced22070SDavid Woodhousedetermine whether the 'deletion dirent' under consideration can be
1281da177e4SLinus Torvaldsdiscarded or whether it is still required to show that an inode has
1291da177e4SLinus Torvaldsbeen unlinked. Because reading from the flash may sleep, the
1301da177e4SLinus Torvaldserase_completion_lock cannot be held, so an alternative, more
1311da177e4SLinus Torvaldsheavyweight lock was required to prevent the erase code from freeing
1321da177e4SLinus Torvaldsthe jffs2_raw_node_ref structures in question while the garbage
1331da177e4SLinus Torvaldscollection code is looking at them.
1341da177e4SLinus Torvalds
1351da177e4SLinus TorvaldsSuggestions for alternative solutions to this problem would be welcomed.
1361da177e4SLinus Torvalds
1371da177e4SLinus Torvalds
1381da177e4SLinus Torvalds	wbuf_sem
1391da177e4SLinus Torvalds	--------
1401da177e4SLinus Torvalds
1411da177e4SLinus TorvaldsThis read/write semaphore protects against concurrent access to the
1421da177e4SLinus Torvaldswrite-behind buffer ('wbuf') used for flash chips where we must write
1431da177e4SLinus Torvaldsin blocks. It protects both the contents of the wbuf and the metadata
1441da177e4SLinus Torvaldswhich indicates which flash region (if any) is currently covered by
1451da177e4SLinus Torvaldsthe buffer.
1461da177e4SLinus Torvalds
1471da177e4SLinus TorvaldsOrdering constraints:
1481da177e4SLinus Torvalds	Lock wbuf_sem last, after the alloc_sem or and f->sem.
1498b0b339dSKaiGai Kohei
1508b0b339dSKaiGai Kohei
1518b0b339dSKaiGai Kohei	c->xattr_sem
1528b0b339dSKaiGai Kohei	------------
1538b0b339dSKaiGai Kohei
1548b0b339dSKaiGai KoheiThis read/write semaphore protects against concurrent access to the
1558b0b339dSKaiGai Koheixattr related objects which include stuff in superblock and ic->xref.
1568b0b339dSKaiGai KoheiIn read-only path, write-semaphore is too much exclusion. It's enough
1578b0b339dSKaiGai Koheiby read-semaphore. But you must hold write-semaphore when updating,
1588b0b339dSKaiGai Koheicreating or deleting any xattr related object.
1598b0b339dSKaiGai Kohei
1608b0b339dSKaiGai KoheiOnce xattr_sem released, there would be no assurance for the existence
1618b0b339dSKaiGai Koheiof those objects. Thus, a series of processes is often required to retry,
1628b0b339dSKaiGai Koheiwhen updating such a object is necessary under holding read semaphore.
1638b0b339dSKaiGai KoheiFor example, do_jffs2_getxattr() holds read-semaphore to scan xref and
1648b0b339dSKaiGai Koheixdatum at first. But it retries this process with holding write-semaphore
1658b0b339dSKaiGai Koheiafter release read-semaphore, if it's necessary to load name/value pair
1668b0b339dSKaiGai Koheifrom medium.
1678b0b339dSKaiGai Kohei
1688b0b339dSKaiGai KoheiOrdering constraints:
1698b0b339dSKaiGai Kohei	Lock xattr_sem last, after the alloc_sem.
170