1ec23eb54SMauro Carvalho Chehab======= 2ec23eb54SMauro Carvalho ChehabLocking 3ec23eb54SMauro Carvalho Chehab======= 4ec23eb54SMauro Carvalho Chehab 5ec23eb54SMauro Carvalho ChehabThe text below describes the locking rules for VFS-related methods. 6ec23eb54SMauro Carvalho ChehabIt is (believed to be) up-to-date. *Please*, if you change anything in 7ec23eb54SMauro Carvalho Chehabprototypes or locking protocols - update this file. And update the relevant 8ec23eb54SMauro Carvalho Chehabinstances in the tree, don't leave that to maintainers of filesystems/devices/ 9ec23eb54SMauro Carvalho Chehabetc. At the very least, put the list of dubious cases in the end of this file. 10ec23eb54SMauro Carvalho ChehabDon't turn it into log - maintainers of out-of-the-tree code are supposed to 11ec23eb54SMauro Carvalho Chehabbe able to use diff(1). 12ec23eb54SMauro Carvalho Chehab 13ec23eb54SMauro Carvalho ChehabThing currently missing here: socket operations. Alexey? 14ec23eb54SMauro Carvalho Chehab 15ec23eb54SMauro Carvalho Chehabdentry_operations 16ec23eb54SMauro Carvalho Chehab================= 17ec23eb54SMauro Carvalho Chehab 18ec23eb54SMauro Carvalho Chehabprototypes:: 19ec23eb54SMauro Carvalho Chehab 20ec23eb54SMauro Carvalho Chehab int (*d_revalidate)(struct dentry *, unsigned int); 21ec23eb54SMauro Carvalho Chehab int (*d_weak_revalidate)(struct dentry *, unsigned int); 22ec23eb54SMauro Carvalho Chehab int (*d_hash)(const struct dentry *, struct qstr *); 23ec23eb54SMauro Carvalho Chehab int (*d_compare)(const struct dentry *, 24ec23eb54SMauro Carvalho Chehab unsigned int, const char *, const struct qstr *); 25ec23eb54SMauro Carvalho Chehab int (*d_delete)(struct dentry *); 26ec23eb54SMauro Carvalho Chehab int (*d_init)(struct dentry *); 27ec23eb54SMauro Carvalho Chehab void (*d_release)(struct dentry *); 28ec23eb54SMauro Carvalho Chehab void (*d_iput)(struct dentry *, struct inode *); 29ec23eb54SMauro Carvalho Chehab char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); 30ec23eb54SMauro Carvalho Chehab struct vfsmount *(*d_automount)(struct path *path); 31ec23eb54SMauro Carvalho Chehab int (*d_manage)(const struct path *, bool); 32ec23eb54SMauro Carvalho Chehab struct dentry *(*d_real)(struct dentry *, const struct inode *); 33ec23eb54SMauro Carvalho Chehab 34ec23eb54SMauro Carvalho Chehablocking rules: 35ec23eb54SMauro Carvalho Chehab 36ec23eb54SMauro Carvalho Chehab================== =========== ======== ============== ======== 37ec23eb54SMauro Carvalho Chehabops rename_lock ->d_lock may block rcu-walk 38ec23eb54SMauro Carvalho Chehab================== =========== ======== ============== ======== 39ec23eb54SMauro Carvalho Chehabd_revalidate: no no yes (ref-walk) maybe 40ec23eb54SMauro Carvalho Chehabd_weak_revalidate: no no yes no 41ec23eb54SMauro Carvalho Chehabd_hash no no no maybe 42ec23eb54SMauro Carvalho Chehabd_compare: yes no no maybe 43ec23eb54SMauro Carvalho Chehabd_delete: no yes no no 44ec23eb54SMauro Carvalho Chehabd_init: no no yes no 45ec23eb54SMauro Carvalho Chehabd_release: no no yes no 46ec23eb54SMauro Carvalho Chehabd_prune: no yes no no 47ec23eb54SMauro Carvalho Chehabd_iput: no no yes no 48ec23eb54SMauro Carvalho Chehabd_dname: no no no no 49ec23eb54SMauro Carvalho Chehabd_automount: no no yes no 50ec23eb54SMauro Carvalho Chehabd_manage: no no yes (ref-walk) maybe 51ec23eb54SMauro Carvalho Chehabd_real no no yes no 52ec23eb54SMauro Carvalho Chehab================== =========== ======== ============== ======== 53ec23eb54SMauro Carvalho Chehab 54ec23eb54SMauro Carvalho Chehabinode_operations 55ec23eb54SMauro Carvalho Chehab================ 56ec23eb54SMauro Carvalho Chehab 57ec23eb54SMauro Carvalho Chehabprototypes:: 58ec23eb54SMauro Carvalho Chehab 596c960e68SChristian Brauner int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool); 60ec23eb54SMauro Carvalho Chehab struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); 61ec23eb54SMauro Carvalho Chehab int (*link) (struct dentry *,struct inode *,struct dentry *); 62ec23eb54SMauro Carvalho Chehab int (*unlink) (struct inode *,struct dentry *); 637a77db95SChristian Brauner int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *); 64c54bd91eSChristian Brauner int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t); 65ec23eb54SMauro Carvalho Chehab int (*rmdir) (struct inode *,struct dentry *); 665ebb29beSChristian Brauner int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t); 67e18275aeSChristian Brauner int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *, 68ec23eb54SMauro Carvalho Chehab struct inode *, struct dentry *, unsigned int); 69ec23eb54SMauro Carvalho Chehab int (*readlink) (struct dentry *, char __user *,int); 70ec23eb54SMauro Carvalho Chehab const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *); 71ec23eb54SMauro Carvalho Chehab void (*truncate) (struct inode *); 724609e1f1SChristian Brauner int (*permission) (struct mnt_idmap *, struct inode *, int, unsigned int); 73cac2f8b8SChristian Brauner struct posix_acl * (*get_inode_acl)(struct inode *, int, bool); 74c1632a0fSChristian Brauner int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *); 75b74d24f7SChristian Brauner int (*getattr) (struct mnt_idmap *, const struct path *, struct kstat *, u32, unsigned int); 76ec23eb54SMauro Carvalho Chehab ssize_t (*listxattr) (struct dentry *, char *, size_t); 77ec23eb54SMauro Carvalho Chehab int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); 78ec23eb54SMauro Carvalho Chehab void (*update_time)(struct inode *, struct timespec *, int); 79ec23eb54SMauro Carvalho Chehab int (*atomic_open)(struct inode *, struct dentry *, 80ec23eb54SMauro Carvalho Chehab struct file *, unsigned open_flag, 81ec23eb54SMauro Carvalho Chehab umode_t create_mode); 82011e2b71SChristian Brauner int (*tmpfile) (struct mnt_idmap *, struct inode *, 83863f144fSMiklos Szeredi struct file *, umode_t); 848782a9aeSChristian Brauner int (*fileattr_set)(struct mnt_idmap *idmap, 854c5b4799SMiklos Szeredi struct dentry *dentry, struct fileattr *fa); 864c5b4799SMiklos Szeredi int (*fileattr_get)(struct dentry *dentry, struct fileattr *fa); 8777435322SChristian Brauner struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int); 886faddda6SChuck Lever struct offset_ctx *(*get_offset_ctx)(struct inode *inode); 89ec23eb54SMauro Carvalho Chehab 90ec23eb54SMauro Carvalho Chehablocking rules: 91ec23eb54SMauro Carvalho Chehab all may block 92ec23eb54SMauro Carvalho Chehab 936faddda6SChuck Lever============== ================================================== 94ec23eb54SMauro Carvalho Chehabops i_rwsem(inode) 956faddda6SChuck Lever============== ================================================== 96ec23eb54SMauro Carvalho Chehablookup: shared 97ec23eb54SMauro Carvalho Chehabcreate: exclusive 98ec23eb54SMauro Carvalho Chehablink: exclusive (both) 99ec23eb54SMauro Carvalho Chehabmknod: exclusive 100ec23eb54SMauro Carvalho Chehabsymlink: exclusive 101ec23eb54SMauro Carvalho Chehabmkdir: exclusive 102ec23eb54SMauro Carvalho Chehabunlink: exclusive (both) 103ec23eb54SMauro Carvalho Chehabrmdir: exclusive (both)(see below) 104*1db06b3dSAl Virorename: exclusive (both parents, some children) (see below) 105ec23eb54SMauro Carvalho Chehabreadlink: no 106ec23eb54SMauro Carvalho Chehabget_link: no 107ec23eb54SMauro Carvalho Chehabsetattr: exclusive 108ec23eb54SMauro Carvalho Chehabpermission: no (may not block if called in rcu-walk mode) 109cac2f8b8SChristian Braunerget_inode_acl: no 1107420332aSChristian Braunerget_acl: no 111ec23eb54SMauro Carvalho Chehabgetattr: no 112ec23eb54SMauro Carvalho Chehablistxattr: no 113ec23eb54SMauro Carvalho Chehabfiemap: no 114ec23eb54SMauro Carvalho Chehabupdate_time: no 115ff467342SJeff Laytonatomic_open: shared (exclusive if O_CREAT is set in open flags) 116ec23eb54SMauro Carvalho Chehabtmpfile: no 1174c5b4799SMiklos Szeredifileattr_get: no or exclusive 1184c5b4799SMiklos Szeredifileattr_set: exclusive 1196faddda6SChuck Leverget_offset_ctx no 1206faddda6SChuck Lever============== ================================================== 121ec23eb54SMauro Carvalho Chehab 122ec23eb54SMauro Carvalho Chehab 123ec23eb54SMauro Carvalho Chehab Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem 124ec23eb54SMauro Carvalho Chehab exclusive on victim. 125ec23eb54SMauro Carvalho Chehab cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. 126*1db06b3dSAl Viro ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories 127*1db06b3dSAl Viro involved. 128*1db06b3dSAl Viro ->rename() has ->i_rwsem exclusive on any subdirectory that changes parent. 129ec23eb54SMauro Carvalho Chehab 130ec23eb54SMauro Carvalho ChehabSee Documentation/filesystems/directory-locking.rst for more detailed discussion 131ec23eb54SMauro Carvalho Chehabof the locking scheme for directory operations. 132ec23eb54SMauro Carvalho Chehab 133ec23eb54SMauro Carvalho Chehabxattr_handler operations 134ec23eb54SMauro Carvalho Chehab======================== 135ec23eb54SMauro Carvalho Chehab 136ec23eb54SMauro Carvalho Chehabprototypes:: 137ec23eb54SMauro Carvalho Chehab 138ec23eb54SMauro Carvalho Chehab bool (*list)(struct dentry *dentry); 139ec23eb54SMauro Carvalho Chehab int (*get)(const struct xattr_handler *handler, struct dentry *dentry, 140ec23eb54SMauro Carvalho Chehab struct inode *inode, const char *name, void *buffer, 141ec23eb54SMauro Carvalho Chehab size_t size); 142e65ce2a5SChristian Brauner int (*set)(const struct xattr_handler *handler, 14339f60c1cSChristian Brauner struct mnt_idmap *idmap, 144e65ce2a5SChristian Brauner struct dentry *dentry, struct inode *inode, const char *name, 145e65ce2a5SChristian Brauner const void *buffer, size_t size, int flags); 146ec23eb54SMauro Carvalho Chehab 147ec23eb54SMauro Carvalho Chehablocking rules: 148ec23eb54SMauro Carvalho Chehab all may block 149ec23eb54SMauro Carvalho Chehab 150ec23eb54SMauro Carvalho Chehab===== ============== 151ec23eb54SMauro Carvalho Chehabops i_rwsem(inode) 152ec23eb54SMauro Carvalho Chehab===== ============== 153ec23eb54SMauro Carvalho Chehablist: no 154ec23eb54SMauro Carvalho Chehabget: no 155ec23eb54SMauro Carvalho Chehabset: exclusive 156ec23eb54SMauro Carvalho Chehab===== ============== 157ec23eb54SMauro Carvalho Chehab 158ec23eb54SMauro Carvalho Chehabsuper_operations 159ec23eb54SMauro Carvalho Chehab================ 160ec23eb54SMauro Carvalho Chehab 161ec23eb54SMauro Carvalho Chehabprototypes:: 162ec23eb54SMauro Carvalho Chehab 163ec23eb54SMauro Carvalho Chehab struct inode *(*alloc_inode)(struct super_block *sb); 164ec23eb54SMauro Carvalho Chehab void (*free_inode)(struct inode *); 165ec23eb54SMauro Carvalho Chehab void (*destroy_inode)(struct inode *); 166ec23eb54SMauro Carvalho Chehab void (*dirty_inode) (struct inode *, int flags); 167ec23eb54SMauro Carvalho Chehab int (*write_inode) (struct inode *, struct writeback_control *wbc); 168ec23eb54SMauro Carvalho Chehab int (*drop_inode) (struct inode *); 169ec23eb54SMauro Carvalho Chehab void (*evict_inode) (struct inode *); 170ec23eb54SMauro Carvalho Chehab void (*put_super) (struct super_block *); 171ec23eb54SMauro Carvalho Chehab int (*sync_fs)(struct super_block *sb, int wait); 172ec23eb54SMauro Carvalho Chehab int (*freeze_fs) (struct super_block *); 173ec23eb54SMauro Carvalho Chehab int (*unfreeze_fs) (struct super_block *); 174ec23eb54SMauro Carvalho Chehab int (*statfs) (struct dentry *, struct kstatfs *); 175ec23eb54SMauro Carvalho Chehab int (*remount_fs) (struct super_block *, int *, char *); 176ec23eb54SMauro Carvalho Chehab void (*umount_begin) (struct super_block *); 177ec23eb54SMauro Carvalho Chehab int (*show_options)(struct seq_file *, struct dentry *); 178ec23eb54SMauro Carvalho Chehab ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); 179ec23eb54SMauro Carvalho Chehab ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); 180ec23eb54SMauro Carvalho Chehab 181ec23eb54SMauro Carvalho Chehablocking rules: 182ec23eb54SMauro Carvalho Chehab All may block [not true, see below] 183ec23eb54SMauro Carvalho Chehab 184ec23eb54SMauro Carvalho Chehab====================== ============ ======================== 185ec23eb54SMauro Carvalho Chehabops s_umount note 186ec23eb54SMauro Carvalho Chehab====================== ============ ======================== 187ec23eb54SMauro Carvalho Chehaballoc_inode: 188ec23eb54SMauro Carvalho Chehabfree_inode: called from RCU callback 189ec23eb54SMauro Carvalho Chehabdestroy_inode: 190ec23eb54SMauro Carvalho Chehabdirty_inode: 191ec23eb54SMauro Carvalho Chehabwrite_inode: 192ec23eb54SMauro Carvalho Chehabdrop_inode: !!!inode->i_lock!!! 193ec23eb54SMauro Carvalho Chehabevict_inode: 194ec23eb54SMauro Carvalho Chehabput_super: write 195ec23eb54SMauro Carvalho Chehabsync_fs: read 196ec23eb54SMauro Carvalho Chehabfreeze_fs: write 197ec23eb54SMauro Carvalho Chehabunfreeze_fs: write 198ec23eb54SMauro Carvalho Chehabstatfs: maybe(read) (see below) 199ec23eb54SMauro Carvalho Chehabremount_fs: write 200ec23eb54SMauro Carvalho Chehabumount_begin: no 201ec23eb54SMauro Carvalho Chehabshow_options: no (namespace_sem) 202ec23eb54SMauro Carvalho Chehabquota_read: no (see below) 203ec23eb54SMauro Carvalho Chehabquota_write: no (see below) 204ec23eb54SMauro Carvalho Chehab====================== ============ ======================== 205ec23eb54SMauro Carvalho Chehab 206ec23eb54SMauro Carvalho Chehab->statfs() has s_umount (shared) when called by ustat(2) (native or 207ec23eb54SMauro Carvalho Chehabcompat), but that's an accident of bad API; s_umount is used to pin 208ec23eb54SMauro Carvalho Chehabthe superblock down when we only have dev_t given us by userland to 209ec23eb54SMauro Carvalho Chehabidentify the superblock. Everything else (statfs(), fstatfs(), etc.) 210ec23eb54SMauro Carvalho Chehabdoesn't hold it when calling ->statfs() - superblock is pinned down 211ec23eb54SMauro Carvalho Chehabby resolving the pathname passed to syscall. 212ec23eb54SMauro Carvalho Chehab 213ec23eb54SMauro Carvalho Chehab->quota_read() and ->quota_write() functions are both guaranteed to 214ec23eb54SMauro Carvalho Chehabbe the only ones operating on the quota file by the quota code (via 215ec23eb54SMauro Carvalho Chehabdqio_sem) (unless an admin really wants to screw up something and 216ec23eb54SMauro Carvalho Chehabwrites to quota files with quotas on). For other details about locking 217ec23eb54SMauro Carvalho Chehabsee also dquot_operations section. 218ec23eb54SMauro Carvalho Chehab 219ec23eb54SMauro Carvalho Chehabfile_system_type 220ec23eb54SMauro Carvalho Chehab================ 221ec23eb54SMauro Carvalho Chehab 222ec23eb54SMauro Carvalho Chehabprototypes:: 223ec23eb54SMauro Carvalho Chehab 224ec23eb54SMauro Carvalho Chehab struct dentry *(*mount) (struct file_system_type *, int, 225ec23eb54SMauro Carvalho Chehab const char *, void *); 226ec23eb54SMauro Carvalho Chehab void (*kill_sb) (struct super_block *); 227ec23eb54SMauro Carvalho Chehab 228ec23eb54SMauro Carvalho Chehablocking rules: 229ec23eb54SMauro Carvalho Chehab 230ec23eb54SMauro Carvalho Chehab======= ========= 231ec23eb54SMauro Carvalho Chehabops may block 232ec23eb54SMauro Carvalho Chehab======= ========= 233ec23eb54SMauro Carvalho Chehabmount yes 234ec23eb54SMauro Carvalho Chehabkill_sb yes 235ec23eb54SMauro Carvalho Chehab======= ========= 236ec23eb54SMauro Carvalho Chehab 237ec23eb54SMauro Carvalho Chehab->mount() returns ERR_PTR or the root dentry; its superblock should be locked 238ec23eb54SMauro Carvalho Chehabon return. 239ec23eb54SMauro Carvalho Chehab 240ec23eb54SMauro Carvalho Chehab->kill_sb() takes a write-locked superblock, does all shutdown work on it, 241ec23eb54SMauro Carvalho Chehabunlocks and drops the reference. 242ec23eb54SMauro Carvalho Chehab 243ec23eb54SMauro Carvalho Chehabaddress_space_operations 244ec23eb54SMauro Carvalho Chehab======================== 245ec23eb54SMauro Carvalho Chehabprototypes:: 246ec23eb54SMauro Carvalho Chehab 247ec23eb54SMauro Carvalho Chehab int (*writepage)(struct page *page, struct writeback_control *wbc); 24808830c8bSMatthew Wilcox (Oracle) int (*read_folio)(struct file *, struct folio *); 249ec23eb54SMauro Carvalho Chehab int (*writepages)(struct address_space *, struct writeback_control *); 2506f31a5a2SMatthew Wilcox (Oracle) bool (*dirty_folio)(struct address_space *, struct folio *folio); 2518151b4c8SMatthew Wilcox (Oracle) void (*readahead)(struct readahead_control *); 252ec23eb54SMauro Carvalho Chehab int (*write_begin)(struct file *, struct address_space *mapping, 2539d6b0cd7SMatthew Wilcox (Oracle) loff_t pos, unsigned len, 254ec23eb54SMauro Carvalho Chehab struct page **pagep, void **fsdata); 255ec23eb54SMauro Carvalho Chehab int (*write_end)(struct file *, struct address_space *mapping, 256ec23eb54SMauro Carvalho Chehab loff_t pos, unsigned len, unsigned copied, 257ec23eb54SMauro Carvalho Chehab struct page *page, void *fsdata); 258ec23eb54SMauro Carvalho Chehab sector_t (*bmap)(struct address_space *, sector_t); 259128d1f82SMatthew Wilcox (Oracle) void (*invalidate_folio) (struct folio *, size_t start, size_t len); 260fa29000bSMatthew Wilcox (Oracle) bool (*release_folio)(struct folio *, gfp_t); 261d2329aa0SMatthew Wilcox (Oracle) void (*free_folio)(struct folio *); 262ec23eb54SMauro Carvalho Chehab int (*direct_IO)(struct kiocb *, struct iov_iter *iter); 2635490da4fSMatthew Wilcox (Oracle) int (*migrate_folio)(struct address_space *, struct folio *dst, 2645490da4fSMatthew Wilcox (Oracle) struct folio *src, enum migrate_mode); 265affa80e8SMatthew Wilcox (Oracle) int (*launder_folio)(struct folio *); 2662e7e80f7SMatthew Wilcox (Oracle) bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count); 267ec23eb54SMauro Carvalho Chehab int (*error_remove_page)(struct address_space *, struct page *); 268cba738f6SNeilBrown int (*swap_activate)(struct swap_info_struct *sis, struct file *f, sector_t *span) 269ec23eb54SMauro Carvalho Chehab int (*swap_deactivate)(struct file *); 270cba738f6SNeilBrown int (*swap_rw)(struct kiocb *iocb, struct iov_iter *iter); 271ec23eb54SMauro Carvalho Chehab 272ec23eb54SMauro Carvalho Chehablocking rules: 273d2329aa0SMatthew Wilcox (Oracle) All except dirty_folio and free_folio may block 274ec23eb54SMauro Carvalho Chehab 275730633f0SJan Kara====================== ======================== ========= =============== 276d2329aa0SMatthew Wilcox (Oracle)ops folio locked i_rwsem invalidate_lock 277730633f0SJan Kara====================== ======================== ========= =============== 278ec23eb54SMauro Carvalho Chehabwritepage: yes, unlocks (see below) 27908830c8bSMatthew Wilcox (Oracle)read_folio: yes, unlocks shared 280ec23eb54SMauro Carvalho Chehabwritepages: 281fa29000bSMatthew Wilcox (Oracle)dirty_folio: maybe 282730633f0SJan Karareadahead: yes, unlocks shared 283ec23eb54SMauro Carvalho Chehabwrite_begin: locks the page exclusive 284ec23eb54SMauro Carvalho Chehabwrite_end: yes, unlocks exclusive 285ec23eb54SMauro Carvalho Chehabbmap: 286128d1f82SMatthew Wilcox (Oracle)invalidate_folio: yes exclusive 287fa29000bSMatthew Wilcox (Oracle)release_folio: yes 288d2329aa0SMatthew Wilcox (Oracle)free_folio: yes 289ec23eb54SMauro Carvalho Chehabdirect_IO: 2905490da4fSMatthew Wilcox (Oracle)migrate_folio: yes (both) 291affa80e8SMatthew Wilcox (Oracle)launder_folio: yes 292ec23eb54SMauro Carvalho Chehabis_partially_uptodate: yes 293ec23eb54SMauro Carvalho Chehaberror_remove_page: yes 294ec23eb54SMauro Carvalho Chehabswap_activate: no 295ec23eb54SMauro Carvalho Chehabswap_deactivate: no 296cba738f6SNeilBrownswap_rw: yes, unlocks 2977882c55eSRandy Dunlap====================== ======================== ========= =============== 298ec23eb54SMauro Carvalho Chehab 29908830c8bSMatthew Wilcox (Oracle)->write_begin(), ->write_end() and ->read_folio() may be called from 300ec23eb54SMauro Carvalho Chehabthe request handler (/dev/loop). 301ec23eb54SMauro Carvalho Chehab 30208830c8bSMatthew Wilcox (Oracle)->read_folio() unlocks the folio, either synchronously or via I/O 303ec23eb54SMauro Carvalho Chehabcompletion. 304ec23eb54SMauro Carvalho Chehab 30508830c8bSMatthew Wilcox (Oracle)->readahead() unlocks the folios that I/O is attempted on like ->read_folio(). 3068151b4c8SMatthew Wilcox (Oracle) 307ec23eb54SMauro Carvalho Chehab->writepage() is used for two purposes: for "memory cleansing" and for 308ec23eb54SMauro Carvalho Chehab"sync". These are quite different operations and the behaviour may differ 309ec23eb54SMauro Carvalho Chehabdepending upon the mode. 310ec23eb54SMauro Carvalho Chehab 311ec23eb54SMauro Carvalho ChehabIf writepage is called for sync (wbc->sync_mode != WBC_SYNC_NONE) then 312ec23eb54SMauro Carvalho Chehabit *must* start I/O against the page, even if that would involve 313ec23eb54SMauro Carvalho Chehabblocking on in-progress I/O. 314ec23eb54SMauro Carvalho Chehab 315ec23eb54SMauro Carvalho ChehabIf writepage is called for memory cleansing (sync_mode == 316ec23eb54SMauro Carvalho ChehabWBC_SYNC_NONE) then its role is to get as much writeout underway as 317ec23eb54SMauro Carvalho Chehabpossible. So writepage should try to avoid blocking against 318ec23eb54SMauro Carvalho Chehabcurrently-in-progress I/O. 319ec23eb54SMauro Carvalho Chehab 320ec23eb54SMauro Carvalho ChehabIf the filesystem is not called for "sync" and it determines that it 321ec23eb54SMauro Carvalho Chehabwould need to block against in-progress I/O to be able to start new I/O 322ec23eb54SMauro Carvalho Chehabagainst the page the filesystem should redirty the page with 323ec23eb54SMauro Carvalho Chehabredirty_page_for_writepage(), then unlock the page and return zero. 324ec23eb54SMauro Carvalho ChehabThis may also be done to avoid internal deadlocks, but rarely. 325ec23eb54SMauro Carvalho Chehab 326ec23eb54SMauro Carvalho ChehabIf the filesystem is called for sync then it must wait on any 327ec23eb54SMauro Carvalho Chehabin-progress I/O and then start new I/O. 328ec23eb54SMauro Carvalho Chehab 329ec23eb54SMauro Carvalho ChehabThe filesystem should unlock the page synchronously, before returning to the 330ec23eb54SMauro Carvalho Chehabcaller, unless ->writepage() returns special WRITEPAGE_ACTIVATE 331ec23eb54SMauro Carvalho Chehabvalue. WRITEPAGE_ACTIVATE means that page cannot really be written out 332ec23eb54SMauro Carvalho Chehabcurrently, and VM should stop calling ->writepage() on this page for some 333ec23eb54SMauro Carvalho Chehabtime. VM does this by moving page to the head of the active list, hence the 334ec23eb54SMauro Carvalho Chehabname. 335ec23eb54SMauro Carvalho Chehab 336ec23eb54SMauro Carvalho ChehabUnless the filesystem is going to redirty_page_for_writepage(), unlock the page 337ec23eb54SMauro Carvalho Chehaband return zero, writepage *must* run set_page_writeback() against the page, 338ec23eb54SMauro Carvalho Chehabfollowed by unlocking it. Once set_page_writeback() has been run against the 339ec23eb54SMauro Carvalho Chehabpage, write I/O can be submitted and the write I/O completion handler must run 340ec23eb54SMauro Carvalho Chehabend_page_writeback() once the I/O is complete. If no I/O is submitted, the 341ec23eb54SMauro Carvalho Chehabfilesystem must run end_page_writeback() against the page before returning from 342ec23eb54SMauro Carvalho Chehabwritepage. 343ec23eb54SMauro Carvalho Chehab 344ec23eb54SMauro Carvalho ChehabThat is: after 2.5.12, pages which are under writeout are *not* locked. Note, 345ec23eb54SMauro Carvalho Chehabif the filesystem needs the page to be locked during writeout, that is ok, too, 346ec23eb54SMauro Carvalho Chehabthe page is allowed to be unlocked at any point in time between the calls to 347ec23eb54SMauro Carvalho Chehabset_page_writeback() and end_page_writeback(). 348ec23eb54SMauro Carvalho Chehab 349ec23eb54SMauro Carvalho ChehabNote, failure to run either redirty_page_for_writepage() or the combination of 350ec23eb54SMauro Carvalho Chehabset_page_writeback()/end_page_writeback() on a page submitted to writepage 351ec23eb54SMauro Carvalho Chehabwill leave the page itself marked clean but it will be tagged as dirty in the 352ec23eb54SMauro Carvalho Chehabradix tree. This incoherency can lead to all sorts of hard-to-debug problems 353ec23eb54SMauro Carvalho Chehabin the filesystem like having dirty inodes at umount and losing written data. 354ec23eb54SMauro Carvalho Chehab 355ec23eb54SMauro Carvalho Chehab->writepages() is used for periodic writeback and for syscall-initiated 356ec23eb54SMauro Carvalho Chehabsync operations. The address_space should start I/O against at least 357ec23eb54SMauro Carvalho Chehab``*nr_to_write`` pages. ``*nr_to_write`` must be decremented for each page 358ec23eb54SMauro Carvalho Chehabwhich is written. The address_space implementation may write more (or less) 359ec23eb54SMauro Carvalho Chehabpages than ``*nr_to_write`` asks for, but it should try to be reasonably close. 360ec23eb54SMauro Carvalho ChehabIf nr_to_write is NULL, all dirty pages must be written. 361ec23eb54SMauro Carvalho Chehab 362ec23eb54SMauro Carvalho Chehabwritepages should _only_ write pages which are present on 363ec23eb54SMauro Carvalho Chehabmapping->io_pages. 364ec23eb54SMauro Carvalho Chehab 3656f31a5a2SMatthew Wilcox (Oracle)->dirty_folio() is called from various places in the kernel when 3666f31a5a2SMatthew Wilcox (Oracle)the target folio is marked as needing writeback. The folio cannot be 3676f31a5a2SMatthew Wilcox (Oracle)truncated because either the caller holds the folio lock, or the caller 3686f31a5a2SMatthew Wilcox (Oracle)has found the folio while holding the page table lock which will block 3696f31a5a2SMatthew Wilcox (Oracle)truncation. 370ec23eb54SMauro Carvalho Chehab 371ec23eb54SMauro Carvalho Chehab->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some 372ec23eb54SMauro Carvalho Chehabfilesystems and by the swapper. The latter will eventually go away. Please, 373ec23eb54SMauro Carvalho Chehabkeep it that way and don't breed new callers. 374ec23eb54SMauro Carvalho Chehab 375128d1f82SMatthew Wilcox (Oracle)->invalidate_folio() is called when the filesystem must attempt to drop 376ec23eb54SMauro Carvalho Chehabsome or all of the buffers from the page when it is being truncated. It 377128d1f82SMatthew Wilcox (Oracle)returns zero on success. The filesystem must exclusively acquire 378128d1f82SMatthew Wilcox (Oracle)invalidate_lock before invalidating page cache in truncate / hole punch 379128d1f82SMatthew Wilcox (Oracle)path (and thus calling into ->invalidate_folio) to block races between page 380128d1f82SMatthew Wilcox (Oracle)cache invalidation and page cache filling functions (fault, read, ...). 381ec23eb54SMauro Carvalho Chehab 38232b29cc9SMatthew Wilcox (Oracle)->release_folio() is called when the MM wants to make a change to the 38332b29cc9SMatthew Wilcox (Oracle)folio that would invalidate the filesystem's private data. For example, 38432b29cc9SMatthew Wilcox (Oracle)it may be about to be removed from the address_space or split. The folio 38532b29cc9SMatthew Wilcox (Oracle)is locked and not under writeback. It may be dirty. The gfp parameter 38632b29cc9SMatthew Wilcox (Oracle)is not usually used for allocation, but rather to indicate what the 38732b29cc9SMatthew Wilcox (Oracle)filesystem may do to attempt to free the private data. The filesystem may 38832b29cc9SMatthew Wilcox (Oracle)return false to indicate that the folio's private data cannot be freed. 38932b29cc9SMatthew Wilcox (Oracle)If it returns true, it should have already removed the private data from 39032b29cc9SMatthew Wilcox (Oracle)the folio. If a filesystem does not provide a ->release_folio method, 39132b29cc9SMatthew Wilcox (Oracle)the pagecache will assume that private data is buffer_heads and call 39232b29cc9SMatthew Wilcox (Oracle)try_to_free_buffers(). 393ec23eb54SMauro Carvalho Chehab 394d2329aa0SMatthew Wilcox (Oracle)->free_folio() is called when the kernel has dropped the folio 395ec23eb54SMauro Carvalho Chehabfrom the page cache. 396ec23eb54SMauro Carvalho Chehab 397affa80e8SMatthew Wilcox (Oracle)->launder_folio() may be called prior to releasing a folio if 398affa80e8SMatthew Wilcox (Oracle)it is still found to be dirty. It returns zero if the folio was successfully 399affa80e8SMatthew Wilcox (Oracle)cleaned, or an error value if not. Note that in order to prevent the folio 400ec23eb54SMauro Carvalho Chehabgetting mapped back in and redirtied, it needs to be kept locked 401ec23eb54SMauro Carvalho Chehabacross the entire operation. 402ec23eb54SMauro Carvalho Chehab 403cba738f6SNeilBrown->swap_activate() will be called to prepare the given file for swap. It 404cba738f6SNeilBrownshould perform any validation and preparation necessary to ensure that 405cba738f6SNeilBrownwrites can be performed with minimal memory allocation. It should call 406cba738f6SNeilBrownadd_swap_extent(), or the helper iomap_swapfile_activate(), and return 407cba738f6SNeilBrownthe number of extents added. If IO should be submitted through 408cba738f6SNeilBrown->swap_rw(), it should set SWP_FS_OPS, otherwise IO will be submitted 409cba738f6SNeilBrowndirectly to the block device ``sis->bdev``. 410ec23eb54SMauro Carvalho Chehab 411ec23eb54SMauro Carvalho Chehab->swap_deactivate() will be called in the sys_swapoff() 412ec23eb54SMauro Carvalho Chehabpath after ->swap_activate() returned success. 413ec23eb54SMauro Carvalho Chehab 414cba738f6SNeilBrown->swap_rw will be called for swap IO if SWP_FS_OPS was set by ->swap_activate(). 415cba738f6SNeilBrown 416ec23eb54SMauro Carvalho Chehabfile_lock_operations 417ec23eb54SMauro Carvalho Chehab==================== 418ec23eb54SMauro Carvalho Chehab 419ec23eb54SMauro Carvalho Chehabprototypes:: 420ec23eb54SMauro Carvalho Chehab 421ec23eb54SMauro Carvalho Chehab void (*fl_copy_lock)(struct file_lock *, struct file_lock *); 422ec23eb54SMauro Carvalho Chehab void (*fl_release_private)(struct file_lock *); 423ec23eb54SMauro Carvalho Chehab 424ec23eb54SMauro Carvalho Chehab 425ec23eb54SMauro Carvalho Chehablocking rules: 426ec23eb54SMauro Carvalho Chehab 427ec23eb54SMauro Carvalho Chehab=================== ============= ========= 428ec23eb54SMauro Carvalho Chehabops inode->i_lock may block 429ec23eb54SMauro Carvalho Chehab=================== ============= ========= 430ec23eb54SMauro Carvalho Chehabfl_copy_lock: yes no 431ec23eb54SMauro Carvalho Chehabfl_release_private: maybe maybe[1]_ 432ec23eb54SMauro Carvalho Chehab=================== ============= ========= 433ec23eb54SMauro Carvalho Chehab 434ec23eb54SMauro Carvalho Chehab.. [1]: 435ec23eb54SMauro Carvalho Chehab ->fl_release_private for flock or POSIX locks is currently allowed 436ec23eb54SMauro Carvalho Chehab to block. Leases however can still be freed while the i_lock is held and 437ec23eb54SMauro Carvalho Chehab so fl_release_private called on a lease should not block. 438ec23eb54SMauro Carvalho Chehab 439ec23eb54SMauro Carvalho Chehablock_manager_operations 440ec23eb54SMauro Carvalho Chehab======================= 441ec23eb54SMauro Carvalho Chehab 442ec23eb54SMauro Carvalho Chehabprototypes:: 443ec23eb54SMauro Carvalho Chehab 444ec23eb54SMauro Carvalho Chehab void (*lm_notify)(struct file_lock *); /* unblock callback */ 445ec23eb54SMauro Carvalho Chehab int (*lm_grant)(struct file_lock *, struct file_lock *, int); 446ec23eb54SMauro Carvalho Chehab void (*lm_break)(struct file_lock *); /* break_lease callback */ 447ec23eb54SMauro Carvalho Chehab int (*lm_change)(struct file_lock **, int); 44828df3d15SJ. Bruce Fields bool (*lm_breaker_owns_lease)(struct file_lock *); 4492443da22SDai Ngo bool (*lm_lock_expirable)(struct file_lock *); 4502443da22SDai Ngo void (*lm_expire_lock)(void); 451ec23eb54SMauro Carvalho Chehab 452ec23eb54SMauro Carvalho Chehablocking rules: 453ec23eb54SMauro Carvalho Chehab 4546cbef2adSRandy Dunlap====================== ============= ================= ========= 4559d664776SDai Ngoops flc_lock blocked_lock_lock may block 4566cbef2adSRandy Dunlap====================== ============= ================= ========= 4579d664776SDai Ngolm_notify: no yes no 458ec23eb54SMauro Carvalho Chehablm_grant: no no no 459ec23eb54SMauro Carvalho Chehablm_break: yes no no 460ec23eb54SMauro Carvalho Chehablm_change yes no no 4619d664776SDai Ngolm_breaker_owns_lease: yes no no 4622443da22SDai Ngolm_lock_expirable yes no no 4632443da22SDai Ngolm_expire_lock no no yes 4646cbef2adSRandy Dunlap====================== ============= ================= ========= 465ec23eb54SMauro Carvalho Chehab 466ec23eb54SMauro Carvalho Chehabbuffer_head 467ec23eb54SMauro Carvalho Chehab=========== 468ec23eb54SMauro Carvalho Chehab 469ec23eb54SMauro Carvalho Chehabprototypes:: 470ec23eb54SMauro Carvalho Chehab 471ec23eb54SMauro Carvalho Chehab void (*b_end_io)(struct buffer_head *bh, int uptodate); 472ec23eb54SMauro Carvalho Chehab 473ec23eb54SMauro Carvalho Chehablocking rules: 474ec23eb54SMauro Carvalho Chehab 475ec23eb54SMauro Carvalho Chehabcalled from interrupts. In other words, extreme care is needed here. 476ec23eb54SMauro Carvalho Chehabbh is locked, but that's all warranties we have here. Currently only RAID1, 477ec23eb54SMauro Carvalho Chehabhighmem, fs/buffer.c, and fs/ntfs/aops.c are providing these. Block devices 478ec23eb54SMauro Carvalho Chehabcall this method upon the IO completion. 479ec23eb54SMauro Carvalho Chehab 480ec23eb54SMauro Carvalho Chehabblock_device_operations 481ec23eb54SMauro Carvalho Chehab======================= 482ec23eb54SMauro Carvalho Chehabprototypes:: 483ec23eb54SMauro Carvalho Chehab 484ec23eb54SMauro Carvalho Chehab int (*open) (struct block_device *, fmode_t); 485ec23eb54SMauro Carvalho Chehab int (*release) (struct gendisk *, fmode_t); 486ec23eb54SMauro Carvalho Chehab int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); 487ec23eb54SMauro Carvalho Chehab int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); 488ec23eb54SMauro Carvalho Chehab int (*direct_access) (struct block_device *, sector_t, void **, 489ec23eb54SMauro Carvalho Chehab unsigned long *); 490ec23eb54SMauro Carvalho Chehab void (*unlock_native_capacity) (struct gendisk *); 491ec23eb54SMauro Carvalho Chehab int (*getgeo)(struct block_device *, struct hd_geometry *); 492ec23eb54SMauro Carvalho Chehab void (*swap_slot_free_notify) (struct block_device *, unsigned long); 493ec23eb54SMauro Carvalho Chehab 494ec23eb54SMauro Carvalho Chehablocking rules: 495ec23eb54SMauro Carvalho Chehab 496ec23eb54SMauro Carvalho Chehab======================= =================== 497a8698707SChristoph Hellwigops open_mutex 498ec23eb54SMauro Carvalho Chehab======================= =================== 499ec23eb54SMauro Carvalho Chehabopen: yes 500ec23eb54SMauro Carvalho Chehabrelease: yes 501ec23eb54SMauro Carvalho Chehabioctl: no 502ec23eb54SMauro Carvalho Chehabcompat_ioctl: no 503ec23eb54SMauro Carvalho Chehabdirect_access: no 504ec23eb54SMauro Carvalho Chehabunlock_native_capacity: no 505ec23eb54SMauro Carvalho Chehabgetgeo: no 506ec23eb54SMauro Carvalho Chehabswap_slot_free_notify: no (see below) 507ec23eb54SMauro Carvalho Chehab======================= =================== 508ec23eb54SMauro Carvalho Chehab 509ec23eb54SMauro Carvalho Chehabswap_slot_free_notify is called with swap_lock and sometimes the page lock 510ec23eb54SMauro Carvalho Chehabheld. 511ec23eb54SMauro Carvalho Chehab 512ec23eb54SMauro Carvalho Chehab 513ec23eb54SMauro Carvalho Chehabfile_operations 514ec23eb54SMauro Carvalho Chehab=============== 515ec23eb54SMauro Carvalho Chehab 516ec23eb54SMauro Carvalho Chehabprototypes:: 517ec23eb54SMauro Carvalho Chehab 518ec23eb54SMauro Carvalho Chehab loff_t (*llseek) (struct file *, loff_t, int); 519ec23eb54SMauro Carvalho Chehab ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); 520ec23eb54SMauro Carvalho Chehab ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); 521ec23eb54SMauro Carvalho Chehab ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); 522ec23eb54SMauro Carvalho Chehab ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); 523c625b4ccSJan Kara int (*iopoll) (struct kiocb *kiocb, bool spin); 524ec23eb54SMauro Carvalho Chehab int (*iterate_shared) (struct file *, struct dir_context *); 525ec23eb54SMauro Carvalho Chehab __poll_t (*poll) (struct file *, struct poll_table_struct *); 526ec23eb54SMauro Carvalho Chehab long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); 527ec23eb54SMauro Carvalho Chehab long (*compat_ioctl) (struct file *, unsigned int, unsigned long); 528ec23eb54SMauro Carvalho Chehab int (*mmap) (struct file *, struct vm_area_struct *); 529ec23eb54SMauro Carvalho Chehab int (*open) (struct inode *, struct file *); 530ec23eb54SMauro Carvalho Chehab int (*flush) (struct file *); 531ec23eb54SMauro Carvalho Chehab int (*release) (struct inode *, struct file *); 532ec23eb54SMauro Carvalho Chehab int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); 533ec23eb54SMauro Carvalho Chehab int (*fasync) (int, struct file *, int); 534ec23eb54SMauro Carvalho Chehab int (*lock) (struct file *, int, struct file_lock *); 535ec23eb54SMauro Carvalho Chehab unsigned long (*get_unmapped_area)(struct file *, unsigned long, 536ec23eb54SMauro Carvalho Chehab unsigned long, unsigned long, unsigned long); 537ec23eb54SMauro Carvalho Chehab int (*check_flags)(int); 538ec23eb54SMauro Carvalho Chehab int (*flock) (struct file *, int, struct file_lock *); 539ec23eb54SMauro Carvalho Chehab ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, 540ec23eb54SMauro Carvalho Chehab size_t, unsigned int); 541ec23eb54SMauro Carvalho Chehab ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, 542ec23eb54SMauro Carvalho Chehab size_t, unsigned int); 543ec23eb54SMauro Carvalho Chehab int (*setlease)(struct file *, long, struct file_lock **, void **); 544ec23eb54SMauro Carvalho Chehab long (*fallocate)(struct file *, int, loff_t, loff_t); 545c625b4ccSJan Kara void (*show_fdinfo)(struct seq_file *m, struct file *f); 546c625b4ccSJan Kara unsigned (*mmap_capabilities)(struct file *); 547c625b4ccSJan Kara ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, 548c625b4ccSJan Kara loff_t, size_t, unsigned int); 549c625b4ccSJan Kara loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in, 550c625b4ccSJan Kara struct file *file_out, loff_t pos_out, 551c625b4ccSJan Kara loff_t len, unsigned int remap_flags); 552c625b4ccSJan Kara int (*fadvise)(struct file *, loff_t, loff_t, int); 553ec23eb54SMauro Carvalho Chehab 554ec23eb54SMauro Carvalho Chehablocking rules: 555ec23eb54SMauro Carvalho Chehab All may block. 556ec23eb54SMauro Carvalho Chehab 557ec23eb54SMauro Carvalho Chehab->llseek() locking has moved from llseek to the individual llseek 558ec23eb54SMauro Carvalho Chehabimplementations. If your fs is not using generic_file_llseek, you 559ec23eb54SMauro Carvalho Chehabneed to acquire and release the appropriate locks in your ->llseek(). 560ec23eb54SMauro Carvalho ChehabFor many filesystems, it is probably safe to acquire the inode 561ec23eb54SMauro Carvalho Chehabmutex or just to use i_size_read() instead. 562ec23eb54SMauro Carvalho ChehabNote: this does not protect the file->f_pos against concurrent modifications 563ec23eb54SMauro Carvalho Chehabsince this is something the userspace has to take care about. 564ec23eb54SMauro Carvalho Chehab 5653e327154SLinus Torvalds->iterate_shared() is called with i_rwsem held for reading, and with the 5663e327154SLinus Torvaldsfile f_pos_lock held exclusively 567ec23eb54SMauro Carvalho Chehab 568ec23eb54SMauro Carvalho Chehab->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags. 569ec23eb54SMauro Carvalho ChehabMost instances call fasync_helper(), which does that maintenance, so it's 570ec23eb54SMauro Carvalho Chehabnot normally something one needs to worry about. Return values > 0 will be 571ec23eb54SMauro Carvalho Chehabmapped to zero in the VFS layer. 572ec23eb54SMauro Carvalho Chehab 573ec23eb54SMauro Carvalho Chehab->readdir() and ->ioctl() on directories must be changed. Ideally we would 574ec23eb54SMauro Carvalho Chehabmove ->readdir() to inode_operations and use a separate method for directory 575ec23eb54SMauro Carvalho Chehab->ioctl() or kill the latter completely. One of the problems is that for 576ec23eb54SMauro Carvalho Chehabanything that resembles union-mount we won't have a struct file for all 577ec23eb54SMauro Carvalho Chehabcomponents. And there are other reasons why the current interface is a mess... 578ec23eb54SMauro Carvalho Chehab 579ec23eb54SMauro Carvalho Chehab->read on directories probably must go away - we should just enforce -EISDIR 580ec23eb54SMauro Carvalho Chehabin sys_read() and friends. 581ec23eb54SMauro Carvalho Chehab 582ec23eb54SMauro Carvalho Chehab->setlease operations should call generic_setlease() before or after setting 583ec23eb54SMauro Carvalho Chehabthe lease within the individual filesystem to record the result of the 584ec23eb54SMauro Carvalho Chehaboperation 585ec23eb54SMauro Carvalho Chehab 586730633f0SJan Kara->fallocate implementation must be really careful to maintain page cache 587730633f0SJan Karaconsistency when punching holes or performing other operations that invalidate 588730633f0SJan Karapage cache contents. Usually the filesystem needs to call 589730633f0SJan Karatruncate_inode_pages_range() to invalidate relevant range of the page cache. 590730633f0SJan KaraHowever the filesystem usually also needs to update its internal (and on disk) 591730633f0SJan Karaview of file offset -> disk block mapping. Until this update is finished, the 592730633f0SJan Karafilesystem needs to block page faults and reads from reloading now-stale page 593730633f0SJan Karacache contents from the disk. Since VFS acquires mapping->invalidate_lock in 594730633f0SJan Karashared mode when loading pages from disk (filemap_fault(), filemap_read(), 595730633f0SJan Karareadahead paths), the fallocate implementation must take the invalidate_lock to 596730633f0SJan Karaprevent reloading. 597730633f0SJan Kara 598730633f0SJan Kara->copy_file_range and ->remap_file_range implementations need to serialize 599730633f0SJan Karaagainst modifications of file data while the operation is running. For 600730633f0SJan Karablocking changes through write(2) and similar operations inode->i_rwsem can be 601730633f0SJan Karaused. To block changes to file contents via a memory mapping during the 602730633f0SJan Karaoperation, the filesystem must take mapping->invalidate_lock to coordinate 603730633f0SJan Karawith ->page_mkwrite. 604730633f0SJan Kara 605ec23eb54SMauro Carvalho Chehabdquot_operations 606ec23eb54SMauro Carvalho Chehab================ 607ec23eb54SMauro Carvalho Chehab 608ec23eb54SMauro Carvalho Chehabprototypes:: 609ec23eb54SMauro Carvalho Chehab 610ec23eb54SMauro Carvalho Chehab int (*write_dquot) (struct dquot *); 611ec23eb54SMauro Carvalho Chehab int (*acquire_dquot) (struct dquot *); 612ec23eb54SMauro Carvalho Chehab int (*release_dquot) (struct dquot *); 613ec23eb54SMauro Carvalho Chehab int (*mark_dirty) (struct dquot *); 614ec23eb54SMauro Carvalho Chehab int (*write_info) (struct super_block *, int); 615ec23eb54SMauro Carvalho Chehab 616ec23eb54SMauro Carvalho ChehabThese operations are intended to be more or less wrapping functions that ensure 617ec23eb54SMauro Carvalho Chehaba proper locking wrt the filesystem and call the generic quota operations. 618ec23eb54SMauro Carvalho Chehab 619ec23eb54SMauro Carvalho ChehabWhat filesystem should expect from the generic quota functions: 620ec23eb54SMauro Carvalho Chehab 621ec23eb54SMauro Carvalho Chehab============== ============ ========================= 622ec23eb54SMauro Carvalho Chehabops FS recursion Held locks when called 623ec23eb54SMauro Carvalho Chehab============== ============ ========================= 624ec23eb54SMauro Carvalho Chehabwrite_dquot: yes dqonoff_sem or dqptr_sem 625ec23eb54SMauro Carvalho Chehabacquire_dquot: yes dqonoff_sem or dqptr_sem 626ec23eb54SMauro Carvalho Chehabrelease_dquot: yes dqonoff_sem or dqptr_sem 627ec23eb54SMauro Carvalho Chehabmark_dirty: no - 628ec23eb54SMauro Carvalho Chehabwrite_info: yes dqonoff_sem 629ec23eb54SMauro Carvalho Chehab============== ============ ========================= 630ec23eb54SMauro Carvalho Chehab 631ec23eb54SMauro Carvalho ChehabFS recursion means calling ->quota_read() and ->quota_write() from superblock 632ec23eb54SMauro Carvalho Chehaboperations. 633ec23eb54SMauro Carvalho Chehab 634ec23eb54SMauro Carvalho ChehabMore details about quota locking can be found in fs/dquot.c. 635ec23eb54SMauro Carvalho Chehab 636ec23eb54SMauro Carvalho Chehabvm_operations_struct 637ec23eb54SMauro Carvalho Chehab==================== 638ec23eb54SMauro Carvalho Chehab 639ec23eb54SMauro Carvalho Chehabprototypes:: 640ec23eb54SMauro Carvalho Chehab 641ec23eb54SMauro Carvalho Chehab void (*open)(struct vm_area_struct *); 642ec23eb54SMauro Carvalho Chehab void (*close)(struct vm_area_struct *); 64340d49a3cSMatthew Wilcox (Oracle) vm_fault_t (*fault)(struct vm_fault *); 64440d49a3cSMatthew Wilcox (Oracle) vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order); 64540d49a3cSMatthew Wilcox (Oracle) vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end); 646ec23eb54SMauro Carvalho Chehab vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); 647ec23eb54SMauro Carvalho Chehab vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *); 648ec23eb54SMauro Carvalho Chehab int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); 649ec23eb54SMauro Carvalho Chehab 650ec23eb54SMauro Carvalho Chehablocking rules: 651ec23eb54SMauro Carvalho Chehab 65240d49a3cSMatthew Wilcox (Oracle)============= ========== =========================== 653c1e8d7c6SMichel Lespinasseops mmap_lock PageLocked(page) 65440d49a3cSMatthew Wilcox (Oracle)============= ========== =========================== 65540d49a3cSMatthew Wilcox (Oracle)open: write 65640d49a3cSMatthew Wilcox (Oracle)close: read/write 65740d49a3cSMatthew Wilcox (Oracle)fault: read can return with page locked 65840d49a3cSMatthew Wilcox (Oracle)huge_fault: maybe-read 65940d49a3cSMatthew Wilcox (Oracle)map_pages: maybe-read 66040d49a3cSMatthew Wilcox (Oracle)page_mkwrite: read can return with page locked 66140d49a3cSMatthew Wilcox (Oracle)pfn_mkwrite: read 66240d49a3cSMatthew Wilcox (Oracle)access: read 66340d49a3cSMatthew Wilcox (Oracle)============= ========== =========================== 664ec23eb54SMauro Carvalho Chehab 665730633f0SJan Kara->fault() is called when a previously not present pte is about to be faulted 666730633f0SJan Karain. The filesystem must find and return the page associated with the passed in 667730633f0SJan Kara"pgoff" in the vm_fault structure. If it is possible that the page may be 668730633f0SJan Karatruncated and/or invalidated, then the filesystem must lock invalidate_lock, 669730633f0SJan Karathen ensure the page is not already truncated (invalidate_lock will block 670ec23eb54SMauro Carvalho Chehabsubsequent truncate), and then return with VM_FAULT_LOCKED, and the page 671ec23eb54SMauro Carvalho Chehablocked. The VM will unlock the page. 672ec23eb54SMauro Carvalho Chehab 67340d49a3cSMatthew Wilcox (Oracle)->huge_fault() is called when there is no PUD or PMD entry present. This 67440d49a3cSMatthew Wilcox (Oracle)gives the filesystem the opportunity to install a PUD or PMD sized page. 67540d49a3cSMatthew Wilcox (Oracle)Filesystems can also use the ->fault method to return a PMD sized page, 67640d49a3cSMatthew Wilcox (Oracle)so implementing this function may not be necessary. In particular, 67740d49a3cSMatthew Wilcox (Oracle)filesystems should not call filemap_fault() from ->huge_fault(). 67840d49a3cSMatthew Wilcox (Oracle)The mmap_lock may not be held when this method is called. 67940d49a3cSMatthew Wilcox (Oracle) 680ec23eb54SMauro Carvalho Chehab->map_pages() is called when VM asks to map easy accessible pages. 681ec23eb54SMauro Carvalho ChehabFilesystem should find and map pages associated with offsets from "start_pgoff" 68258ef47efSMatthew Wilcox (Oracle)till "end_pgoff". ->map_pages() is called with the RCU lock held and must 683ec23eb54SMauro Carvalho Chehabnot block. If it's not possible to reach a page without blocking, 6843bd786f7SYin Fengweifilesystem should skip it. Filesystem should use set_pte_range() to setup 685ec23eb54SMauro Carvalho Chehabpage table entry. Pointer to entry associated with the page is passed in 686ec23eb54SMauro Carvalho Chehab"pte" field in vm_fault structure. Pointers to entries for other offsets 687ec23eb54SMauro Carvalho Chehabshould be calculated relative to "pte". 688ec23eb54SMauro Carvalho Chehab 689730633f0SJan Kara->page_mkwrite() is called when a previously read-only pte is about to become 690730633f0SJan Karawriteable. The filesystem again must ensure that there are no 691730633f0SJan Karatruncate/invalidate races or races with operations such as ->remap_file_range 692730633f0SJan Karaor ->copy_file_range, and then return with the page locked. Usually 693730633f0SJan Karamapping->invalidate_lock is suitable for proper serialization. If the page has 694730633f0SJan Karabeen truncated, the filesystem should not look up a new page like the ->fault() 695730633f0SJan Karahandler, but simply return with VM_FAULT_NOPAGE, which will cause the VM to 696730633f0SJan Kararetry the fault. 697ec23eb54SMauro Carvalho Chehab 698ec23eb54SMauro Carvalho Chehab->pfn_mkwrite() is the same as page_mkwrite but when the pte is 699ec23eb54SMauro Carvalho ChehabVM_PFNMAP or VM_MIXEDMAP with a page-less entry. Expected return is 700ec23eb54SMauro Carvalho ChehabVM_FAULT_NOPAGE. Or one of the VM_FAULT_ERROR types. The default behavior 701ec23eb54SMauro Carvalho Chehabafter this call is to make the pte read-write, unless pfn_mkwrite returns 702ec23eb54SMauro Carvalho Chehaban error. 703ec23eb54SMauro Carvalho Chehab 704ec23eb54SMauro Carvalho Chehab->access() is called when get_user_pages() fails in 705ec23eb54SMauro Carvalho Chehabaccess_process_vm(), typically used to debug a process through 706ec23eb54SMauro Carvalho Chehab/proc/pid/mem or ptrace. This function is needed only for 707ec23eb54SMauro Carvalho ChehabVM_IO | VM_PFNMAP VMAs. 708ec23eb54SMauro Carvalho Chehab 709ec23eb54SMauro Carvalho Chehab-------------------------------------------------------------------------------- 710ec23eb54SMauro Carvalho Chehab 711ec23eb54SMauro Carvalho Chehab Dubious stuff 712ec23eb54SMauro Carvalho Chehab 713ec23eb54SMauro Carvalho Chehab(if you break something or notice that it is broken and do not fix it yourself 714ec23eb54SMauro Carvalho Chehab- at least put it here) 715