Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
42954c37 |
| 06-Feb-2024 |
Jan Kara <jack@suse.cz> |
quota: Properly annotate i_dquot arrays with __rcu
[ Upstream commit ccb49011bb2ebfd66164dbf68c5bff48917bb5ef ]
Dquots pointed to from i_dquot arrays in inodes are protected by dquot_srcu. Annotate
quota: Properly annotate i_dquot arrays with __rcu
[ Upstream commit ccb49011bb2ebfd66164dbf68c5bff48917bb5ef ]
Dquots pointed to from i_dquot arrays in inodes are protected by dquot_srcu. Annotate them as such and change .get_dquots callback to return properly annotated pointer to make sparse happy.
Fixes: b9ba6f94b238 ("quota: remove dqptr_sem") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
c6f95031 |
| 21-Feb-2024 |
Gabriel Krisman Bertazi <krisman@suse.de> |
ovl: Always reject mounting over case-insensitive directories
[ Upstream commit 2824083db76cb9d4b7910607b367e93b02912865 ]
overlayfs relies on the filesystem setting DCACHE_OP_HASH or DCACHE_OP_COM
ovl: Always reject mounting over case-insensitive directories
[ Upstream commit 2824083db76cb9d4b7910607b367e93b02912865 ]
overlayfs relies on the filesystem setting DCACHE_OP_HASH or DCACHE_OP_COMPARE to reject mounting over case-insensitive directories.
Since commit bb9cd9106b22 ("fscrypt: Have filesystems handle their d_ops"), we set ->d_op through a hook in ->d_lookup, which means the root dentry won't have them, causing the mount to accidentally succeed.
In v6.7-rc7, the following sequence will succeed to mount, but any dentry other than the root dentry will be a "weird" dentry to ovl and fail with EREMOTE.
mkfs.ext4 -O casefold lower.img mount -O loop lower.img lower mount -t overlay -o lowerdir=lower,upperdir=upper,workdir=work ovl /mnt
Mounting on a subdirectory fails, as expected, because DCACHE_OP_HASH and DCACHE_OP_COMPARE are properly set by ->lookup.
Fix by explicitly rejecting superblocks that allow case-insensitive dentries. Yes, this will be solved when we move d_op configuration back to ->s_d_op. Yet, we better have an explicit fix to avoid messing up again.
While there, re-sort the entries to have more descriptive error messages first.
Fixes: bb9cd9106b22 ("fscrypt: Have filesystems handle their d_ops") Acked-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20240221171412.10710-2-krisman@suse.de Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
e7e23fc5 |
| 15-Feb-2024 |
Bart Van Assche <bvanassche@acm.org> |
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
commit b820de741ae48ccf50dd95e297889c286ff4f760 upstream.
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the f
fs/aio: Restrict kiocb_set_cancel_fn() to I/O submitted via libaio
commit b820de741ae48ccf50dd95e297889c286ff4f760 upstream.
If kiocb_set_cancel_fn() is called for I/O submitted via io_uring, the following kernel warning appears:
WARNING: CPU: 3 PID: 368 at fs/aio.c:598 kiocb_set_cancel_fn+0x9c/0xa8 Call trace: kiocb_set_cancel_fn+0x9c/0xa8 ffs_epfile_read_iter+0x144/0x1d0 io_read+0x19c/0x498 io_issue_sqe+0x118/0x27c io_submit_sqes+0x25c/0x5fc __arm64_sys_io_uring_enter+0x104/0xab0 invoke_syscall+0x58/0x11c el0_svc_common+0xb4/0xf4 do_el0_svc+0x2c/0xb0 el0_svc+0x2c/0xa4 el0t_64_sync_handler+0x68/0xb4 el0t_64_sync+0x1a4/0x1a8
Fix this by setting the IOCB_AIO_RW flag for read and write I/O that is submitted by libaio.
Suggested-by: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Avi Kivity <avi@scylladb.com> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240215204739.2677806-2-bvanassche@acm.org Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6 |
|
#
5b5599a7 |
| 04-Oct-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: new accessor methods for atime and mtime
[ Upstream commit 077c212f0344ae4198b2b51af128a94b614ccdf4 ]
Recently, we converted the ctime accesses in the kernel to use new accessor functions. Linu
fs: new accessor methods for atime and mtime
[ Upstream commit 077c212f0344ae4198b2b51af128a94b614ccdf4 ]
Recently, we converted the ctime accesses in the kernel to use new accessor functions. Linus recently pointed out though that if we add accessors for the atime and mtime, then that would allow us to seamlessly change how these timestamps are stored in the inode.
Add new accessor functions for the atime and mtime that mirror the accessors for the ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185239.80830-1-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> Stable-dep-of: 01fe654f78fd ("fs: cifs: Fix atime update check") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
03adc61e |
| 12-Oct-2023 |
Dan Clash <daclash@linux.microsoft.com> |
audit,io_uring: io_uring openat triggers audit reference count underflow
An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below.
A
audit,io_uring: io_uring openat triggers audit reference count underflow
An io_uring openat operation can update an audit reference count from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64() __do_sys_io_uring_enter() io_submit_sqes() io_openat_prep() __io_openat_prep() getname() getname_flags() /* update 1 (increment) */ __audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the opportunity for a race. The system call exit performs one decrement.
do_syscall_64() syscall_exit_to_user_mode() syscall_exit_to_user_mode_prepare() __audit_syscall_exit() audit_reset_context() putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements. These updates can race with the system call decrement.
io_wqe_worker() io_worker_handle_work() io_wq_submit_work() io_issue_sqe() io_openat() io_openat2() do_filp_open() path_openat() __audit_inode() /* update 4 (increment) */ putname() /* update 5 (decrement) */ __audit_uring_exit() audit_reset_context() putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names from int to atomic_t.
kernel BUG at fs/namei.c:262! Call Trace: ... ? putname+0x68/0x70 audit_reset_context.part.0.constprop.0+0xe1/0x300 __audit_uring_exit+0xda/0x1c0 io_issue_sqe+0x1f3/0x450 ? lock_timer_base+0x3b/0xd0 io_wq_submit_work+0x8d/0x2b0 ? __try_to_del_timer_sync+0x67/0xa0 io_worker_handle_work+0x17c/0x2b0 io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/ Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring") Signed-off-by: Dan Clash <daclash@linux.microsoft.com> Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.5.5 |
|
#
647aa768 |
| 20-Sep-2023 |
Christian Brauner <brauner@kernel.org> |
Revert "fs: add infrastructure for multigrain timestamps"
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e.
Users reported regressions due to enabling multi-grained timestamps unconditi
Revert "fs: add infrastructure for multigrain timestamps"
This reverts commit ffb6cf19e06334062744b7e3493f71e500964f8e.
Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away.
Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50 |
|
#
69881be3 |
| 29-Aug-2023 |
Christian Brauner <brauner@kernel.org> |
fs: export sget_dev()
They will be used for mtd devices as well.
Acked-by: Richard Weinberger <richard@nod.at> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230829-vfs-super-mtd-v1-1-fecb572e
fs: export sget_dev()
They will be used for mtd devices as well.
Acked-by: Richard Weinberger <richard@nod.at> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230829-vfs-super-mtd-v1-1-fecb572e5df3@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.5, v6.1.49, v6.1.48 |
|
#
2c18a63b |
| 18-Aug-2023 |
Christian Brauner <brauner@kernel.org> |
super: wait until we passed kill super
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() c
super: wait until we passed kill super
Recent rework moved block device closing out of sb->put_super() and into sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and blkdev_put() can end up taking s_umount again.
That means we need to move the removal of the superblock from @fs_supers out of generic_shutdown_super() and into deactivate_locked_super() to ensure that concurrent mounters don't fail to open block devices that are still in use because blkdev_put() in sb->kill_sb() hasn't been called yet.
We can now do this as we can make iterators through @fs_super and @super_blocks wait without holding s_umount. Concurrent mounts will wait until a dying superblock is fully dead so until sb->kill_sb() has been called and SB_DEAD been set. Concurrent iterators can already discard any SB_DYING superblock.
Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-4-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
5e874914 |
| 18-Aug-2023 |
Christian Brauner <brauner@kernel.org> |
super: wait for nascent superblocks
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects
super: wait for nascent superblocks
Recent patches experiment with making it possible to allocate a new superblock before opening the relevant block device. Naturally this has intricate side-effects that we get to learn about while developing this.
Superblock allocators such as sget{_fc}() return with s_umount of the new superblock held and lock ordering currently requires that block level locks such as bdev_lock and open_mutex rank above s_umount.
Before aca740cecbe5 ("fs: open block device after superblock creation") ordering was guaranteed to be correct as block devices were opened prior to superblock allocation and thus s_umount wasn't held. But now s_umount must be dropped before opening block devices to avoid locking violations.
This has consequences. The main one being that iterators over @super_blocks and @fs_supers that grab a temporary reference to the superblock can now also grab s_umount before the caller has managed to open block devices and called fill_super(). So whereas before such iterators or concurrent mounts would have simply slept on s_umount until SB_BORN was set or the superblock was discard due to initalization failure they can now needlessly spin through sget{_fc}().
If the caller is sleeping on bdev_lock or open_mutex one caller waiting on SB_BORN will always spin somewhere and potentially this can go on for quite a while.
It should be possible to drop s_umount while allowing iterators to wait on a nascent superblock to either be born or discarded. This patch implements a wait_var_event() mechanism allowing iterators to sleep until they are woken when the superblock is born or discarded.
This also allows us to avoid relooping through @fs_supers and @super_blocks if a superblock isn't yet born or dying.
Link: aca740cecbe5 ("fs: open block device after superblock creation") Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230818-vfs-super-fixes-v3-v3-3-9f0b1876e46b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
ed0360bb |
| 17-Aug-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: create kiocb_{start,end}_write() helpers
aio, io_uring, cachefiles and overlayfs, all open code an ugly variant of file_{start,end}_write() to silence lockdep warnings.
Create helpers for this
fs: create kiocb_{start,end}_write() helpers
aio, io_uring, cachefiles and overlayfs, all open code an ugly variant of file_{start,end}_write() to silence lockdep warnings.
Create helpers for this lockdep dance so we can use the helpers in all the callers.
Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-4-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
f6c05b9e |
| 17-Aug-2023 |
Amir Goldstein <amir73il@gmail.com> |
fs: add kerneldoc to file_{start,end}_write() helpers
and use sb_end_write() instead of open coded version.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> R
fs: add kerneldoc to file_{start,end}_write() helpers
and use sb_end_write() instead of open coded version.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-3-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.46, v6.1.45 |
|
#
38bcdd38 |
| 11-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
fs: remove get_super
get_super is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda
fs: remove get_super
get_super is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Message-Id: <20230811100828.1897174-17-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
aee79d4e |
| 16-Jul-2023 |
Zhu, Lipeng <lipeng.zhu@intel.com> |
fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.
When running UnixBench/Shell Scripts, we observed high false sharing for accessing i_mmap against i_mm
fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing.
When running UnixBench/Shell Scripts, we observed high false sharing for accessing i_mmap against i_mmap_rwsem.
UnixBench/Shell Scripts are typical load/execute command test scenarios, which concurrently launch->execute->exit a lot of shell commands. A lot of processes invoke vma_interval_tree_remove which touch "i_mmap", the call stack:
----vma_interval_tree_remove |----unlink_file_vma | free_pgtables | |----exit_mmap | | mmput | | |----begin_new_exec | | | load_elf_binary | | | bprm_execve
Meanwhile, there are a lot of processes touch 'i_mmap_rwsem' to acquire the semaphore in order to access 'i_mmap'. In existing 'address_space' layout, 'i_mmap' and 'i_mmap_rwsem' are in the same cacheline.
The patch places the i_mmap and i_mmap_rwsem in separate cache lines to avoid this false sharing problem.
With this patch, based on kernel v6.4.0, on Intel Sapphire Rapids 112C/224T platform, the score improves by ~5.3%. And perf c2c tool shows the false sharing is resolved as expected, the symbol vma_interval_tree_remove disappeared in cache line 0 after this change.
Baseline: ================================================= Shared Cache Line Distribution Pareto ================================================= ------------------------------------------------------------- 0 3729 5791 0 0 0xff19b3818445c740 ------------------------------------------------------------- 3.27% 3.02% 0.00% 0.00% 0x18 0 1 0xffffffffa194403b 604 483 389 692 203 [k] vma_interval_tree_insert [kernel.kallsyms] vma_interval_tree_insert+75 0 1 4.13% 3.63% 0.00% 0.00% 0x20 0 1 0xffffffffa19440a2 553 413 415 962 215 [k] vma_interval_tree_remove [kernel.kallsyms] vma_interval_tree_remove+18 0 1 2.04% 1.35% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1d6 1210 855 460 1229 222 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.62% 1.85% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1bf 762 329 577 527 198 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.48% 0.31% 0.00% 0.00% 0x28 0 1 0xffffffffa219a58c 1677 1476 733 1544 224 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.05% 0.07% 0.00% 0.00% 0x28 0 1 0xffffffffa219a21d 1040 819 689 33 27 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+749 0 1 0.00% 0.05% 0.00% 0.00% 0x28 0 1 0xffffffffa17707db 0 1005 786 1373 223 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.02% 0.00% 0.00% 0x28 0 1 0xffffffffa219a064 0 233 778 32 30 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 33.82% 34.10% 0.00% 0.00% 0x30 0 1 0xffffffffa1770945 779 495 534 6011 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 17.06% 15.28% 0.00% 0.00% 0x30 0 1 0xffffffffa1770915 593 438 468 2715 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1 3.54% 3.52% 0.00% 0.00% 0x30 0 1 0xffffffffa2199f84 881 601 583 1421 223 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+84 0 1
With this change: ------------------------------------------------------------- 0 556 838 0 0 0xff2780d7965d2780 ------------------------------------------------------------- 0.18% 0.60% 0.00% 0.00% 0x8 0 1 0xffffffffafff27b8 503 453 569 14 13 [k] do_dentry_open [kernel.kallsyms] do_dentry_open+456 0 1 0.54% 0.12% 0.00% 0.00% 0x8 0 1 0xffffffffaffc51ac 510 199 428 15 12 [k] hugepage_vma_check [kernel.kallsyms] hugepage_vma_check+252 0 1 1.80% 2.15% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1d6 1778 799 343 215 136 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.54% 1.31% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1bf 547 296 528 91 71 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.72% 0.72% 0.00% 0.00% 0x18 0 1 0xffffffffb079a58c 1479 1534 676 288 163 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffafd707db 0 2381 744 282 158 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffb079a064 0 239 518 6 6 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 46.58% 47.02% 0.00% 0.00% 0x20 0 1 0xffffffffafd70945 704 403 499 1137 219 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 23.92% 25.78% 0.00% 0.00% 0x20 0 1 0xffffffffafd70915 558 413 500 542 185 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1
v1->v2: change padding to exchange fields.
Link: https://lkml.kernel.org/r/20230716145653.20122-1-lipeng.zhu@intel.com Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Yu Ma <yu.ma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
45e0d4b9 |
| 11-Aug-2023 |
Mateusz Guzik <mjguzik@gmail.com> |
vfs: fix up the assert in i_readcount_dec
Drops a race where 2 threads could spot a positive value and both proceed to dec to -1, without reporting anything.
Signed-off-by: Mateusz Guzik <mjguzik@g
vfs: fix up the assert in i_readcount_dec
Drops a race where 2 threads could spot a positive value and both proceed to dec to -1, without reporting anything.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com> Message-Id: <20230811194814.1612336-1-mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
ffb6cf19 |
| 07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: add infrastructure for multigrain timestamps
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optim
fs: add infrastructure for multigrain timestamps
The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot metadata updates, down to around 1 per jiffy, even when a file is under heavy writes.
Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g backup applications).
If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates.
What we need is a way to only use fine-grained timestamps when they are being actively queried.
POSIX generally mandates that when the the mtime changes, the ctime must also change. The kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used.
Use the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one.
Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps.
Later patches will convert individual filesystems to use the new infrastructure.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-9-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
913e9928 |
| 07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remo
fs: drop the timespec64 argument from update_time
Now that all of the update_time operations are prepared for it, we can drop the timespec64 argument from the update_time operation. Do that and remove it from some associated functions like inode_update_time and inode_needs_update_time.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-8-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.38, v6.1.37 |
|
#
6faddda6 |
| 30-Jun-2023 |
Chuck Lever <chuck.lever@oracle.com> |
libfs: Add directory operations for stable offsets
Create a vector of directory operations in fs/libfs.c that handles directory seeks and readdir via stable offsets instead of the current cursor-bas
libfs: Add directory operations for stable offsets
Create a vector of directory operations in fs/libfs.c that handles directory seeks and readdir via stable offsets instead of the current cursor-based mechanism.
For the moment these are unused.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Message-Id: <168814732984.530310.11190772066786107220.stgit@manet.1015granger.net> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
541d4c79 |
| 07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: drop the timespec64 arg from generic_update_time
In future patches we're going to change how the ctime is updated to keep track of when it has been queried. The way that the update_time operatio
fs: drop the timespec64 arg from generic_update_time
In future patches we're going to change how the ctime is updated to keep track of when it has been queried. The way that the update_time operation works (and a lot of its callers) make this difficult, since they grab a timestamp early and then pass it down to eventually be copied into the inode.
All of the existing update_time callers pass in the result of current_time() in some fashion. Drop the "time" parameter from generic_update_time, and rework it to fetch its own timestamp.
This change means that an update_time could fetch a different timestamp than was seen in inode_needs_update_time. update_time is only ever called with one of two flag combinations: Either S_ATIME is set, or S_MTIME|S_CTIME|S_VERSION are set.
With this change we now treat the flags argument as an indicator that some value needed to be updated when last checked, rather than an indication to update specific timestamps.
Rework the logic for updating the timestamps and put it in a new inode_update_timestamps helper that other update_time routines can use. S_ATIME is as treated as we always have, but if any of the other three are set, then we attempt to update all three.
Also, some callers of generic_update_time need to know what timestamps were actually updated. Change it to return an S_* flag mask to indicate that and rework the callers to expect it.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230807-mgctime-v7-3-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
0d72b928 |
| 07-Aug-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE
fs: pass the request_mask to generic_fillattr
generic_fillattr just fills in the entire stat struct indiscriminately today, copying data from the inode. There is at least one attribute (STATX_CHANGE_COOKIE) that can have side effects when it is reported, and we're looking at adding more with the addition of multigrain timestamps.
Add a request_mask argument to generic_fillattr and have most callers just pass in the value that is passed to getattr. Have other callers (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of STATX_CHANGE_COOKIE into generic_fillattr.
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Paulo Alcantara (SUSE)" <pc@manguebit.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jeff Layton <jlayton@kernel.org> Message-Id: <20230807-mgctime-v7-2-d1dec143a704@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
3e327154 |
| 05-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
vfs: get rid of old '->iterate' directory operation
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading.
Filesystems that never got convered to shared
vfs: get rid of old '->iterate' directory operation
All users now just use '->iterate_shared()', which only takes the directory inode lock for reading.
Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode.
This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era.
The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf.
Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
9cf3516c |
| 08-Jul-2023 |
Jens Axboe <axboe@kernel.dk> |
fs: add IOCB flags related to passing back dio completions
Async dio completions generally happen from hard/soft IRQ context, which means that users like iomap may need to defer some of the completi
fs: add IOCB flags related to passing back dio completions
Async dio completions generally happen from hard/soft IRQ context, which means that users like iomap may need to defer some of the completion handling to a workqueue. This is less efficient than having the original issuer handle it, like we do for sync IO, and it adds latency to the completions.
Add IOCB_DIO_CALLER_COMP, which the issuer can set if it is able to safely punt these completions to a safe context. If the dio handler is aware of this flag, assign a callback handler in kiocb->dio_complete and associated data io kiocb->private. The issuer will then call this handler with that data from task context.
No functional changes in this patch.
Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
13bc2445 |
| 05-Jul-2023 |
Jeff Layton <jlayton@kernel.org> |
fs: rename i_ctime field to __i_ctime
Now that everything in-tree is converted to use the accessor functions, rename the i_ctime field in the inode to discourage direct access.
Signed-off-by: Jeff
fs: rename i_ctime field to __i_ctime
Now that everything in-tree is converted to use the accessor functions, rename the i_ctime field in the inode to discourage direct access.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230705185812.579118-4-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
880b9577 |
| 17-Jul-2023 |
Darrick J. Wong <djwong@kernel.org> |
fs: distinguish between user initiated freeze and kernel initiated freeze
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until user
fs: distinguish between user initiated freeze and kernel initiated freeze
Userspace can freeze a filesystem using the FIFREEZE ioctl or by suspending the block device; this state persists until userspace thaws the filesystem with the FITHAW ioctl or resuming the block device. Since commit 18e9e5104fcd ("Introduce freeze_super and thaw_super for the fsfreeze ioctl") we only allow the first freeze command to succeed.
The kernel may decide that it is necessary to freeze a filesystem for its own internal purposes, such as suspends in progress, filesystem fsck activities, or quiescing a device prior to removal. Userspace thaw commands must never break a kernel freeze, and kernel thaw commands shouldn't undo userspace's freeze command.
Introduce a couple of freeze holder flags and wire it into the sb_writers state. One kernel and one userspace freeze are allowed to coexist at the same time; the filesystem will not thaw until both are lifted.
I wonder if the f2fs/gfs2 code should be using a kernel freeze here, but for now we'll use FREEZE_HOLDER_USERSPACE to preserve existing behaviors.
Cc: mcgrof@kernel.org Cc: jack@suse.cz Cc: hch@infradead.org Cc: ruansy.fnst@fujitsu.com Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz>
show more ...
|
Revision tags: v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10 |
|
#
ed5f17f6 |
| 01-Feb-2023 |
Luca Vizzarro <Luca.Vizzarro@arm.com> |
fs: Pass argument to fcntl_setlease as int
The interface for fcntl expects the argument passed for the command F_SETLEASE to be of type int. The current code wrongly treats it as a long. In order to
fs: Pass argument to fcntl_setlease as int
The interface for fcntl expects the argument passed for the command F_SETLEASE to be of type int. The current code wrongly treats it as a long. In order to avoid access to undefined bits, we should explicitly cast the argument to int.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: linux-nfs@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-3-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
bccb5c39 |
| 02-Feb-2023 |
Luca Vizzarro <Luca.Vizzarro@arm.com> |
fcntl: Cast commands with int args explicitly
According to the fcntl API specification commands that expect an integer, hence not a pointer, always take an int and not long. In order to avoid access
fcntl: Cast commands with int args explicitly
According to the fcntl API specification commands that expect an integer, hence not a pointer, always take an int and not long. In order to avoid access to undefined bits, we should explicitly cast the argument to int.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Kevin Brodsky <Kevin.Brodsky@arm.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: David Laight <David.Laight@ACULAB.com> Cc: Mark Rutland <Mark.Rutland@arm.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-morello@op-lists.linaro.org Signed-off-by: Luca Vizzarro <Luca.Vizzarro@arm.com> Message-Id: <20230414152459.816046-2-Luca.Vizzarro@arm.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|