Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
e25447c3 |
| 01-Mar-2024 |
Josef Bacik <josef@toxicpanda.com> |
nfs: fix UAF in direct writes
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ]
In production we have been hitting the following warning consistently
------------[ cut here ]-----------
nfs: fix UAF in direct writes
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ]
In production we have been hitting the following warning consistently
------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30
This is because we're completing the nfs_direct_request twice in a row.
The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have
if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq);
However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice.
The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a
nfs_commit_begin(); nfs_commit_end();
Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths.
Fix this by using the same pattern for the commit requests.
Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
e25447c3 |
| 01-Mar-2024 |
Josef Bacik <josef@toxicpanda.com> |
nfs: fix UAF in direct writes
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ]
In production we have been hitting the following warning consistently
------------[ cut here ]-----------
nfs: fix UAF in direct writes
[ Upstream commit 17f46b803d4f23c66cacce81db35fef3adb8f2af ]
In production we have been hitting the following warning consistently
------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 17 PID: 1800359 at lib/refcount.c:28 refcount_warn_saturate+0x9c/0xe0 Workqueue: nfsiod nfs_direct_write_schedule_work [nfs] RIP: 0010:refcount_warn_saturate+0x9c/0xe0 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x9f/0x130 ? refcount_warn_saturate+0x9c/0xe0 ? report_bug+0xcc/0x150 ? handle_bug+0x3d/0x70 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? refcount_warn_saturate+0x9c/0xe0 nfs_direct_write_schedule_work+0x237/0x250 [nfs] process_one_work+0x12f/0x4a0 worker_thread+0x14e/0x3b0 ? ZSTD_getCParams_internal+0x220/0x220 kthread+0xdc/0x120 ? __btf_name_valid+0xa0/0xa0 ret_from_fork+0x1f/0x30
This is because we're completing the nfs_direct_request twice in a row.
The source of this is when we have our commit requests to submit, we process them and send them off, and then in the completion path for the commit requests we have
if (nfs_commit_end(cinfo.mds)) nfs_direct_write_complete(dreq);
However since we're submitting asynchronous requests we sometimes have one that completes before we submit the next one, so we end up calling complete on the nfs_direct_request twice.
The only other place we use nfs_generic_commit_list() is in __nfs_commit_inode, which wraps this call in a
nfs_commit_begin(); nfs_commit_end();
Which is a common pattern for this style of completion handling, one that is also repeated in the direct code with get_dreq()/put_dreq() calls around where we process events as well as in the completion paths.
Fix this by using the same pattern for the commit requests.
Before with my 200 node rocksdb stress running this warning would pop every 10ish minutes. With my patch the stress test has been running for several hours without popping.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21 |
|
#
3db63daa |
| 21-Mar-2023 |
NeilBrown <neilb@suse.de> |
NFSv3: handle out-of-order write replies.
NFSv3 includes pre/post wcc attributes which allow the client to determine if all changes to the file have been made by the client itself, or if any might h
NFSv3: handle out-of-order write replies.
NFSv3 includes pre/post wcc attributes which allow the client to determine if all changes to the file have been made by the client itself, or if any might have been made by some other client.
If there are gaps in the pre/post ctime sequence it must be assumed that some other client changed the file in that gap and the local cache must be suspect. The next time the file is opened the cache should be invalidated.
Since Commit 1c341b777501 ("NFS: Add deferred cache invalidation for close-to-open consistency violations") in linux 5.3 the Linux client has been triggering this invalidation. The chunk in nfs_update_inode() in particularly triggers.
Unfortunately Linux NFS assumes that all replies will be processed in the order sent, and will arrive in the order processed. This is not true in general. Consequently Linux NFS might ignore the wcc info in a WRITE reply because the reply is in response to a WRITE that was sent before some other request for which a reply has already been seen. This is detected by Linux using the gencount tests in nfs_inode_attr_cmp().
Also, when the gencount tests pass it is still possible that the request were processed on the server in a different order, and a gap seen in the ctime sequence might be filled in by a subsequent reply, so gaps should not immediately trigger delayed invalidation.
The net result is that writing to a server and then reading the file back can result in going to the server for the read rather than serving it from cache - all because a couple of replies arrived out-of-order. This is a performance regression over kernels before 5.3, though the change in 5.3 is a correctness improvement.
This has been seen with Linux writing to a Netapp server which occasionally re-orders requests. In testing the majority of requests were in-order, but a few (maybe 2 or three at a time) could be re-ordered.
This patch addresses the problem by recording any gaps seen in the pre/post ctime sequence and not triggering invalidation until either there are too many gaps to fit in the table, or until there are no more active writes and the remaining gaps cannot be resolved.
We allocate a table of 16 gaps on demand. If the allocation fails we revert to current behaviour which is of little cost as we are unlikely to be able to cache the writes anyway.
In the table we store "start->end" pair when iversion is updated and "end<-start" pairs pre/post pairs reported by the server. Usually these exactly cancel out and so nothing is stored. When there are out-of-order replies we do store gaps and these will eventually be cancelled against later replies when this client is the only writer.
If the final write is out-of-order there may be one gap remaining when the file is closed. This will be noticed and if there is precisely on gap and if the iversion can be advanced to match it, then we do so.
This patch makes no attempt to handle directories correctly. The same problem potentially exists in the out-of-order replies to create/unlink requests can cause future lookup requires to be sent to the server unnecessarily. A similar scheme using the same primitives could be used to notice and handle out-of-order replies.
Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
show more ...
|
Revision tags: v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13 |
|
#
03f5bd75 |
| 20-Feb-2023 |
Dave Wysochanski <dwysocha@redhat.com> |
NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
The NFS specific trace points are no longer needed as tracing is well covered by netfs and fscache.
Signed-off-by: Dave Wysochanski
NFS: Remove fscache specific trace points and NFS_INO_FSCACHE bit
The NFS specific trace points are no longer needed as tracing is well covered by netfs and fscache.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Tested-by: Daire Byrne <daire@dneg.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
show more ...
|
#
88a4d7bd |
| 20-Feb-2023 |
Dave Wysochanski <dwysocha@redhat.com> |
NFS: Configure support for netfs when NFS fscache is configured
As first steps for support of the netfs library when NFS_FSCACHE is configured, add NETFS_SUPPORT to Kconfig and add the required netf
NFS: Configure support for netfs when NFS fscache is configured
As first steps for support of the netfs library when NFS_FSCACHE is configured, add NETFS_SUPPORT to Kconfig and add the required netfs_inode into struct nfs_inode.
Using netfs requires we move the VFS inode structure to be stored inside struct netfs_inode, along with the fscache_cookie. Thus, if NFS_FSCACHE is configured, place netfs_inode inside an anonymous union so the vfs_inode memory is the same and we do not need to modify other non-fscache areas of NFS. In addition, inside the NFS fscache code, use the new helpers, netfs_inode() and netfs_i_cookie() helpers, and remove our own helper, nfs_i_fscache().
Later patches will convert NFS fscache to fully use netfs.
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Tested-by: Daire Byrne <daire@dneg.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
show more ...
|
Revision tags: v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8 |
|
#
4cbf7694 |
| 19-Jan-2023 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Remove unused function nfs_wb_page()
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
#
0c493b5c |
| 19-Jan-2023 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Convert buffered writes to use folios
Mostly mechanical conversion of struct page and functions into struct folio equivalents. The lack of support for folios in write_cache_pages(), means we st
NFS: Convert buffered writes to use folios
Mostly mechanical conversion of struct page and functions into struct folio equivalents. The lack of support for folios in write_cache_pages(), means we still only support order 0 folio allocations. However the rest of the writeback code should now be ready for order n > 0.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
show more ...
|
#
4b27232a |
| 19-Jan-2023 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Add a helper nfs_wb_folio()
...and use it in nfs_launder_folio().
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
|
Revision tags: v6.1.7, v6.1.6 |
|
#
4609e1f1 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is j
fs: port ->permission() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
b74d24f7 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just
fs: port ->getattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
c1632a0f |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just
fs: port ->setattr() to pass mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64 |
|
#
0eb43812 |
| 26-Aug-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Clear the file access cache upon login
POSIX typically only refreshes the user's supplementary group information upon login. Since NFS servers may often refresh their concept of the user supple
NFS: Clear the file access cache upon login
POSIX typically only refreshes the user's supplementary group information upon login. Since NFS servers may often refresh their concept of the user supplementary group membership at their own cadence, it is possible for the NFS client's access cache to become stale due to the user's group membership changing on the server after the user has already logged in on the client. While it is reasonable to expect that such group membership changes are rare, and that we do not want to optimise the cache to accommodate them, it is also not unreasonable for the user to expect that if they log out and log back in again, that the staleness would clear up.
Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
Revision tags: v5.15.63, v5.15.62, v5.15.61 |
|
#
5f6277a0 |
| 13-Aug-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
|
#
67f4b5dc |
| 13-Aug-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Fix another fsync() issue after a server reboot
Currently, when the writeback code detects a server reboot, it redirties any pages that were not committed to disk, and it sets the flag NFS_CONT
NFS: Fix another fsync() issue after a server reboot
Currently, when the writeback code detects a server reboot, it redirties any pages that were not committed to disk, and it sets the flag NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor that dirtied the file. While this allows the file descriptor in question to redrive its own writes, it violates the fsync() requirement that we should be synchronising all writes to disk. While the problem is infrequent, we do see corner cases where an untimely server reboot causes the fsync() call to abandon its attempt to sync data to disk and causing data corruption issues due to missed error conditions or similar.
In order to tighted up the client's ability to deal with this situation without introducing livelocks, add a counter that records the number of times pages are redirtied due to a server reboot-like condition, and use that in fsync() to redrive the sync to disk.
Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
Revision tags: v5.15.60, v5.15.59 |
|
#
3c59366c |
| 31-Jul-2022 |
NeilBrown <neilb@suse.de> |
NFS: don't unhash dentry during unlink/rename
NFS unlink() (and rename over existing target) must determine if the file is open, and must perform a "silly rename" instead of an unlink (or before ren
NFS: don't unhash dentry during unlink/rename
NFS unlink() (and rename over existing target) must determine if the file is open, and must perform a "silly rename" instead of an unlink (or before rename) if it is. Otherwise the client might hold a file open which has been removed on the server.
Consequently if it determines that the file isn't open, it must block any subsequent opens until the unlink/rename has been completed on the server.
This is currently achieved by unhashing the dentry. This forces any open attempt to the slow-path for lookup which will block on i_rwsem on the directory until the unlink/rename completes. A future patch will change the VFS to only get a shared lock on i_rwsem for unlink, so this will no longer work.
Instead we introduce an explicit interlock. A special value is stored in dentry->d_fsdata while the unlink/rename is running and ->d_revalidate blocks while that value is present. When ->d_revalidate unblocks, the dentry will be invalid. This closes the race without requiring exclusion on i_rwsem.
d_fsdata is already used in two different ways. 1/ an IS_ROOT directory dentry might have a "devname" stored in d_fsdata. Such a dentry doesn't have a name and so cannot be the target of unlink or rename. For safety we check if an old devname is still stored, and remove it if it is. 2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct nfs_unlinkdata' stored in d_fsdata. While this is set maydelete() will fail, so an unlink or rename will never proceed on such a dentry.
Neither of these can be in effect when a dentry is the target of unlink or rename. So we can expect d_fsdata to be NULL, and store a special value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate that any lookup will be blocked.
The d_count() is incremented under d_lock() when a lookup finds the dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under the same lock to avoid any races.
Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
Revision tags: v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39 |
|
#
eb79f3af |
| 09-May-2022 |
NeilBrown <neilb@suse.de> |
nfs: rename nfs_direct_IO and use as ->swap_rw
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a while. We now need a ->swap_rw function which behaves slightly differently, ret
nfs: rename nfs_direct_IO and use as ->swap_rw
The nfs_direct_IO() exists to support SWAP IO, but hasn't worked for a while. We now need a ->swap_rw function which behaves slightly differently, returning zero for success rather than a byte count.
So modify nfs_direct_IO accordingly, rename it, and use it as the ->swap_rw function.
Link: https://lkml.kernel.org/r/165119301493.15698.7491285551903597618.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> (on Renesas RSK+RZA1 with 32 MiB of SDRAM) Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v5.15.38, v5.15.37 |
|
#
65d023af |
| 29-Apr-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
nfs: Convert nfs to read_folio
This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the fi
nfs: Convert nfs to read_folio
This is a "weak" conversion which converts straight back to using pages. A full conversion should be performed at some point, hopefully by someone familiar with the filesystem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
show more ...
|
Revision tags: v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31 |
|
#
515dcdcd |
| 21-Mar-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: nfsiod should not block forever in mempool_alloc()
The concern is that since nfsiod is sometimes required to kick off a commit, it can get locked up waiting forever in mempool_alloc() instead o
NFS: nfsiod should not block forever in mempool_alloc()
The concern is that since nfsiod is sometimes required to kick off a commit, it can get locked up waiting forever in mempool_alloc() instead of failing gracefully and leaving the commit until later.
Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY, then fall back to a non-blocking attempt to allocate from the memory pool.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
Revision tags: v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23 |
|
#
6d740c76 |
| 09-Feb-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
nfs: Convert from invalidatepage to invalidate_folio
Print the folio index instead of the pointer, since this is more useful. We also don't need to use page_file_mapping() as we do not invalidate s
nfs: Convert from invalidatepage to invalidate_folio
Print the folio index instead of the pointer, since this is more useful. We also don't need to use page_file_mapping() as we do not invalidate swapcache pages. Since this is the only caller of nfs_wb_page_cancel(), convert it to nfs_wb_folio_cancel().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs Tested-by: David Howells <dhowells@redhat.com> # afs
show more ...
|
#
64158668 |
| 06-Mar-2022 |
NeilBrown <neilb@suse.de> |
NFS: swap IO handling is slightly different for O_DIRECT IO
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I b
NFS: swap IO handling is slightly different for O_DIRECT IO
1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding possible deadlocks with "fs_reclaim". These deadlocks could, I believe, eventuate if a buffered read on the swapfile was attempted.
We don't need coherence with the page cache for a swap file, and buffered writes are forbidden anyway. There is no other need for i_rwsem during direct IO. So never take it for swap_rw()
2/ generic_write_checks() explicitly forbids writes to swap, and performs checks that are not needed for swap. So bypass it for swap_rw().
Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
#
89c2be8a |
| 06-Mar-2022 |
NeilBrown <neilb@suse.de> |
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
NFS_RPC_SWAPFLAGS is only used for READ requests. It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to requests. This is not
NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
NFS_RPC_SWAPFLAGS is only used for READ requests. It sets RPC_TASK_SWAPPER which gives some memory-allocation priority to requests. This is not needed for swap READ - though it is for writes where it is set via a different mechanism.
RPC_TASK_ROOTCREDS causes the 'machine' credential to be used. This is not needed as the root credential is saved when the swap file is opened, and this is used for all IO.
So NFS_RPC_SWAPFLAGS isn't needed, and as it is the only user of RPC_TASK_ROOTCREDS, that isn't needed either.
Remove both.
Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
#
b0365ccb |
| 23-Feb-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Fix up forced readdirplus
Avoid clearing the entire readdir page cache if we're just doing forced readdirplus for the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammers
NFS: Fix up forced readdirplus
Avoid clearing the entire readdir page cache if we're just doing forced readdirplus for the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
#
f648022f |
| 23-Feb-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Convert readdir page cache to use a cookie based index
Instead of using a linear index to address the pages, use the cookie of the first entry, since that is what we use to match the page anywa
NFS: Convert readdir page cache to use a cookie based index
Instead of using a linear index to address the pages, use the cookie of the first entry, since that is what we use to match the page anyway.
This allows us to avoid re-reading the entire cache on a seekdir() type of operation. The latter is very common when re-exporting NFS, and is a major performance drain.
The change does affect our duplicate cookie detection, since we can no longer rely on the page index as a linear offset for detecting whether we looped backwards. However since we no longer do a linear search through all the pages on each call to nfs_readdir(), this is less of a concern than it was previously. The other downside is that invalidate_mapping_pages() no longer can use the page index to avoid clearing pages that have been read. A subsequent patch will restore the functionality this provides to the 'ls -l' heuristic.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
#
230bc98f |
| 17-Feb-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Improve heuristic for readdirplus
The heuristic for readdirplus is designed to try to detect 'ls -l' and similar patterns. It does so by looking for cache hit/miss patterns in both the attribut
NFS: Improve heuristic for readdirplus
The heuristic for readdirplus is designed to try to detect 'ls -l' and similar patterns. It does so by looking for cache hit/miss patterns in both the attribute cache and in the dcache of the files in a given directory, and then sets a flag for the readdirplus code to interpret.
The problem with this approach is that a single attribute or dcache miss can cause the NFS code to force a refresh of the attributes for the entire set of files contained in the directory.
To be able to make a more nuanced decision, let's sample the number of hits and misses in the set of open directory descriptors. That allows us to set thresholds at which we start preferring READDIRPLUS over regular READDIR, or at which we start to force a re-read of the remaining readdir cache using READDIRPLUS.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|
Revision tags: v5.15.22 |
|
#
580f2367 |
| 07-Feb-2022 |
Trond Myklebust <trond.myklebust@hammerspace.com> |
NFS: Adjust the amount of readahead performed by NFS readdir
The current NFS readdir code will always try to maximise the amount of readahead it performs on the assumption that we can cache anything
NFS: Adjust the amount of readahead performed by NFS readdir
The current NFS readdir code will always try to maximise the amount of readahead it performs on the assumption that we can cache anything that isn't immediately read by the process. There are several cases where this assumption breaks down, including when the 'ls -l' heuristic kicks in to try to force use of readdirplus as a batch replacement for lookup/getattr.
This patch therefore tries to tone down the amount of readahead we perform, and adjust it to try to match the amount of data being requested by user space.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
show more ...
|