Revision tags: v6.6.35, v6.6.34, v6.6.33, v6.6.32 |
|
#
9ee7a77c |
| 21-May-2024 |
Xu Yang <xu.yang_2@nxp.com> |
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap w
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem.
If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again:
start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB
Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case.
With this change, the write speed will be stable. Tested on ARM64 device.
Before:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s)
After:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s)
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.35, v6.6.34, v6.6.33, v6.6.32 |
|
#
9ee7a77c |
| 21-May-2024 |
Xu Yang <xu.yang_2@nxp.com> |
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap w
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem.
If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again:
start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB
Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case.
With this change, the write speed will be stable. Tested on ARM64 device.
Before:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s)
After:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s)
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.35, v6.6.34, v6.6.33, v6.6.32 |
|
#
9ee7a77c |
| 21-May-2024 |
Xu Yang <xu.yang_2@nxp.com> |
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap w
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem.
If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again:
start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB
Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case.
With this change, the write speed will be stable. Tested on ARM64 device.
Before:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s)
After:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s)
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.35, v6.6.34, v6.6.33, v6.6.32 |
|
#
9ee7a77c |
| 21-May-2024 |
Xu Yang <xu.yang_2@nxp.com> |
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap w
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem.
If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again:
start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB
Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case.
With this change, the write speed will be stable. Tested on ARM64 device.
Before:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s)
After:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s)
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.35, v6.6.34, v6.6.33, v6.6.32 |
|
#
9ee7a77c |
| 21-May-2024 |
Xu Yang <xu.yang_2@nxp.com> |
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap w
iomap: fault in smaller chunks for non-large folio mappings
commit 4e527d5841e24623181edc7fd6f6598ffa810e10 upstream.
Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem.
If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again:
start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB
Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case.
With this change, the write speed will be stable. Tested on ARM64 device.
Before:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s)
After:
- dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s)
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5 |
|
#
0ab2a85c |
| 07-Dec-2023 |
Christoph Hellwig <hch@lst.de> |
iomap: clear the per-folio dirty bits on all writeback failures
[ Upstream commit 7ea1d9b4a840c2dd01d1234663d4a8ef256cfe39 ]
write_cache_pages always clear the page dirty bit before calling into th
iomap: clear the per-folio dirty bits on all writeback failures
[ Upstream commit 7ea1d9b4a840c2dd01d1234663d4a8ef256cfe39 ]
write_cache_pages always clear the page dirty bit before calling into the file systems, and leaves folios with a writeback failure without the dirty bit after return. We also clear the per-block writeback bits for writeback failures unless no I/O has submitted, which will leave the folio in an inconsistent state where it doesn't have the folio dirty, but one or more per-block dirty bits. This seems to be due the place where the iomap_clear_range_dirty call was inserted into the existing not very clearly structured code when adding per-block dirty bit support and not actually intentional. Switch to always clearing the dirty on writeback failure.
Fixes: 4ce02c679722 ("iomap: Add per-block dirty state tracking to improve performance") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231207072710.176093-2-hch@lst.de Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8 |
|
#
3ac97479 |
| 19-Oct-2023 |
Jan Stancek <jstancek@redhat.com> |
iomap: fix short copy in iomap_write_iter()
Starting with commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace"), iomap_write_iter() can get into endless loop. This can be reproduced with
iomap: fix short copy in iomap_write_iter()
Starting with commit 5d8edfb900d5 ("iomap: Copy larger chunks from userspace"), iomap_write_iter() can get into endless loop. This can be reproduced with LTP writev07 which uses partially valid iovecs: struct iovec wr_iovec[] = { { buffer, 64 }, { bad_addr, 64 }, { buffer + 64, 64 }, { buffer + 64 * 2, 64 }, };
commit bc1bb416bbb9 ("generic_perform_write()/iomap_write_actor(): saner logics for short copy") previously introduced the logic, which made short copy retry in next iteration with amount of "bytes" it managed to copy:
if (unlikely(status == 0)) { /* * A short copy made iomap_write_end() reject the * thing entirely. Might be memory poisoning * halfway through, might be a race with munmap, * might be severe memory pressure. */ if (copied) bytes = copied;
However, since 5d8edfb900d5 "bytes" is no longer carried into next iteration, because it is now always initialized at the beginning of the loop. And for iov_iter_count < PAGE_SIZE, "bytes" ends up with same value as previous iteration, making the loop retry same copy over and over, which leads to writev07 testcase hanging.
Make next iteration retry with amount of bytes we managed to copy.
Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
Revision tags: v6.5.7, v6.5.6 |
|
#
684f7e6d |
| 28-Sep-2023 |
Geert Uytterhoeven <geert+renesas@glider.be> |
iomap: Spelling s/preceeding/preceding/g
Fix a misspelling of "preceding".
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by:
iomap: Spelling s/preceeding/preceding/g
Fix a misspelling of "preceding".
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.5.5, v6.5.4 |
|
#
a5f31a50 |
| 18-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
iomap: convert iomap_unshare_iter to use large folios
Convert iomap_unshare_iter to create large folios if possible, since the write and zeroing paths already do that. I think this got missed in th
iomap: convert iomap_unshare_iter to use large folios
Convert iomap_unshare_iter to create large folios if possible, since the write and zeroing paths already do that. I think this got missed in the conversion of the write paths that landed in 6.6-rc1.
Cc: ritesh.list@gmail.com, willy@infradead.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
show more ...
|
#
35d30c9c |
| 18-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
iomap: don't skip reading in !uptodate folios when unsharing a range
Prior to commit a01b8f225248e, we would always read in the contents of a !uptodate folio prior to writing userspace data into the
iomap: don't skip reading in !uptodate folios when unsharing a range
Prior to commit a01b8f225248e, we would always read in the contents of a !uptodate folio prior to writing userspace data into the folio, allocated a folio state object, etc. Ritesh introduced an optimization that skips all of that if the write would cover the entire folio.
Unfortunately, the optimization misses the unshare case, where we always have to read in the folio contents since there isn't a data buffer supplied by userspace. This can result in stale kernel memory exposure if userspace issues a FALLOC_FL_UNSHARE_RANGE call on part of a shared file that isn't already cached.
This was caught by observing fstests regressions in the "unshare around" mechanism that is used for unaligned writes to a reflinked realtime volume when the realtime extent size is larger than 1FSB, though I think it applies to any shared file.
Cc: ritesh.list@gmail.com, willy@infradead.org Fixes: a01b8f225248e ("iomap: Allocate ifs in ->write_begin() early") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
show more ...
|
Revision tags: v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43 |
|
#
2ba39cc4 |
| 01-Aug-2023 |
Christoph Hellwig <hch@lst.de> |
fs: rename and move block_page_mkwrite_return
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error.
fs: rename and move block_page_mkwrite_return
block_page_mkwrite_return is neither block nor mkwrite specific, and should not be under CONFIG_BLOCK. Move it to mm.h and rename it to vmf_fs_error.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20230801172201.1923299-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
4ce02c67 |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Add per-block dirty state tracking to improve performance
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when t
iomap: Add per-block dirty state tracking to improve performance
When filesystem blocksize is less than folio size (either with mapping_large_folio_support() or with blocksize < pagesize) and when the folio is uptodate in pagecache, then even a byte write can cause an entire folio to be written to disk during writeback. This happens because we currently don't have a mechanism to track per-block dirty state within struct iomap_folio_state. We currently only track uptodate state.
This patch implements support for tracking per-block dirty state in iomap_folio_state->state bitmap. This should help improve the filesystem write performance and help reduce write amplification.
Performance testing of below fio workload reveals ~16x performance improvement using nvme with XFS (4k blocksize) on Power (64K pagesize) FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
1. <test_randwrite.fio> [global] ioengine=psync rw=randwrite overwrite=1 pre_read=1 direct=0 bs=4k size=1G dir=./ numjobs=8 fdatasync=1 runtime=60 iodepth=64 group_reporting=1
[fio-run]
2. Also our internal performance team reported that this patch improves their database workload performance by around ~83% (with XFS on Power)
Reported-by: Aravinda Herle <araherle@in.ibm.com> Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
a01b8f22 |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Allocate ifs in ->write_begin() early
We dont need to allocate an ifs in ->write_begin() for writes where the position and length completely overlap with the given folio. Therefore, such case
iomap: Allocate ifs in ->write_begin() early
We dont need to allocate an ifs in ->write_begin() for writes where the position and length completely overlap with the given folio. Therefore, such cases are skipped.
Currently when the folio is uptodate, we only allocate ifs at writeback time (in iomap_writepage_map()). This is ok until now, but when we are going to add support for per-block dirty state bitmap in ifs, this could cause some performance degradation. The reason is that if we don't allocate ifs during ->write_begin(), then we will never mark the necessary dirty bits in ->write_end() call. And we will have to mark all the bits as dirty at the writeback time, that could cause the same write amplification and performance problems as it is now.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
7f79d85b |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Refactor iomap_write_delalloc_punch() function out
This patch factors iomap_write_delalloc_punch() function out. This function is resposible for actual punch out operation. The reason for doi
iomap: Refactor iomap_write_delalloc_punch() function out
This patch factors iomap_write_delalloc_punch() function out. This function is resposible for actual punch out operation. The reason for doing this is, to avoid deep indentation when we bring punch-out of individual non-dirty blocks within a dirty folio in a later patch (which adds per-block dirty status handling to iomap) to avoid delalloc block leak.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
0af2b37d |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Use iomap_punch_t typedef
It makes it much easier if we have iomap_punch_t typedef for "punch" function pointer in all delalloc related punch, scan and release functions. It will be useful in
iomap: Use iomap_punch_t typedef
It makes it much easier if we have iomap_punch_t typedef for "punch" function pointer in all delalloc related punch, scan and release functions. It will be useful in later patches when we will factor out iomap_write_delalloc_punch() function.
Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
eee2d2e6 |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Fix possible overflow condition in iomap_write_delalloc_scan
folio_next_index() returns an unsigned long value which left shifted by PAGE_SHIFT could possibly cause an overflow on 32-bit syst
iomap: Fix possible overflow condition in iomap_write_delalloc_scan
folio_next_index() returns an unsigned long value which left shifted by PAGE_SHIFT could possibly cause an overflow on 32-bit system. Instead use folio_pos(folio) + folio_size(folio), which does this correctly.
Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
cc86181a |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Add some uptodate state handling helpers for ifs state bitmap
This patch adds two of the helper routines ifs_is_fully_uptodate() and ifs_block_is_uptodate() for managing uptodate state of "if
iomap: Add some uptodate state handling helpers for ifs state bitmap
This patch adds two of the helper routines ifs_is_fully_uptodate() and ifs_block_is_uptodate() for managing uptodate state of "ifs" state bitmap.
In later patches ifs state bitmap array will also handle dirty state of all blocks of a folio. Hence this patch adds some helper routines for handling uptodate state of the ifs state bitmap.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
3ea5c76c |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Drop ifs argument from iomap_set_range_uptodate()
iomap_folio_state (ifs) can be derived directly from the folio, making it unnecessary to pass "ifs" as an argument to iomap_set_range_uptodat
iomap: Drop ifs argument from iomap_set_range_uptodate()
iomap_folio_state (ifs) can be derived directly from the folio, making it unnecessary to pass "ifs" as an argument to iomap_set_range_uptodate(). This patch eliminates "ifs" argument from iomap_set_range_uptodate() function.
Also, the definition of iomap_set_range_uptodate() and ifs_set_range_uptodate() functions are moved above ifs_alloc(). In upcoming patches, we plan to introduce additional helper routines for handling dirty state, with the intention of consolidating all of "ifs" state handling routines at one place.
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
04f52c4e |
| 10-Jul-2023 |
Ritesh Harjani (IBM) <ritesh.list@gmail.com> |
iomap: Rename iomap_page to iomap_folio_state and others
struct iomap_page actually tracks per-block state of a folio. Hence it make sense to rename some of these function names and data structures
iomap: Rename iomap_page to iomap_folio_state and others
struct iomap_page actually tracks per-block state of a folio. Hence it make sense to rename some of these function names and data structures for e.g. 1. struct iomap_page (iop) -> struct iomap_folio_state (ifs) 2. iomap_page_create() -> ifs_alloc() 3. iomap_page_release() -> ifs_free() 4. iomap_iop_set_range_uptodate() -> ifs_set_range_uptodate() 5. to_iomap_page() -> folio->private
Since in later patches we are also going to add per-block dirty state tracking to iomap_folio_state. Hence this patch also renames "uptodate" & "uptodate_lock" members of iomap_folio_state to "state" and"state_lock".
We don't really need to_iomap_page() function, instead directly open code it as folio->private;
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30 |
|
#
5d8edfb9 |
| 20-May-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
iomap: Copy larger chunks from userspace
If we have a large folio, we can copy in larger chunks than PAGE_SIZE. Start at the maximum page cache size and shrink by half every time we hit the "we are
iomap: Copy larger chunks from userspace
If we have a large folio, we can copy in larger chunks than PAGE_SIZE. Start at the maximum page cache size and shrink by half every time we hit the "we are short on memory" problem.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
d6bb59a9 |
| 19-May-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
iomap: Create large folios in the buffered write path
Use the size of the write as a hint for the size of the folio to create.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-
iomap: Create large folios in the buffered write path
Use the size of the write as a hint for the size of the folio to create.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
ffc143db |
| 26-May-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
filemap: Add fgf_t typedef
Similarly to gfp_t, define fgf_t as its own type to prevent various misuses and confusion. Leave the flags as FGP_* for now to reduce the size of this patch; they will be
filemap: Add fgf_t typedef
Similarly to gfp_t, define fgf_t as its own type to prevent various misuses and confusion. Leave the flags as FGP_* for now to reduce the size of this patch; they will be converted to FGF_* later. Move the documentation to the definition of the type insted of burying it in the __filemap_get_folio() documentation.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev>
show more ...
|
#
7a8eb01b |
| 02-Jun-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
iomap: Remove unnecessary test from iomap_release_folio()
The check for the folio being under writeback is unnecessary; the caller has checked this and the folio is locked, so the folio cannot be un
iomap: Remove unnecessary test from iomap_release_folio()
The check for the folio being under writeback is unnecessary; the caller has checked this and the folio is locked, so the folio cannot be under writeback at this point.
The comment is somewhat misleading in that it talks about one specific situation in which we can see a dirty folio. There are others, so change the comment to explain why we can't release the iomap_page.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
a221ab71 |
| 02-Jun-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
iomap: Remove large folio handling in iomap_invalidate_folio()
We do not need to release the iomap_page in iomap_invalidate_folio() to allow the folio to be split. The splitting code will call ->re
iomap: Remove large folio handling in iomap_invalidate_folio()
We do not need to release the iomap_page in iomap_invalidate_folio() to allow the folio to be split. The splitting code will call ->release_folio() if there is still per-fs private data attached to the folio. At that point, we will check if the folio is still dirty and decline to release the iomap_page. It is possible to trigger the warning in perfectly legitimate circumstances (eg if a disk read fails, we do a partial write to the folio, then we truncate the folio), which will cause those writes to be lost.
Fixes: 60d8231089f0 ("iomap: Support large folios in invalidatepage") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
efa96cc9 |
| 17-Jul-2023 |
Christoph Hellwig <hch@lst.de> |
iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write
We have the new value for ki_pos right at hand in iter.pos, so assign that instead of recalculating it from ret.
Signed-off-
iomap: micro optimize the ki_pos assignment in iomap_file_buffered_write
We have the new value for ki_pos right at hand in iter.pos, so assign that instead of recalculating it from ret.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Ritesh Harjani (IBM) <ritesh.harjani@gmail.com>
show more ...
|