#
9888c340 |
| 03-Apr-2018 |
David Sterba <dsterba@suse.com> |
btrfs: replace GPL boilerplate by SPDX -- headers
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX h
btrfs: replace GPL boilerplate by SPDX -- headers
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header.
Unify the include protection macros to match the file names.
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.16 |
|
#
051c98eb |
| 07-Mar-2018 |
David Sterba <dsterba@suse.com> |
btrfs: open code trivial helper btrfs_page_exists_in_range
The called function name is self explanatory.
Signed-off-by: David Sterba <dsterba@suse.com>
|
#
965aab1c |
| 06-Mar-2018 |
Matthew Wilcox <mawilcox@microsoft.com> |
btrfs: Use filemap_range_has_page()
The current implementation of btrfs_page_exists_in_range() gives the wrong answer if the workingset code has stored a shadow entry in the page cache. The filemap
btrfs: Use filemap_range_has_page()
The current implementation of btrfs_page_exists_in_range() gives the wrong answer if the workingset code has stored a shadow entry in the page cache. The filemap_range_has_page() function does not have this problem, and it's shared code, so use it instead.
eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.15 |
|
#
c1c3fac2 |
| 16-Jan-2018 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove btrfs_inode::delayed_iput_count
delayed_iput_count wa supposed to be used to implement, well, delayed iput. The idea is that we keep accumulating the number of iputs we do until eventu
btrfs: Remove btrfs_inode::delayed_iput_count
delayed_iput_count wa supposed to be used to implement, well, delayed iput. The idea is that we keep accumulating the number of iputs we do until eventually the inode is deleted. Turns out we never really switched the delayed_iput_count from 0 to 1, hence all conditional code relying on the value of that member being different than 0 was never executed. This, as it turns out, didn't cause any problem due to the simple fact that the generic inode's i_count member was always used to count the number of iputs. So let's just remove the unused member and all unused code. This patch essentially provides no functional changes. While at it, also add proper documentation for btrfs_add_delayed_iput
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comment ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.13.16, v4.14 |
|
#
69fe2d75 |
| 19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make the delalloc block rsv per inode
The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around k
btrfs: make the delalloc block rsv per inode
The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand.
Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
dd48d407 |
| 19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add tracepoints for outstanding extents mods
This is handy for tracing problems with modifying the outstanding extents counters.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Dav
btrfs: add tracepoints for outstanding extents mods
This is handy for tracing problems with modifying the outstanding extents counters.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
8b62f87b |
| 19-Oct-2017 |
Josef Bacik <josef@toxicpanda.com> |
Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding
Btrfs: rework outstanding_extents
Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents.
Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this
btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1
for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size
btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1
In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with
btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0
This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter.
The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat.
I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.13.5, v4.13 |
|
#
eec63c65 |
| 17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-
btrfs: separate defrag and property compression
Add new value for compression to distinguish between defrag and property. Previously, a single variable was used and this caused clashes when the per-file 'compression' was set and a defrag -c was called.
The property-compression is loaded when the file is open, defrag will overwrite the same variable and reset to 0 (ie. NONE) at when the file defragmentaion is finished. That's considered a usability bug.
Now we won't touch the property value, use the defrag-compression. The precedence of defrag is higher than for property (and whole-filesystem).
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b52aa8c9 |
| 17-Jul-2017 |
David Sterba <dsterba@suse.com> |
btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag w
btrfs: rename variable holding per-inode compression type
This is preparatory for separating inode compression requested by defrag and set via properties. This will fix a usability bug when defrag will reset compression type to NONE. If the file has compression set via property, it will not apply anymore (until next mount or reset through command line).
We're going to fix that by adding another variable just for the defrag call and won't touch the property. The defrag will have higher priority when deciding whether to compress the data.
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.12 |
|
#
9a35b637 |
| 28-Jun-2017 |
Jeff Mahoney <jeffm@suse.com> |
btrfs: constify tracepoint arguments
Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate.
Signed-off
btrfs: constify tracepoint arguments
Tracepoint arguments are all read-only. If we mark the arguments as const, we're able to keep or convert those arguments to const where appropriate.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4e4cbee9 |
| 03-Jun-2017 |
Christoph Hellwig <hch@lst.de> |
block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep a
block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion. Note that device mapper overloaded bi_error with a private value, which we'll have to keep arround at least for now and thus propagate to a proper blk_status_t value.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
show more ...
|
Revision tags: v4.10.17, v4.10.16, v4.10.15, v4.10.14, v4.10.13, v4.10.12, v4.10.11, v4.10.10, v4.10.9 |
|
#
a7e3b975 |
| 03-Apr-2017 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof a
Btrfs: fix reported number of inode blocks
Currently when there are buffered writes that were not yet flushed and they fall within allocated ranges of the file (that is, not in holes or beyond eof assuming there are no prealloc extents beyond eof), btrfs simply reports an incorrect number of used blocks through the stat(2) system call (or any of its variants), regardless of mount options or inode flags (compress, compress-force, nodatacow). This is because the number of blocks used that is reported is based on the current number of bytes in the vfs inode plus the number of dealloc bytes in the btrfs inode. The later covers bytes that both fall within allocated regions of the file and holes.
Example scenarios where the number of reported blocks is wrong while the buffered writes are not flushed:
$ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc
$ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
# The following should have reported 64K... $ du -h /mnt/sdc/foo1 128K /mnt/sdc/foo1
$ sync
# After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo1 64K /mnt/sdc/foo1
$ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 0 64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
$ sync
$ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2 wrote 65536/65536 bytes at offset 65536 64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
# The following should have reported 128K... $ du -h /mnt/sdc/foo2 192K /mnt/sdc/foo2
$ sync
# After flushing the buffered write, it now reports the correct value. $ du -h /mnt/sdc/foo2 128K /mnt/sdc/foo2
So the number of used file blocks is simply incorrect, unlike in other filesystems such as ext4 and xfs for example, but only while the buffered writes are not flushed.
Fix this by tracking the number of delalloc bytes that fall within holes and beyond eof of a file, and use instead this new counter when reporting the number of used blocks for an inode.
Another different problem that exists is that the delalloc bytes counter is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from the respective range in the inode's iotree) and the vfs inode's bytes counter is only incremented when writeback finishes (through insert_reserved_file_extent()). Therefore while writeback is ongoing we simply report a wrong number of blocks used by an inode if the write operation covers a range previously unallocated. While this change does not fix this problem, it does minimizes it a lot by shortening that time window, as the new dealloc bytes counter (new_delalloc_bytes) is only decremented when writeback finishes right before updating the vfs inode's bytes counter. Fully fixing this second problem is not trivial and will be addressed later by a different patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
show more ...
|
Revision tags: v4.10.8, v4.10.7, v4.10.6, v4.10.5, v4.10.4, v4.10.3, v4.10.2, v4.10.1 |
|
#
0b581701 |
| 20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_inode_resume_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
abcefb1e |
| 20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_inode_block_unlocked_dio take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
0970a22e |
| 20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_print_data_csum_error take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
70ddc553 |
| 20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_is_free_space_inode take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
6ef06d27 |
| 20-Feb-2017 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Make btrfs_i_size_write take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
Revision tags: v4.10 |
|
#
6f6b643e |
| 08-Feb-2017 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check sum and expected check sum.
However no root objectid is outputted, w
btrfs: Better csum error message for data csum mismatch
The original csum error message only outputs inode number, offset, check sum and expected check sum.
However no root objectid is outputted, which sometimes makes debugging quite painful under multi-subvolume case (including relocation).
Also the checksum output is decimal, which seldom makes sense for users/developers and is hard to read in most time.
This patch will add root objectid, which will be %lld for rootid larger than LAST_FREE_OBJECTID, and hex csum output for better readability.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
0f8939b8 |
| 17-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_inode_in_log take btrfs_inode
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
#
4a0cc7ca |
| 10-Jan-2017 |
Nikolay Borisov <n.borisov.lkml@gmail.com> |
btrfs: Make btrfs_ino take a struct btrfs_inode
Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btr
btrfs: Make btrfs_ino take a struct btrfs_inode
Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.9, openbmc-4.4-20161121-1, v4.4.33, v4.4.32, v4.4.31, v4.4.30, v4.4.29, v4.4.28, v4.4.27, v4.7.10, openbmc-4.4-20161021-1, v4.7.9, v4.4.26, v4.7.8, v4.4.25, v4.4.24, v4.7.7, v4.8, v4.4.23, v4.7.6, v4.7.5, v4.4.22, v4.4.21, v4.7.4, v4.7.3, v4.4.20 |
|
#
afcdd129 |
| 02-Sep-2016 |
Josef Bacik <jbacik@fb.com> |
Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on o
Btrfs: add a flags field to btrfs_fs_info
We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v4.7.2, v4.4.19, openbmc-4.4-20160819-1, v4.7.1, v4.4.18, v4.4.17, openbmc-4.4-20160804-1, v4.4.16, v4.7, openbmc-4.4-20160722-1, openbmc-20160722-1, openbmc-20160713-1, v4.4.15, v4.6.4, v4.6.3, v4.4.14, v4.6.2, v4.4.13, openbmc-20160606-1, v4.6.1, v4.4.12, openbmc-20160521-1 |
|
#
01327610 |
| 19-May-2016 |
Nicholas D Steeves <nsteeves@gmail.com> |
btrfs: fix string and comment grammatical issues and typos
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
|
Revision tags: v4.4.11, openbmc-20160518-1, v4.6 |
|
#
5f9a8a51 |
| 12-May-2016 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: impl
Btrfs: add semaphore to synchronize direct IO writes with fsync
Due to the optimization of lockless direct IO writes (the inode's i_mutex is not held) introduced in commit 38851cc19adb ("Btrfs: implement unlocked dio write"), we started having races between such writes with concurrent fsync operations that use the fast fsync path. These races were addressed in the patches titled "Btrfs: fix race between fsync and lockless direct IO writes" and "Btrfs: fix race between fsync and direct IO writes for prealloc extents". The races happened because the direct IO path, like every other write path, does create extent maps followed by the corresponding ordered extents while the fast fsync path collected first ordered extents and then it collected extent maps. This made it possible to log file extent items (based on the collected extent maps) without waiting for the corresponding ordered extents to complete (get their IO done). The two fixes mentioned before added a solution that consists of making the direct IO path create first the ordered extents and then the extent maps, while the fsync path attempts to collect any new ordered extents once it collects the extent maps. This was simple and did not require adding any synchonization primitive to any data structure (struct btrfs_inode for example) but it makes things more fragile for future development endeavours and adds an exceptional approach compared to the other write paths.
This change adds a read-write semaphore to the btrfs inode structure and makes the direct IO path create the extent maps and the ordered extents while holding read access on that semaphore, while the fast fsync path collects extent maps and ordered extents while holding write access on that semaphore. The logic for direct IO write path is encapsulated in a new helper function that is used both for cow and nocow direct IO writes.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <jbacik@fb.com>
show more ...
|
Revision tags: v4.4.10, openbmc-20160511-1, openbmc-20160505-1, v4.4.9, v4.4.8, v4.4.7, openbmc-20160329-2, openbmc-20160329-1, openbmc-20160321-1, v4.4.6, v4.5, v4.4.5, v4.4.4, v4.4.3, openbmc-20160222-1, v4.4.2, openbmc-20160212-1, openbmc-20160210-1, openbmc-20160202-2, openbmc-20160202-1, v4.4.1, openbmc-20160127-1, openbmc-20160120-1, v4.4, openbmc-20151217-1, openbmc-20151210-1, openbmc-20151202-1, openbmc-20151123-1 |
|
#
8089fe62 |
| 19-Nov-2015 |
David Sterba <dsterba@suse.cz> |
btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a
btrfs: put delayed item hook into inode
Inodes for delayed iput allocate a trivial helper structure, let's place the list hook directly into the inode and save a kmalloc (killing a __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
The inode can be put into the delayed_iputs list more than once and we have to keep the count. This means we can't use the list_splice to process a bunch of inodes because we'd lost track of the count if the inode is put into the delayed iputs again while it's processed.
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: openbmc-20151118-1, openbmc-20151104-1, v4.3, openbmc-20151102-1, openbmc-20151028-1, v4.3-rc1, v4.2 |
|
#
50745b0a |
| 28-Aug-2015 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: Direct I/O: Fix space accounting
The following call trace is seen when generic/095 test is executed,
WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_d
Btrfs: Direct I/O: Fix space accounting
The following call trace is seen when generic/095 test is executed,
WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0() Modules linked in: CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014 ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0 0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38 ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0 Call Trace: [<ffffffff81984058>] dump_stack+0x45/0x57 [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0 [<ffffffff81050465>] warn_slowpath_null+0x15/0x20 [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0 [<ffffffff8117ce07>] destroy_inode+0x37/0x60 [<ffffffff8117cf39>] evict+0x109/0x170 [<ffffffff8117cfd5>] dispose_list+0x35/0x50 [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100 [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0 [<ffffffff81165951>] kill_anon_super+0x11/0x20 [<ffffffff81302093>] btrfs_kill_super+0x13/0x110 [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70 [<ffffffff811660cf>] deactivate_super+0x5f/0x70 [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90 [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10 [<ffffffff81069c06>] task_work_run+0x96/0xb0 [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50 [<ffffffff8198cbc2>] int_signal+0x12/0x17
This means that the inode had non-zero "outstanding extents" during eviction. This occurs because, during direct I/O a task which successfully used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does not clear the bit after finishing the DIO write. A future DIO write could actually fail and the unused reserve space won't be freed because of the previously set BTRFS_INODE_DIO_READY bit.
Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the following issue, |-----------------------------------+-------------------------------------| | Task A | Task B | |-----------------------------------+-------------------------------------| | Start direct i/o write on inode X.| | | reserve space | | | Allocate ordered extent | | | release reserved space | | | Set BTRFS_INODE_DIO_READY bit. | | | | splice() | | | Transfer data from pipe buffer to | | | destination file. | | | - kmap(pipe buffer page) | | | - Start direct i/o write on | | | inode X. | | | - reserve space | | | - dio_refill_pages() | | | - sdio->blocks_available == 0 | | | - Since a kernel address is | | | being passed instead of a | | | user space address, | | | iov_iter_get_pages() returns | | | -EFAULT. | | | - Since BTRFS_INODE_DIO_READY is | | | set, we don't release reserved | | | space. | | | - Clear BTRFS_INODE_DIO_READY bit.| | -EIOCBQUEUED is returned. | | |-----------------------------------+-------------------------------------|
Hence this commit introduces "struct btrfs_dio_data" to track the usage of reserved data space. The remaining unused "reserve space" can now be freed reliably.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|