#
8d875f95 |
| 12-Aug-2014 |
Chris Mason <clm@fb.com> |
btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic
btrfs: disable strict file flushes for renames and truncates
Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk.
Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete.
This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock.
This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks.
Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|
Revision tags: v3.16, v3.16-rc7, v3.16-rc6 |
|
#
98ce2ded |
| 17-Jul-2014 |
Liu Bo <bo.li.liu@oracle.com> |
Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem.
With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range.
Btrfs: fix abnormal long waiting in fsync
xfstests generic/127 detected this problem.
With commit 7fc34a62ca4434a79c68e23e70ed26111b7a4cf8, now fsync will only flush data within the passed range. This is the cause of the above problem, -- btrfs's fsync has a stage called 'sync log' which will wait for all the ordered extents it've recorded to finish.
In xfstests/generic/127, with mixed operations such as truncate, fallocate, punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will mmap, and then msync. And I find that msync will wait for quite a long time (about 20s in my case), thanks to ftrace, it turns out that the previous fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants, there can be some ordered extents created but not getting corresponding pages flushed, then they're left in memory until we fsync which runs into the stage 'sync log', and fsync will just wait for the system writeback thread to flush those pages and get ordered extents finished, so the latency is inevitable.
This adds a flush similar to btrfs_start_ordered_extent() in btrfs_wait_logged_extents() to fix that.
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|
Revision tags: v3.16-rc5, v3.16-rc4, v3.16-rc3, v3.16-rc2, v3.16-rc1, v3.15, v3.15-rc8, v3.15-rc7, v3.15-rc6 |
|
#
351fd353 |
| 15-May-2014 |
David Sterba <dsterba@suse.cz> |
btrfs: remove stale newlines from log messages
I've noticed an extra line after "use no compression", but search revealed much more in messages of more critical levels and rare errors.
Signed-off-b
btrfs: remove stale newlines from log messages
I've noticed an extra line after "use no compression", but search revealed much more in messages of more critical levels and rare errors.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|
Revision tags: v3.15-rc5, v3.15-rc4, v3.15-rc3, v3.15-rc2, v3.15-rc1, v3.14, v3.14-rc8, v3.14-rc7, v3.14-rc6 |
|
#
31f3d255 |
| 05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in the source fs/file root, but because we use the global mutex to protect this ordered
Btrfs: split the global ordered extents mutex
When we create a snapshot, we just need wait the ordered extents in the source fs/file root, but because we use the global mutex to protect this ordered extents list of the source fs/file root to avoid accessing a empty list, if someone got the mutex to access the ordered extents list of the other fs/file root, we had to wait.
This patch splits the above global mutex, now every fs/file root has its own mutex to protect its own list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
#
af7a6509 |
| 05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: wake up the tasks that wait for the io earlier
The tasks that wait for the IO_DONE flag just care about the io of the dirty pages, so it is better to wake up them immediately after all the pa
Btrfs: wake up the tasks that wait for the io earlier
The tasks that wait for the IO_DONE flag just care about the io of the dirty pages, so it is better to wake up them immediately after all the pages are written, not the whole process of the io completes.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
#
8b9d83cd |
| 05-Mar-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix early enospc due to the race of the two ordered extent wait
btrfs_wait_ordered_roots() moves all the list entries to a new list, and then deals with them one by one. But if the other task
Btrfs: fix early enospc due to the race of the two ordered extent wait
btrfs_wait_ordered_roots() moves all the list entries to a new list, and then deals with them one by one. But if the other task invokes this function at that time, it would get a empty list. It makes the enospc error happens more early. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
Revision tags: v3.14-rc5 |
|
#
d458b054 |
| 27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no
btrfs: Cleanup the "_struct" suffix in btrfs_workequeue
Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue.
Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
#
a44903ab |
| 27-Feb-2014 |
Qu Wenruo <quwenruo@cn.fujitsu.com> |
btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by:
btrfs: Replace fs_info->flush_workers with btrfs_workqueue.
Replace the fs_info->submit_workers with the newly created btrfs_workqueue.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
Revision tags: v3.14-rc4, v3.14-rc3, v3.14-rc2, v3.14-rc1, v3.13 |
|
#
827463c4 |
| 14-Jan-2014 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't mix the ordered extents of all files together during logging the inodes
There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the l
Btrfs: don't mix the ordered extents of all files together during logging the inodes
There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the log list including those ordered extents that were logged successfully, it would make the log committer not to wait for the completion of the ordered extents.
This patch doesn't insert the ordered extents that is about to be logged into a global list, instead, we insert them into a local list. If we log the ordered extents successfully, we splice them with the global list, or we will throw them away, then do full sync. It can also reduce the lock contention and the traverse time of list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
show more ...
|
Revision tags: v3.13-rc8, v3.13-rc7, v3.13-rc6, v3.13-rc5 |
|
#
efe120a0 |
| 20-Dec-2013 |
Frank Holton <fholton@gmail.com> |
Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@g
Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Convert all applicable cases of printk and pr_* to the btrfs_* macros.
Fix all uses of the BTRFS prefix.
Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|
Revision tags: v3.13-rc4, v3.13-rc3, v3.13-rc2, v3.13-rc1 |
|
#
1b8e7e45 |
| 22-Nov-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: avoid unnecessary ordered extent cache resets
After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer.
While running the xfstests
Btrfs: avoid unnecessary ordered extent cache resets
After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer.
While running the xfstests I noticed that about 29% of the time the ordered extent to which tree->last pointed was not the same as our just completed ordered extent. After that I ran the following sysbench test (after a prepare phase) and noticed that about 68% of the time tree->last pointed to a different ordered extent too.
sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=rndwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --max-requests=0 run
Therefore reset tree->last on ordered extent removal only if it pointed to the ordered extent we're removing from the tree.
Results from 4 runs of the following test before and after applying this patch:
$ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync run
Before this path:
run 1 - 64.049Mb/sec run 2 - 63.455Mb/sec run 3 - 64.656Mb/sec run 4 - 63.833Mb/sec
After this patch:
run 1 - 66.149Mb/sec run 2 - 68.459Mb/sec run 3 - 66.338Mb/sec run 4 - 66.176Mb/sec
With random writes (--file-test-mode=rndwr) I had huge fluctuations on the results (+- 35% easily).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
show more ...
|
#
931aa877 |
| 14-Nov-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: fix list delete warning when removing ordered root from the list
Commit b02441999efcc6152b87cd58e7970bb7843f76cf "Btrfs: don't wait for the completion of all the ordered extents" introduced a
Btrfs: fix list delete warning when removing ordered root from the list
Commit b02441999efcc6152b87cd58e7970bb7843f76cf "Btrfs: don't wait for the completion of all the ordered extents" introduced a bug that broke the ordered root list: WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
It is because we forgot to return the roots in the splice list to the ordered list of the fs. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
#
b52abf1e |
| 06-Nov-2013 |
Filipe David Borba Manana <fdmanana@gmail.com> |
Btrfs: don't wait for ordered data outside desired range
In btrfs_wait_ordered_range(), if we found an extent to the left of the start of our desired wait range and the last byte of that extent is 1
Btrfs: don't wait for ordered data outside desired range
In btrfs_wait_ordered_range(), if we found an extent to the left of the start of our desired wait range and the last byte of that extent is 1 less than the desired range's start, we would would wait for the IO completion of that extent unnecessarily.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
#
b0244199 |
| 04-Nov-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: don't wait for the completion of all the ordered extents
It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want t
Btrfs: don't wait for the completion of all the ordered extents
It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want to reclaim some space for the metadata space reservation, we would be blocked for a long time. The performance would drop down suddenly for a long time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.12 |
|
#
93858769 |
| 28-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: take ordered root lock when removing ordered operations inode
A user reported a list corruption warning from btrfs_remove_ordered_extent, it is because we aren't taking the ordered_root_lock
Btrfs: take ordered root lock when removing ordered operations inode
A user reported a list corruption warning from btrfs_remove_ordered_extent, it is because we aren't taking the ordered_root_lock when we remove the inode from the ordered operations list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.12-rc7 |
|
#
0ef8b726 |
| 25-Oct-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: return an error from btrfs_wait_ordered_range
I noticed that if the free space cache has an error writing out it's data it won't actually error out, it will just carry on. This is because it
Btrfs: return an error from btrfs_wait_ordered_range
I noticed that if the free space cache has an error writing out it's data it won't actually error out, it will just carry on. This is because it doesn't check the return value of btrfs_wait_ordered_range, which didn't actually return anything. So fix this in order to keep us from making free space cache look valid when it really isnt. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.12-rc6 |
|
#
5ede859b |
| 14-Oct-2013 |
chandan <chandan@linux.vnet.ibm.com> |
Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison.
Comparison of an inode's last modified transaction with the last committed transaction is incorrect. Fix it.
Signed-off
Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison.
Comparison of an inode's last modified transaction with the last committed transaction is incorrect. Fix it.
Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.12-rc5, v3.12-rc4, v3.12-rc3, v3.12-rc2 |
|
#
f0de181c |
| 17-Sep-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: kill delay_iput arg to the wait_ordered functions
This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we
Btrfs: kill delay_iput arg to the wait_ordered functions
This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.12-rc1, v3.11 |
|
#
77cef2ec |
| 29-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: allow partial ordered extent completion
We currently have this problem where you can truncate pages that have not yet been written for an ordered extent. We do this because the truncate will
Btrfs: allow partial ordered extent completion
We currently have this problem where you can truncate pages that have not yet been written for an ordered extent. We do this because the truncate will be coming behind to clean us up anyway so what's the harm right? Well if truncate fails for whatever reason we leave an orphan item around for the file to be cleaned up later. But if the user goes and truncates up the file and tries to read from the area that had been discarded previously they will get a csum error because we never actually wrote that data out.
This patch fixes this by allowing us to either discard the ordered extent completely, by which I mean we just free up the space we had allocated and not add the file extent, or adjust the length of the file extent we write. We do this by setting the length we truncated down to in the ordered extent, and then we set the file extent length and ram bytes to this length. The total disk space stays unchanged since we may be compressed and we can't just chop off the disk space, but at least this way the file extent only points to the valid data. Then when the file extent is free'd the extent and csums will be freed normally.
This patch is needed for the next series which will give us more graceful recovery of failed truncates. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.11-rc7 |
|
#
c1c9ff7c |
| 20-Aug-2013 |
Geert Uytterhoeven <geert@linux-m68k.org> |
Btrfs: Remove superfluous casts from u64 to unsigned long long
u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier.
Btrfs: Remove superfluous casts from u64 to unsigned long long
u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.11-rc6 |
|
#
9ffba8cd |
| 14-Aug-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: fix heavy delalloc related deadlock
I added a patch where we started taking the ordered operations mutex when we waited on ordered extents. We need this because we splice the list and proces
Btrfs: fix heavy delalloc related deadlock
I added a patch where we started taking the ordered operations mutex when we waited on ordered extents. We need this because we splice the list and process it, so if a flusher came in during this scenario it would think the list was empty and we'd usually get an early ENOSPC. The problem with this is that this lock is used in transaction committing. So we end up with something like this
Transaction commit -> wait on writers
Delalloc flusher -> run_ordered_operations (holds mutex) ->wait for filemap-flush to do its thing
flush task -> cow_file_range ->wait on btrfs_join_transaction because we're commiting
some other task -> commit_transaction because we notice trans->transaction->flush is set -> run_ordered_operations (hang on mutex)
We need to disentangle the ordered operations flushing from the delalloc flushing, since they are separate things. This solves the deadlock issue I was seeing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
show more ...
|
Revision tags: v3.11-rc5, v3.11-rc4, v3.11-rc3, v3.11-rc2, v3.11-rc1, v3.10, v3.10-rc7 |
|
#
f51a4a18 |
| 18-Jun-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can fi
Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can find out the expected checksums by btrfs_ordered_sum's bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After removing bytenr, there is only one member in the structure, so it makes no sense to keep the structure, just remove it, and use a u32 array to store the checksum value.
By this change, we don't use the while loop to get the checksums one by one. Now, we can get several checksum value at one time, it improved the performance by ~74% on my SSD (31MB/s -> 54MB/s).
test command: # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
show more ...
|
Revision tags: v3.10-rc6, v3.10-rc5, v3.10-rc4, v3.10-rc3, v3.10-rc2 |
|
#
199c2a9c |
| 15-May-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: introduce per-subvolume ordered extent list
The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.
Btrfs: introduce per-subvolume ordered extent list
The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
show more ...
|
Revision tags: v3.10-rc1, v3.9, v3.9-rc8, v3.9-rc7, v3.9-rc6 |
|
#
e4100d98 |
| 05-Apr-2013 |
Miao Xie <miaox@cn.fujitsu.com> |
Btrfs: improve the performance of the csums lookup
It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem b
Btrfs: improve the performance of the csums lookup
It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem by getting the csums in batch.
According to the result of the following test, the execute time of __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us).
# dd if=<mnt>/file of=/dev/null bs=1M count=1024
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
show more ...
|
Revision tags: v3.9-rc5 |
|
#
db1d607d |
| 26-Mar-2013 |
Josef Bacik <jbacik@fusionio.com> |
Btrfs: hold the ordered operations mutex when waiting on ordered extents
We need to hold the ordered_operations mutex while waiting on ordered extents since we splice and run the ordered extents lis
Btrfs: hold the ordered operations mutex when waiting on ordered extents
We need to hold the ordered_operations mutex while waiting on ordered extents since we splice and run the ordered extents list. We need to make sure anybody else who wants to wait on ordered extents does actually wait for them to be completed. This will keep us from bailing out of flushing in case somebody is already waiting on ordered extents to complete. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
show more ...
|