Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27 |
|
#
54d687c1 |
| 29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space cache code and into block-rsv.c.
Reviewed-by: Johannes Thums
btrfs: move btrfs_check_trunc_cache_free_space into block-rsv.c
This is completely related to block rsv's, move it out of the free space cache code and into block-rsv.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6 |
|
#
cb9a10a6 |
| 26-Oct-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: convert discard stat defs to enum
Do away with the defines and use an enum as it's cleaner.
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <joh
btrfs: convert discard stat defs to enum
Do away with the defines and use an enum as it's cleaner.
Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68 |
|
#
eda517fd |
| 14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move free space cachep's out of ctree.h
This is local to the free-space-cache.c code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to
btrfs: move free space cachep's out of ctree.h
This is local to the free-space-cache.c code, remove it from ctree.h and inode.c, create new init/exit functions for the cachep, and move it locally to free-space-cache.c.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
390d89cc |
| 14-Sep-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move discard stat defs to free-space-cache.h
These definitions are used for discard statistics, move them out of ctree.h and put them in free-space-cache.h.
Reviewed-by: Qu Wenruo <wqu@suse.
btrfs: move discard stat defs to free-space-cache.h
These definitions are used for discard statistics, move them out of ctree.h and put them in free-space-cache.h.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60 |
|
#
fc80f7ac |
| 08-Aug-2022 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove use btrfs_remove_free_space_cache instead of variant
We are calling __btrfs_remove_free_space_cache everywhere to cleanup the block group free space, however we can just use btrfs_remo
btrfs: remove use btrfs_remove_free_space_cache instead of variant
We are calling __btrfs_remove_free_space_cache everywhere to cleanup the block group free space, however we can just use btrfs_remove_free_space_cache and pass in the block group in all of these places. Then we can remove __btrfs_remove_free_space_cache and rename __btrfs_remove_free_space_cache_locked to __btrfs_remove_free_space_cache.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5 |
|
#
364be842 |
| 23-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: change name and type of private member of btrfs_free_space_ctl
btrfs_free_space_ctl::private is either unset or it always points to struct btrfs_block_group when it is set. So there's no poin
btrfs: change name and type of private member of btrfs_free_space_ctl
btrfs_free_space_ctl::private is either unset or it always points to struct btrfs_block_group when it is set. So there's no point in keeping the unhelpful 'private' name and keeping it an untyped pointer. Change both the type and name to be self-describing. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
290ef19a |
| 23-Nov-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make __btrfs_add_free_space take just block group reference
There is no point in the function taking an fs_info and a btrfs_free_space because the ctl passed always belongs to the block group
btrfs: make __btrfs_add_free_space take just block group reference
There is no point in the function taking an fs_info and a btrfs_free_space because the ctl passed always belongs to the block group. Furthermore fs_info can be referenced from the block group. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.15.4 |
|
#
59c7b566 |
| 18-Nov-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: index free space entries on size
Currently we index free space on offset only, because usually we have a hint from the allocator that we want to honor for locality reasons. However if we fail
btrfs: index free space entries on size
Currently we index free space on offset only, because usually we have a hint from the allocator that we want to honor for locality reasons. However if we fail to use this hint we have to go back to a brute force search through the free space entries to find a large enough extent.
With sufficiently fragmented free space this becomes quite expensive, as we have to linearly search all of the free space entries to find if we have a part that's long enough.
To fix this add a cached rb tree to index based on free space entry bytes. This will allow us to quickly look up the largest chunk in the free space tree for this block group, and stop searching once we've found an entry that is too small to satisfy our allocation. We simply choose to use this tree if we're searching from the beginning of the block group, as we know we do not care about locality at that point.
I wrote an allocator test that creates a 10TiB ram backed null block device and then fallocates random files until the file system is full. I think go through and delete all of the odd files. Then I spawn 8 threads that fallocate 64MiB files (1/2 our extent size cap) until the file system is full again. I use bcc's funclatency to measure the latency of find_free_extent. The baseline results are
nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 10356 |**** | 512 -> 1023 : 58242 |************************* | 1024 -> 2047 : 74418 |******************************** | 2048 -> 4095 : 90393 |****************************************| 4096 -> 8191 : 79119 |*********************************** | 8192 -> 16383 : 35614 |*************** | 16384 -> 32767 : 13418 |***** | 32768 -> 65535 : 12811 |***** | 65536 -> 131071 : 17090 |******* | 131072 -> 262143 : 26465 |*********** | 262144 -> 524287 : 40179 |***************** | 524288 -> 1048575 : 55469 |************************ | 1048576 -> 2097151 : 48807 |********************* | 2097152 -> 4194303 : 26744 |*********** | 4194304 -> 8388607 : 35351 |*************** | 8388608 -> 16777215 : 13918 |****** | 16777216 -> 33554431 : 21 | |
avg = 908079 nsecs, total: 580889071441 nsecs, count: 639690
And the patch results are
nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 6883 |** | 512 -> 1023 : 54346 |********************* | 1024 -> 2047 : 79170 |******************************** | 2048 -> 4095 : 98890 |****************************************| 4096 -> 8191 : 81911 |********************************* | 8192 -> 16383 : 27075 |********** | 16384 -> 32767 : 14668 |***** | 32768 -> 65535 : 13251 |***** | 65536 -> 131071 : 15340 |****** | 131072 -> 262143 : 26715 |********** | 262144 -> 524287 : 43274 |***************** | 524288 -> 1048575 : 53870 |********************* | 1048576 -> 2097151 : 55368 |********************** | 2097152 -> 4194303 : 41036 |**************** | 4194304 -> 8388607 : 24927 |********** | 8388608 -> 16777215 : 33 | | 16777216 -> 33554431 : 9 | |
avg = 623599 nsecs, total: 397259314759 nsecs, count: 637042
There's a little variation in the amount of calls done because of timing of the threads with metadata requirements, but the avg, total, and count's are relatively consistent between runs (usually within 2-5% of each other). As you can see here we have around a 30% decrease in average latency with a 30% decrease in overall time spent in find_free_extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14 |
|
#
169e0da9 |
| 04-Feb-2021 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable sp
btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space.
Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space.
Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset.
Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.10 |
|
#
36b216c8 |
| 18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: remove free space items when disabling space cache v1
When the filesystem transitions from space cache v1 to v2 or to nospace_cache, it removes the old cached data, but does not remove the FR
btrfs: remove free space items when disabling space cache v1
When the filesystem transitions from space cache v1 to v2 or to nospace_cache, it removes the old cached data, but does not remove the FREE_SPACE items nor the free space inodes they point to. This doesn't cause any issues besides being a bit inefficient, since these items no longer do anything useful.
To fix it, when we are mounting, and plan to disable the space cache, destroy each block group's free space item and free space inode. The code to remove the items is lifted from the existing use case of removing the block group, with a light adaptation to handle whether or not we have already looked up the free space inode.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
94846229 |
| 18-Nov-2020 |
Boris Burkov <boris@bur.io> |
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to determine if space cache v1 is in use. However, by mounting with nosp
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to determine if space cache v1 is in use. However, by mounting with nospace_cache or space_cache=v2, it is possible to disable space cache v1, which does not result in un-setting cache_generation back to 0.
In order to base some logic, like mount option printing in /proc/mounts, on the current state of the space cache rather than just the values of the mount option, keep the value of cache_generation consistent with the status of space cache v1.
We ensure that cache_generation > 0 iff the file system is using space_cache v1. This requires committing a transaction on any mount which changes whether we are using v1. (v1->nospace_cache, v1->v2, nospace_cache->v1, v2->v1).
Since the mechanism for writing out the cache generation is transaction commit, but we want some finer grained control over when we un-set it, we can't just rely on the SPACE_CACHE mount option, and introduce an fs_info flag that mount can use when it wants to unset the generation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
fa598b06 |
| 03-Dec-2020 |
David Sterba <dsterba@suse.com> |
btrfs: remove recalc_thresholds from free space ops
After removing the inode number cache that was using the free space cache code, we can remove at least the recalc_thresholds callback from the ops
btrfs: remove recalc_thresholds from free space ops
After removing the inode number cache that was using the free space cache code, we can remove at least the recalc_thresholds callback from the ops. Both code and tests use the same callback function. It's moved before its first use.
The use_bitmaps callback is still needed by tests to create some extents/bitmap setup.
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
7dbdb443 |
| 03-Dec-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove crc_check logic from free space
Following removal of the ino cache io_ctl_init will be called only on behalf of the freespace inode. In this case we always want to check CRCs so condit
btrfs: remove crc_check logic from free space
Following removal of the ino cache io_ctl_init will be called only on behalf of the freespace inode. In this case we always want to check CRCs so conditional code that depended on io_ctl::check_crc can be removed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
5297199a |
| 26-Nov-2020 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: remove inode number cache feature
It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons.
A filesystem that uses
btrfs: remove inode number cache feature
It's been deprecated since commit b547a88ea577 ("btrfs: start deprecation of mount option inode_cache") which enumerates the reasons.
A filesystem that uses the feature (mount -o inode_cache) tracks the inode numbers in bitmaps, that data stay on the filesystem after this patch. The size is roughly 5MiB for 1M inodes [1], which is considered small enough to be left there. Removal of the change can be implemented in btrfs-progs if needed.
[1] https://lore.kernel.org/linux-btrfs/20201127145836.GZ6430@twin.jikos.cz/
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.8.17 |
|
#
cd79909b |
| 23-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: load free space cache into a temporary ctl
The free space cache has been special in that we would load it right away instead of farming the work off to a worker thread. This resulted in some
btrfs: load free space cache into a temporary ctl
The free space cache has been special in that we would load it right away instead of farming the work off to a worker thread. This resulted in some weirdness that had to be taken into account for this fact, namely that if we every found a block group being cached the fast way we had to wait for it to finish, because we could get the cache before it had been validated and we may throw the cache away.
To handle this particular case instead create a temporary btrfs_free_space_ctl to load the free space cache into. Then once we've validated that it makes sense, copy it's contents into the actual block_group->free_space_ctl. This allows us to avoid the problems of needing to wait for the caching to complete, we can clean up the discard extent handling stuff in __load_free_space_cache, and we no longer need to do the merge_space_tree() because the space is added one by one into the real free_space_ctl. This will allow further reworks of how we handle loading the free space cache.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1 |
|
#
69b0e093 |
| 03-Jun-2020 |
Anand Jain <anand.jain@oracle.com> |
btrfs: let btrfs_return_cluster_to_free_space() return void
__btrfs_return_cluster_to_free_space() returns only 0. And all its parent functions don't need the return value either so make this a void
btrfs: let btrfs_return_cluster_to_free_space() return void
__btrfs_return_cluster_to_free_space() returns only 0. And all its parent functions don't need the return value either so make this a void function.
Further, as none of the callers of btrfs_return_cluster_to_free_space() is actually using the return from this function, make this function also return void.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8 |
|
#
7fe6d45e |
| 02-Jan-2020 |
Dennis Zhou <dennis@kernel.org> |
btrfs: have multiple discard lists
Non-block group destruction discarding currently only had a single list with no minimum discard length. This can lead to caravaning more meaningful discards behind
btrfs: have multiple discard lists
Non-block group destruction discarding currently only had a single list with no minimum discard length. This can lead to caravaning more meaningful discards behind a heavily fragmented block group.
This adds support for multiple lists with minimum discard lengths to prevent the caravan effect. We promote block groups back up when we exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support only 2 lists with filters of 1MB and 32KB respectively.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.4.7, v5.4.6, v5.4.5, v5.4.4 |
|
#
5dc7c10b |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: keep track of discardable_bytes for async discard
Keep track of this metric so that we can understand how ahead or behind we are in discarding rate. This uses the same accounting method as di
btrfs: keep track of discardable_bytes for async discard
Keep track of this metric so that we can understand how ahead or behind we are in discarding rate. This uses the same accounting method as discardable_extents, deltas between previous/current values and propagating them up.
Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
dfb79ddb |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: track discardable extents for async discard
The number of discardable extents will serve as the rate limiting metric for how often we should discard. This keeps track of discardable extents i
btrfs: track discardable extents for async discard
The number of discardable extents will serve as the rate limiting metric for how often we should discard. This keeps track of discardable extents in the free space caches by maintaining deltas and propagating them to the global count.
The deltas are calculated from 2 values stored in PREV and CURR entries, then propagated up to the global discard ctl. The current counter value becomes the previous counter value after update.
Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2bee7eb8 |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: discard one region at a time in async discard
The prior two patches added discarding via a background workqueue. This just piggybacked off of the fstrim code to trim the whole block at once.
btrfs: discard one region at a time in async discard
The prior two patches added discarding via a background workqueue. This just piggybacked off of the fstrim code to trim the whole block at once. Well inevitably this is worse performance wise and will aggressively overtrim. But it was nice to plumb the other infrastructure to keep the patches easier to review.
This adds the real goal of this series which is discarding slowly (ie. a slow long running fstrim). The discarding is split into two phases, extents and then bitmaps. The reason for this is two fold. First, the bitmap regions overlap the extent regions. Second, discarding the extents first will let the newly trimmed bitmaps have the highest chance of coalescing when being readded to the free space cache.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6e80d4f8 |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a
btrfs: handle empty block_group removal for async discard
block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a block_group to be added to the unused_bgs list. Then, when the cleaner thread comes around, it starts a transaction and then proceeds with removing the block_group. Extents that are pinned are subsequently removed from the pinned trees and then eventually a discard is issued for the entire block_group.
Async discard introduces another player into the game, the discard workqueue. While it has none of the racing issues, the new problem is ensuring we don't leave free space untrimmed prior to forgetting the block_group. This is handled by placing fully free block_groups on a separate discard queue. This is necessary to maintain discarding order as in the future we will slowly trim even fully free block_groups. The ordering helps us make progress on the same block_group rather than say the last fully freed block_group or needing to search through the fully freed block groups at the beginning of a list and insert after.
The new order of events is a fully freed block group gets placed on the unused discard queue first. Once it's processed, it will be placed on the unusued_bgs list and then the original sequence of events will happen, just without the final whole block_group discard.
The mount flags can change when processing unused_bgs, so when flipping from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt free block groups on the discard_list to the unused_bg queue which will do the final discard for us.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b0643e59 |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: add the beginning of async discard, discard workqueue
When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the ex
btrfs: add the beginning of async discard, discard workqueue
When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations.
This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series.
For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same.
On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache.
discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701
discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200
While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
da080fe1 |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: keep track of free space bitmap trim status cleanliness
There is a cap in btrfs in the amount of free extents that a block group can have. When it surpasses that threshold, future extents are
btrfs: keep track of free space bitmap trim status cleanliness
There is a cap in btrfs in the amount of free extents that a block group can have. When it surpasses that threshold, future extents are placed into bitmaps. Instead of keeping track of if a certain bit is trimmed or not in a second bitmap, keep track of the relative state of the bitmap.
With async discard, trimming bitmaps becomes a more frequent operation. As a trade off with simplicity, we keep track of if discarding a bitmap is in progress. If we fully scan a bitmap and trim as necessary, the bitmap is marked clean. This has some caveats as the min block size may skip over regions deemed too small. But this should be a reasonable trade off rather than keeping a second bitmap and making allocation paths more complex. The downside is we may overtrim, but ideally the min block size should prevent us from doing that too often and getting stuck trimming pathological cases.
BTRFS_TRIM_STATE_TRIMMING is added to indicate a bitmap is in the process of being trimmed. If additional free space is added to that bitmap, the bit is cleared. A bitmap will be marked BTRFS_TRIM_STATE_TRIMMED if the trimming code was able to reach the end of it and the former is still set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a7ccb255 |
| 13-Dec-2019 |
Dennis Zhou <dennis@kernel.org> |
btrfs: keep track of which extents have been discarded
Async discard will use the free space cache as backing knowledge for which extents to discard. This patch plumbs knowledge about which extents
btrfs: keep track of which extents have been discarded
Async discard will use the free space cache as backing knowledge for which extents to discard. This patch plumbs knowledge about which extents need to be discarded into the free space cache from unpin_extent_range().
An untrimmed extent can merge with everything as this is a new region. Absorbing trimmed extents is a tradeoff to for greater coalescing which makes life better for find_free_extent(). Additionally, it seems the size of a trim isn't as problematic as the trim io itself.
When reading in the free space cache from disk, if sync is set, mark all extents as trimmed. The current code ensures at transaction commit that all free space is trimmed when sync is set, so this reflects that.
Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12, v5.3.11, v5.3.10, v5.3.9 |
|
#
32da5386 |
| 29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was
btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|