History log of /openbmc/linux/fs/btrfs/file.c (Results 1 – 25 of 3153)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# fd5e9fcc 23-Feb-2025 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.79' into for/openbmc/dev-6.6

This is the 6.6.79 stable release

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAme4eH0ACgkQONu9yGCS
# aT6dPRAAo8SiKI7

Merge tag 'v6.6.79' into for/openbmc/dev-6.6

This is the 6.6.79 stable release

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAme4eH0ACgkQONu9yGCS
# aT6dPRAAo8SiKI7YXQB5KPEkcuD3rKr53C1EMxaiSPPXrgQw95R9HYjkPg6YQw+4
# pvyW2HNMbvC/s7HPVoT9R4VLOkcTJ8d0jtnZFgcwEjQzTNTI1Xju+rqate/lGhHY
# cMwb93ThgBUqxwmKSZ3LaknzNrCb1EXopvERXM05anbSQ+JQ5+oq0kA2xOJmu5dm
# GuQyi1AhkG4/Fh3r0wdUlP7/pO6Tx/XiZt6c7JZ/RQ6lGVkpzhn0m6r8U/7hkei7
# fKLGRfbQAqBrR5+tUQQq7pdSJcJWGygd7wdutD790yYhWC5pq36KzoLjxNEpDIcD
# k37e4teclWlRZb6VsprhLlX4KDDc/kgot7A6Ko44tWeo+dGsx+s5jhO6TVEbPeU/
# YgNP87FlrBf5Li0uk0iMYyAT11KEQXroJ9AZJ/KuNpCA47+2scvF4B3tK1QcdSmv
# 21W7ysiGepfk+NG3Gbm7vqKp7JaVZakIZquqx7CUMZz//VcWPj5AgOYzBMmtMALk
# I5Bqt8Zo2I0hWiqvdmOGnQMwUYQ7B3wPfR3lWM95UfWV2bxC3NvSnC88eI8/VolM
# IrH9l26UHIj5sTGJAZOahq9R7h+lBAr419tDw0Z+GeATqqXX/17BXUomHPuviGyj
# /rA67sxTulhw3oSJLsIDKgmMzau8SJQdWDa5aJ5wBPdrgpvaATs=
# =tz58
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 21 Feb 2025 23:28:37 ACDT
# gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E
# gpg: Good signature from "Greg Kroah-Hartman <gregkh@kernel.org>" [marginal]
# gpg: gregkh@kernel.org: Verified 8 signatures in the past 4 weeks. Encrypted
# 0 messages.
# gpg: Warning: you have yet to encrypt a message to this key!
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E

show more ...


Revision tags: v6.6.79, v6.6.78, v6.6.77, v6.6.76
# 29775681 05-Feb-2025 Filipe Manana <fdmanana@suse.com>

btrfs: fix hole expansion when writing at an offset beyond EOF

commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream.

At btrfs_write_check() if our file's i_size is not sector size aligned and
w

btrfs: fix hole expansion when writing at an offset beyond EOF

commit da2dccd7451de62b175fb8f0808d644959e964c7 upstream.

At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).

The code is this:

start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
/* Expand hole size to cover write data, preventing empty gap */
loff_t end_pos = round_up(pos + count, fs_info->sectorsize);

ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
if (ret)
return ret;
}

So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().

This is an old bug introduced in commit 9036c10208e1 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.

Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.

This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.

We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.

The test was failing like this:

$ ./check generic/363
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
# --- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000
# +++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000
@@ -1 +1,46 @@
QA output created by 363
+READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
+OFFSET GOOD BAD RANGE
+0x1609d 0x0000 0x3104 0x0
+operation# (mod 256) for the bad data may be 4
+0x1609e 0x0000 0x0472 0x1
+operation# (mod 256) for the bad data may be 4
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests

Fixes: 9036c10208e1 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


# 360823a0 17-Feb-2025 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.78' into for/openbmc/dev-6.6

This is the 6.6.78 stable release

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmey9hUACgkQONu9yGCS
# aT7Ecw//Ts3+DVy

Merge tag 'v6.6.78' into for/openbmc/dev-6.6

This is the 6.6.78 stable release

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmey9hUACgkQONu9yGCS
# aT7Ecw//Ts3+DVyM1iMAUj6zZHQ7+UVqRxvVQ0yJwe1gzECrasxhu+ack0MDuRXb
# RTOHzrVkpHrOZ58T0kkkp4DVea4bq8kpq9wnnOxpta4SzQYuwxuypxw9ZML2u8kR
# A77akcb4MPBpeTwlLUTEX1K2CrF+Wfz9ZGauJRTmrnWogJe1hZWTxr3tc9TqGeMA
# tk93g9kWy7hxxubPJpAUbNVmWbpm/TfZuMAEyktpNf8E0DLukHjr0If85t3BC0KZ
# kxLSCN05ZmWoZVQjmaerS8pXFvwj08OeRbUtW+b4oaraUV7vsrwxW/WcOqb6vIBn
# AEohV3w7CpFj0moRPXJO+UuxmP5TrSCIGUaEGjnrMCPJfjxwnmFYaf+9DYi3bR4H
# U8UyU55PhGTWlWg238Qp64KsDn41M/rlNKOiPEGq08+1Qnhoj4LWfFFHzLhO8y4R
# xLfsOzu6cHgEUnMKPTV6TnkWSCEL9t51wgzsqa7iKdO7kyAL1YCb4+LkskJAqUzW
# t3i8Sw8nygE7cKQ5eHzG6CClKEfgxtMGiR63gan9npEUgcFbzoVP0uz9RYz7+0Vz
# 5oE2ZSGXSoiJNWhdjJVrr1gqg/TwrzmVjsmUEnf4uTDABh9GXL+g+UZHGSMvvvYi
# T8gUY4aFwXO5fGKN1RW8RXJSbJr4nKYde2s/h4ZT1EwRVdj5Zcc=
# =+i1A
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 17 Feb 2025 19:10:53 ACDT
# gpg: using RSA key 647F28654894E3BD457199BE38DBBDC86092693E
# gpg: Good signature from "Greg Kroah-Hartman <gregkh@kernel.org>" [marginal]
# gpg: gregkh@kernel.org: Verified 7 signatures in the past 3 weeks. Encrypted
# 0 messages.
# gpg: Warning: you have yet to encrypt a message to this key!
# gpg: Warning: if you think you've seen more signatures by this key and user
# id, then this key might be a forgery! Carefully examine the email address
# for small variations. If the key is suspect, then use
# gpg --tofu-policy bad 647F28654894E3BD457199BE38DBBDC86092693E
# to mark it as being bad.
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 647F 2865 4894 E3BD 4571 99BE 38DB BDC8 6092 693E

show more ...


Revision tags: v6.6.75, v6.6.74, v6.6.73, v6.6.72, v6.6.71, v6.12.9, v6.6.70, v6.12.8, v6.6.69, v6.12.7, v6.6.68, v6.12.6, v6.6.67, v6.12.5, v6.6.66, v6.6.65, v6.12.4, v6.6.64, v6.12.3, v6.12.2
# 726d3634 03-Dec-2024 Hao-ran Zheng <zhenghaoran154@gmail.com>

btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()

[ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ]

A data race occurs when the function `insert_ordered

btrfs: fix data race when accessing the inode's disk_i_size at btrfs_drop_extents()

[ Upstream commit 5324c4e10e9c2ce307a037e904c0d9671d7137d9 ]

A data race occurs when the function `insert_ordered_extent_file_extent()`
and the function `btrfs_inode_safe_disk_i_size_write()` are executed
concurrently. The function `insert_ordered_extent_file_extent()` is not
locked when reading inode->disk_i_size, causing
`btrfs_inode_safe_disk_i_size_write()` to cause data competition when
writing inode->disk_i_size, thus affecting the value of `modify_tree`.

The specific call stack that appears during testing is as follows:

============DATA_RACE============
btrfs_drop_extents+0x89a/0xa060 [btrfs]
insert_reserved_file_extent+0xb54/0x2960 [btrfs]
insert_ordered_extent_file_extent+0xff5/0x1760 [btrfs]
btrfs_finish_one_ordered+0x1b85/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
============OTHER_INFO============
btrfs_inode_safe_disk_i_size_write+0x4ec/0x600 [btrfs]
btrfs_finish_one_ordered+0x24c7/0x36a0 [btrfs]
btrfs_finish_ordered_io+0x37/0x60 [btrfs]
finish_ordered_fn+0x3e/0x50 [btrfs]
btrfs_work_helper+0x9c9/0x27a0 [btrfs]
process_scheduled_works+0x716/0xf10
worker_thread+0xb6a/0x1190
kthread+0x292/0x330
ret_from_fork+0x4d/0x80
ret_from_fork_asm+0x1a/0x30
=================================

The main purpose of the check of the inode's disk_i_size is to avoid
taking write locks on a btree path when we have a write at or beyond
EOF, since in these cases we don't expect to find extent items in the
root to drop. However if we end up taking write locks due to a data
race on disk_i_size, everything is still correct, we only add extra
lock contention on the tree in case there's concurrency from other tasks.
If the race causes us to not take write locks when we actually need them,
then everything is functionally correct as well, since if we find out we
have extent items to drop and we took read locks (modify_tree set to 0),
we release the path and retry again with write locks.

Since this data race does not affect the correctness of the function,
it is a harmless data race, use data_race() to check inode->disk_i_size.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Hao-ran Zheng <zhenghaoran154@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v6.6.63, v6.12.1, v6.12, v6.6.62, v6.6.61, v6.6.60, v6.6.59, v6.6.58, v6.6.57, v6.6.56, v6.6.55
# 34d6f206 07-Oct-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.54' into for/openbmc/dev-6.6

This is the 6.6.54 stable release


Revision tags: v6.6.54, v6.6.53, v6.6.52, v6.6.51, v6.6.50, v6.6.49
# f56a6d9c 03-Sep-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race setting file private on concurrent lseek using same fd

[ Upstream commit 7ee85f5515e86a4e2a2f51969795920733912bad ]

When doing concurrent lseek(2) system calls against the same file

btrfs: fix race setting file private on concurrent lseek using same fd

[ Upstream commit 7ee85f5515e86a4e2a2f51969795920733912bad ]

When doing concurrent lseek(2) system calls against the same file
descriptor, using multiple threads belonging to the same process, we have
a short time window where a race happens and can result in a memory leak.

The race happens like this:

1) A program opens a file descriptor for a file and then spawns two
threads (with the pthreads library for example), lets call them
task A and task B;

2) Task A calls lseek with SEEK_DATA or SEEK_HOLE and ends up at
file.c:find_desired_extent() while holding a read lock on the inode;

3) At the start of find_desired_extent(), it extracts the file's
private_data pointer into a local variable named 'private', which has
a value of NULL;

4) Task B also calls lseek with SEEK_DATA or SEEK_HOLE, locks the inode
in shared mode and enters file.c:find_desired_extent(), where it also
extracts file->private_data into its local variable 'private', which
has a NULL value;

5) Because it saw a NULL file private, task A allocates a private
structure and assigns to the file structure;

6) Task B also saw a NULL file private so it also allocates its own file
private and then assigns it to the same file structure, since both
tasks are using the same file descriptor.

At this point we leak the private structure allocated by task A.

Besides the memory leak, there's also the detail that both tasks end up
using the same cached state record in the private structure (struct
btrfs_file_private::llseek_cached_state), which can result in a
use-after-free problem since one task can free it while the other is
still using it (only one task took a reference count on it). Also, sharing
the cached state is not a good idea since it could result in incorrect
results in the future - right now it should not be a problem because it
end ups being used only in extent-io-tree.c:count_range_bits() where we do
range validation before using the cached state.

Fix this by protecting the private assignment and check of a file while
holding the inode's spinlock and keep track of the task that allocated
the private, so that it's used only by that task in order to prevent
user-after-free issues with the cached state record as well as potentially
using it incorrectly in the future.

Fixes: 3c32c7212f16 ("btrfs: use cached state when looking for delalloc ranges with lseek")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


# ca2478a7 12-Sep-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.51' into for/openbmc/dev-6.6

This is the 6.6.51 stable release


# 7b5595f3 29-Aug-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix race between direct IO write and fsync when using same fd

commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream.

If we have 2 threads that are using the same file descriptor and one o

btrfs: fix race between direct IO write and fsync when using same fd

commit cd9253c23aedd61eb5ff11f37a36247cd46faf86 upstream.

If we have 2 threads that are using the same file descriptor and one of
them is doing direct IO writes while the other is doing fsync, we have a
race where we can end up either:

1) Attempt a fsync without holding the inode's lock, triggering an
assertion failures when assertions are enabled;

2) Do an invalid memory access from the fsync task because the file private
points to memory allocated on stack by the direct IO task and it may be
used by the fsync task after the stack was destroyed.

The race happens like this:

1) A user space program opens a file descriptor with O_DIRECT;

2) The program spawns 2 threads using libpthread for example;

3) One of the threads uses the file descriptor to do direct IO writes,
while the other calls fsync using the same file descriptor.

4) Call task A the thread doing direct IO writes and task B the thread
doing fsyncs;

5) Task A does a direct IO write, and at btrfs_direct_write() sets the
file's private to an on stack allocated private with the member
'fsync_skip_inode_lock' set to true;

6) Task B enters btrfs_sync_file() and sees that there's a private
structure associated to the file which has 'fsync_skip_inode_lock' set
to true, so it skips locking the inode's VFS lock;

7) Task A completes the direct IO write, and resets the file's private to
NULL since it had no prior private and our private was stack allocated.
Then it unlocks the inode's VFS lock;

8) Task B enters btrfs_get_ordered_extents_for_logging(), then the
assertion that checks the inode's VFS lock is held fails, since task B
never locked it and task A has already unlocked it.

The stack trace produced is the following:

assertion failed: inode_is_locked(&inode->vfs_inode), in fs/btrfs/ordered-data.c:983
------------[ cut here ]------------
kernel BUG at fs/btrfs/ordered-data.c:983!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 9 PID: 5072 Comm: worker Tainted: G U OE 6.10.5-1-default #1 openSUSE Tumbleweed 69f48d427608e1c09e60ea24c6c55e2ca1b049e8
Hardware name: Acer Predator PH315-52/Covini_CFS, BIOS V1.12 07/28/2020
RIP: 0010:btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs]
Code: 50 d6 86 c0 e8 (...)
RSP: 0018:ffff9e4a03dcfc78 EFLAGS: 00010246
RAX: 0000000000000054 RBX: ffff9078a9868e98 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff907dce4a7800 RDI: ffff907dce4a7800
RBP: ffff907805518800 R08: 0000000000000000 R09: ffff9e4a03dcfb38
R10: ffff9e4a03dcfb30 R11: 0000000000000003 R12: ffff907684ae7800
R13: 0000000000000001 R14: ffff90774646b600 R15: 0000000000000000
FS: 00007f04b96006c0(0000) GS:ffff907dce480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f32acbfc000 CR3: 00000001fd4fa005 CR4: 00000000003726f0
Call Trace:
<TASK>
? __die_body.cold+0x14/0x24
? die+0x2e/0x50
? do_trap+0xca/0x110
? do_error_trap+0x6a/0x90
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? exc_invalid_op+0x50/0x70
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? asm_exc_invalid_op+0x1a/0x20
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? btrfs_get_ordered_extents_for_logging.cold+0x1f/0x42 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
btrfs_sync_file+0x21a/0x4d0 [btrfs bb26272d49b4cdc847cf3f7faadd459b62caee9a]
? __seccomp_filter+0x31d/0x4f0
__x64_sys_fdatasync+0x4f/0x90
do_syscall_64+0x82/0x160
? do_futex+0xcb/0x190
? __x64_sys_futex+0x10e/0x1d0
? switch_fpu_return+0x4f/0xd0
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
? syscall_exit_to_user_mode+0x72/0x220
? do_syscall_64+0x8e/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e

Another problem here is if task B grabs the private pointer and then uses
it after task A has finished, since the private was allocated in the stack
of task A, it results in some invalid memory access with a hard to predict
result.

This issue, triggering the assertion, was observed with QEMU workloads by
two users in the Link tags below.

Fix this by not relying on a file's private to pass information to fsync
that it should skip locking the inode and instead pass this information
through a special value stored in current->journal_info. This is safe
because in the relevant section of the direct IO write path we are not
holding a transaction handle, so current->journal_info is NULL.

The following C program triggers the issue:

$ cat repro.c
/* Get the O_DIRECT definition. */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <pthread.h>

static int fd;

static ssize_t do_write(int fd, const void *buf, size_t count, off_t offset)
{
while (count > 0) {
ssize_t ret;

ret = pwrite(fd, buf, count, offset);
if (ret < 0) {
if (errno == EINTR)
continue;
return ret;
}
count -= ret;
buf += ret;
}
return 0;
}

static void *fsync_loop(void *arg)
{
while (1) {
int ret;

ret = fsync(fd);
if (ret != 0) {
perror("Fsync failed");
exit(6);
}
}
}

int main(int argc, char *argv[])
{
long pagesize;
void *write_buf;
pthread_t fsyncer;
int ret;

if (argc != 2) {
fprintf(stderr, "Use: %s <file path>\n", argv[0]);
return 1;
}

fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT, 0666);
if (fd == -1) {
perror("Failed to open/create file");
return 1;
}

pagesize = sysconf(_SC_PAGE_SIZE);
if (pagesize == -1) {
perror("Failed to get page size");
return 2;
}

ret = posix_memalign(&write_buf, pagesize, pagesize);
if (ret) {
perror("Failed to allocate buffer");
return 3;
}

ret = pthread_create(&fsyncer, NULL, fsync_loop, NULL);
if (ret != 0) {
fprintf(stderr, "Failed to create writer thread: %d\n", ret);
return 4;
}

while (1) {
ret = do_write(fd, write_buf, pagesize, 0);
if (ret != 0) {
perror("Write failed");
exit(5);
}
}

return 0;
}

$ mkfs.btrfs -f /dev/sdi
$ mount /dev/sdi /mnt/sdi
$ timeout 10 ./repro /mnt/sdi/foo

Usually the race is triggered within less than 1 second. A test case for
fstests will follow soon.

Reported-by: Paulo Dias <paulo.miguel.dias@gmail.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219187
Reported-by: Andreas Jahn <jahn-andi@web.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219199
Reported-by: syzbot+4704b3cc972bd76024f1@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000044ff540620d7dee2@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.48, v6.6.47
# c005e2f6 14-Aug-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.46' into for/openbmc/dev-6.6

This is the 6.6.46 stable release


Revision tags: v6.6.46, v6.6.45, v6.6.44
# 7ba27f14 02-Aug-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix double inode unlock for direct IO sync writes

commit e0391e92f9ab4fb3dbdeb139c967dcfa7ac4b115 upstream.

If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode l

btrfs: fix double inode unlock for direct IO sync writes

commit e0391e92f9ab4fb3dbdeb139c967dcfa7ac4b115 upstream.

If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip
inode logging or we get an error starting a transaction or an error when
flushing delalloc, we end up unlocking the inode when we shouldn't under
the 'out_release_extents' label, and then unlock it again at
btrfs_direct_write().

Fix that by checking if we have to skip inode unlocking under that label.

Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/
Fixes: 939b656bc8ab ("btrfs: fix corruption after buffer fault in during direct IO append write")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.43
# 0a108bde 26-Jul-2024 Filipe Manana <fdmanana@suse.com>

btrfs: fix corruption after buffer fault in during direct IO append write

commit 939b656bc8ab203fdbde26ccac22bcb7f0985be5 upstream.

During an append (O_APPEND write flag) direct IO write if the inp

btrfs: fix corruption after buffer fault in during direct IO append write

commit 939b656bc8ab203fdbde26ccac22bcb7f0985be5 upstream.

During an append (O_APPEND write flag) direct IO write if the input buffer
was not previously faulted in, we can corrupt the file in a way that the
final size is unexpected and it includes an unexpected hole.

The problem happens like this:

1) We have an empty file, with size 0, for example;

2) We do an O_APPEND direct IO with a length of 4096 bytes and the input
buffer is not currently faulted in;

3) We enter btrfs_direct_write(), lock the inode and call
generic_write_checks(), which calls generic_write_checks_count(), and
that function sets the iocb position to 0 with the following code:

if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);

4) We call btrfs_dio_write() and enter into iomap, which will end up
calling btrfs_dio_iomap_begin() and that calls
btrfs_get_blocks_direct_write(), where we update the i_size of the
inode to 4096 bytes;

5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access
the page of the write input buffer (at iomap_dio_bio_iter(), with a
call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets
returned to btrfs at btrfs_direct_write() via btrfs_dio_write();

6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode,
fault in the write buffer and then goto to the label 'relock';

7) We lock again the inode, do all the necessary checks again and call
again generic_write_checks(), which calls generic_write_checks_count()
again, and there we set the iocb's position to 4K, which is the current
i_size of the inode, with the following code pointed above:

if (iocb->ki_flags & IOCB_APPEND)
iocb->ki_pos = i_size_read(inode);

8) Then we go again to btrfs_dio_write() and enter iomap and the write
succeeds, but it wrote to the file range [4K, 8K), leaving a hole in
the [0, 4K) range and an i_size of 8K, which goes against the
expectations of having the data written to the range [0, 4K) and get an
i_size of 4K.

Fix this by not unlocking the inode before faulting in the input buffer,
in case we get -EFAULT or an incomplete write, and not jumping to the
'relock' label after faulting in the buffer - instead jump to a location
immediately before calling iomap, skipping all the write checks and
relocking. This solves this problem and it's fine even in case the input
buffer is memory mapped to the same file range, since only holding the
range locked in the inode's io tree can cause a deadlock, it's safe to
keep the inode lock (VFS lock), as was fixed and described in commit
51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO
reads and writes").

A sample reproducer provided by a reporter is the following:

$ cat test.c
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <fcntl.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
if (argc < 2) {
fprintf(stderr, "Usage: %s <test file>\n", argv[0]);
return 1;
}

int fd = open(argv[1], O_WRONLY | O_CREAT | O_TRUNC | O_DIRECT |
O_APPEND, 0644);
if (fd < 0) {
perror("creating test file");
return 1;
}

char *buf = mmap(NULL, 4096, PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
ssize_t ret = write(fd, buf, 4096);
if (ret < 0) {
perror("pwritev2");
return 1;
}

struct stat stbuf;
ret = fstat(fd, &stbuf);
if (ret < 0) {
perror("stat");
return 1;
}

printf("size: %llu\n", (unsigned long long)stbuf.st_size);
return stbuf.st_size == 4096 ? 0 : 1;
}

A test case for fstests will be sent soon.

Reported-by: Hanna Czenczek <hreitz@redhat.com>
Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/
Fixes: 8184620ae212 ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb")
CC: stable@vger.kernel.org # 6.1+
Tested-by: Hanna Czenczek <hreitz@redhat.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.42, v6.6.41, v6.6.40, v6.6.39, v6.6.38, v6.6.37, v6.6.36, v6.6.35, v6.6.34, v6.6.33, v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23
# e6eff205 10-Feb-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.8' into dev-6.6

This is the 6.6.8 stable release


Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4
# 18234915 01-Dec-2023 Boris Burkov <boris@bur.io>

btrfs: fix qgroup_free_reserved_data int overflow

commit 9e65bfca24cf1d77e4a5c7a170db5867377b3fe7 upstream.

The reserved data counter and input parameter is a u64, but we
inadvertently accumulate i

btrfs: fix qgroup_free_reserved_data int overflow

commit 9e65bfca24cf1d77e4a5c7a170db5867377b3fe7 upstream.

The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.

Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.

Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.

Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.

CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


# b97d6790 13-Dec-2023 Joel Stanley <joel@jms.id.au>

Merge tag 'v6.6.6' into dev-6.6

This is the 6.6.6 stable release

Signed-off-by: Joel Stanley <joel@jms.id.au>


Revision tags: v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3
# d5e09e38 12-Sep-2023 Filipe Manana <fdmanana@suse.com>

btrfs: abort transaction on generation mismatch when marking eb as dirty

[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]

When marking an extent buffer as dirty, at btrfs_mark_buffer_di

btrfs: abort transaction on generation mismatch when marking eb as dirty

[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]

When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.

So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


# b5cbe7c0 21-Sep-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull finegrained timestamp reverts from Christian Brauner:
"Earlier this week we sent a few minor fixe

Merge tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull finegrained timestamp reverts from Christian Brauner:
"Earlier this week we sent a few minor fixes for the multi-grained
timestamp work in [1]. While we were polishing those up after Linus
realized that there might be a nicer way to fix them we received a
regression report in [2] that fine grained timestamps break gnulib
tests and thus possibly other tools.

The kernel will elide fine-grain timestamp updates when no one is
actively querying for them to avoid performance impacts. So a sequence
like write(f1) stat(f2) write(f2) stat(f2) write(f1) stat(f1) may
result in timestamp f1 to be older than the final f2 timestamp even
though f1 was last written too but the second write didn't update the
timestamp.

Such plotholes can lead to subtle bugs when programs compare
timestamps. For example, the nap() function in [2] will estimate that
it needs to wait one ns on a fine-grain timestamp enabled filesytem
between subsequent calls to observe a timestamp change. But in general
we don't update timestamps with more than one jiffie if we think that
no one is actively querying for fine-grain timestamps to avoid
performance impacts.

While discussing various fixes the decision was to go back to the
drawing board and ultimately to explore a solution that involves only
exposing such fine-grained timestamps to nfs internally and never to
userspace.

As there are multiple solutions discussed the honest thing to do here
is not to fix this up or disable it but to cleanly revert. The general
infrastructure will probably come back but there is no reason to keep
this code in mainline.

The general changes to timestamp handling are valid and a good cleanup
that will stay. The revert is fully bisectable"

Link: https://lore.kernel.org/all/20230918-hirte-neuzugang-4c2324e7bae3@brauner [1]
Link: https://lore.kernel.org/all/bf0524debb976627693e12ad23690094e4514303.camel@linuxfromscratch.org [2]

* tag 'v6.6-rc3.vfs.ctime.revert' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "fs: add infrastructure for multigrain timestamps"
Revert "btrfs: convert to multigrain timestamps"
Revert "ext4: switch to multigrain timestamps"
Revert "xfs: switch to multigrain timestamps"
Revert "tmpfs: add support for multigrain timestamps"

show more ...


# a229cf67 20-Sep-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few more followup fixes to the directory listing.

People have noti

Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few more followup fixes to the directory listing.

People have noticed different behaviour compared to other filesystems
after changes in 6.5. This is now unified to more "logical" and
expected behaviour while still within POSIX. And a few more fixes for
stable.

- change behaviour of readdir()/rewinddir() when new directory
entries are created after opendir(), properly tracking the last
entry

- fix race in readdir when multiple threads can set the last entry
index for a directory

Additionally:

- use exclusive lock when direct io might need to drop privs and call
notify_change()

- don't clear uptodate bit on page after an error, this may lead to a
deadlock in subpage mode

- fix waiting pattern when multiple readers block on Merkle tree
data, switch to folios"

* tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix race between reading a directory and adding entries to it
btrfs: refresh dir last index during a rewinddir(3) call
btrfs: set last dir index to the current last index when opening dir
btrfs: don't clear uptodate on write errors
btrfs: file_remove_privs needs an exclusive lock in direct io write
btrfs: convert btrfs_read_merkle_tree_page() to use a folio

show more ...


# efd34f03 20-Sep-2023 Christian Brauner <brauner@kernel.org>

Revert "btrfs: convert to multigrain timestamps"

This reverts commit 50e9ceef1d4f644ee0049e82e360058a64ec284c.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. A

Revert "btrfs: convert to multigrain timestamps"

This reverts commit 50e9ceef1d4f644ee0049e82e360058a64ec284c.

Users reported regressions due to enabling multi-grained timestamps
unconditionally. As no clear consensus on a solution has come up and the
discussion has gone back to the drawing board revert the infrastructure
changes for. If it isn't code that's here to stay, make it go away.

Message-ID: <20230920-keine-eile-c9755b5825db@brauner>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

show more ...


Revision tags: v6.5.2
# 9af86694 06-Sep-2023 Bernd Schubert <bschubert@ddn.com>

btrfs: file_remove_privs needs an exclusive lock in direct io write

This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The p

btrfs: file_remove_privs needs an exclusive lock in direct io write

This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.

Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# c900529f 12-Sep-2023 Thomas Zimmermann <tzimmermann@suse.de>

Merge drm/drm-fixes into drm-misc-fixes

Forwarding to v6.6-rc1.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>


Revision tags: v6.1.51, v6.5.1
# 1ac731c5 30-Aug-2023 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'next' into for-linus

Prepare input updates for 6.6 merge window.


Revision tags: v6.1.50
# 547635c6 28-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"No new features, the bulk of the changes are fixes, refactoring and
cle

Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
"No new features, the bulk of the changes are fixes, refactoring and
cleanups. The notable fix is the scrub performance restoration after
rewrite in 6.4, though still only partial.

Fixes:

- scrub performance drop due to rewrite in 6.4 partially restored:
- do IO grouping by blg_plug/blk_unplug again
- avoid unnecessary tree searches when processing stripes, in
extent and checksum trees
- the drop is noticeable on fast PCIe devices, -66% and restored
to -33% of the original
- backports to 6.4 planned

- handle more corner cases of transaction commit during orphan
cleanup or delayed ref processing

- use correct fsid/metadata_uuid when validating super block

- copy directory permissions and time when creating a stub subvolume

Core:

- debugging feature integrity checker deprecated, to be removed in
6.7

- in zoned mode, zones are activated just before the write, making
error handling easier, now the overcommit mechanism can be enabled
again which improves performance by avoiding more frequent flushing

- v0 extent handling completely removed, deprecated long time ago

- error handling improvements

- tests:
- extent buffer bitmap tests
- pinned extent splitting tests

- cleanups and refactoring:
- compression writeback
- extent buffer bitmap
- space flushing, ENOSPC handling"

* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
btrfs: tests: test invalid splitting when skipping pinned drop extent_map
btrfs: tests: add a test for btrfs_add_extent_mapping
btrfs: tests: add extent_map tests for dropping with odd layouts
btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
btrfs: scrub: don't go ordered workqueue for dev-replace
btrfs: scrub: fix grouping of read IO
btrfs: scrub: avoid unnecessary csum tree search preparing stripes
btrfs: scrub: avoid unnecessary extent tree search preparing stripes
btrfs: copy dir permission and time when creating a stub subvolume
btrfs: remove pointless empty list check when reading delayed dir indexes
btrfs: drop redundant check to use fs_devices::metadata_uuid
btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
btrfs: simplify memcpy either of metadata_uuid or fsid
btrfs: add a helper to read the superblock metadata_uuid
btrfs: remove v0 extent handling
btrfs: output extra debug info if we failed to find an inline backref
btrfs: move the !zoned assert into run_delalloc_cow
btrfs: consolidate the error handling in run_delalloc_nocow
...

show more ...


# 6016fc91 28-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"We've got some big changes for this release -- I'm very happy to be
landing w

Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
"We've got some big changes for this release -- I'm very happy to be
landing willy's work to enable large folios for the page cache for
general read and write IOs when the fs can make contiguous space
allocations, and Ritesh's work to track sub-folio dirty state to
eliminate the write amplification problems inherent in using large
folios.

As a bonus, io_uring can now process write completions in the caller's
context instead of bouncing through a workqueue, which should reduce
io latency dramatically. IOWs, XFS should see a nice performance bump
for both IO paths.

Summary:

- Make large writes to the page cache fill sparse parts of the cache
with large folios, then use large memcpy calls for the large folio.

- Track the per-block dirty state of each large folio so that a
buffered write to a single byte on a large folio does not result in
a (potentially) multi-megabyte writeback IO.

- Allow some directio completions to be performed in the initiating
task's context instead of punting through a workqueue. This will
reduce latency for some io_uring requests"

* tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits)
iomap: support IOCB_DIO_CALLER_COMP
io_uring/rw: add write support for IOCB_DIO_CALLER_COMP
fs: add IOCB flags related to passing back dio completions
iomap: add IOMAP_DIO_INLINE_COMP
iomap: only set iocb->private for polled bio
iomap: treat a write through cache the same as FUA
iomap: use an unsigned type for IOMAP_DIO_* defines
iomap: cleanup up iomap_dio_bio_end_io()
iomap: Add per-block dirty state tracking to improve performance
iomap: Allocate ifs in ->write_begin() early
iomap: Refactor iomap_write_delalloc_punch() function out
iomap: Use iomap_punch_t typedef
iomap: Fix possible overflow condition in iomap_write_delalloc_scan
iomap: Add some uptodate state handling helpers for ifs state bitmap
iomap: Drop ifs argument from iomap_set_range_uptodate()
iomap: Rename iomap_page to iomap_folio_state and others
iomap: Copy larger chunks from userspace
iomap: Create large folios in the buffered write path
filemap: Allow __filemap_get_folio to allocate large folios
filemap: Add fgf_t typedef
...

show more ...


# 615e9583 28-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts

Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs timestamp updates from Christian Brauner:
"This adds VFS support for multi-grain timestamps and converts tmpfs,
xfs, ext4, and btrfs to use them. This carries acks from all relevant
filesystems.

The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot of metadata updates, down to around 1 per
jiffy, even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support
a change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps
(e.g., backup applications).

If we were to always use fine-grained timestamps, that would improve
the situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

This introduces fine-grained timestamps that are used when they are
actively queried.

This uses the 31st bit of the ctime tv_nsec field to indicate that
something has queried the inode for the mtime or ctime. When this flag
is set, on the next mtime or ctime update, the kernel will fetch a
fine-grained timestamp instead of the usual coarse-grained one.

As POSIX generally mandates that when the mtime changes, the ctime
must also change the kernel always stores normalized ctime values, so
only the first 30 bits of the tv_nsec field are ever used.

Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.

Various preparatory changes, fixes and cleanups are included:

- Fixup all relevant places where POSIX requires updating ctime
together with mtime. This is a wide-range of places and all
maintainers provided necessary Acks.

- Add new accessors for inode->i_ctime directly and change all
callers to rely on them. Plain accesses to inode->i_ctime are now
gone and it is accordingly rename to inode->__i_ctime and commented
as requiring accessors.

- Extend generic_fillattr() to pass in a request mask mirroring in a
sense the statx() uapi. This allows callers to pass in a request
mask to only get a subset of attributes filled in.

- Rework timestamp updates so it's possible to drop the @now
parameter the update_time() inode operation and associated helpers.

- Add inode_update_timestamps() and convert all filesystems to it
removing a bunch of open-coding"

* tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits)
btrfs: convert to multigrain timestamps
ext4: switch to multigrain timestamps
xfs: switch to multigrain timestamps
tmpfs: add support for multigrain timestamps
fs: add infrastructure for multigrain timestamps
fs: drop the timespec64 argument from update_time
xfs: have xfs_vn_update_time gets its own timestamp
fat: make fat_update_time get its own timestamp
fat: remove i_version handling from fat_update_time
ubifs: have ubifs_update_time use inode_update_timestamps
btrfs: have it use inode_update_timestamps
fs: drop the timespec64 arg from generic_update_time
fs: pass the request_mask to generic_fillattr
fs: remove silly warning from current_time
gfs2: fix timestamp handling on quota inodes
fs: rename i_ctime field to __i_ctime
selinux: convert to ctime accessor functions
security: convert to ctime accessor functions
apparmor: convert to ctime accessor functions
sunrpc: convert to ctime accessor functions
...

show more ...


Revision tags: v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45
# 84af994b 09-Aug-2023 Ruan Jinjie <ruanjinjie@huawei.com>

btrfs: use LIST_HEAD() to initialize the list_head

Use LIST_HEAD() to initialize the list_head instead of open-coding it.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: David Sterb

btrfs: use LIST_HEAD() to initialize the list_head

Use LIST_HEAD() to initialize the list_head instead of open-coding it.

Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


12345678910>>...127