Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
31f71f2d |
| 06-Feb-2024 |
Christian Brauner <brauner@kernel.org> |
fs: relax mount_setattr() permission checks
commit 46f5ab762d048dad224436978315cbc2fa79c630 upstream.
When we added mount_setattr() I added additional checks compared to the legacy do_reconfigure_m
fs: relax mount_setattr() permission checks
commit 46f5ab762d048dad224436978315cbc2fa79c630 upstream.
When we added mount_setattr() I added additional checks compared to the legacy do_reconfigure_mnt() and do_change_type() helpers used by regular mount(2). If that mount had a parent then verify that the caller and the mount namespace the mount is attached to match and if not make sure that it's an anonymous mount.
The real rootfs falls into neither category. It is neither an anoymous mount because it is obviously attached to the initial mount namespace but it also obviously doesn't have a parent mount. So that means legacy mount(2) allows changing mount properties on the real rootfs but mount_setattr(2) blocks this. I never thought much about this but of course someone on this planet of earth changes properties on the real rootfs as can be seen in [1].
Since util-linux finally switched to the new mount api in 2.39 not so long ago it also relies on mount_setattr() and that surfaced this issue when Fedora 39 finally switched to it. Fix this.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2256843 Link: https://lore.kernel.org/r/20240206-vfs-mount-rootfs-v1-1-19b335eee133@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Karel Zak <kzak@redhat.com> Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3 |
|
#
5248b445 |
| 22-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
fs: indicate request originates from old mount API
[ Upstream commit f67d922edb4e95a4a56d07d5d40a76dd4f23a85b ]
We already communicate to filesystems when a remount request comes from the old mount
fs: indicate request originates from old mount API
[ Upstream commit f67d922edb4e95a4a56d07d5d40a76dd4f23a85b ]
We already communicate to filesystems when a remount request comes from the old mount API as some filesystems choose to implement different behavior in the new mount API than the old mount API to e.g., take the chance to fix significant API bugs. Allow the same for regular mount requests.
Fixes: b330966f79fb ("fuse: reject options on reconfigure via fsconfig(2)") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35 |
|
#
d7439fb1 |
| 20-Jun-2023 |
Jan Kara <jack@suse.cz> |
fs: Provide helpers for manipulating sb->s_readonly_remount
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what
fs: Provide helpers for manipulating sb->s_readonly_remount
Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed.
Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Message-Id: <20230620112832.5158-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28 |
|
#
6ac39281 |
| 03-May-2023 |
Christian Brauner <brauner@kernel.org> |
fs: allow to mount beneath top mount
Various distributions are adding or are in the process of adding support for system extensions and in the future configuration extensions through various tools.
fs: allow to mount beneath top mount
Various distributions are adding or are in the process of adding support for system extensions and in the future configuration extensions through various tools. A more detailed explanation on system and configuration extensions can be found on the manpage which is listed below at [1].
System extension images may – dynamically at runtime — extend the /usr/ and /opt/ directory hierarchies with additional files. This is particularly useful on immutable system images where a /usr/ and/or /opt/ hierarchy residing on a read-only file system shall be extended temporarily at runtime without making any persistent modifications.
When one or more system extension images are activated, their /usr/ and /opt/ hierarchies are combined via overlayfs with the same hierarchies of the host OS, and the host /usr/ and /opt/ overmounted with it ("merging"). When they are deactivated, the mount point is disassembled — again revealing the unmodified original host version of the hierarchy ("unmerging"). Merging thus makes the extension's resources suddenly appear below the /usr/ and /opt/ hierarchies as if they were included in the base OS image itself. Unmerging makes them disappear again, leaving in place only the files that were shipped with the base OS image itself.
System configuration images are similar but operate on directories containing system or service configuration.
On nearly all modern distributions mount propagation plays a crucial role and the rootfs of the OS is a shared mount in a peer group (usually with peer group id 1):
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:1 29 1
On such systems all services and containers run in a separate mount namespace and are pivot_root()ed into their rootfs. A separate mount namespace is almost always used as it is the minimal isolation mechanism services have. But usually they are even much more isolated up to the point where they almost become indistinguishable from containers.
Mount propagation again plays a crucial role here. The rootfs of all these services is a slave mount to the peer group of the host rootfs. This is done so the service will receive mount propagation events from the host when certain files or directories are updated.
In addition, the rootfs of each service, container, and sandbox is also a shared mount in its separate peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID / / ext4 shared:24 master:1 71 47
For people not too familiar with mount propagation, the master:1 means that this is a slave mount to peer group 1. Which as one can see is the host rootfs as indicated by shared:1 above. The shared:24 indicates that the service rootfs is a shared mount in a separate peer group with peer group id 24.
A service may run other services. Such nested services will also have a rootfs mount that is a slave to the peer group of the outer service rootfs mount.
For containers things are just slighly different. A container's rootfs isn't a slave to the service's or host rootfs' peer group. The rootfs mount of a container is simply a shared mount in its own peer group:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /home/ubuntu/debian-tree / ext4 shared:99 61 60
So whereas services are isolated OS components a container is treated like a separate world and mount propagation into it is restricted to a single well known mount that is a slave to the peer group of the shared mount /run on the host:
TARGET SOURCE FSTYPE PROPAGATION MNT_ID PARENT_ID /propagate/debian-tree /run/host/incoming tmpfs master:5 71 68
Here, the master:5 indicates that this mount is a slave to the peer group with peer group id 5. This allows to propagate mounts into the container and served as a workaround for not being able to insert mounts into mount namespaces directly. But the new mount api does support inserting mounts directly. For the interested reader the blogpost in [2] might be worth reading where I explain the old and the new approach to inserting mounts into mount namespaces.
Containers of course, can themselves be run as services. They often run full systems themselves which means they again run services and containers with the exact same propagation settings explained above.
The whole system is designed so that it can be easily updated, including all services in various fine-grained ways without having to enter every single service's mount namespace which would be prohibitively expensive. The mount propagation layout has been carefully chosen so it is possible to propagate updates for system extensions and configurations from the host into all services.
The simplest model to update the whole system is to mount on top of /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc will then propagate into every service. This works cleanly the first time. However, when the system is updated multiple times it becomes necessary to unmount the first update on /opt, /usr, /etc and then propagate the new update. But this means, there's an interval where the old base system is accessible. This has to be avoided to protect against downgrade attacks.
The vfs already exposes a mechanism to userspace whereby mounts can be mounted beneath an existing mount. Such mounts are internally referred to as "tucked". The patch series exposes the ability to mount beneath a top mount through the new MOVE_MOUNT_BENEATH flag for the move_mount() system call. This allows userspace to seamlessly upgrade mounts. After this series the only thing that will have changed is that mounting beneath an existing mount can be done explicitly instead of just implicitly.
Today, there are two scenarios where a mount can be mounted beneath an existing mount instead of on top of it:
(1) When a service or container is started in a new mount namespace and pivot_root()s into its new rootfs. The way this is done is by mounting the new rootfs beneath the old rootfs:
fd_newroot = open("/var/lib/machines/fedora", ...); fd_oldroot = open("/", ...); fchdir(fd_newroot); pivot_root(".", ".");
After the pivot_root(".", ".") call the new rootfs is mounted beneath the old rootfs which can then be unmounted to reveal the underlying mount:
fchdir(fd_oldroot); umount2(".", MNT_DETACH);
Since pivot_root() moves the caller into a new rootfs no mounts must be propagated out of the new rootfs as a consequence of the pivot_root() call. Thus, the mounts cannot be shared.
(2) When a mount is propagated to a mount that already has another mount mounted on the same dentry.
The easiest example for this is to create a new mount namespace. The following commands will create a mount namespace where the rootfs mount / will be a slave to the peer group of the host rootfs / mount's peer group. IOW, it will receive propagation from the host:
mount --make-shared / unshare --mount --propagation=slave
Now a new mount on the /mnt dentry in that mount namespace is created. (As it can be confusing it should be spelled out that the tmpfs mount on the /mnt dentry that was just created doesn't propagate back to the host because the rootfs mount / of the mount namespace isn't a peer of the host rootfs.):
mount -t tmpfs tmpfs /mnt
TARGET SOURCE FSTYPE PROPAGATION └─/mnt tmpfs tmpfs
Now another terminal in the host mount namespace can observe that the mount indeed hasn't propagated back to into the host mount namespace. A new mount can now be created on top of the /mnt dentry with the rootfs mount / as its parent:
mount --bind /opt /mnt
TARGET SOURCE FSTYPE PROPAGATION └─/mnt /dev/sda2[/opt] ext4 shared:1
The mount namespace that was created earlier can now observe that the bind mount created on the host has propagated into it:
TARGET SOURCE FSTYPE PROPAGATION └─/mnt /dev/sda2[/opt] ext4 master:1 └─/mnt tmpfs tmpfs
But instead of having been mounted on top of the tmpfs mount at the /mnt dentry the /opt mount has been mounted on top of the rootfs mount at the /mnt dentry. And the tmpfs mount has been remounted on top of the propagated /opt mount at the /opt dentry. So in other words, the propagated mount has been mounted beneath the preexisting mount in that mount namespace.
Mount namespaces make this easy to illustrate but it's also easy to mount beneath an existing mount in the same mount namespace (The following example assumes a shared rootfs mount / with peer group id 1):
mount --bind /opt /opt
TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION └─/opt /dev/sda2[/opt] ext4 188 29 shared:1
If another mount is mounted on top of the /opt mount at the /opt dentry:
mount --bind /tmp /opt
The following clunky mount tree will result:
TARGET SOURCE FSTYPE MNT_ID PARENT_ID PROPAGATION └─/opt /dev/sda2[/tmp] ext4 405 29 shared:1 └─/opt /dev/sda2[/opt] ext4 188 405 shared:1 └─/opt /dev/sda2[/tmp] ext4 404 188 shared:1
The /tmp mount is mounted beneath the /opt mount and another copy is mounted on top of the /opt mount. This happens because the rootfs / and the /opt mount are shared mounts in the same peer group.
When the new /tmp mount is supposed to be mounted at the /opt dentry then the /tmp mount first propagates to the root mount at the /opt dentry. But there already is the /opt mount mounted at the /opt dentry. So the old /opt mount at the /opt dentry will be mounted on top of the new /tmp mount at the /tmp dentry, i.e. @opt->mnt_parent is @tmp and @opt->mnt_mountpoint is /tmp (Note that @opt->mnt_root is /opt which is what shows up as /opt under SOURCE). So again, a mount will be mounted beneath a preexisting mount.
(Fwiw, a few iterations of mount --bind /opt /opt in a loop on a shared rootfs is a good example of what could be referred to as mount explosion.)
The main point is that such mounts allows userspace to umount a top mount and reveal an underlying mount. So for example, umounting the tmpfs mount on /mnt that was created in example (1) using mount namespaces reveals the /opt mount which was mounted beneath it.
In (2) where a mount was mounted beneath the top mount in the same mount namespace unmounting the top mount would unmount both the top mount and the mount beneath. In the process the original mount would be remounted on top of the rootfs mount / at the /opt dentry again.
This again, is a result of mount propagation only this time it's umount propagation. However, this can be avoided by simply making the parent mount / of the @opt mount a private or slave mount. Then the top mount and the original mount can be unmounted to reveal the mount beneath.
These two examples are fairly arcane and are merely added to make it clear how mount propagation has effects on current and future features.
More common use-cases will just be things like:
mount -t btrfs /dev/sdA /mnt mount -t xfs /dev/sdB --beneath /mnt umount /mnt
after which we'll have updated from a btrfs filesystem to a xfs filesystem without ever revealing the underlying mountpoint.
The crux is that the proposed mechanism already exists and that it is so powerful as to cover cases where mounts are supposed to be updated with new versions. Crucially, it offers an important flexibility. Namely that updates to a system may either be forced or can be delayed and the umount of the top mount be left to a service if it is a cooperative one.
This adds a new flag to move_mount() that allows to explicitly move a beneath the top mount adhering to the following semantics:
* Mounts cannot be mounted beneath the rootfs. This restriction encompasses the rootfs but also chroots via chroot() and pivot_root(). To mount a mount beneath the rootfs or a chroot, pivot_root() can be used as illustrated above. * The source mount must be a private mount to force the kernel to allocate a new, unused peer group id. This isn't a required restriction but a voluntary one. It avoids repeating a semantical quirk that already exists today. If bind mounts which already have a peer group id are inserted into mount trees that have the same peer group id this can cause a lot of mount propagation events to be generated (For example, consider running mount --bind /opt /opt in a loop where the parent mount is a shared mount.). * Avoid getting rid of the top mount in the kernel. Cooperative services need to be able to unmount the top mount themselves. This also avoids a good deal of additional complexity. The umount would have to be propagated which would be another rather expensive operation. So namespace_lock() and lock_mount_hash() would potentially have to be held for a long time for both a mount and umount propagation. That should be avoided. * The path to mount beneath must be mounted and attached. * The top mount and its parent must be in the caller's mount namespace and the caller must be able to mount in that mount namespace. * The caller must be able to unmount the top mount to prove that they could reveal the underlying mount. * The propagation tree is calculated based on the destination mount's parent mount and the destination mount's mountpoint on the parent mount. Of course, if the parent of the destination mount and the destination mount are shared mounts in the same peer group and the mountpoint of the new mount to be mounted is a subdir of their ->mnt_root then both will receive a mount of /opt. That's probably easier to understand with an example. Assuming a standard shared rootfs /:
mount --bind /opt /opt mount --bind /tmp /opt
will cause the same mount tree as:
mount --bind /opt /opt mount --beneath /tmp /opt
because both / and /opt are shared mounts/peers in the same peer group and the /opt dentry is a subdirectory of both the parent's and the child's ->mnt_root. If a mount tree like that is created it almost always is an accident or abuse of mount propagation. Realistically what most people probably mean in this scenarios is:
mount --bind /opt /opt mount --make-private /opt mount --make-shared /opt
This forces the allocation of a new separate peer group for the /opt mount. Aferwards a mount --bind or mount --beneath actually makes sense as the / and /opt mount belong to different peer groups. Before that it's likely just confusion about what the user wanted to achieve. * Refuse MOVE_MOUNT_BENEATH if: (1) the @mnt_from has been overmounted in between path resolution and acquiring @namespace_sem when locking @mnt_to. This avoids the proliferation of shadow mounts. (2) if @to_mnt is moved to a different mountpoint while acquiring @namespace_sem to lock @to_mnt. (3) if @to_mnt is unmounted while acquiring @namespace_sem to lock @to_mnt. (4) if the parent of the target mount propagates to the target mount at the same mountpoint. This would mean mounting @mnt_from on @mnt_to->mnt_parent and then propagating a copy @c of @mnt_from onto @mnt_to. This defeats the whole purpose of mounting @mnt_from beneath @mnt_to. (5) if the parent mount @mnt_to->mnt_parent propagates to @mnt_from at the same mountpoint. If @mnt_to->mnt_parent propagates to @mnt_from this would mean propagating a copy @c of @mnt_from on top of @mnt_from. Afterwards @mnt_from would be mounted on top of @mnt_to->mnt_parent and @mnt_to would be unmounted from @mnt->mnt_parent and remounted on @mnt_from. But since @c is already mounted on @mnt_from, @mnt_to would ultimately be remounted on top of @c. Afterwards, @mnt_from would be covered by a copy @c of @mnt_from and @c would be covered by @mnt_from itself. This defeats the whole purpose of mounting @mnt_from beneath @mnt_to. Cases (1) to (3) are required as they deal with races that would cause bugs or unexpected behavior for users. Cases (4) and (5) refuse semantical quirks that would not be a bug but would cause weird mount trees to be created. While they can already be created via other means (mount --bind /opt /opt x n) there's no reason to repeat past mistakes in new features.
Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [1] Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [2] Link: https://github.com/flatcar/sysext-bakery Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1 Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2 Link: https://github.com/systemd/systemd/pull/26013
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-4-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
64f44b27 |
| 03-May-2023 |
Christian Brauner <brauner@kernel.org> |
fs: use a for loop when locking a mount
Currently, lock_mount() uses a goto to retry the lookup until it succeeded in acquiring the namespace_lock() preventing the top mount from being overmounted.
fs: use a for loop when locking a mount
Currently, lock_mount() uses a goto to retry the lookup until it succeeded in acquiring the namespace_lock() preventing the top mount from being overmounted. While that's perfectly fine we want to lookup the mountpoint on the parent of the top mount in later patches. So adapt the code to make this easier to implement. Also, the for loop is arguably a little cleaner and makes the code easier to follow. No functional changes intended.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-3-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
104026c2 |
| 03-May-2023 |
Christian Brauner <brauner@kernel.org> |
fs: properly document __lookup_mnt()
The comment on top of __lookup_mnt() states that it finds the first mount implying that there could be multiple mounts mounted at the same dentry with the same p
fs: properly document __lookup_mnt()
The comment on top of __lookup_mnt() states that it finds the first mount implying that there could be multiple mounts mounted at the same dentry with the same parent.
On older kernels "shadow mounts" could be created during mount propagation. So if a mount @m in the destination propagation tree already had a child mount @p mounted at @mp then any mount @n we propagated to @m at the same @mp would be appended after the preexisting mount @p in @mount_hashtable. This was a completely direct way of creating shadow mounts.
That direct way is gone but there are still subtle ways to create shadow mounts. For example, when attaching a source mnt @mnt to a shared mount. The root of the source mnt @mnt might be overmounted by a mount @o after we finished path lookup but before we acquired the namespace semaphore to copy the source mount tree @mnt.
After we acquired the namespace lock @mnt is copied including @o covering it. After we attach @mnt to a shared mount @dest_mnt we end up propagation it to all it's peer and slaves @d. If @d already has a mount @n mounted on top of it we tuck @mnt beneath @n. This means, we mount @mnt at @d and mount @n on @mnt. Now we have both @o and @n mounted on the same mountpoint at @mnt.
Explain this in the documentation as this is pretty subtle.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-2-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
78aa08a8 |
| 03-May-2023 |
Christian Brauner <brauner@kernel.org> |
fs: add path_mounted()
Add a small helper to check whether a path refers to the root of the mount instead of open-coding this everywhere.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.o
fs: add path_mounted()
Add a small helper to check whether a path refers to the root of the mount instead of open-coding this everywhere.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Message-Id: <20230202-fs-move-mount-replace-v4-1-98f3d80d7eaa@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45 |
|
#
96e85e95 |
| 05-Jun-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
build_mount_idmapped(): switch to fdget()
Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
cb2239c1 |
| 30-Mar-2023 |
Christian Brauner <brauner@kernel.org> |
fs: drop peer group ids under namespace lock
When cleaning up peer group ids in the failure path we need to make sure to hold on to the namespace lock. Otherwise another thread might just turn the m
fs: drop peer group ids under namespace lock
When cleaning up peer group ids in the failure path we need to make sure to hold on to the namespace lock. Otherwise another thread might just turn the mount from a shared into a non-shared mount concurrently.
Link: https://lore.kernel.org/lkml/00000000000088694505f8132d77@google.com Fixes: 2a1867219c7b ("fs: add mount_setattr()") Reported-by: syzbot+8ac3859139c685c4f597@syzkaller.appspotmail.com Cc: stable@vger.kernel.org # 5.12+ Message-Id: <20230330-vfs-mount_setattr-propagation-fix-v1-1-37548d91533b@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
74e60b8b |
| 14-Mar-2023 |
Andy Shevchenko <andriy.shevchenko@linux.intel.com> |
fs/namespace: fnic: Switch to use %ptTd
Use %ptTd instead of open-coded variant to print contents of time64_t type in human readable form.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.in
fs/namespace: fnic: Switch to use %ptTd
Use %ptTd instead of open-coded variant to print contents of time64_t type in human readable form.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
da27f796 |
| 27-Jan-2023 |
Rik van Riel <riel@surriel.com> |
ipc,namespace: batch free ipc_namespace structures
Instead of waiting for an RCU grace period between each ipc_namespace structure that is being freed, wait an RCU grace period for every batch of ip
ipc,namespace: batch free ipc_namespace structures
Instead of waiting for an RCU grace period between each ipc_namespace structure that is being freed, wait an RCU grace period for every batch of ipc_namespace structures.
Thanks to Al Viro for the suggestion of the helper function.
This speeds up the run time of the test case that allocates ipc_namespaces in a loop from 6 minutes, to a little over 1 second:
real 0m1.192s user 0m0.038s sys 0m1.152s
Signed-off-by: Rik van Riel <riel@surriel.com> Reported-by: Chris Mason <clm@meta.com> Tested-by: Giuseppe Scrivano <gscrivan@redhat.com> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
3707d84c |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: move mnt_idmap
Now that we converted everything to just rely on struct mnt_idmap move it all into a separate file. This ensure that no code can poke around in struct mnt_idmap without any dedica
fs: move mnt_idmap
Now that we converted everything to just rely on struct mnt_idmap move it all into a separate file. This ensure that no code can poke around in struct mnt_idmap without any dedicated helpers and makes it easier to extend it in the future. Filesystems will now not be able to conflate mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. We are now also able to extend struct mnt_idmap as we see fit.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
e67fe633 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in 256c8aed2b42
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
61d8e426 |
| 24-Nov-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
copy_mnt_ns(): handle a corner case (overmounted mntns bindings) saner
copy_mnt_ns() has the old tree copied, with mntns binding *and* anything bound on top of them skipped. Then it proceeds to wal
copy_mnt_ns(): handle a corner case (overmounted mntns bindings) saner
copy_mnt_ns() has the old tree copied, with mntns binding *and* anything bound on top of them skipped. Then it proceeds to walk both trees in parallel. Unfortunately, it doesn't get the "skip the stuff we'd skipped when copying" quite right. Consequences are minor (the ->mnt_root comparison will return the situation to sanity pretty soon and the worst we get is the unexpected subset of opened non-directories being switched to new namespace), but it's confusing enough and it's not hard to get the expected behaviour...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
256c8aed |
| 26-Oct-2022 |
Christian Brauner <brauner@kernel.org> |
fs: introduce dedicated idmap type for mounts
Last cycle we've already made the interaction with idmapped mounts more robust and type safe by introducing the vfs{g,u}id_t type. This cycle we conclud
fs: introduce dedicated idmap type for mounts
Last cycle we've already made the interaction with idmapped mounts more robust and type safe by introducing the vfs{g,u}id_t type. This cycle we concluded the conversion and removed the legacy helpers.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate filesystem and mount namespaces and what different roles they have to play. Especially for filesystem developers without much experience in this area this is an easy source for bugs.
Instead of passing the plain namespace we introduce a dedicated type struct mnt_idmap and replace the pointer with a pointer to a struct mnt_idmap. There are no semantic or size changes for the mount struct caused by this.
We then start converting all places aware of idmapped mounts to rely on struct mnt_idmap. Once the conversion is done all helpers down to the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two, removing and thus eliminating the possibility of any bugs. Fwiw, I fixed some issues in that area a while ago in ntfs3 and ksmbd in the past. Afterwards, only low-level code can ultimately use the associated namespace for any permission checks. Even most of the vfs can be ultimately completely oblivious about this and filesystems will never interact with it directly in any form in the future.
A struct mnt_idmap currently encompasses a simple refcount and a pointer to the relevant namespace the mount is idmapped to. If a mount isn't idmapped then it will point to a static nop_mnt_idmap. If it is an idmapped mount it will point to a new struct mnt_idmap. As usual there are no allocations or anything happening for non-idmapped mounts. Everthing is carefully written to be a nop for non-idmapped mounts as has always been the case.
If an idmapped mount or mount tree is created a new struct mnt_idmap is allocated and a reference taken on the relevant namespace. For each mount in a mount tree that gets idmapped or a mount that inherits the idmap when it is cloned the reference count on the associated struct mnt_idmap is bumped. Just a reminder that we only allow a mount to change it's idmapping a single time and only if it hasn't already been attached to the filesystems and has no active writers.
The actual changes are fairly straightforward. This will have huge benefits for maintenance and security in the long run even if it causes some churn. I'm aware that there's some cost for all of you. And I'll commit to doing this work and make this as painless as I can.
Note that this also makes it possible to extend struct mount_idmap in the future. For example, it would be possible to place the namespace pointer in an anonymous union together with an idmapping struct. This would allow us to expose an api to userspace that would let it specify idmappings directly instead of having to go through the detour of setting up namespaces at all.
This just adds the infrastructure and doesn't do any conversions.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
bf1ac16e |
| 16-Aug-2022 |
Seth Forshee <sforshee@digitalocean.com> |
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
Idmapped mounts should not allow a user to map file ownsership into a range of ids which is not under the control of that user. Howe
fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts
Idmapped mounts should not allow a user to map file ownsership into a range of ids which is not under the control of that user. However, we currently don't check whether the mounter is privileged wrt to the target user namespace.
Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed to set up idmapped mounts. But this could change in the future, so add a check to refuse to create idmapped mounts when the mounter does not have CAP_SYS_ADMIN in the target user namespace.
Fixes: bd303368b776 ("fs: support mapped mounts of mapped filesystems") Signed-off-by: Seth Forshee <sforshee@digitalocean.com> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.com Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
7e4745a0 |
| 05-Jul-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
switch try_to_unlazy_next() to __legitimize_mnt()
The tricky case (__legitimize_mnt() failing after having grabbed a reference) can be trivially dealt with by leaving nd->path.mnt non-NULL, for term
switch try_to_unlazy_next() to __legitimize_mnt()
The tricky case (__legitimize_mnt() failing after having grabbed a reference) can be trivially dealt with by leaving nd->path.mnt non-NULL, for terminate_walk() to drop it.
legitimize_mnt() becomes static after that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26 |
|
#
a5f85d78 |
| 28-Feb-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
It's done once per (mount-related) syscall and there's no point whatsoever making it inline.
Signed-off-by: Al Viro <viro@zeniv.lin
uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)
It's done once per (mount-related) syscall and there's no point whatsoever making it inline.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
e1bbcd27 |
| 10-May-2022 |
Christian Brauner <christian.brauner@ubuntu.com> |
fs: hold writers when changing mount's idmapping
Hold writers when changing a mount's idmapping to make it more robust.
The vfs layer takes care to retrieve the idmapping of a mount once ensuring t
fs: hold writers when changing mount's idmapping
Hold writers when changing a mount's idmapping to make it more robust.
The vfs layer takes care to retrieve the idmapping of a mount once ensuring that the idmapping used for vfs permission checking is identical to the idmapping passed down to the filesystem.
For ioctl codepaths the filesystem itself is responsible for taking the idmapping into account if they need to. While all filesystems with FS_ALLOW_IDMAP raised take the same precautions as the vfs we should enforce it explicitly by making sure there are no active writers on the relevant mount while changing the idmapping.
This is similar to turning a mount ro with the difference that in contrast to turning a mount ro changing the idmapping can only ever be done once while a mount can transition between ro and rw as much as it wants.
This is a minor user-visible change. But it is extremely unlikely to matter. The caller must've created a detached mount via OPEN_TREE_CLONE and then handed that O_PATH fd to another process or thread which then must've gotten a writable fd for that mount and started creating files in there while the caller is still changing mount properties. While not impossible it will be an extremely rare corner-case and should in general be considered a bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to perform lookups or exec'ing in parallel by handing them a copy of the OPEN_TREE_CLONE fd or another fd beneath that mount.
Link: https://lore.kernel.org/r/20220510095840.152264-1-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
0014edae |
| 20-Apr-2022 |
Christian Brauner <brauner@kernel.org> |
fs: unset MNT_WRITE_HOLD on failure
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD and consequently we always need to pair mnt_hold_writers() with mnt_unhold_writers
fs: unset MNT_WRITE_HOLD on failure
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD and consequently we always need to pair mnt_hold_writers() with mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for the first mount that was changed. Fix this and make sure that the first mount will be cleaned up and add some comments to make it more obvious.
Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org Fixes: e257039f0fc7 ("mount_setattr(): clean the control flow and calling conventions") [1] Cc: Hillf Danton <hdanton@sina.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com Reported-by: syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
a128b054 |
| 22-Mar-2022 |
Anthony Iliopoulos <ailiop@suse.com> |
mount: warn only once about timestamp range expiration
Commit f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry") introduced a mount warning regarding filesystem timestamp limit
mount: warn only once about timestamp range expiration
Commit f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry") introduced a mount warning regarding filesystem timestamp limits, that is printed upon each writable mount or remount.
This can result in a lot of unnecessary messages in the kernel log in setups where filesystems are being frequently remounted (or mounted multiple times).
Avoid this by setting a superblock flag which indicates that the warning has been emitted at least once for any particular mount, as suggested in [1].
Link: https://lore.kernel.org/CAHk-=wim6VGnxQmjfK_tDg6fbHYKL4EFkmnTjVr9QnRqjDBAeA@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220119202934.26495-1-ailiop@suse.com Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
e257039f |
| 28-Feb-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
mount_setattr(): clean the control flow and calling conventions
separate the "cleanup" and "apply" codepaths (they have almost no overlap), fold the "cleanup" into "prepare" (which eliminates the ne
mount_setattr(): clean the control flow and calling conventions
separate the "cleanup" and "apply" codepaths (they have almost no overlap), fold the "cleanup" into "prepare" (which eliminates the need of ->revert) and make loops more idiomatic.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20 |
|
#
87bb5b60 |
| 03-Feb-2022 |
Christian Brauner <brauner@kernel.org> |
fs: clean up mount_setattr control flow
Simplify the control flow of mount_setattr_{prepare,commit} so they become easier to follow. We kept using both an integer error variable that was passed by p
fs: clean up mount_setattr control flow
Simplify the control flow of mount_setattr_{prepare,commit} so they become easier to follow. We kept using both an integer error variable that was passed by pointer as well as a pointer as an indicator for whether or not we need to revert or commit the prepared changes. Simplify this and just use the pointer. If we successfully changed properties the revert pointer will be NULL and if we failed to change properties it will indicate where we failed and thus need to stop reverting.
Link: https://lore.kernel.org/r/20220203131411.3093040-8-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
ad1844a0 |
| 03-Feb-2022 |
Christian Brauner <brauner@kernel.org> |
fs: don't open-code mnt_hold_writers()
Remove sb_prepare_remount_readonly()'s open-coded mnt_hold_writers() implementation with the real helper we introduced in commit fbdc2f6c40f6 ("fs: split out f
fs: don't open-code mnt_hold_writers()
Remove sb_prepare_remount_readonly()'s open-coded mnt_hold_writers() implementation with the real helper we introduced in commit fbdc2f6c40f6 ("fs: split out functions to hold writers").
Link: https://lore.kernel.org/r/20220203131411.3093040-7-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
03b6abee |
| 03-Feb-2022 |
Christian Brauner <brauner@kernel.org> |
fs: simplify check in mount_setattr_commit()
In order to determine whether we need to call mnt_unhold_writers() in mount_setattr_commit() we currently do not just check whether MNT_WRITE_HOLD is set
fs: simplify check in mount_setattr_commit()
In order to determine whether we need to call mnt_unhold_writers() in mount_setattr_commit() we currently do not just check whether MNT_WRITE_HOLD is set but also if a read-only mount was requested.
However, checking whether MNT_WRITE_HOLD is set is enough. Setting MNT_WRITE_HOLD requires lock_mount_hash() to be held and it must be unset before calling unlock_mount_hash(). This guarantees that if we see MNT_WRITE_HOLD we know that we were the ones who set it earlier. We don't need to care about why we set it. Plus, leaving this additional read-only check in makes the code more confusing because it implies that MNT_WRITE_HOLD could've been set by another thread when it really can't. Remove it and update the associated comment.
Link: https://lore.kernel.org/r/20220203131411.3093040-6-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|