Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32 |
|
#
f9010dbd |
| 01-Jun-2023 |
Mike Christie <michael.christie@oracle.com> |
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps o
fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them.
To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported.
This is a modified version of the patch written by Mike Christie <michael.christie@oracle.com> which was a modified version of patch originally written by Linus.
Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit.
Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done.
The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps.
For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads.
Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running.
Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal.
Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Co-developed-by: Mike Christie <michael.christie@oracle.com> Signed-off-by: Mike Christie <michael.christie@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3 |
|
#
88e46070 |
| 20-Apr-2023 |
Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> |
coredump: require O_WRONLY instead of O_RDWR
The motivation for this patch has been to enable using a stricter apparmor profile to prevent programs from reading any coredump in the system.
However,
coredump: require O_WRONLY instead of O_RDWR
The motivation for this patch has been to enable using a stricter apparmor profile to prevent programs from reading any coredump in the system.
However, this became something else. The following details are based on Christian's and Linus' archeology into the history of the number "2" in the coredump handling code.
To make sure we're not accidently introducing some subtle behavioral change into the coredump code we set out on a voyage into the depths of history.git to figure out why this was O_RDWR in the first place.
Coredump handling was introduced over 30 years ago in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)"). The original code used O_WRONLY:
open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)
However, this changed in 1993 and starting with commit 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code suddenly used the constant "2":
open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)
This was curious as in the same commit the kernel switched from constants to proper defines in other places such as KERNEL_DS and USER_DS and O_RDWR did already exist.
So why was "2" used? It turns out that open_namei() - an early version of what later turned into filp_open() - didn't accept O_RDWR.
A semantic quirk of the open() uapi is the definition of the O_RDONLY flag. It would seem natural to define:
#define O_RDWR (O_RDONLY | O_WRONLY)
but that isn't possible because:
#define O_RDONLY 0
This makes O_RDONLY effectively meaningless when passed to the kernel. In other words, there has never been a way - until O_PATH at least - to open a file without any permission; O_RDONLY was always implied on the uapi side while the kernel does in fact allow opening files without permissions.
The trouble comes when trying to map the uapi flags onto the corresponding file mode flags FMODE_{READ,WRITE}. This mapping still happens today and is causing issues to this day (We ran into this during additions for openat2() for example.).
So the special value "3" was used to indicate that the file was opened for special access:
f->f_flags = flag = flags; f->f_mode = (flag+1) & O_ACCMODE; if (f->f_mode) flag++;
This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The special access then required read-write permissions and 0 was used to access symlinks.
But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an argument. So the coredump handling introduced in ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on accident.
So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which was correct.
The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2" value was passed.
Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code was changed to:
- dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600); ... + file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
At this point the raw "2" should have become O_WRONLY again as filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.
Another 17 years later, the code was changed again cementing the mistake and making it almost impossible to detect when commit 378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories") replaced the raw "2" with O_RDWR.
And now, here we are with this patch that sent us on a quest to answer the big questions in life such as "Why are coredump files opened with O_RDWR?" and "Is it safe to just use O_WRONLY?".
So with this commit we're reintroducing O_WRONLY again and bringing this code back to its original state when it was first introduced in commit ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Message-Id: <20230420120409.602576-1-vsementsov@yandex-team.ru> [brauner@kernel.org: completely rewritten commit message] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.25 |
|
#
245f0922 |
| 16-Apr-2023 |
Kefeng Wang <wangkefeng.wang@huawei.com> |
mm: hwpoison: coredump: support recovery from dump_user_range()
dump_user_range() is used to copy the user page to a coredump file, but if a hardware memory error occurred during copy, which called
mm: hwpoison: coredump: support recovery from dump_user_range()
dump_user_range() is used to copy the user page to a coredump file, but if a hardware memory error occurred during copy, which called from __kernel_write_iter() in dump_user_range(), it crashes,
CPU: 112 PID: 7014 Comm: mca-recover Not tainted 6.3.0-rc2 #425
pc : __memcpy+0x110/0x260 lr : _copy_from_iter+0x3bc/0x4c8 ... Call trace: __memcpy+0x110/0x260 copy_page_from_iter+0xcc/0x130 pipe_write+0x164/0x6d8 __kernel_write_iter+0x9c/0x210 dump_user_range+0xc8/0x1d8 elf_core_dump+0x308/0x368 do_coredump+0x2e8/0xa40 get_signal+0x59c/0x788 do_signal+0x118/0x1f8 do_notify_resume+0xf0/0x280 el0_da+0x130/0x138 el0t_64_sync_handler+0x68/0xc0 el0t_64_sync+0x188/0x190
Generally, the '->write_iter' of file ops will use copy_page_from_iter() and copy_page_from_iter_atomic(), change memcpy() to copy_mc_to_kernel() in both of them to handle #MC during source read, which stop coredump processing and kill the task instead of kernel panic, but the source address may not always a user address, so introduce a new copy_mc flag in struct iov_iter{} to indicate that the iter could do a safe memory copy, also introduce the helpers to set/cleck the flag, for now, it's only used in coredump's dump_user_range(), but it could expand to any other scenarios to fix the similar issue.
Link: https://lkml.kernel.org/r/20230417045323.11054-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tong Tiangen <tongtiangen@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8 |
|
#
e552cdb8 |
| 20-Jan-2023 |
Liam R. Howlett <Liam.Howlett@Oracle.com> |
coredump: convert to vma iterator
Use the vma iterator so that the iterator can be invalidated or updated to avoid each caller doing so.
Link: https://lkml.kernel.org/r/20230120162650.984577-20-Lia
coredump: convert to vma iterator
Use the vma iterator so that the iterator can be invalidated or updated to avoid each caller doing so.
Link: https://lkml.kernel.org/r/20230120162650.984577-20-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
cd598003 |
| 03-Feb-2023 |
Christoph Hellwig <hch@lst.de> |
coredump: use bvec_set_page to initialize a bvec
Use the bvec_set_page helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230203150634.3199
coredump: use bvec_set_page to initialize a bvec
Use the bvec_set_page helper to initialize a bvec.
Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20230203150634.3199647-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v6.1.7, v6.1.6 |
|
#
e67fe633 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in 256c8aed2b42
fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns().
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
#
abf08576 |
| 13-Jan-2023 |
Christian Brauner <brauner@kernel.org> |
fs: port vfs_*() helpers to struct mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This i
fs: port vfs_*() helpers to struct mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in 256c8aed2b42 ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap.
Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs.
Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap.
Acked-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
Revision tags: v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72 |
|
#
9c7417b5 |
| 03-Oct-2022 |
Geert Uytterhoeven <geert@linux-m68k.org> |
coredump: Move dump_emit_page() to kill unused warning
If CONFIG_ELF_CORE is not set:
fs/coredump.c:835:12: error: ‘dump_emit_page’ defined but not used [-Werror=unused-function] 835 | st
coredump: Move dump_emit_page() to kill unused warning
If CONFIG_ELF_CORE is not set:
fs/coredump.c:835:12: error: ‘dump_emit_page’ defined but not used [-Werror=unused-function] 835 | static int dump_emit_page(struct coredump_params *cprm, struct page *page) | ^~~~~~~~~~~~~~
Fix this by moving dump_emit_page() inside the existing section protected by #ifdef CONFIG_ELF_CORE.
Fixes: 06bbaa6dc53cb720 ("[coredump] don't use __kernel_write() on kmap_local_page()") Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v6.0, v5.15.71, v5.15.70, v5.15.69 |
|
#
de4eda9d |
| 15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with wri
use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v5.15.68, v5.15.67, v5.15.66, v5.15.65 |
|
#
8603b6f5 |
| 03-Sep-2022 |
Oleksandr Natalenko <oleksandr@redhat.com> |
core_pattern: add CPU specifier
Statistically, in a large deployment regular segfaults may indicate a CPU issue.
Currently, it is not possible to find out what CPU the segfault happened on. There
core_pattern: add CPU specifier
Statistically, in a large deployment regular segfaults may indicate a CPU issue.
Currently, it is not possible to find out what CPU the segfault happened on. There are at least two attempts to improve segfault logging with this regard, but they do not help in case the logs rotate.
Hence, lets make sure it is possible to permanently record a CPU the task ran on using a new core_pattern specifier.
Link: https://lkml.kernel.org/r/20220903064330.20772-1-oleksandr@redhat.com Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com> Suggested-by: Renaud Métrich <rmetrich@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Grzegorz Halat <ghalat@redhat.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Joel Savitz <jsavitz@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <keescook@chromium.org> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Rob Herring <robh@kernel.org> Cc: Stephen Kitt <steve@sk2.org> Cc: Will Deacon <will@kernel.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
6dd142d9 |
| 20-Sep-2022 |
Kees Cook <keescook@chromium.org> |
coredump: Proactively round up to kmalloc bucket size
Instead of discovering the kmalloc bucket size _after_ allocation, round up proactively so the allocation is explicitly made for the full size,
coredump: Proactively round up to kmalloc bucket size
Instead of discovering the kmalloc bucket size _after_ allocation, round up proactively so the allocation is explicitly made for the full size, allowing the compiler to correctly reason about the resulting size of the buffer through the existing __alloc_size() hint.
Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
show more ...
|
Revision tags: v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50 |
|
#
a2bd096f |
| 22-Jun-2022 |
Christian Brauner <brauner@kernel.org> |
fs: use type safe idmapping helpers
We already ported most parts and filesystems over for v6.0 to the new vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining places so we can re
fs: use type safe idmapping helpers
We already ported most parts and filesystems over for v6.0 to the new vfs{g,u}id_t type and associated helpers for v6.0. Convert the remaining places so we can remove all the old helpers. This is a non-functional change.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
show more ...
|
Revision tags: v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38, v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33, v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28, v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1, v5.15, v5.14.14, v5.14.13, v5.14.12, v5.14.11, v5.14.10, v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60, v5.10.53, v5.10.52, v5.10.51, v5.10.50, v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39, v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30, v5.10.27, v5.10.26, v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18, v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14, v5.10, v5.8.17, v5.8.16, v5.8.15, v5.9, v5.8.14, v5.8.13, v5.8.12, v5.8.11, v5.8.10, v5.8.9, v5.8.8, v5.8.7, v5.8.6, v5.4.62, v5.8.5, v5.8.4, v5.4.61, v5.8.3, v5.4.60, v5.8.2, v5.4.59, v5.8.1, v5.4.58, v5.4.57, v5.4.56, v5.8, v5.7.12, v5.4.55, v5.7.11, v5.4.54, v5.7.10, v5.4.53, v5.4.52, v5.7.9, v5.7.8, v5.4.51, v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2 |
|
#
9a938eba |
| 08-Jun-2020 |
Al Viro <viro@zeniv.linux.org.uk> |
kill coredump_params->regs
it's always task_pt_regs(current)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
#
6a542d1d |
| 08-Jun-2020 |
Al Viro <viro@zeniv.linux.org.uk> |
kill signal_pt_regs()
Once upon at it was used on hot paths, but that had not been true since 2013. IOW, there's no point for arch-optimized equivalent of task_pt_regs(current) - remaining two user
kill signal_pt_regs()
Once upon at it was used on hot paths, but that had not been true since 2013. IOW, there's no point for arch-optimized equivalent of task_pt_regs(current) - remaining two users are not worth bothering with.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
4f526fef |
| 03-Oct-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
[brown paperbag] fix coredump breakage
Let me count the ways in which I'd screwed up:
* when emitting a page, handling of gaps in coredump should happen before fetching the current file position. *
[brown paperbag] fix coredump breakage
Let me count the ways in which I'd screwed up:
* when emitting a page, handling of gaps in coredump should happen before fetching the current file position. * fix for a problem that occurs on rather uncommon setups (and hadn't been observed in the wild) had been sent very late in the cycle. * ... with badly insufficient testing, introducing an easily reproducible breakage. Without giving it time to soak in -next.
Fucked-up-by: Al Viro <viro@zeniv.linux.org.uk> Reported-by: "J. R. Okajima" <hooanon05g@gmail.com> Tested-by: "J. R. Okajima" <hooanon05g@gmail.com> Fixes: 06bbaa6dc53c "[coredump] don't use __kernel_write() on kmap_local_page()" Cc: stable@kernel.org # v6.0-only Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
06bbaa6d |
| 26-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
[coredump] don't use __kernel_write() on kmap_local_page()
passing kmap_local_page() result to __kernel_write() is unsafe - random ->write_iter() might (and 9p one does) get unhappy when passed ITER
[coredump] don't use __kernel_write() on kmap_local_page()
passing kmap_local_page() result to __kernel_write() is unsafe - random ->write_iter() might (and 9p one does) get unhappy when passed ITER_KVEC with pointer that came from kmap_local_page().
Fix by providing a variant of __kernel_write() that takes an iov_iter from caller (__kernel_write() becomes a trivial wrapper) and adding dump_emit_page() that parallels dump_emit(), except that instead of __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source.
Fixes: 3159ed57792b "fs/coredump: use kmap_local_page()" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
182ea1d7 |
| 06-Sep-2022 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
coredump: remove vma linked list walk
Use the Maple Tree iterator instead. This is too complicated for the VMA iterator to handle, so let's open-code it for now. If this turns out to be a common p
coredump: remove vma linked list walk
Use the Maple Tree iterator instead. This is too complicated for the VMA iterator to handle, so let's open-code it for now. If this turns out to be a common pattern, we can migrate it to common code.
Link: https://lkml.kernel.org/r/20220906194824.2110408-41-Liam.Howlett@oracle.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: SeongJae Park <sj@kernel.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
f5d39b02 |
| 22-Aug-2022 |
Peter Zijlstra <peterz@infradead.org> |
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensu
freezer,sched: Rewrite core freezer logic
Rewrite the core freezer to behave better wrt thawing and be simpler in general.
By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible.
As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!).
Specifically; the current scheme works a little like:
freezer_do_not_count(); schedule(); freezer_count();
And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues.
However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time.
That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back.
This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run.
As said; replace this with a special task state TASK_FROZEN and add the following state transitions:
TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN
The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost).
The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl.
With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
show more ...
|
#
f9fc8cad |
| 06-Sep-2022 |
Peter Zijlstra <peterz@infradead.org> |
sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states
sched: Add TASK_ANY for wait_task_inactive()
Now that wait_task_inactive()'s @match_state argument is a mask (like ttwu()) it is possible to replace the special !match_state case with an 'all-states' value such that any blocked state will match.
Suggested-by: Ingo Molnar (mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/YxhkzfuFTvRnpUaH@hirez.programming.kicks-ass.net
show more ...
|
#
9a95f78e |
| 08-Jan-2022 |
Eric W. Biederman <ebiederm@xmission.com> |
signal: Drop signals received after a fatal signal has been processed
In 403bad72b67d ("coredump: only SIGKILL should interrupt the coredumping task") Oleg modified the kernel to drop all signals th
signal: Drop signals received after a fatal signal has been processed
In 403bad72b67d ("coredump: only SIGKILL should interrupt the coredumping task") Oleg modified the kernel to drop all signals that come in during a coredump except SIGKILL, and suggested that it might be a good idea to generalize that to other cases after the process has received a fatal signal.
Semantically it does not make sense to perform any signal delivery after the process has already been killed.
When a signal is sent while a process is dying today the signal is placed in the signal queue by __send_signal and a single task of the process is woken up with signal_wake_up, if there are any tasks that have not set PF_EXITING.
Take things one step farther and have prepare_signal report that all signals that come after a process has been killed should be ignored. While retaining the historical exception of allowing SIGKILL to interrupt coredumps.
Update the comment in fs/coredump.c to make it clear coredumps are special in being able to receive SIGKILL.
This changes things so that a process stopped in PTRACE_EVENT_EXIT can not be made to escape it's ptracer and finish exiting by sending it SIGKILL. That a process can be made to leave PTRACE_EVENT_EXIT and escape it's tracer by sending the process a SIGKILL has been complicating tracer's for no apparent advantage. If the process needs to be made to leave PTRACE_EVENT_EXIT all that needs to happen is to kill the proceses's tracer. This differs from the coredump code where there is no other mechanism besides honoring SIGKILL to expedite the end of coredumping.
Link: https://lkml.kernel.org/r/875yksd4s9.fsf_-_@email.froward.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|
#
4e3299ea |
| 29-Jun-2022 |
Jason A. Donenfeld <Jason@zx2c4.com> |
fs: do not compare against ->llseek
Now vfs_llseek() can simply check for FMODE_LSEEK; if it's set, we know that ->llseek() won't be NULL and if it's not we should just fail with -ESPIPE.
A couple
fs: do not compare against ->llseek
Now vfs_llseek() can simply check for FMODE_LSEEK; if it's set, we know that ->llseek() won't be NULL and if it's not we should just fail with -ESPIPE.
A couple of other places where we used to check for special values of ->llseek() (somewhat inconsistently) switched to checking FMODE_LSEEK.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
355f841a |
| 09-Feb-2022 |
Eric W. Biederman <ebiederm@xmission.com> |
tracehook: Remove tracehook.h
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it.
Upda
tracehook: Remove tracehook.h
Now that all of the definitions have moved out of tracehook.h into ptrace.h, sched/signal.h, resume_user_mode.h there is nothing left in tracehook.h so remove it.
Update the few files that were depending upon tracehook.h to bring in definitions to use the headers they need directly.
Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lkml.kernel.org/r/20220309162454.123006-13-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|
#
390031c9 |
| 08-Mar-2022 |
Eric W. Biederman <ebiederm@xmission.com> |
coredump: Use the vma snapshot in fill_files_note
Matthew Wilcox reported that there is a missing mmap_lock in file_files_note that could possibly lead to a user after free.
Solve this by using the
coredump: Use the vma snapshot in fill_files_note
Matthew Wilcox reported that there is a missing mmap_lock in file_files_note that could possibly lead to a user after free.
Solve this by using the existing vma snapshot for consistency and to avoid the need to take the mmap_lock anywhere in the coredump code except for dump_vma_snapshot.
Update the dump_vma_snapshot to capture vm_pgoff and vm_file that are neeeded by fill_files_note.
Add free_vma_snapshot to free the captured values of vm_file.
Reported-by: Matthew Wilcox <willy@infradead.org> Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org Cc: stable@vger.kernel.org Fixes: a07279c9a8cd ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot") Fixes: 2aa362c49c31 ("coredump: extend core dump note section to contain file names of mapped files") Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|
#
49c18663 |
| 08-Mar-2022 |
Eric W. Biederman <ebiederm@xmission.com> |
coredump: Remove the WARN_ON in dump_vma_snapshot
The condition is impossible and to the best of my knowledge has never triggered.
We are in deep trouble if that conditions happens and we walk past
coredump: Remove the WARN_ON in dump_vma_snapshot
The condition is impossible and to the best of my knowledge has never triggered.
We are in deep trouble if that conditions happens and we walk past the end of our allocated array.
So delete the WARN_ON and the code that makes it look like the kernel can handle the case of walking past the end of it's vma_meta array.
Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|
#
95c5436a |
| 08-Mar-2022 |
Eric W. Biederman <ebiederm@xmission.com> |
coredump: Snapshot the vmas in do_coredump
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the individual coredump routines into do_coredump itself. This makes the code less error pr
coredump: Snapshot the vmas in do_coredump
Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the individual coredump routines into do_coredump itself. This makes the code less error prone and easier to maintain.
Make the vma snapshot available to the coredump routines in struct coredump_params. This makes it easier to change and update what is captures in the vma snapshot and will be needed for fixing fill_file_notes.
Reviewed-by: Jann Horn <jannh@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|