#
35ce8ae9 |
| 16-Jan-2022 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull signal/exit/ptrace updates from Eric Biederman: "This set of changes deletes some dead
Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull signal/exit/ptrace updates from Eric Biederman: "This set of changes deletes some dead code, makes a lot of cleanups which hopefully make the code easier to follow, and fixes bugs found along the way.
The end-game which I have not yet reached yet is for fatal signals that generate coredumps to be short-circuit deliverable from complete_signal, for force_siginfo_to_task not to require changing userspace configured signal delivery state, and for the ptrace stops to always happen in locations where we can guarantee on all architectures that the all of the registers are saved and available on the stack.
Removal of profile_task_ext, profile_munmap, and profile_handoff_task are the big successes for dead code removal this round.
A bunch of small bug fixes are included, as most of the issues reported were small enough that they would not affect bisection so I simply added the fixes and did not fold the fixes into the changes they were fixing.
There was a bug that broke coredumps piped to systemd-coredump. I dropped the change that caused that bug and replaced it entirely with something much more restrained. Unfortunately that required some rebasing.
Some successes after this set of changes: There are few enough calls to do_exit to audit in a reasonable amount of time. The lifetime of struct kthread now matches the lifetime of struct task, and the pointer to struct kthread is no longer stored in set_child_tid. The flag SIGNAL_GROUP_COREDUMP is removed. The field group_exit_task is removed. Issues where task->exit_code was examined with signal->group_exit_code should been examined were fixed.
There are several loosely related changes included because I am cleaning up and if I don't include them they will probably get lost.
The original postings of these changes can be found at: https://lkml.kernel.org/r/87a6ha4zsd.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87bl1kunjj.fsf@email.froward.int.ebiederm.org https://lkml.kernel.org/r/87r19opkx1.fsf_-_@email.froward.int.ebiederm.org
I trimmed back the last set of changes to only the obviously correct once. Simply because there was less time for review than I had hoped"
* 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (44 commits) ptrace/m68k: Stop open coding ptrace_report_syscall ptrace: Remove unused regs argument from ptrace_report_syscall ptrace: Remove second setting of PT_SEIZED in ptrace_attach taskstats: Cleanup the use of task->exit_code exit: Use the correct exit_code in /proc/<pid>/stat exit: Fix the exit_code for wait_task_zombie exit: Coredumps reach do_group_exit exit: Remove profile_handoff_task exit: Remove profile_task_exit & profile_munmap signal: clean up kernel-doc comments signal: Remove the helper signal_group_exit signal: Rename group_exit_task group_exec_task coredump: Stop setting signal->group_exit_task signal: Remove SIGNAL_GROUP_COREDUMP signal: During coredumps set SIGNAL_GROUP_EXIT in zap_process signal: Make coredump handling explicit in complete_signal signal: Have prepare_signal detect coredumps using signal->core_state signal: Have the oom killer detect coredumps using signal->core_state exit: Move force_uaccess back into do_exit exit: Guarantee make_task_dead leaks the tsk when calling do_task_exit ...
show more ...
|
Revision tags: v5.15.15 |
|
#
762f99f4 |
| 15-Jan-2022 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 5.17 merge window.
|
#
f56caeda |
| 15-Jan-2022 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "146 patches.
Subsystems affected by this patch series: kthread, ia64, scripts, ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak, dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap, memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb, userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp, ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits) mm/damon: hide kernel pointer from tracepoint event mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging mm/damon/dbgfs: remove an unnecessary variable mm/damon: move the implementation of damon_insert_region to damon.h mm/damon: add access checking for hugetlb pages Docs/admin-guide/mm/damon/usage: update for schemes statistics mm/damon/dbgfs: support all DAMOS stats Docs/admin-guide/mm/damon/reclaim: document statistics parameters mm/damon/reclaim: provide reclamation statistics mm/damon/schemes: account how many times quota limit has exceeded mm/damon/schemes: account scheme actions that successfully applied mm/damon: remove a mistakenly added comment for a future feature Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning Docs/admin-guide/mm/damon/usage: remove redundant information Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks mm/damon: convert macro functions to static inline functions mm/damon: modify damon_rand() macro to static inline function mm/damon: move damon_rand() definition into damon.h ...
show more ...
|
#
f530243a |
| 14-Jan-2022 |
Jann Horn <jannh@google.com> |
mm, oom: OOM sysrq should always kill a process
The OOM kill sysrq (alt+sysrq+F) should allow the user to kill the process with the highest OOM badness with a single execution.
However, at the mome
mm, oom: OOM sysrq should always kill a process
The OOM kill sysrq (alt+sysrq+F) should allow the user to kill the process with the highest OOM badness with a single execution.
However, at the moment, the OOM kill can bail out if an OOM notifier (e.g. the i915 one) says that it reclaimed a tiny amount of memory from somewhere. That's probably not what the user wants, so skip the bailout if the OOM was triggered via sysrq.
Link: https://lkml.kernel.org/r/20220106102605.635656-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
ba535c1c |
| 14-Jan-2022 |
Suren Baghdasaryan <surenb@google.com> |
mm/oom_kill: allow process_mrelease to run under mmap_lock protection
With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in orde
mm/oom_kill: allow process_mrelease to run under mmap_lock protection
With exit_mmap holding mmap_write_lock during free_pgtables call, process_mrelease does not need to elevate mm->mm_users in order to prevent exit_mmap from destrying pagetables while __oom_reap_task_mm is walking the VMA tree. The change prevents process_mrelease from calling the last mmput, which can lead to waiting for IO completion in exit_aio.
Link: https://lkml.kernel.org/r/20211209191325.3069345-3-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <christian@brauner.io> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Jan Engelhardt <jengelh@inai.de> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
b6bf9abb |
| 14-Jan-2022 |
Dan Schatzberg <schatzberg.dan@gmail.com> |
mm/memcg: add oom_group_kill memory event
Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM
mm/memcg: add oom_group_kill memory event
Our container agent wants to know when a container exits if it was OOM killed or not to report to the user. We use memory.oom.group = 1 to ensure that OOM kills within the container's cgroup kill everything. Existing memory.events are insufficient for knowing if this triggered:
1) Our current approach reads memory.events oom_kill and reports the container was killed if the value is non-zero. This is erroneous in some cases where containers create their children cgroups with memory.oom.group=1 as such OOM kills will get counted against the container cgroup's oom_kill counter despite not actually OOM killing the entire container.
2) Reading memory.events.local will fail to identify OOM kills in leaf cgroups (that don't set memory.oom.group) within the container cgroup.
This patch adds a new oom_group_kill event when memory.oom.group triggers to allow userspace to cleanly identify when an entire cgroup is oom killed.
[schatzberg.dan@gmail.com: changes from Johannes and Chris] Link: https://lkml.kernel.org/r/20211213162511.2492267-1-schatzberg.dan@gmail.com
Link: https://lkml.kernel.org/r/20211203162426.3375036-1-schatzberg.dan@gmail.com Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
8a2094d6 |
| 10-Jan-2022 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'for-5.17/core' into for-linus
- support for USI style pens (Tero Kristo, Mika Westerberg) - quirk for devices that need inverted X/Y axes (Alistair Francis) - small core code cleanups
Merge branch 'for-5.17/core' into for-linus
- support for USI style pens (Tero Kristo, Mika Westerberg) - quirk for devices that need inverted X/Y axes (Alistair Francis) - small core code cleanups and deduplication (Benjamin Tissoires)
show more ...
|
Revision tags: v5.16, v5.15.10, v5.15.9, v5.15.8, v5.15.7, v5.15.6, v5.15.5, v5.15.4 |
|
#
98b24b16 |
| 19-Nov-2021 |
Eric W. Biederman <ebiederm@xmission.com> |
signal: Have the oom killer detect coredumps using signal->core_state
In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of th
signal: Have the oom killer detect coredumps using signal->core_state
In preparation for removing the flag SIGNAL_GROUP_COREDUMP, change __task_will_free_mem to test signal->core_state instead of the flag SIGNAL_GROUP_COREDUMP.
Both fields are protected by siglock and both live in signal_struct so there are no real tradeoffs here, just a change to which field is being tested.
Link: https://lkml.kernel.org/r/20211213225350.27481-3-ebiederm@xmission.com Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
show more ...
|
#
f81483aa |
| 05-Jan-2022 |
Takashi Iwai <tiwai@suse.de> |
Merge branch 'for-next' into for-linus
Pull 5.17 materials.
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
#
17580470 |
| 17-Dec-2021 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next-fixes
Backmerging to bring drm-misc-next-fixes up to the latest state for the current release cycle.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
#
86329873 |
| 09-Dec-2021 |
Daniel Lezcano <daniel.lezcano@linaro.org> |
Merge branch 'reset/of-get-optional-exclusive' of git://git.pengutronix.de/pza/linux into timers/drivers/next
"Add optional variant of of_reset_control_get_exclusive(). If the requested reset is not
Merge branch 'reset/of-get-optional-exclusive' of git://git.pengutronix.de/pza/linux into timers/drivers/next
"Add optional variant of of_reset_control_get_exclusive(). If the requested reset is not specified in the device tree, this function returns NULL instead of an error."
This dependency is needed for the Generic Timer Module (a.k.a OSTM) support for RZ/G2L.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
show more ...
|
#
5d8dfaa7 |
| 09-Dec-2021 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge tag 'v5.15' into next
Sync up with the mainline to get the latest APIs and DT bindings.
|
#
448cc2fb |
| 22-Nov-2021 |
Jani Nikula <jani.nikula@intel.com> |
Merge drm/drm-next into drm-intel-next
Sync up with drm-next to get v5.16-rc2.
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
|
#
8626afb1 |
| 22-Nov-2021 |
Tvrtko Ursulin <tvrtko.ursulin@intel.com> |
Merge drm/drm-next into drm-intel-gt-next
Thomas needs the dma_resv_for_each_fence API for i915/ttm async migration work.
Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
|
Revision tags: v5.15.3 |
|
#
a713ca23 |
| 18-Nov-2021 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-next into drm-misc-next
Backmerging from drm/drm-next for v5.16-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
#
467dd91e |
| 16-Nov-2021 |
Maxime Ripard <maxime@cerno.tech> |
Merge drm/drm-fixes into drm-misc-fixes
We need -rc1 to address a breakage in drm/scheduler affecting panfrost.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
|
Revision tags: v5.15.2 |
|
#
447212bb |
| 11-Nov-2021 |
Dave Airlie <airlied@redhat.com> |
BackMerge tag 'v5.15' into drm-next
I got a drm-fixes which had some 5.15 stuff in it, so to avoid the mess just backmerge here.
Linux 5.15
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
#
6752de1a |
| 10-Nov-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner: "Various places in the kernel have picked up pidfds.
The two mos
Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner: "Various places in the kernel have picked up pidfds.
The two most recent additions have probably been the ability to use pidfds in bpf maps and the usage of pidfds in mm-based syscalls such as process_mrelease() and process_madvise().
The same pattern to turn a pidfd into a struct task exists in two places. One of those places used PIDTYPE_TGID while the other one used PIDTYPE_PID even though it is clearly documented in all pidfd-helpers that pidfds __currently__ only refer to thread-group leaders (subject to change in the future if need be).
This isn't a bug per se but has the potential to be one if we allow pidfds to refer to individual threads. If that happens we want to audit all codepaths that make use of them to ensure they can deal with pidfds refering to individual threads.
This adds a simple helper to turn a pidfd into a struct task making it easy to grep for such places. Plus, it gets rid of code-duplication"
* tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mm: use pidfd_get_task() pid: add pidfd_get_task() helper
show more ...
|
#
512b7931 |
| 06-Nov-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub,
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton: "257 patches.
Subsystems affected by this patch series: scripts, ocfs2, vfs, and mm (slab-generic, slab, slub, kconfig, dax, kasan, debug, pagecache, gup, swap, memcg, pagemap, mprotect, mremap, iomap, tracing, vmalloc, pagealloc, memory-failure, hugetlb, userfaultfd, vmscan, tools, memblock, oom-kill, hugetlbfs, migration, thp, readahead, nommu, ksm, vmstat, madvise, memory-hotplug, rmap, zsmalloc, highmem, zram, cleanups, kfence, and damon)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (257 commits) mm/damon: remove return value from before_terminate callback mm/damon: fix a few spelling mistakes in comments and a pr_debug message mm/damon: simplify stop mechanism Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Docs/admin-guide/mm/damon/start: simplify the content Docs/admin-guide/mm/damon/start: fix a wrong link Docs/admin-guide/mm/damon/start: fix wrong example commands mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on mm/damon: remove unnecessary variable initialization Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) selftests/damon: support watermarks mm/damon/dbgfs: support watermarks mm/damon/schemes: activate schemes based on a watermarks mechanism tools/selftests/damon: update for regions prioritization of schemes mm/damon/dbgfs: support prioritization weights mm/damon/vaddr,paddr: support pageout prioritization mm/damon/schemes: prioritize regions within the quotas mm/damon/selftests: support schemes quotas mm/damon/dbgfs: support quotas of schemes ...
show more ...
|
Revision tags: v5.15.1 |
|
#
3723929e |
| 05-Nov-2021 |
Sultan Alsawaf <sultan@kerneltoast.com> |
mm: mark the OOM reaper thread as freezable
The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state.
mm: mark the OOM reaper thread as freezable
The OOM reaper alters user address space which might theoretically alter the snapshot if reaping is allowed to happen after the freezer quiescent state. To this end, the reaper kthread uses wait_event_freezable() while waiting for any work so that it cannot run while the system freezes.
However, the current implementation doesn't respect the freezer because all kernel threads are created with the PF_NOFREEZE flag, so they are automatically excluded from freezing operations. This means that the OOM reaper can race with system snapshotting if it has work to do while the system is being frozen.
Fix this by adding a set_freezable() call which will clear the PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer.
Please note that the OOM reaper altering the snapshot this way is mostly a theoretical concern and has not been observed in practice.
Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com Fixes: aac453635549 ("mm, oom: introduce oom reaper") Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
60e2793d |
| 05-Nov-2021 |
Michal Hocko <mhocko@suse.com> |
mm, oom: do not trigger out_of_memory from the #PF
Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 d
mm, oom: do not trigger out_of_memory from the #PF
Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails.
The latter is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate.
Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain.
This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. This has been broken since kmem accounting has been introduced for cgroup v1 (3.8). There was no kmem specific reclaim for the separate limit so the only way to handle kmem hard limit was to return with ENOMEM. In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable/LTS.]
Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
0b28179a |
| 05-Nov-2021 |
Vasily Averin <vvs@virtuozzo.com> |
mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
Memory cgroup charging allows killed or
mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It can be misused and allowed to trigger global OOM from inside a memcg-limited container. On the other hand if memcg fails allocation, called from inside #PF handler it triggers global OOM from inside pagefault_out_of_memory().
To prevent these problems this patchset: (a) removes execution of out_of_memory() from pagefault_out_of_memory(), becasue nobody can explain why it is necessary. (b) allow memcg to fail allocation of dying/killed tasks.
This patch (of 3):
Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory which in turn executes out_out_memory() and can kill a random task.
An allocation might fail when the current task is the oom victim and there are no memory reserves left. The OOM killer is already handled at the page allocator level for the global OOM and at the charging level for the memcg one. Both have much more information about the scope of allocation/charge request. This means that either the OOM killer has been invoked properly and didn't lead to the allocation success or it has been skipped because it couldn't have been invoked. In both cases triggering it from here is pointless and even harmful.
It makes much more sense to let the killed task die rather than to wake up an eternally hungry oom-killer and send him to choose a fatter victim for breakfast.
Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
7f9f8792 |
| 06-Nov-2021 |
Arnaldo Carvalho de Melo <acme@redhat.com> |
Merge remote-tracking branch 'torvalds/master' into perf/core
To pick up some tools/perf/ patches that went via tip/perf/core, such as:
tools/perf: Add mem_hops field in perf_mem_data_src structu
Merge remote-tracking branch 'torvalds/master' into perf/core
To pick up some tools/perf/ patches that went via tip/perf/core, such as:
tools/perf: Add mem_hops field in perf_mem_data_src structure
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
show more ...
|
#
820e9906 |
| 05-Nov-2021 |
Jiri Kosina <jkosina@suse.cz> |
Merge branch 'for-5.16/asus' into for-linus
|
#
a602285a |
| 03-Nov-2021 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge branch 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull per signal_struct coredumps from Eric Biederman: "Current coredump
Merge branch 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull per signal_struct coredumps from Eric Biederman: "Current coredumps are mixed up with the exit code, the signal handling code, and the ptrace code making coredumps much more complicated than necessary and difficult to follow.
This series of changes starts with ptrace_stop and cleans it up, making it easier to follow what is happening in ptrace_stop. Then cleans up the exec interactions with coredumps. Then cleans up the coredump interactions with exit. Finally the coredump interactions with the signal handling code is cleaned up.
The first and last changes are bug fixes for minor bugs.
I believe the fact that vfork followed by execve can kill the process the called vfork if exec fails is sufficient justification to change the userspace visible behavior.
In previous discussions some of these changes were organized differently and individually appeared to make the code base worse. As currently written I believe they all stand on their own as cleanups and bug fixes.
Which means that even if the worst should happen and the last change needs to be reverted for some unimaginable reason, the code base will still be improved.
If the worst does not happen there are a more cleanups that can be made. Signals that generate coredumps can easily become eligible for short circuit delivery in complete_signal. The entire rendezvous for generating a coredump can move into get_signal. The function force_sig_info_to_task be written in a way that does not modify the signal handling state of the target task (because coredumps are eligible for short circuit delivery). Many of these future cleanups can be done another way but nothing so cleanly as if coredumps become per signal_struct"
* 'per_signal_struct_coredumps-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: coredump: Limit coredumps to a single thread group coredump: Don't perform any cleanups before dumping core exit: Factor coredump_exit_mm out of exit_mm exec: Check for a pending fatal signal instead of core_state ptrace: Remove the unnecessary arguments from arch_ptrace_stop signal: Remove the bogus sigkill_pending in ptrace_stop
show more ...
|