Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
b9349432 |
| 22-Feb-2024 |
Vlastimil Babka <vbabka@suse.cz> |
mm, mmap: fix vma_merge() case 7 with vma_ops->close
commit fc0c8f9089c20d198d8fe51ddc28bfa1af588dce upstream.
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with
mm, mmap: fix vma_merge() case 7 with vma_ops->close
commit fc0c8f9089c20d198d8fe51ddc28bfa1af588dce upstream.
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with a reproducer that shows how a series of mprotect() operations can result in an elevated shm_nattch and thus leak of the resource.
The problem is caused by wrong assumptions in vma_merge() commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test"). The shmem vmas have a vma_ops->close callback that decrements shm_nattch, and we remove the vma without calling it.
vma_merge() has thus historically avoided merging vma's with vma_ops->close and commit 714965ca8252 was supposed to keep it that way. It relaxed the checks for vma_ops->close in can_vma_merge_after() assuming that it is never called on a vma that would be a candidate for removal. However, the vma_merge() code does also use the result of this check in the decision to remove a different vma in the merge case 7.
A robust solution would be to refactor vma_merge() code in a way that the vma_ops->close check is only done for vma's that are actually going to be removed, and not as part of the preliminary checks. That would both solve the existing bug, and also allow additional merges that the checks currently prevent unnecessarily in some cases.
However to fix the existing bug first with a minimized risk, and for easier stable backports, this patch only adds a vma_ops->close check to the buggy case 7 specifically. All other cases of vma removal are covered by the can_vma_merge_before() check that includes the test for vma_ops->close.
The reproducer code, adapted from Michal Hocko's code:
int main(int argc, char *argv[]) { int segment_id; size_t segment_size = 20 * PAGE_SIZE; char * sh_mem; struct shmid_ds shmid_ds;
key_t key = 0x1234; segment_id = shmget(key, segment_size, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR); sh_mem = (char *)shmat(segment_id, NULL, 0);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
shmdt(sh_mem);
shmctl(segment_id, IPC_STAT, &shmid_ds); printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
if (shmctl(segment_id, IPC_RMID, 0)) printf("IPCRM failed %d\n", errno); return (shmid_ds.shm_nattch) ? 1 : 0; }
Link: https://lkml.kernel.org/r/20240222215930.14637-2-vbabka@suse.cz Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
b9349432 |
| 22-Feb-2024 |
Vlastimil Babka <vbabka@suse.cz> |
mm, mmap: fix vma_merge() case 7 with vma_ops->close
commit fc0c8f9089c20d198d8fe51ddc28bfa1af588dce upstream.
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with
mm, mmap: fix vma_merge() case 7 with vma_ops->close
commit fc0c8f9089c20d198d8fe51ddc28bfa1af588dce upstream.
When debugging issues with a workload using SysV shmem, Michal Hocko has come up with a reproducer that shows how a series of mprotect() operations can result in an elevated shm_nattch and thus leak of the resource.
The problem is caused by wrong assumptions in vma_merge() commit 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test"). The shmem vmas have a vma_ops->close callback that decrements shm_nattch, and we remove the vma without calling it.
vma_merge() has thus historically avoided merging vma's with vma_ops->close and commit 714965ca8252 was supposed to keep it that way. It relaxed the checks for vma_ops->close in can_vma_merge_after() assuming that it is never called on a vma that would be a candidate for removal. However, the vma_merge() code does also use the result of this check in the decision to remove a different vma in the merge case 7.
A robust solution would be to refactor vma_merge() code in a way that the vma_ops->close check is only done for vma's that are actually going to be removed, and not as part of the preliminary checks. That would both solve the existing bug, and also allow additional merges that the checks currently prevent unnecessarily in some cases.
However to fix the existing bug first with a minimized risk, and for easier stable backports, this patch only adds a vma_ops->close check to the buggy case 7 specifically. All other cases of vma removal are covered by the can_vma_merge_before() check that includes the test for vma_ops->close.
The reproducer code, adapted from Michal Hocko's code:
int main(int argc, char *argv[]) { int segment_id; size_t segment_size = 20 * PAGE_SIZE; char * sh_mem; struct shmid_ds shmid_ds;
key_t key = 0x1234; segment_id = shmget(key, segment_size, IPC_CREAT | IPC_EXCL | S_IRUSR | S_IWUSR); sh_mem = (char *)shmat(segment_id, NULL, 0);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_NONE);
mprotect(sh_mem + PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
mprotect(sh_mem + 2*PAGE_SIZE, PAGE_SIZE, PROT_WRITE);
shmdt(sh_mem);
shmctl(segment_id, IPC_STAT, &shmid_ds); printf("nattch after shmdt(): %lu (expected: 0)\n", shmid_ds.shm_nattch);
if (shmctl(segment_id, IPC_RMID, 0)) printf("IPCRM failed %d\n", errno); return (shmid_ds.shm_nattch) ? 1 : 0; }
Link: https://lkml.kernel.org/r/20240222215930.14637-2-vbabka@suse.cz Fixes: 714965ca8252 ("mm/mmap: start distinguishing if vma can be removed in mergeability test") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6 |
|
#
824135c4 |
| 29-Sep-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mmap: fix error paths with dup_anon_vma()
When the calling function fails after the dup_anon_vma(), the duplication of the anon_vma is not being undone. Add the necessary unlink_anon_vma() call to
mmap: fix error paths with dup_anon_vma()
When the calling function fails after the dup_anon_vma(), the duplication of the anon_vma is not being undone. Add the necessary unlink_anon_vma() call to the error paths that are missing them.
This issue showed up during inspection of the error path in vma_merge() for an unrelated vma iterator issue.
Users may experience increased memory usage, which may be problematic as the failure would likely be caused by a low memory situation.
Link: https://lkml.kernel.org/r/20230929183041.2835469-3-Liam.Howlett@oracle.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
1419430c |
| 29-Sep-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mmap: fix vma_iterator in error path of vma_merge()
During the error path, the vma iterator may not be correctly positioned or set to the correct range. Undo the vma_prev() call by resetting to the
mmap: fix vma_iterator in error path of vma_merge()
During the error path, the vma iterator may not be correctly positioned or set to the correct range. Undo the vma_prev() call by resetting to the passed in address. Re-walking to the same range will fix the range to the area previously passed in.
Users would notice increased cycles as vma_merge() would be called an extra time with vma == prev, and thus would fail to merge and return.
Link: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejtiHpgJAbdQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20230929183041.2835469-2-Liam.Howlett@oracle.com Fixes: 18b098af2890 ("vma_merge: set vma iterator to correct position.") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reported-by: Jann Horn <jannh@google.com> Closes: https://lore.kernel.org/linux-mm/CAG48ez12VN1JAOtTNMY+Y2YnsU45yL5giS-Qn=ejtiHpgJAbdQ@mail.gmail.com/ Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
e0f81ab1 |
| 29-Sep-2023 |
Sebastian Ott <sebott@redhat.com> |
mm: fix vm_brk_flags() to not bail out while holding lock
Calling vm_brk_flags() with flags set other than VM_EXEC will exit the function without releasing the mmap_write_lock.
Just do the sanity c
mm: fix vm_brk_flags() to not bail out while holding lock
Calling vm_brk_flags() with flags set other than VM_EXEC will exit the function without releasing the mmap_write_lock.
Just do the sanity check before the lock is acquired. This doesn't fix an actual issue since no caller sets a flag other than VM_EXEC.
Link: https://lkml.kernel.org/r/20230929171937.work.697-kees@kernel.org Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Sebastian Ott <sebott@redhat.com> Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44 |
|
#
c9d6e982 |
| 04-Aug-2023 |
Suren Baghdasaryan <surenb@google.com> |
mm: move vma locking out of vma_prepare and dup_anon_vma
vma_prepare() is currently the central place where vmas are being locked before vma_complete() applies changes to them. While this is conveni
mm: move vma locking out of vma_prepare and dup_anon_vma
vma_prepare() is currently the central place where vmas are being locked before vma_complete() applies changes to them. While this is convenient, it also obscures vma locking and makes it harder to follow the locking rules. Move vma locking out of vma_prepare() and take vma locks explicitly at the locations where vmas are being modified. Move vma locking and replace it with an assertion inside dup_anon_vma() to further clarify the locking pattern inside vma_merge().
Link: https://lkml.kernel.org/r/20230804152724.3090321-7-surenb@google.com Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
ad9f0063 |
| 04-Aug-2023 |
Suren Baghdasaryan <surenb@google.com> |
mm: always lock new vma before inserting into vma tree
While it's not strictly necessary to lock a newly created vma before adding it into the vma tree (as long as no further changes are performed t
mm: always lock new vma before inserting into vma tree
While it's not strictly necessary to lock a newly created vma before adding it into the vma tree (as long as no further changes are performed to it), it seems like a good policy to lock it and prevent accidental changes after it becomes visible to the page faults. Lock the vma before adding it into the vma tree.
[akpm@linux-foundation.org: fix reject fixing in vma_link(), per Jann] Link: https://lkml.kernel.org/r/20230804152724.3090321-6-surenb@google.com Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linuxfoundation.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
b1773e0e |
| 03-Aug-2023 |
ZhangPeng <zhangpeng362@huawei.com> |
mm/mmap.c: use helper macro K()
Use helper macro K() to improve code readability. No functional modification involved.
Link: https://lkml.kernel.org/r/20230804012559.2617515-7-zhangpeng362@huawei.
mm/mmap.c: use helper macro K()
Use helper macro K() to improve code readability. No functional modification involved.
Link: https://lkml.kernel.org/r/20230804012559.2617515-7-zhangpeng362@huawei.com Signed-off-by: ZhangPeng <zhangpeng362@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.43, v6.1.42 |
|
#
6935e052 |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm/mmap: change vma iteration order in do_vmi_align_munmap()
By delaying the setting of prev/next VMA until after the write of NULL, the probability of the prev/next VMA already being in the CPU cac
mm/mmap: change vma iteration order in do_vmi_align_munmap()
By delaying the setting of prev/next VMA until after the write of NULL, the probability of the prev/next VMA already being in the CPU cache is significantly increased, especially for larger munmap operations. It also means that prev/next will be loaded closer to when they are used.
This requires changing the loop type when gathering the VMAs that will be freed.
Since prev will be set later in the function, it is better to reverse the splitting direction of the start VMA (modify the new_below argument to __split_vma).
Using the vma_iter_prev_range() to walk back to the correct location in the tree will, on the most part, mean walking within the CPU cache. Usually, this is two steps vs a node reset and a tree re-walk.
Link: https://lkml.kernel.org/r/20230724183157.3939892-16-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
b5df0922 |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm: set up vma iterator for vma_iter_prealloc() calls
Set the correct limits for vma_iter_prealloc() calls so that the maple tree can be smarter about how many nodes are needed.
Link: https://lkml.
mm: set up vma iterator for vma_iter_prealloc() calls
Set the correct limits for vma_iter_prealloc() calls so that the maple tree can be smarter about how many nodes are needed.
Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
f72cf24a |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm: use vma_iter_clear_gfp() in nommu
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so it can be used in the nommu code. This will reduce node preallocations in nommu.
Link
mm: use vma_iter_clear_gfp() in nommu
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so it can be used in the nommu code. This will reduce node preallocations in nommu.
Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
da089254 |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
maple_tree: re-introduce entry to mas_preallocate() arguments
The current preallocation strategy is to preallocate the absolute worst-case allocation for a tree modification. The entry (or NULL) is
maple_tree: re-introduce entry to mas_preallocate() arguments
The current preallocation strategy is to preallocate the absolute worst-case allocation for a tree modification. The entry (or NULL) is needed to know how many nodes are needed to write to the tree. Start by adding the argument to the mas_preallocate() definition.
Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
53bee98d |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm: remove re-walk from mmap_region()
Using vma_iter_set() will reset the tree and cause a re-walk. Use vmi_iter_config() to set the write to a sub-set of the range. Change the file case to also u
mm: remove re-walk from mmap_region()
Using vma_iter_set() will reset the tree and cause a re-walk. Use vmi_iter_config() to set the write to a sub-set of the range. Change the file case to also use vmi_iter_config() so that the end is correctly set.
Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
445a2ea0 |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm: remove prev check from do_vmi_align_munmap()
If the prev does not exist, the vma iterator will be set to MAS_NONE, which will be treated as a MAS_START when the mas_next or mas_find is used. In
mm: remove prev check from do_vmi_align_munmap()
If the prev does not exist, the vma iterator will be set to MAS_NONE, which will be treated as a MAS_START when the mas_next or mas_find is used. In this case, the next caller will be the vma iterator, which uses mas_find() under the hood and will now do what the user expects.
Link: https://lkml.kernel.org/r/20230724183157.3939892-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
fd892593 |
| 24-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm: change do_vmi_align_munmap() tracking of VMAs to remove
The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of
mm: change do_vmi_align_munmap() tracking of VMAs to remove
The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of 1 as a pointer and avoid any allocations. Change do_vmi_align_munmap() to store the VMAs being munmap()'ed into a tree indexed by the count. This will leverage the ability to store the first entry without a node allocation.
Storing the entries into a tree by the count and not the vma start and end means changing the functions which iterate over the entries. Update unmap_vmas() and free_pgtables() to take a maple state and a tree end address to support this functionality.
Passing through the same maple state to unmap_vmas() and free_pgtables() means the state needs to be reset between calls. This happens in the static unmap_region() and exit_mmap().
Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.41, v6.1.40 |
|
#
90717566 |
| 20-Jul-2023 |
Jann Horn <jannh@google.com> |
mm: don't drop VMA locks in mm_drop_all_locks()
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap lock is held write-locked by the caller, and the caller is responsible for d
mm: don't drop VMA locks in mm_drop_all_locks()
Despite its name, mm_drop_all_locks() does not drop _all_ locks; the mmap lock is held write-locked by the caller, and the caller is responsible for dropping the mmap lock at a later point (which will also release the VMA locks).
Calling vma_end_write_all() here is dangerous because the caller might have write-locked a VMA with the expectation that it will stay write-locked until the mmap_lock is released, as usual.
This _almost_ becomes a problem in the following scenario:
An anonymous VMA A and an SGX VMA B are mapped adjacent to each other. Userspace calls munmap() on a range starting at the start address of A and ending in the middle of B.
Hypothetical call graph with additional notes in brackets:
do_vmi_align_munmap [begin first for_each_vma_range loop] vma_start_write [on VMA A] vma_mark_detached [on VMA A] __split_vma [on VMA B] sgx_vma_open [== new->vm_ops->open] sgx_encl_mm_add __mmu_notifier_register [luckily THIS CAN'T ACTUALLY HAPPEN] mm_take_all_locks mm_drop_all_locks vma_end_write_all [drops VMA lock taken on VMA A before] vma_start_write [on VMA B] vma_mark_detached [on VMA B] [end first for_each_vma_range loop] vma_iter_clear_gfp [removes VMAs from maple tree] mmap_write_downgrade unmap_region mmap_read_unlock
In this hypothetical scenario, while do_vmi_align_munmap() thinks it still holds a VMA write lock on VMA A, the VMA write lock has actually been invalidated inside __split_vma().
The call from sgx_encl_mm_add() to __mmu_notifier_register() can't actually happen here, as far as I understand, because we are duplicating an existing SGX VMA, but sgx_encl_mm_add() only calls __mmu_notifier_register() for the first SGX VMA created in a given process. So this could only happen in fork(), not on munmap(). But in my view it is just pure luck that this can't happen.
Also, we wouldn't actually have any bad consequences from this in do_vmi_align_munmap(), because by the time the bug drops the lock on VMA A, we've already marked VMA A as detached, which makes it completely ineligible for any VMA-locked page faults. But again, that's just pure luck.
So remove the vma_end_write_all(), so that VMA write locks are only ever released on mmap_write_unlock() or mmap_write_downgrade().
Also add comments to document the locking rules established by this patch.
Link: https://lkml.kernel.org/r/20230720193436.454247-1-jannh@google.com Fixes: eeff9a5d47f8 ("mm/mmap: prevent pagefault handler from racing with mmu_notifier registration") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.39 |
|
#
02fdb25f |
| 05-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm/mmap: change detached vma locking scheme
Don't set the lock to the mm lock so that the detached VMA tree does not complain about being unlocked when the mmap_lock is dropped prior to freeing the
mm/mmap: change detached vma locking scheme
Don't set the lock to the mm lock so that the detached VMA tree does not complain about being unlocked when the mmap_lock is dropped prior to freeing the tree.
Introduce mt_on_stack() for setting the external lock to NULL only when LOCKDEP is used.
Move the destroying of the detached tree outside the mmap lock all together.
Link: https://lkml.kernel.org/r/20230719183142.ktgcmuj2pnlr3h3s@revolver Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
2574d5e4 |
| 14-Jul-2023 |
Liam R. Howlett <Liam.Howlett@oracle.com> |
mm/mmap: clean up validate_mm() calls
Patch series "More strict maple tree lockdep", v2.
Linus asked for more strict maple tree lockdep checking [1] and for them to resume the normal path through A
mm/mmap: clean up validate_mm() calls
Patch series "More strict maple tree lockdep", v2.
Linus asked for more strict maple tree lockdep checking [1] and for them to resume the normal path through Andrews tree.
This series of patches adds checks to ensure the lock is held in write mode during the write path of the maple tree instead of checking if it's held at all.
It also reduces the validate_mm() calls by consolidating into commonly used functions (patch 0001), and removes the necessity of holding the lock on the detached tree during munmap() operations.
This patch (of 4):
validate_mm() calls are too spread out and duplicated in numerous locations. Also, now that the stack write is done under the write lock, it is not necessary to validate the mm prior to write operations.
Add a validate_mm() to the stack expansions, and to vma_complete() so that numerous others may be dropped.
Note that vma_link() (and also insert_vm_struct() by call path) already call validate_mm().
vma_merge() also had an unnecessary call to vma_iter_free() since the logic change to abort earlier if no merging is necessary.
Drop extra validate_mm() calls at the start of functions and error paths which won't write to the tree.
Relocate the validate_mm() call in the do_brk_flags() to avoid re-running the same test when vma_complete() is used.
The call within the error path of mmap_region() is left intentionally because of the complexity of the function and the potential of drivers modifying the tree.
Link: https://lkml.kernel.org/r/20230714195551.894800-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230714195551.894800-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Sang <oliver.sang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
6852c46c |
| 12-Jul-2023 |
Yu Ma <yu.ma@intel.com> |
mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock
UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short j
mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock
UnixBench/Execl represents a class of workload where bash scripts are spawned frequently to do some short jobs. When running multiple parallel tasks, hot osq_lock is observed from do_mmap and exit_mmap. Both of them come from load_elf_binary through the call chain "execl->do_execveat_common->bprm_execve->load_elf_binary".
In do_mmap,it will call mmap_region to create vma node, initialize it and insert it to vma maintain structure in mm_struct and i_mmap tree of the mapping file, then increase map_count to record the number of vma nodes used. The hot osq_lock is to protect operations on file's i_mmap tree. For the mm_struct member change like vma insertion and map_count update, they do not affect i_mmap tree. Move those operations out of the lock's critical section, to reduce hold time on the lock.
With this change, on Intel Sapphire Rapids 112C/224T platform, based on v6.0-rc6, the 160 parallel score improves by 12%. The patch has no obvious performance gain on v6.5-rc1 due to regression of this benchmark from this commit f1a7941243c102a44e8847e3b94ff4ff3ec56f25 (mm: convert mm's rss stats into percpu_counter). Related discussion and conclusion can be referred at the mail thread initiated by 0day as below: Link: https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com Signed-off-by: Yu Ma <yu.ma@intel.com> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A . Shutemov <kirill@shutemov.name> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Zhu, Lipeng <lipeng.zhu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
#
d8ab9f7b |
| 20-Jul-2023 |
Jann Horn <jannh@google.com> |
mm: lock VMA in dup_anon_vma() before setting ->anon_vma
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the VMA that is being expanded to cover the area previously occupied by
mm: lock VMA in dup_anon_vma() before setting ->anon_vma
When VMAs are merged, dup_anon_vma() is called with `dst` pointing to the VMA that is being expanded to cover the area previously occupied by another VMA. This currently happens while `dst` is not write-locked.
This means that, in the `src->anon_vma && !dst->anon_vma` case, as soon as the assignment `dst->anon_vma = src->anon_vma` has happened, concurrent page faults can happen on `dst` under the per-VMA lock. This is already icky in itself, since such page faults can now install pages into `dst` that are attached to an `anon_vma` that is not yet tied back to the `anon_vma` with an `anon_vma_chain`. But if `anon_vma_clone()` fails due to an out-of-memory error, things get much worse: `anon_vma_clone()` then reverts `dst->anon_vma` back to NULL, and `dst` remains completely unconnected to the `anon_vma`, even though we can have pages in the area covered by `dst` that point to the `anon_vma`.
This means the `anon_vma` of such pages can be freed while the pages are still mapped into userspace, which leads to UAF when a helper like folio_lock_anon_vma_read() tries to look up the anon_vma of such a page.
This theoretically is a security bug, but I believe it is really hard to actually trigger as an unprivileged user because it requires that you can make an order-0 GFP_KERNEL allocation fail, and the page allocator tries pretty hard to prevent that.
I think doing the vma_start_write() call inside dup_anon_vma() is the most straightforward fix for now.
For a kernel-assisted reproducer, see the notes section of the patch mail.
Link: https://lkml.kernel.org/r/20230721034643.616851-1-jannh@google.com Fixes: 5e31275cc997 ("mm: add per-VMA lock and helper functions to control it") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34 |
|
#
0266e7c5 |
| 12-Jun-2023 |
Rick Edgecombe <rick.p.edgecombe@intel.com> |
mm: Add guard pages around a shadow stack.
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual prop
mm: Add guard pages around a shadow stack.
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly.
The architecture of shadow stack constrains the ability of userspace to move the shadow stack pointer (SSP) in order to prevent corrupting or switching to other shadow stacks. The RSTORSSP instruction can move the SSP to different shadow stacks, but it requires a specially placed token in order to do this. However, the architecture does not prevent incrementing the stack pointer to wander onto an adjacent shadow stack. To prevent this in software, enforce guard pages at the beginning of shadow stack VMAs, such that there will always be a gap between adjacent shadow stacks.
Make the gap big enough so that no userspace SSP changing operations (besides RSTORSSP), can move the SSP from one stack to the next. The SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and RET can move the SSP by a maximum of 8 bytes, at which point the shadow stack would be accessed.
The INCSSP instruction can also increment the shadow stack pointer. It is the shadow stack analog of an instruction like:
addq $0x80, %rsp
However, there is one important difference between an ADD on %rsp and INCSSP. In addition to modifying SSP, INCSSP also reads from the memory of the first and last elements that were "popped". It can be thought of as acting like this:
READ_ONCE(ssp); // read+discard top element on stack ssp += nr_to_pop * 8; // move the shadow stack READ_ONCE(ssp-8); // read+discard last popped stack element
The maximum distance INCSSP can move the SSP is 2040 bytes, before it would read the memory. Therefore, a single page gap will be enough to prevent any operation from shifting the SSP to an adjacent stack, since it would have to land in the gap at least once, causing a fault.
This could be accomplished by using VM_GROWSDOWN, but this has a downside. The behavior would allow shadow stacks to grow, which is unneeded and adds a strange difference to how most regular stacks work.
In the maple tree code, there is some logic for retrying the unmapped area search if a guard gap is violated. This retry should happen for shadow stack guard gap violations as well. This logic currently only checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have a start gap as well, create an new define VM_STARTGAP_FLAGS to hold all the VM flag bits that have start gaps, and make mmap use it.
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-17-rick.p.edgecombe%40intel.com
show more ...
|
#
592b5fad |
| 12-Jun-2023 |
Yu-cheng Yu <yu-cheng.yu@intel.com> |
mm: Re-introduce vm_flags to do_mmap()
There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by:
commit 45e55300f114 ("mm: remove unnecessar
mm: Re-introduce vm_flags to do_mmap()
There was no more caller passing vm_flags to do_mmap(), and vm_flags was removed from the function's input by:
commit 45e55300f114 ("mm: remove unnecessary wrapper function do_mmap_pgoff()").
There is a new user now. Shadow stack allocation passes VM_SHADOW_STACK to do_mmap(). Thus, re-introduce vm_flags to do_mmap().
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Peter Collingbourne <pcc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Mark Brown <broonie@kernel.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/all/20230613001108.3040476-5-rick.p.edgecombe%40intel.com
show more ...
|
#
1c7873e3 |
| 08-Jul-2023 |
Hugh Dickins <hughd@google.com> |
mm: lock newly mapped VMA with corrected ordering
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f but task is alread
mm: lock newly mapped VMA with corrected ordering
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f but task is already holding lock: (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible") Cc: stable@vger.kernel.org Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
33313a74 |
| 08-Jul-2023 |
Suren Baghdasaryan <surenb@google.com> |
mm: lock newly mapped VMA which can be modified after it becomes visible
mmap_region adds a newly created VMA into VMA tree and might modify it afterwards before dropping the mmap_lock. This poses
mm: lock newly mapped VMA which can be modified after it becomes visible
mmap_region adds a newly created VMA into VMA tree and might modify it afterwards before dropping the mmap_lock. This poses a problem for page faults handled under per-VMA locks because they don't take the mmap_lock and can stumble on this VMA while it's still being modified. Currently this does not pose a problem since post-addition modifications are done only for file-backed VMAs, which are not handled under per-VMA lock. However, once support for handling file-backed page faults with per-VMA locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree. Other places where a new VMA is added into VMA tree do not modify it after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
c137381f |
| 08-Jul-2023 |
Suren Baghdasaryan <surenb@google.com> |
mm: lock a vma before stack expansion
With recent changes necessitating mmap_lock to be held for write while expanding a stack, per-VMA locks should follow the same rules and be write-locked to prev
mm: lock a vma before stack expansion
With recent changes necessitating mmap_lock to be held for write while expanding a stack, per-VMA locks should follow the same rules and be write-locked to prevent page faults into the VMA being expanded. Add the necessary locking.
Cc: stable@vger.kernel.org Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|