History log of /openbmc/linux/mm/memory.c (Results 1 – 25 of 4368)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.6.67, v6.6.66, v6.6.65, v6.6.64, v6.6.63, v6.6.62, v6.6.61
# 5f8b7d4b 10-Nov-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.60' into for/openbmc/dev-6.6

This is the 6.6.60 stable release


Revision tags: v6.6.60, v6.6.59, v6.6.58, v6.6.57
# 3d544942 11-Oct-2024 David Hildenbrand <david@redhat.com>

mm: don't install PMD mappings when THPs are disabled by the hw/process/vma

commit 2b0f922323ccfa76219bcaacd35cd50aeaa13592 upstream.

We (or rather, readahead logic :) ) might be allocating a THP i

mm: don't install PMD mappings when THPs are disabled by the hw/process/vma

commit 2b0f922323ccfa76219bcaacd35cd50aeaa13592 upstream.

We (or rather, readahead logic :) ) might be allocating a THP in the
pagecache and then try mapping it into a process that explicitly disabled
THP: we might end up installing PMD mappings.

This is a problem for s390x KVM, which explicitly remaps all PMD-mapped
THPs to be PTE-mapped in s390_enable_sie()->thp_split_mm(), before
starting the VM.

For example, starting a VM backed on a file system with large folios
supported makes the VM crash when the VM tries accessing such a mapping
using KVM.

Is it also a problem when the HW disabled THP using
TRANSPARENT_HUGEPAGE_UNSUPPORTED? At least on x86 this would be the case
without X86_FEATURE_PSE.

In the future, we might be able to do better on s390x and only disallow
PMD mappings -- what s390x and likely TRANSPARENT_HUGEPAGE_UNSUPPORTED
really wants. For now, fix it by essentially performing the same check as
would be done in __thp_vma_allowable_orders() or in shmem code, where this
works as expected, and disallow PMD mappings, making us fallback to PTE
mappings.

Link: https://lkml.kernel.org/r/20241011102445.934409-3-david@redhat.com
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Leo Fu <bfu@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Cc: Thomas Huth <thuth@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.56, v6.6.55, v6.6.54, v6.6.53
# 76921820 18-Sep-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.52' into for/openbmc/dev-6.6

This is the 6.6.52 stable release


Revision tags: v6.6.52, v6.6.51
# a95a24fc 11-Sep-2024 Linus Torvalds <torvalds@linux-foundation.org>

mm: avoid leaving partial pfn mappings around in error case

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3 upstream.

As Jann points out, PFN mappings are special, because unlike normal
memory mapp

mm: avoid leaving partial pfn mappings around in error case

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3 upstream.

As Jann points out, PFN mappings are special, because unlike normal
memory mappings, there is no lifetime information associated with the
mapping - it is just a raw mapping of PFNs with no reference counting of
a 'struct page'.

That's all very much intentional, but it does mean that it's easy to
mess up the cleanup in case of errors. Yes, a failed mmap() will always
eventually clean up any partial mappings, but without any explicit
lifetime in the page table mapping itself, it's very easy to do the
error handling in the wrong order.

In particular, it's easy to mistakenly free the physical backing store
before the page tables are actually cleaned up and (temporarily) have
stale dangling PTE entries.

To make this situation less error-prone, just make sure that any partial
pfn mapping is torn down early, before any other error handling.

Reported-and-tested-by: Jann Horn <jannh@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.50, v6.6.49
# 26d0dfbb 29-Aug-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.48' into for/openbmc/dev-6.6

This is the 6.6.48 stable release


Revision tags: v6.6.48, v6.6.47, v6.6.46, v6.6.45
# 19b4397c 09-Aug-2024 Zi Yan <ziy@nvidia.com>

mm/numa: no task_numa_fault() call if PTE is changed

commit 40b760cfd44566bca791c80e0720d70d75382b84 upstream.

When handling a numa page fault, task_numa_fault() should be called by a
process that

mm/numa: no task_numa_fault() call if PTE is changed

commit 40b760cfd44566bca791c80e0720d70d75382b84 upstream.

When handling a numa page fault, task_numa_fault() should be called by a
process that restores the page table of the faulted folio to avoid
duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce
TLB flush via delaying mapping on hint page fault") restructured
do_numa_page() and did not avoid task_numa_fault() call in the second page
table check after a numa migration failure. Fix it by making all
!pte_same() return immediately.

This issue can cause task_numa_fault() being called more than necessary
and lead to unexpected numa balancing results (It is hard to tell whether
the issue will cause positive or negative performance impact due to
duplicated numa fault counting).

Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com
Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


# 7e24a55b 04-Aug-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.44' into for/openbmc/dev-6.6

This is the 6.6.44 stable release


Revision tags: v6.6.44, v6.6.43, v6.6.42, v6.6.41, v6.6.40, v6.6.39
# 46001644 09-Jul-2024 Ram Tummala <rtummala@nvidia.com>

mm: fix old/young bit handling in the faulting path

commit 4cd7ba16a0afb36550eed7690e73d3e7a743fa96 upstream.

Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
replaced do_set_pte

mm: fix old/young bit handling in the faulting path

commit 4cd7ba16a0afb36550eed7690e73d3e7a743fa96 upstream.

Commit 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
replaced do_set_pte() with set_pte_range() and that introduced a
regression in the following faulting path of non-anonymous vmas which
caused the PTE for the faulting address to be marked as old instead of
young.

handle_pte_fault()
do_pte_missing()
do_fault()
do_read_fault() || do_cow_fault() || do_shared_fault()
finish_fault()
set_pte_range()

The polarity of prefault calculation is incorrect. This leads to prefault
being incorrectly set for the faulting address. The following check will
incorrectly mark the PTE old rather than young. On some architectures
this will cause a double fault to mark it young when the access is
retried.

if (prefault && arch_wants_old_prefaulted_pte())
entry = pte_mkold(entry);

On a subsequent fault on the same address, the faulting path will see a
non NULL vmf->pte and instead of reaching the do_pte_missing() path, PTE
will then be correctly marked young in handle_pte_fault() itself.

Due to this bug, performance degradation in the fault handling path will
be observed due to unnecessary double faulting.

Link: https://lkml.kernel.org/r/20240710014539.746200-1-rtummala@nvidia.com
Fixes: 3bd786f76de2 ("mm: convert do_set_pte() to set_pte_range()")
Signed-off-by: Ram Tummala <rtummala@nvidia.com>
Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.38, v6.6.37, v6.6.36, v6.6.35, v6.6.34, v6.6.33, v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27
# 86aa961b 10-Apr-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.26' into dev-6.6

This is the 6.6.26 stable release


Revision tags: v6.6.26, v6.6.25
# 51b7841f 03-Apr-2024 David Hildenbrand <david@redhat.com>

x86/mm/pat: fix VM_PAT handling in COW mappings

commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221 upstream.

PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs

x86/mm/pat: fix VM_PAT handling in COW mappings

commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221 upstream.

PAT handling won't do the right thing in COW mappings: the first PTE (or,
in fact, all PTEs) can be replaced during write faults to point at anon
folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.

Using follow_phys(), we might just get the address+protection of the anon
folio (which is very wrong), or fail on swap/nonswap entries, failing
follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and
track_pfn_copy(), not properly calling free_pfn_range().

In free_pfn_range(), we either wouldn't call memtype_free() or would call
it with the wrong range, possibly leaking memory.

To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings
if we run into that.

We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.

For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.

Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

<--- C reproducer --->
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
#include <liburing.h>

int main(void)
{
struct io_uring_params p = {};
int ring_fd;
size_t size;
char *map;

ring_fd = io_uring_setup(1, &p);
if (ring_fd < 0) {
perror("io_uring_setup");
return 1;
}
size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

/* Map the submission queue ring MAP_PRIVATE */
map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
ring_fd, IORING_OFF_SQ_RING);
if (map == MAP_FAILED) {
perror("mmap");
return 1;
}

/* We have at least one page. Let's COW it. */
*map = 0;
pause();
return 0;
}
<--- C reproducer --->

On a system with 16 GiB RAM and swap configured:
# ./iouring &
# memhog 16G
# killall iouring
[ 301.552930] ------------[ cut here ]------------
[ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[ 301.565725] PKRU: 55555554
[ 301.565944] Call Trace:
[ 301.566148] <TASK>
[ 301.566325] ? untrack_pfn+0xf4/0x100
[ 301.566618] ? __warn+0x81/0x130
[ 301.566876] ? untrack_pfn+0xf4/0x100
[ 301.567163] ? report_bug+0x171/0x1a0
[ 301.567466] ? handle_bug+0x3c/0x80
[ 301.567743] ? exc_invalid_op+0x17/0x70
[ 301.568038] ? asm_exc_invalid_op+0x1a/0x20
[ 301.568363] ? untrack_pfn+0xf4/0x100
[ 301.568660] ? untrack_pfn+0x65/0x100
[ 301.568947] unmap_single_vma+0xa6/0xe0
[ 301.569247] unmap_vmas+0xb5/0x190
[ 301.569532] exit_mmap+0xec/0x340
[ 301.569801] __mmput+0x3e/0x130
[ 301.570051] do_exit+0x305/0xaf0
...

Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.24, v6.6.23
# 96d3c5a7 13-Mar-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.19' into dev-6.6

This is the 6.6.19 stable release


# c595db6d 13-Mar-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.18' into dev-6.6

This is the 6.6.18 stable release


# 30515231 06-Feb-2024 Kairui Song <kasong@tencent.com>

mm/swap: fix race when skipping swapcache

commit 13ddaf26be324a7f951891ecd9ccd04466d27458 upstream.

When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
swapin the same entry at t

mm/swap: fix race when skipping swapcache

commit 13ddaf26be324a7f951891ecd9ccd04466d27458 upstream.

When skipping swapcache for SWP_SYNCHRONOUS_IO, if two or more threads
swapin the same entry at the same time, they get different pages (A, B).
Before one thread (T0) finishes the swapin and installs page (A) to the
PTE, another thread (T1) could finish swapin of page (B), swap_free the
entry, then swap out the possibly modified page reusing the same entry.
It breaks the pte_same check in (T0) because PTE value is unchanged,
causing ABA problem. Thread (T0) will install a stalled page (A) into the
PTE and cause data corruption.

One possible callstack is like this:

CPU0 CPU1
---- ----
do_swap_page() do_swap_page() with same entry
<direct swapin path> <direct swapin path>
<alloc page A> <alloc page B>
swap_read_folio() <- read to page A swap_read_folio() <- read to page B
<slow on later locks or interrupt> <finished swapin first>
... set_pte_at()
swap_free() <- entry is free
<write to page B, now page A stalled>
<swap out page B to same swap entry>
pte_same() <- Check pass, PTE seems
unchanged, but page A
is stalled!
swap_free() <- page B content lost!
set_pte_at() <- staled page A installed!

And besides, for ZRAM, swap_free() allows the swap device to discard the
entry content, so even if page (B) is not modified, if swap_read_folio()
on CPU0 happens later than swap_free() on CPU1, it may also cause data
loss.

To fix this, reuse swapcache_prepare which will pin the swap entry using
the cache flag, and allow only one thread to swap it in, also prevent any
parallel code from putting the entry in the cache. Release the pin after
PT unlocked.

Racers just loop and wait since it's a rare and very short event. A
schedule_timeout_uninterruptible(1) call is added to avoid repeated page
faults wasting too much CPU, causing livelock or adding too much noise to
perf statistics. A similar livelock issue was described in commit
029c4628b2eb ("mm: swap: get rid of livelock in swapin readahead")

Reproducer:

This race issue can be triggered easily using a well constructed
reproducer and patched brd (with a delay in read path) [1]:

With latest 6.8 mainline, race caused data loss can be observed easily:
$ gcc -g -lpthread test-thread-swap-race.c && ./a.out
Polulating 32MB of memory region...
Keep swapping out...
Starting round 0...
Spawning 65536 workers...
32746 workers spawned, wait for done...
Round 0: Error on 0x5aa00, expected 32746, got 32743, 3 data loss!
Round 0: Error on 0x395200, expected 32746, got 32743, 3 data loss!
Round 0: Error on 0x3fd000, expected 32746, got 32737, 9 data loss!
Round 0 Failed, 15 data loss!

This reproducer spawns multiple threads sharing the same memory region
using a small swap device. Every two threads updates mapped pages one by
one in opposite direction trying to create a race, with one dedicated
thread keep swapping out the data out using madvise.

The reproducer created a reproduce rate of about once every 5 minutes, so
the race should be totally possible in production.

After this patch, I ran the reproducer for over a few hundred rounds and
no data loss observed.

Performance overhead is minimal, microbenchmark swapin 10G from 32G
zram:

Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

[kasong@tencent.com: v4]
Link: https://lkml.kernel.org/r/20240219082040.7495-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240206182559.32264-1-ryncsn@gmail.com
Fixes: 0bcac06f27d7 ("mm, swap: skip swapcache for swapin of synchronous device")
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Closes: https://lore.kernel.org/lkml/87bk92gqpx.fsf_-_@yhuang6-desk2.ccr.corp.intel.com/
Link: https://github.com/ryncsn/emm-test-project/tree/master/swap-stress-race [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Yu Zhao <yuzhao@google.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.16
# 94d34a68 02-Feb-2024 Jiaxun Yang <jiaxun.yang@flygoat.com>

mm/memory: Use exception ip to search exception tables

[ Upstream commit 8fa5070833886268e4fb646daaca99f725b378e9 ]

On architectures with delay slot, instruction_pointer() may differ
from where exc

mm/memory: Use exception ip to search exception tables

[ Upstream commit 8fa5070833886268e4fb646daaca99f725b378e9 ]

On architectures with delay slot, instruction_pointer() may differ
from where exception was triggered.

Use exception_ip we just introduced to search exception tables to
get rid of the problem.

Fixes: 4bce37a68ff8 ("mips/mm: Convert to using lock_mm_and_find_vma()")
Reported-by: Xi Ruoyao <xry111@xry111.site>
Link: https://lore.kernel.org/r/75e9fd7b08562ad9b456a5bdaacb7cc220311cc9.camel@xry111.site/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jiaxun Yang <jiaxun.yang@flygoat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


# 59f5a149 10-Feb-2024 Andrew Jeffery <andrew@codeconstruct.com.au>

Merge tag 'v6.6.11' into dev-6.6

This is the 6.6.11 stable release


Revision tags: v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8
# 2247df45 19-Dec-2023 Jiajun Xie <jiajun.xie.sh@gmail.com>

mm: fix unmap_mapping_range high bits shift bug

commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d upstream.

The bug happens when highest bit of holebegin is 1, suppose holebegin is
0x8000000111111000

mm: fix unmap_mapping_range high bits shift bug

commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d upstream.

The bug happens when highest bit of holebegin is 1, suppose holebegin is
0x8000000111111000, after shift, hba would be 0xfff8000000111111, then
vma_interval_tree_foreach would look it up fail or leads to the wrong
result.

error call seq e.g.:
- mmap(..., offset=0x8000000111111000)
|- syscall(mmap, ... unsigned long, off):
|- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT);

here pgoff is correctly shifted to 0x8000000111111,
but pass 0x8000000111111000 as holebegin to unmap
would then cause terrible result, as shown below:

- unmap_mapping_range(..., loff_t const holebegin)
|- pgoff_t hba = holebegin >> PAGE_SHIFT;
/* hba = 0xfff8000000111111 unexpectedly */

The issue happens in Heterogeneous computing, where the device(e.g.
gpu) and host share the same virtual address space.

A simple workflow pattern which hit the issue is:
/* host */
1. userspace first mmap a file backed VA range with specified offset.
e.g. (offset=0x800..., mmap return: va_a)
2. write some data to the corresponding sys page
e.g. (va_a = 0xAABB)
/* device */
3. gpu workload touches VA, triggers gpu fault and notify the host.
/* host */
4. reviced gpu fault notification, then it will:
4.1 unmap host pages and also takes care of cpu tlb
(use unmap_mapping_range with offset=0x800...)
4.2 migrate sys page to device
4.3 setup device page table and resolve device fault.
/* device */
5. gpu workload continued, it accessed va_a and got 0xAABB.
6. gpu workload continued, it wrote 0xBBCC to va_a.
/* host */
7. userspace access va_a, as expected, it will:
7.1 trigger cpu vm fault.
7.2 driver handling fault to migrate gpu local page to host.
8. userspace then could correctly get 0xBBCC from va_a
9. done

But in step 4.1, if we hit the bug this patch mentioned, then userspace
would never trigger cpu fault, and still get the old value: 0xAABB.

Making holebegin unsigned first fixes the bug.

Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com
Signed-off-by: Jiajun Xie <jiajun.xie.sh@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9
# 4f828701 24-Oct-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"20 hotfixes. 12 are cc:stable and the remainder address

Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"20 hotfixes. 12 are cc:stable and the remainder address post-6.5
issues or aren't considered necessary for earlier kernel versions"

* tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
mailmap: correct email aliasing for Oleksij Rempel
mailmap: map Bartosz's old address to the current one
mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
MAINTAINERS: Ondrej has moved
kasan: disable kasan_non_canonical_hook() for HW tags
kasan: print the original fault addr when access invalid shadow
hugetlbfs: close race between MADV_DONTNEED and page fault
hugetlbfs: extend hugetlb_vma_lock to private VMAs
hugetlbfs: clear resv_map pointer if mmap fails
mm: zswap: fix pool refcount bug around shrink_worker()
mm/migrate: fix do_pages_move for compat pointers
riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set
riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking
mmap: fix error paths with dup_anon_vma()
mmap: fix vma_iterator in error path of vma_merge()
mm: fix vm_brk_flags() to not bail out while holding lock
mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
mm/page_alloc: correct start page when guard page debug is enabled

show more ...


Revision tags: v6.5.8, v6.5.7, v6.5.6
# 2820b0f0 05-Oct-2023 Rik van Riel <riel@surriel.com>

hugetlbfs: close race between MADV_DONTNEED and page fault

Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.

T

hugetlbfs: close race between MADV_DONTNEED and page fault

Malloc libraries, like jemalloc and tcalloc, take decisions on when to
call madvise independently from the code in the main application.

This sometimes results in the application page faulting on an address,
right after the malloc library has shot down the backing memory with
MADV_DONTNEED.

Usually this is harmless, because we always have some 4kB pages sitting
around to satisfy a page fault. However, with hugetlbfs systems often
allocate only the exact number of huge pages that the application wants.

Due to TLB batching, hugetlbfs MADV_DONTNEED will free pages outside of
any lock taken on the page fault path, which can open up the following
race condition:

CPU 1 CPU 2

MADV_DONTNEED
unmap page
shoot down TLB entry
page fault
fail to allocate a huge page
killed with SIGBUS
free page

Fix that race by pulling the locking from __unmap_hugepage_final_range
into helper functions called from zap_page_range_single. This ensures
page faults stay locked out of the MADV_DONTNEED VMA until the huge pages
have actually been freed.

Link: https://lkml.kernel.org/r/20231006040020.3677377-4-riel@surriel.com
Fixes: 04ada095dcfc ("hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing")
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

show more ...


Revision tags: v6.5.5, v6.5.4, v6.5.3
# c900529f 12-Sep-2023 Thomas Zimmermann <tzimmermann@suse.de>

Merge drm/drm-fixes into drm-misc-fixes

Forwarding to v6.6-rc1.

Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>


Revision tags: v6.5.2, v6.1.51, v6.5.1
# df57721f 31-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part o

Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 shadow stack support from Dave Hansen:
"This is the long awaited x86 shadow stack support, part of Intel's
Control-flow Enforcement Technology (CET).

CET consists of two related security features: shadow stacks and
indirect branch tracking. This series implements just the shadow stack
part of this feature, and just for userspace.

The main use case for shadow stack is providing protection against
return oriented programming attacks. It works by maintaining a
secondary (shadow) stack using a special memory type that has
protections against modification. When executing a CALL instruction,
the processor pushes the return address to both the normal stack and
to the special permission shadow stack. Upon RET, the processor pops
the shadow stack copy and compares it to the normal stack copy.

For more information, refer to the links below for the earlier
versions of this patch set"

Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/
Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/

* tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits)
x86/shstk: Change order of __user in type
x86/ibt: Convert IBT selftest to asm
x86/shstk: Don't retry vm_munmap() on -EINTR
x86/kbuild: Fix Documentation/ reference
x86/shstk: Move arch detail comment out of core mm
x86/shstk: Add ARCH_SHSTK_STATUS
x86/shstk: Add ARCH_SHSTK_UNLOCK
x86: Add PTRACE interface for shadow stack
selftests/x86: Add shadow stack test
x86/cpufeatures: Enable CET CR4 bit for shadow stack
x86/shstk: Wire in shadow stack interface
x86: Expose thread features in /proc/$PID/status
x86/shstk: Support WRSS for userspace
x86/shstk: Introduce map_shadow_stack syscall
x86/shstk: Check that signal frame is shadow stack mem
x86/shstk: Check that SSP is aligned on sigreturn
x86/shstk: Handle signals for shadow stack
x86/shstk: Introduce routines modifying shstk
x86/shstk: Handle thread shadow stack
x86/shstk: Add user-mode shadow stack support
...

show more ...


# 1ac731c5 30-Aug-2023 Dmitry Torokhov <dmitry.torokhov@gmail.com>

Merge branch 'next' into for-linus

Prepare input updates for 6.6 merge window.


Revision tags: v6.1.50
# b96a3e91 29-Aug-2023 Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_a

Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

- Some swap cleanups from Ma Wupeng ("fix WARN_ON in
add_to_avail_list")

- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which
reduces the special-case code for handling hugetlb pages in GUP. It
also speeds up GUP handling of transparent hugepages.

- Peng Zhang provides some maple tree speedups ("Optimize the fast path
of mas_store()").

- Sergey Senozhatsky has improved te performance of zsmalloc during
compaction (zsmalloc: small compaction improvements").

- Domenico Cerasuolo has developed additional selftest code for zswap
("selftests: cgroup: add zswap test program").

- xu xin has doe some work on KSM's handling of zero pages. These
changes are mainly to enable the user to better understand the
effectiveness of KSM's treatment of zero pages ("ksm: support
tracking KSM-placed zero-pages").

- Jeff Xu has fixes the behaviour of memfd's
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl
MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED").

- David Howells has fixed an fscache optimization ("mm, netfs, fscache:
Stop read optimisation when folio removed from pagecache").

- Axel Rasmussen has given userfaultfd the ability to simulate memory
poisoning ("add UFFDIO_POISON to simulate memory poisoning with
UFFD").

- Miaohe Lin has contributed some routine maintenance work on the
memory-failure code ("mm: memory-failure: remove unneeded PageHuge()
check").

- Peng Zhang has contributed some maintenance work on the maple tree
code ("Improve the validation for maple tree and some cleanup").

- Hugh Dickins has optimized the collapsing of shmem or file pages into
THPs ("mm: free retracted page table by RCU").

- Jiaqi Yan has a patch series which permits us to use the healthy
subpages within a hardware poisoned huge page for general purposes
("Improve hugetlbfs read on HWPOISON hugepages").

- Kemeng Shi has done some maintenance work on the pagetable-check code
("Remove unused parameters in page_table_check").

- More folioification work from Matthew Wilcox ("More filesystem folio
conversions for 6.6"), ("Followup folio conversions for zswap"). And
from ZhangPeng ("Convert several functions in page_io.c to use a
folio").

- page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext").

- Baoquan He has converted some architectures to use the
GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert
architectures to take GENERIC_IOREMAP way").

- Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support
batched/deferred tlb shootdown during page reclamation/migration").

- Better maple tree lockdep checking from Liam Howlett ("More strict
maple tree lockdep"). Liam also developed some efficiency
improvements ("Reduce preallocations for maple tree").

- Cleanup and optimization to the secondary IOMMU TLB invalidation,
from Alistair Popple ("Invalidate secondary IOMMU TLB on permission
upgrade").

- Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes
for arm64").

- Kemeng Shi provides some maintenance work on the compaction code
("Two minor cleanups for compaction").

- Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle
most file-backed faults under the VMA lock").

- Aneesh Kumar contributes code to use the vmemmap optimization for DAX
on ppc64, under some circumstances ("Add support for DAX vmemmap
optimization for ppc64").

- page-ext cleanups from Kemeng Shi ("add page_ext_data to get client
data in page_ext"), ("minor cleanups to page_ext header").

- Some zswap cleanups from Johannes Weiner ("mm: zswap: three
cleanups").

- kmsan cleanups from ZhangPeng ("minor cleanups for kmsan").

- VMA handling cleanups from Kefeng Wang ("mm: convert to
vma_is_initial_heap/stack()").

- DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes:
implement DAMOS tried total bytes file"), ("Extend DAMOS filters for
address ranges and DAMON monitoring targets").

- Compaction work from Kemeng Shi ("Fixes and cleanups to compaction").

- Liam Howlett has improved the maple tree node replacement code
("maple_tree: Change replacement strategy").

- ZhangPeng has a general code cleanup - use the K() macro more widely
("cleanup with helper macro K()").

- Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for
memmap on memory feature on ppc64").

- pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list
in page_alloc"), ("Two minor cleanups for get pageblock
migratetype").

- Vishal Moola introduces a memory descriptor for page table tracking,
"struct ptdesc" ("Split ptdesc from struct page").

- memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups
for vm.memfd_noexec").

- MM include file rationalization from Hugh Dickins ("arch: include
asm/cacheflush.h in asm/hugetlb.h").

- THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text
output").

- kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use
object_cache instead of kmemleak_initialized").

- More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor
and _folio_order").

- A VMA locking scalability improvement from Suren Baghdasaryan
("Per-VMA lock support for swap and userfaults").

- pagetable handling cleanups from Matthew Wilcox ("New page table
range API").

- A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop
using page->private on tail pages for THP_SWAP + cleanups").

- Cleanups and speedups to the hugetlb fault handling from Matthew
Wilcox ("Change calling convention for ->huge_fault").

- Matthew Wilcox has also done some maintenance work on the MM
subsystem documentation ("Improve mm documentation").

* tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits)
maple_tree: shrink struct maple_tree
maple_tree: clean up mas_wr_append()
secretmem: convert page_is_secretmem() to folio_is_secretmem()
nios2: fix flush_dcache_page() for usage from irq context
hugetlb: add documentation for vma_kernel_pagesize()
mm: add orphaned kernel-doc to the rst files.
mm: fix clean_record_shared_mapping_range kernel-doc
mm: fix get_mctgt_type() kernel-doc
mm: fix kernel-doc warning from tlb_flush_rmaps()
mm: remove enum page_entry_size
mm: allow ->huge_fault() to be called without the mmap_lock held
mm: move PMD_ORDER to pgtable.h
mm: remove checks for pte_index
memcg: remove duplication detection for mem_cgroup_uncharge_swap
mm/huge_memory: work on folio->swap instead of page->private when splitting folio
mm/swap: inline folio_set_swap_entry() and folio_swap_entry()
mm/swap: use dedicated entry for swap in folio
mm/swap: stop using page->private on tail pages for THP_SWAP
selftests/mm: fix WARNING comparing pointer to 0
selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check
...

show more ...


# 692f5510 28-Aug-2023 Takashi Iwai <tiwai@suse.de>

Merge tag 'asoc-v6.6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v6.6

The rest of the updates for v6.6, some of the highlights include:

- A

Merge tag 'asoc-v6.6' of https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Updates for v6.6

The rest of the updates for v6.6, some of the highlights include:

- A big API cleanup from Morimoto-san, rationalising the places we put
functions.
- Lots of work on the SOF framework, AMD and Intel drivers, including a
lot of cleanup and new device support.
- Standardisation of the presentation of jacks from drivers.
- Provision of some generic sound card DT properties.
- Conversion oof more drivers to the maple tree register cache.
- New drivers for AMD Van Gogh, AWInic AW88261, Cirrus Logic cs42l43,
various Intel platforms, Mediatek MT7986, RealTek RT1017 and StarFive
JH7110.

show more ...


Revision tags: v6.5, v6.1.49, v6.1.48
# 1d024e7a 18-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: remove enum page_entry_size

Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().

The switch constructs have to be ch

mm: remove enum page_entry_size

Remove the unnecessary encoding of page order into an enum and pass the
page order directly. That lets us get rid of pe_order().

The switch constructs have to be changed to if/else constructs to prevent
GCC from warning on builds with 3-level page tables where PMD_ORDER and
PUD_ORDER have the same value.

If you are looking at this commit because your driver stopped compiling,
look at the previous commit as well and audit your driver to be sure it
doesn't depend on mmap_lock being held in its ->huge_fault method.

[willy@infradead.org: use "order %u" to match the (non dev_t) style]
Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org
Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

show more ...


# 40d49a3c 18-Aug-2023 Matthew Wilcox (Oracle) <willy@infradead.org>

mm: allow ->huge_fault() to be called without the mmap_lock held

Remove the checks for the VMA lock being held, allowing the page fault
path to call into the filesystem instead of retrying with the

mm: allow ->huge_fault() to be called without the mmap_lock held

Remove the checks for the VMA lock being held, allowing the page fault
path to call into the filesystem instead of retrying with the mmap_lock
held. This will improve scalability for DAX page faults. Also update the
documentation to match (and fix some other changes that have happened
recently).

Link: https://lkml.kernel.org/r/20230818202335.2739663-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

show more ...


12345678910>>...175