#
8a8792f6 |
| 23-Jan-2021 |
Shakeel Butt <shakeelb@google.com> |
mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stats.
That was not an issue until recently the commit 5f9a4f4
mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stats.
That was not an issue until recently the commit 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") exposed numa stats for the memcg.
So fix the file_dirty per-memcg numa stat.
Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
8958b249 |
| 15-Dec-2020 |
Haitao Shi <shihaitao1@huawei.com> |
mm: fix some spelling mistakes in comments
Fix some spelling mistakes in comments: udpate ==> update succesful ==> successful exmaple ==> example unneccessary ==> unnecessary stoping ==> stoppi
mm: fix some spelling mistakes in comments
Fix some spelling mistakes in comments: udpate ==> update succesful ==> successful exmaple ==> example unneccessary ==> unnecessary stoping ==> stopping uknown ==> unknown
Link: https://lkml.kernel.org/r/20201127011747.86005-1-shihaitao1@huawei.com Signed-off-by: Haitao Shi <shihaitao1@huawei.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d85c6db4 |
| 14-Dec-2020 |
Stephen Zhang <starzhangzsd@gmail.com> |
mm: migrate: remove unused parameter in migrate_vma_insert_page()
"dst" parameter to migrate_vma_insert_page() is not used anymore.
Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=
mm: migrate: remove unused parameter in migrate_vma_insert_page()
"dst" parameter to migrate_vma_insert_page() is not used anymore.
Link: https://lkml.kernel.org/r/CANubcdUwCAMuUyamG2dkWP=cqSR9MAS=tHLDc95kQkqU-rEnAg@mail.gmail.com Signed-off-by: Stephen Zhang <starzhangzsd@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d532e2e5 |
| 14-Dec-2020 |
Yang Shi <shy828301@gmail.com> |
mm: migrate: return -ENOSYS if THP migration is unsupported
In the current implementation unmap_and_move() would return -ENOMEM if THP migration is unsupported, then the THP will be split. If split
mm: migrate: return -ENOSYS if THP migration is unsupported
In the current implementation unmap_and_move() would return -ENOMEM if THP migration is unsupported, then the THP will be split. If split is failed just exit without trying to migrate other pages. It doesn't make too much sense since there may be enough free memory to migrate other pages and there may be a lot base pages on the list.
Return -ENOSYS to make consistent with hugetlb. And if THP split is failed just skip and try other pages on the list.
Just skip the whole list and exit when free memory is really low.
Link: https://lkml.kernel.org/r/20201113205359.556831-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Song Liu <songliubraving@fb.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
236c32eb |
| 14-Dec-2020 |
Yang Shi <shy828301@gmail.com> |
mm: migrate: clean up migrate_prep{_local}
The migrate_prep{_local} never fails, so it is pointless to have return value and check the return value.
Link: https://lkml.kernel.org/r/20201113205359.5
mm: migrate: clean up migrate_prep{_local}
The migrate_prep{_local} never fails, so it is pointless to have return value and check the return value.
Link: https://lkml.kernel.org/r/20201113205359.556831-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
c77c5cba |
| 14-Dec-2020 |
Yang Shi <shy828301@gmail.com> |
mm: migrate: skip shared exec THP for NUMA balancing
The NUMA balancing skip shared exec base page. Since CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec THP, so skip suc
mm: migrate: skip shared exec THP for NUMA balancing
The NUMA balancing skip shared exec base page. Since CONFIG_READ_ONLY_THP_FOR_FS was introduced, there are probably shared exec THP, so skip such THPs for NUMA balancing as well.
And Willy's regular filesystem THP support patches could create shared exec THP wven without that config.
In addition, the page_is_file_lru() is used to tell if the page is file cache or not, but it filters out shmem page. It sounds like a typical usecase by putting executables in shmem to achieve performance gain via using shmem-THP, so it sounds worth skipping migration for such case too.
Link: https://lkml.kernel.org/r/20201113205359.556831-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Song Liu <songliubraving@fb.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
dd4ae78a |
| 14-Dec-2020 |
Yang Shi <shy828301@gmail.com> |
mm: migrate: simplify the logic for handling permanent failure
When unmap_and_move{_huge_page}() returns !-EAGAIN and !MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if it is
mm: migrate: simplify the logic for handling permanent failure
When unmap_and_move{_huge_page}() returns !-EAGAIN and !MIGRATEPAGE_SUCCESS, the page would be put back to LRU or proper list if it is non-LRU movable page. But, the callers always call putback_movable_pages() to put the failed pages back later on, so it seems not very efficient to put every single page back immediately, and the code looks convoluted.
Put the failed page on a separate list, then splice the list to migrate list when all pages are tried. It is the caller's responsibility to call putback_movable_pages() to handle failures. This also makes the code simpler and more readable.
After the change the rules are: * Success: non hugetlb page will be freed, hugetlb page will be put back * -EAGAIN: stay on the from list * -ENOMEM: stay on the from list * Other errno: put on ret_pages list then splice to from list
The from list would be empty iff all pages are migrated successfully, it was not so before. This has no impact to current existing callsites.
Link: https://lkml.kernel.org/r/20201113205359.556831-3-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Song Liu <songliubraving@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d12b8951 |
| 14-Dec-2020 |
Yang Shi <shy828301@gmail.com> |
mm: truncate_complete_page() does not exist any more
Patch series "mm: misc migrate cleanup and improvement", v3.
This patch (of 5):
The commit 9f4e41f4717832e ("mm: refactor truncate_complete_pag
mm: truncate_complete_page() does not exist any more
Patch series "mm: misc migrate cleanup and improvement", v3.
This patch (of 5):
The commit 9f4e41f4717832e ("mm: refactor truncate_complete_page()") refactored truncate_complete_page(), and it is not existed anymore, correct the comment in vmscan and migrate to avoid confusion.
Link: https://lkml.kernel.org/r/20201113205359.556831-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20201113205359.556831-2-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
5e5dda81 |
| 14-Dec-2020 |
Ralph Campbell <rcampbell@nvidia.com> |
mm/migrate.c: optimize migrate_vma_pages() mmu notifier
When migrating a zero page or pte_none() anonymous page to device private memory, migrate_vma_setup() will initialize the src[] array with a N
mm/migrate.c: optimize migrate_vma_pages() mmu notifier
When migrating a zero page or pte_none() anonymous page to device private memory, migrate_vma_setup() will initialize the src[] array with a NULL PFN. This lets the device driver allocate device private memory and clear it instead of DMAing a page of zeros over the device bus.
Since the source page didn't exist at the time, no struct page was locked nor a migration PTE inserted into the CPU page tables. The actual PTE insertion happens in migrate_vma_pages() when it tries to insert the device private struct page PTE into the CPU page tables. migrate_vma_pages() has to call the mmu notifiers again since another device could fault on the same page before the page table locks are acquired.
Allow device drivers to optimize the invalidation similar to migrate_vma_setup() by calling mmu_notifier_range_init() which sets struct mmu_notifier_range event type to MMU_NOTIFY_MIGRATE and the migrate_pgmap_owner field.
Link: https://lkml.kernel.org/r/20201021191335.10916-1-rcampbell@nvidia.com Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
ab9dd4f8 |
| 14-Dec-2020 |
Long Li <lonuxli.64@gmail.com> |
mm/migrate.c: fix comment spelling
The word in the comment is misspelled, it should be "include".
Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong Signed-off-by: Long Li <lonuxli.64@gm
mm/migrate.c: fix comment spelling
The word in the comment is misspelled, it should be "include".
Link: https://lkml.kernel.org/r/20201024114144.GA20552@lilong Signed-off-by: Long Li <lonuxli.64@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
013339df |
| 14-Dec-2020 |
Shakeel Butt <shakeelb@google.com> |
mm/rmap: always do TTU_IGNORE_ACCESS
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IG
mm/rmap: always do TTU_IGNORE_ACCESS
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm.
However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code.
There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim.
The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code.
Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
32f954e9 |
| 15-Jun-2021 |
Xu Yu <xuyu@linux.alibaba.com> |
mm, thp: use head page in __migration_entry_wait()
commit ffc90cbb2970ab88b66ea51dd580469eede57b67 upstream.
We notice that hung task happens in a corner but practical scenario when CONFIG_PREEMPT_
mm, thp: use head page in __migration_entry_wait()
commit ffc90cbb2970ab88b66ea51dd580469eede57b67 upstream.
We notice that hung task happens in a corner but practical scenario when CONFIG_PREEMPT_NONE is enabled, as follows.
Process 0 Process 1 Process 2..Inf split_huge_page_to_list unmap_page split_huge_pmd_address __migration_entry_wait(head) __migration_entry_wait(tail) remap_page (roll back) remove_migration_ptes rmap_walk_anon cond_resched
Where __migration_entry_wait(tail) is occurred in kernel space, e.g., copy_to_user in fstat, which will immediately fault again without rescheduling, and thus occupy the cpu fully.
When there are too many processes performing __migration_entry_wait on tail page, remap_page will never be done after cond_resched.
This makes __migration_entry_wait operate on the compound head page, thus waits for remap_page to complete, whether the THP is split successfully or roll back.
Note that put_and_wait_on_page_locked helps to drop the page reference acquired with get_page_unless_zero, as soon as the page is on the wait queue, before actually waiting. So splitting the THP is only prevented for a brief interval.
Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split") Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
aa0d6d1d |
| 04-May-2021 |
Miaohe Lin <linmiaohe@huawei.com> |
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
[ Upstream commit 34f5e9b9d1990d286199084efa752530ee3d8297 ]
If the zone device page does not belong to un-addressab
mm/migrate.c: fix potential indeterminate pte entry in migrate_vma_insert_page()
[ Upstream commit 34f5e9b9d1990d286199084efa752530ee3d8297 ]
If the zone device page does not belong to un-addressable device memory, the variable entry will be uninitialized and lead to indeterminate pte entry ultimately. Fix this unexpected case and warn about it.
Link: https://lkml.kernel.org/r/20210325131524.48181-4-linmiaohe@huawei.com Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
371f3fbf |
| 23-Jan-2021 |
Shakeel Butt <shakeelb@google.com> |
mm: fix numa stats for thp migration
commit 5c447d274f3746fbed6e695e7b9a2d7bd8b31b71 upstream.
Currently the kernel is not correctly updating the numa stats for NR_FILE_PAGES and NR_SHMEM on THP mi
mm: fix numa stats for thp migration
commit 5c447d274f3746fbed6e695e7b9a2d7bd8b31b71 upstream.
Currently the kernel is not correctly updating the numa stats for NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that.
For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment there is no need to handle THP migration as kernel still does not have write support for file THP but to be more future proof, this patch adds the THP support for those stats as well.
Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
0dc3a130 |
| 23-Jan-2021 |
Shakeel Butt <shakeelb@google.com> |
mm: memcg: fix memcg file_dirty numa stat
commit 8a8792f600abacd7e1b9bb667759dca1c153f64c upstream.
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stat
mm: memcg: fix memcg file_dirty numa stat
commit 8a8792f600abacd7e1b9bb667759dca1c153f64c upstream.
The kernel updates the per-node NR_FILE_DIRTY stats on page migration but not the memcg numa stats.
That was not an issue until recently the commit 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") exposed numa stats for the memcg.
So fix the file_dirty per-memcg numa stat.
Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
dd156e3f |
| 14-Dec-2020 |
Shakeel Butt <shakeelb@google.com> |
mm/rmap: always do TTU_IGNORE_ACCESS
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check th
mm/rmap: always do TTU_IGNORE_ACCESS
[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]
Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2"), the code to check the secondary MMU's page table access bit is broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the secondary MMU's page table before the check. More specifically for those secondary MMUs which unmap the memory in mmu_notifier_invalidate_range_start() like kvm.
However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the absence of TTU_IGNORE_ACCESS and it explicitly performs the page table access check before trying to unmap the page. So, at worst the reclaim will miss accesses in a very short window if we remove page table access check in unmapping code.
There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg reclaim. From memcg reclaim the page_referenced() only account the accesses from the processes which are in the same memcg of the target page but the unmapping code is considering accesses from all the processes, so, decreasing the effectiveness of memcg reclaim.
The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping code.
Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
336bf30e |
| 14-Nov-2020 |
Mike Kravetz <mike.kravetz@oracle.com> |
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12 BUG: unable to handle page fault for address: ffffffffffffffe0 ... RIP: 00
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12 BUG: unable to handle page fault for address: ffffffffffffffe0 ... RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63 Call Trace: rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864 try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763 migrate_pages+0x1005/0x1fb0 move_pages_and_store_status.isra.47+0xd7/0x1a0 __x64_sys_move_pages+0xa5c/0x1100 do_syscall_64+0x5f/0x310 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for anon pages as the anon_vma_lock should be held in this case. Further analysis suggested that i_mmap_rwsem was not required to he held at all when calling try_to_unmap for anon pages as an anon page could never be part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine will now simply do a 'trylock' while still holding the page lock. If the trylock fails, it will return NULL. This could impact the callers:
- migration calling code will receive -EAGAIN and retry up to the hard coded limit (10).
- memory error code will treat the page as BUSY. This will force killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a race. None of the standard kernel testing suites actually hit this race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/ [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization") Reported-by: Qian Cai <cai@lca.pw> Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
4dc200ce |
| 17-Oct-2020 |
Miaohe Lin <linmiaohe@huawei.com> |
mm/migrate: avoid possible unnecessary process right check in kernel_move_pages()
There is no need to check if this process has the right to modify the specified process when they are same. And we
mm/migrate: avoid possible unnecessary process right check in kernel_move_pages()
There is no need to check if this process has the right to modify the specified process when they are same. And we could also skip the security hook call if a process is modifying its own pages. Add helper function to handle these.
Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Hongxiang Lou <louhongxiang@huawei.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christopher Lameter <cl@linux.com> Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
79f5f8fa |
| 15-Oct-2020 |
Oscar Salvador <osalvador@suse.de> |
mm,hwpoison: rework soft offline for in-use pages
This patch changes the way we set and handle in-use poisoned pages. Until now, poisoned pages were released to the buddy allocator, trusting that t
mm,hwpoison: rework soft offline for in-use pages
This patch changes the way we set and handle in-use poisoned pages. Until now, poisoned pages were released to the buddy allocator, trusting that the checks that take place at allocation time would act as a safe net and would skip that page.
This has proved to be wrong, as we got some pfn walkers out there, like compaction, that all they care is the page to be in a buddy freelist.
Although this might not be the only user, having poisoned pages in the buddy allocator seems a bad idea as we should only have free pages that are ready and meant to be used as such.
Before explaining the taken approach, let us break down the kind of pages we can soft offline.
- Anonymous THP (after the split, they end up being 4K pages) - Hugetlb - Order-0 pages (that can be either migrated or invalited)
* Normal pages (order-0 and anon-THP)
- If they are clean and unmapped page cache pages, we invalidate then by means of invalidate_inode_page(). - If they are mapped/dirty, we do the isolate-and-migrate dance.
Either way, do not call put_page directly from those paths. Instead, we keep the page and send it to page_handle_poison to perform the right handling.
page_handle_poison sets the HWPoison flag and does the last put_page.
Down the chain, we placed a check for HWPoison page in free_pages_prepare, that just skips any poisoned page, so those pages do not end up in any pcplist/freelist.
After that, we set the refcount on the page to 1 and we increment the poisoned pages counter.
If we see that the check in free_pages_prepare creates trouble, we can always do what we do for free pages:
- wait until the page hits buddy's freelists - take it off, and flag it
The downside of the above approach is that we could race with an allocation, so by the time we want to take the page off the buddy, the page has been already allocated so we cannot soft offline it. But the user could always retry it.
* Hugetlb pages
- We isolate-and-migrate them
After the migration has been successful, we call dissolve_free_huge_page, and we set HWPoison on the page if we succeed. Hugetlb has a slightly different handling though.
While for non-hugetlb pages we cared about closing the race with an allocation, doing so for hugetlb pages requires quite some additional and intrusive code (we would need to hook in free_huge_page and some other places). So I decided to not make the code overly complicated and just fail normally if the page we allocated in the meantime.
We can always build on top of this.
As a bonus, because of the way we handle now in-use pages, we no longer need the put-as-isolation-migratetype dance, that was guarding for poisoned pages to end up in pcplists.
Signed-off-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
f1f4f3ab |
| 13-Oct-2020 |
Ralph Campbell <rcampbell@nvidia.com> |
mm/migrate: remove obsolete comment about device public
Device public memory never had an in tree consumer and was removed in commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support"). Delete
mm/migrate: remove obsolete comment about device public
Device public memory never had an in tree consumer and was removed in commit 25b2995a35b6 ("mm: remove MEMORY_DEVICE_PUBLIC support"). Delete the obsolete comment.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200827190735.12752-2-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
42578891 |
| 13-Oct-2020 |
Ralph Campbell <rcampbell@nvidia.com> |
mm/migrate: remove cpages-- in migrate_vma_finalize()
The variable struct migrate_vma->cpages is only used in migrate_vma_setup(). There is no need to decrement it in migrate_vma_finalize() since i
mm/migrate: remove cpages-- in migrate_vma_finalize()
The variable struct migrate_vma->cpages is only used in migrate_vma_setup(). There is no need to decrement it in migrate_vma_finalize() since it is never checked.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20200827190735.12752-1-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
6c5c7b9f |
| 25-Sep-2020 |
Zi Yan <ziy@nvidia.com> |
mm/migrate: correct thp migration stats
PageTransHuge returns true for both thp and hugetlb, so thp stats was counting both thp and hugetlb migrations. Exclude hugetlb migration by setting is_thp v
mm/migrate: correct thp migration stats
PageTransHuge returns true for both thp and hugetlb, so thp stats was counting both thp and hugetlb migrations. Exclude hugetlb migration by setting is_thp variable right.
Clean up thp handling code too when we are there.
Fixes: 1a5bae25e3cf ("mm/vmstat: add events for THP migration without split") Signed-off-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
f56753ac |
| 24-Sep-2020 |
Christoph Hellwig <hch@lst.de> |
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability ins
bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
Replace the two negative flags that are always used together with a single positive flag that indicates the writeback capability instead of two related non-capabilities. Also remove the pointless wrappers to just check the flag.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
#
a333e3e7 |
| 18-Sep-2020 |
Hugh Dickins <hughd@google.com> |
mm: migration of hugetlbfs page skip memcg
hugetlbfs pages do not participate in memcg: so although they do find most of migrate_page_states() useful, it would be better if they did not call into me
mm: migration of hugetlbfs page skip memcg
hugetlbfs pages do not participate in memcg: so although they do find most of migrate_page_states() useful, it would be better if they did not call into mem_cgroup_migrate() - where Qian Cai reported that LTP's move_pages12 triggers the warning in Alex Shi's prospective commit "mm/memcg: warning on !memcg after readahead page charged".
Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxch.org> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Qian Cai <cai@lca.pw> Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008301359460.5954@eggly.anvils Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
3d321bf8 |
| 04-Sep-2020 |
Ralph Campbell <rcampbell@nvidia.com> |
mm/migrate: preserve soft dirty in remove_migration_pte()
The code to remove a migration PTE and replace it with a device private PTE was not copying the soft dirty bit from the migration entry. Th
mm/migrate: preserve soft dirty in remove_migration_pte()
The code to remove a migration PTE and replace it with a device private PTE was not copying the soft dirty bit from the migration entry. This could lead to page contents not being marked dirty when faulting the page back from device private memory.
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Bharata B Rao <bharata@linux.ibm.com> Link: https://lkml.kernel.org/r/20200831212222.22409-3-rcampbell@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|