#
33c3fc71 |
| 09-Sep-2015 |
Vladimir Davydov <vdavydov@parallels.com> |
mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. b
mm: introduce idle page tracking
Knowing the portion of memory that is not used by a certain application or memory cgroup (idle memory) can be useful for partitioning the system efficiently, e.g. by setting memory cgroup limits appropriately. Currently, the only means to estimate the amount of idle memory provided by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the access bit for all pages mapped to a particular process by writing 1 to clear_refs, wait for some time, and then count smaps:Referenced. However, this method has two serious shortcomings:
- it does not count unmapped file pages - it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags, Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap. A page's Idle flag can only be set from userspace by setting bit in /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page, and it is cleared whenever the page is accessed either through page tables (it is cleared in page_referenced() in this case) or using the read(2) system call (mark_page_accessed()). Thus by setting the Idle flag for pages of a particular workload, which can be found e.g. by reading /proc/PID/pagemap, waiting for some time to let the workload access its working set, and then reading the bitmap file, one can estimate the amount of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory reclaimer. A page's Young flag is set whenever the Access bit of a page table entry pointing to the page is cleared by writing to the bitmap file. If page_referenced() is called on a Young page, it will add 1 to its return value, therefore concealing the fact that the Access bit was cleared.
Note, since there is no room for extra page flags on 32 bit, this feature uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build] [akpm@linux-foundation.org: kpageidle requires an MMU] [akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
8334b962 |
| 08-Sep-2015 |
Minchan Kim <minchan@kernel.org> |
mm: /proc/pid/smaps:: show proportional swap share of the mapping
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize me
mm: /proc/pid/smaps:: show proportional swap share of the mapping
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS.
On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android).
This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process.
Bongkyu tested it. Result is below.
1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184
2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB
$ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744
[akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
77bb499b |
| 08-Sep-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
pagemap: add mmap-exclusive bit for marking pages mapped only here
This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PF
pagemap: add mmap-exclusive bit for marking pages mapped only here
This patch sets bit 56 in pagemap if this page is mapped only once. It allows to detect exclusively used pages without exposing PFN:
present file exclusive state 0 0 0 non-present 1 1 0 file page mapped somewhere else 1 1 1 file page mapped only here 1 0 0 anon non-CoWed page (shared with parent/child) 1 0 1 anon CoWed page (or never forked)
CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.
MMap-exclusive bit doesn't reflect potential page-sharing via swapcache: page could be mapped once but has several swap-ptes which point to it. Application could detect that by swap bit in pagemap entry and touch that pte via /proc/pid/mem to get real information.
See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com
Requested by Mark Williamson.
[akpm@linux-foundation.org: fix spello] Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
1c90308e |
| 08-Sep-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
pagemap: hide physical addresses from non-privileged users
This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all.
pagemap: hide physical addresses from non-privileged users
This patch makes pagemap readable for normal users and hides physical addresses from them. For some use-cases PFN isn't required at all.
See http://lkml.kernel.org/r/1425935472-17949-1-git-send-email-kirill@shutemov.name
Fixes: ab676b7d6fbf ("pagemap: do not leak physical addresses to non-privileged userspace") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
356515e7 |
| 08-Sep-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
pagemap: rework hugetlb and thp report
This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs.
Add missing "FILE" bit in hugetl
pagemap: rework hugetlb and thp report
This patch moves pmd dissection out of reporting loop: huge pages are reported as bunch of normal pages with contiguous PFNs.
Add missing "FILE" bit in hugetlb vmas.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
deb94544 |
| 08-Sep-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
pagemap: switch to the new format and do some cleanup
This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro.
pagemap: switch to the new format and do some cleanup
This patch removes page-shift bits (scheduled to remove since 3.11) and completes migration to the new bit layout. Also it cleans messy macro.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
a06db751 |
| 08-Sep-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
pagemap: check permissions and capabilities at open time
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores acce
pagemap: check permissions and capabilities at open time
This patchset makes pagemap useable again in the safe way (after row hammer bug it was made CAP_SYS_ADMIN-only). This patchset restores access for non-privileged users but hides PFNs from them.
Also it adds bit 'map-exclusive' which is set if page is mapped only here: it helps in estimation of working set without exposing pfns and allows to distinguish CoWed and non-CoWed private anonymous pages.
Second patch removes page-shift bits and completes migration to the new pagemap format: flags soft-dirty and mmap-exclusive are available only in the new format.
This patch (of 5):
This patch moves permission checks from pagemap_read() into pagemap_open().
Pointer to mm is saved in file->private_data. This reference pins only mm_struct itself. /proc/*/mem, maps, smaps already work in the same way.
See http://lkml.kernel.org/r/CA+55aFyKpWrt_Ajzh1rzp_GcwZ4=6Y=kOv8hBz172CFJp6L8Tg@mail.gmail.com
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Mark Williamson <mwilliamson@undo-software.com> Tested-by: Mark Williamson <mwilliamson@undo-software.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
16ba6f81 |
| 04-Sep-2015 |
Andrea Arcangeli <aarcange@redhat.com> |
userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only
userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP
These two flags gets set in vma->vm_flags to tell the VM common code if the userfaultfd is armed and in which mode (only tracking missing faults, only tracking wrprotect faults or both). If neither flags is set it means the userfaultfd is not armed on the vma.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
2726d566 |
| 19-Jun-2015 |
Miklos Szeredi <mszeredi@suse.cz> |
vfs: add seq_file_path() helper
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...);
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.lin
vfs: add seq_file_path() helper
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...);
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
Revision tags: v4.0, v4.0-rc7, v4.0-rc6, v4.0-rc5, v4.0-rc4 |
|
#
ab676b7d |
| 09-Mar-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
pagemap: do not leak physical addresses to non-privileged userspace
As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can b
pagemap: do not leak physical addresses to non-privileged userspace
As pointed by recent post[1] on exploiting DRAM physical imperfection, /proc/PID/pagemap exposes sensitive information which can be used to do attacks.
This disallows anybody without CAP_SYS_ADMIN to read the pagemap.
[1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
[ Eventually we might want to do anything more finegrained, but for now this is the simple model. - Linus ]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Seaborn <mseaborn@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v4.0-rc3, v4.0-rc2, v4.0-rc1 |
|
#
198d1597 |
| 12-Feb-2015 |
Rafael Aquini <aquini@redhat.com> |
fs: proc: task_mmu: show page size in /proc/<pid>/numa_maps
The output of /proc/$pid/numa_maps is in terms of number of pages like anon=22 or dirty=54. Here's some output:
7f4680000000 default f
fs: proc: task_mmu: show page size in /proc/<pid>/numa_maps
The output of /proc/$pid/numa_maps is in terms of number of pages like anon=22 or dirty=54. Here's some output:
7f4680000000 default file=/hugetlb/bigfile anon=50 dirty=50 N0=50 7f7659600000 default file=/anon_hugepage\040(deleted) anon=50 dirty=50 N0=50 7fff8d425000 default stack anon=50 dirty=50 N0=50
Looks like we have a stack and a couple of anonymous hugetlbfs areas page which both use the same amount of memory. They don't.
The 'bigfile' uses 1GB pages and takes up ~50GB of space. The anon_hugepage uses 2MB pages and takes up ~100MB of space while the stack uses normal 4k pages. You can go over to smaps to figure out what the page size _really_ is with KernelPageSize or MMUPageSize. But, I think this is a pretty nasty and counterintuitive interface as it stands.
This patch introduces 'kernelpagesize_kB' line element to /proc/<pid>/numa_maps report file in order to help identifying the size of pages that are backing memory areas mapped by a given task. This is specially useful to help differentiating between HUGE and GIGANTIC page backed VMAs.
This patch is based on Dave Hansen's proposal and reviewer's follow-ups taken from the following dicussion threads: * https://lkml.org/lkml/2011/9/21/454 * https://lkml.org/lkml/2014/12/20/66
Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
695f0559 |
| 12-Feb-2015 |
Petr Cermak <petrcermak@chromium.org> |
fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS)
Peak resident size of a process can be reset back to the process's current rss value by writing "5" to /proc/pid/c
fs/proc/task_mmu.c: add user-space support for resetting mm->hiwater_rss (peak RSS)
Peak resident size of a process can be reset back to the process's current rss value by writing "5" to /proc/pid/clear_refs. The driving use-case for this would be getting the peak RSS value, which can be retrieved from the VmHWM field in /proc/pid/status, per benchmark iteration or test scenario.
[akpm@linux-foundation.org: clarify behaviour in documentation] Signed-off-by: Petr Cermak <petrcermak@chromium.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Primiano Tucci <primiano@chromium.org> Cc: Petr Cermak <petrcermak@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
7d5b3bfa |
| 11-Feb-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm: /proc/pid/clear_refs: avoid split_huge_page()
Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level.
One side effect is that
mm: /proc/pid/clear_refs: avoid split_huge_page()
Currently pagewalker splits all THP pages on any clear_refs request. It's not necessary. We can handle this on PMD level.
One side effect is that soft dirty will potentially see more dirty memory, since we will mark whole THP page dirty at once.
Sanity checked with CRIU test suite. More testing is required.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
48684a65 |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_pag
mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)
walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). For example for pagemap_read(), when no callbacks are called against VM_PFNMAP vma, pagemap_read() may prepare pagemap data for next virtual address range at wrong index. That could confuse and/or break userspace applications.
This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows: - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole() over vma(VM_PFNMAP), - for clear_refs and queue_pages which have their own ->tests_walk, just return 1 and skip vma(VM_PFNMAP). This is no problem because these are not interested in hole regions, - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Shiraz Hashim <shashim@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
d85f4d6d |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
numa_maps: remove numa_maps->vma
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is prefe
numa_maps: remove numa_maps->vma
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_numa_map() walks pages on vma basis, so using walk_page_vma() is preferable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
632fd60f |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
numa_maps: fix typo in gather_hugetbl_stats
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked
numa_maps: fix typo in gather_hugetbl_stats
Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code grep-friendly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
f995ece2 |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
pagemap: use walk->vma instead of calling find_vma()
Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() ca
pagemap: use walk->vma instead of calling find_vma()
Page table walker has the information of the current vma in mm_walk, so we don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call any longer. Currently pagemap_pte_range() does vma loop itself, so this patch reduces many lines of code.
NULL-vma check is omitted because we assume that we never run these callbacks on any address outside vma. And even if it were broken, NULL pointer dereference would be detected, so we can get enough information for debugging.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
5c64f52a |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() call
clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk()
clear_refs_write() has some prechecks to determine if we really walk over a given vma. Now we have a test_walk() callback to filter vmas, so let's utilize it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
14eb6fdd |
| 11-Feb-2015 |
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> |
smaps: remove mem_size_stats->vma and use walk_page_vma()
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using wal
smaps: remove mem_size_stats->vma and use walk_page_vma()
pagewalk.c can handle vma in itself, so we don't have to pass vma via walk->private. And show_smap() walks pages on vma basis, so using walk_page_vma() is preferable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
05fbf357 |
| 11-Feb-2015 |
Konstantin Khlebnikov <khlebnikov@yandex-team.ru> |
proc/pagemap: walk page tables under pte lock
Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():
CPU A (pag
proc/pagemap: walk page tables under pte lock
Lockless access to pte in pagemap_pte_range() might race with page migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():
CPU A (pagemap) CPU B (migration) lock_page() try_to_unmap(page, TTU_MIGRATION...) make_migration_entry() set_pte_at() <read *pte> pte_to_pagemap_entry() remove_migration_ptes() unlock_page() if(is_migration_entry()) migration_entry_to_page() BUG_ON(!PageLocked(page))
Also lockless read might be non-atomic if pte is larger than wordsize. Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.
Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
dc6c9a35 |
| 11-Feb-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgrou
mm: account pmd page tables to the process
Dave noticed that unprivileged process can allocate significant amount of memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and memory cgroup. The trick is to allocate a lot of PMD page tables. Linux kernel doesn't account PMD tables to the process, only PTE.
The use-cases below use few tricks to allocate a lot of PMD page tables while keeping VmRSS and VmPTE low. oom_score for the process will be 0.
#include <errno.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #include <sys/prctl.h>
#define PUD_SIZE (1UL << 30) #define PMD_SIZE (1UL << 21)
#define NR_PUD 130000
int main(void) { char *addr = NULL; unsigned long i;
prctl(PR_SET_THP_DISABLE); for (i = 0; i < NR_PUD ; i++) { addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("mmap"); break; } *addr = 'x'; munmap(addr, PMD_SIZE); mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0); if (addr == MAP_FAILED) perror("re-mmap"), exit(1); } printf("PID %d consumed %lu KiB in PMD page tables\n", getpid(), i * 4096 >> 10); return pause(); }
The patch addresses the issue by account PMD tables to the process the same way we account PTE.
The main place where PMD tables is accounted is __pmd_alloc() and free_pmd_range(). But there're few corner cases:
- HugeTLB can share PMD page tables. The patch handles by accounting the table to all processes who share it.
- x86 PAE pre-allocates few PMD tables on fork.
- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity check on exit(2).
Accounting only happens on configuration where PMD page table's level is present (PMD is not folded). As with nr_ptes we use per-mm counter. The counter value is used to calculate baseline for badness score by oom-killer.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: David Rientjes <rientjes@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
1da4b35b |
| 10-Feb-2015 |
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> |
proc: drop handling non-linear mappings
We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs} which is unused now. Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemo
proc: drop handling non-linear mappings
We have to handle non-linear mappings for /proc/PID/{smaps,clear_refs} which is unused now. Let's drop it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v3.19, v3.19-rc7, v3.19-rc6, v3.19-rc5, v3.19-rc4, v3.19-rc3, v3.19-rc2, v3.19-rc1 |
|
#
c164e038 |
| 10-Dec-2014 |
Kirill A. Shutemov <kirill@shutemov.name> |
mm: fix huge zero page accounting in smaps report
As a small zero page, huge zero page should not be accounted in smaps report as normal page.
For small pages we rely on vm_normal_page() to filter
mm: fix huge zero page accounting in smaps report
As a small zero page, huge zero page should not be accounted in smaps report as normal page.
For small pages we rely on vm_normal_page() to filter out zero page, but vm_normal_page() is not designed to handle pmds. We only get here due hackish cast pmd to pte in smaps_pte_range() -- pte and pmd format is not necessary compatible on each and every architecture.
Let's add separate codepath to handle pmds. follow_trans_huge_pmd() will detect huge zero page for us.
We would need pmd_dirty() helper to do this properly. The patch adds it to THP-enabled architectures which don't yet have one.
[akpm@linux-foundation.org: use do_div to fix 32-bit build] Signed-off-by: "Kirill A. Shutemov" <kirill@shutemov.name> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengwei Yin <yfw.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v3.18, v3.18-rc7, v3.18-rc6, v3.18-rc5 |
|
#
4aae7e43 |
| 14-Nov-2014 |
Qiaowei Ren <qiaowei.ren@intel.com> |
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
MPX-enabled applications using large swaths of memory can potentially have large numbers of bounds tables in process address space t
x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific
MPX-enabled applications using large swaths of memory can potentially have large numbers of bounds tables in process address space to save bounds information. These tables can take up huge swaths of memory (as much as 80% of the memory on the system) even if we clean them up aggressively. In the worst-case scenario, the tables can be 4x the size of the data structure being tracked. IOW, a 1-page structure can require 4 bounds-table pages.
Being this huge, our expectation is that folks using MPX are going to be keen on figuring out how much memory is being dedicated to it. So we need a way to track memory use for MPX.
If we want to specifically track MPX VMAs we need to be able to distinguish them from normal VMAs, and keep them from getting merged with normal VMAs. A new VM_ flag set only on MPX VMAs does both of those things. With this flag, MPX bounds-table VMAs can be distinguished from other VMAs, and userspace can also walk /proc/$pid/smaps to get memory usage for MPX.
In addition to this flag, we also introduce a special ->vm_ops specific to MPX VMAs (see the patch "add MPX specific mmap interface"), but currently different ->vm_ops do not by themselves prevent VMA merging, so we still need this flag.
We understand that VM_ flags are scarce and are open to other options.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
show more ...
|
Revision tags: v3.18-rc4, v3.18-rc3, v3.18-rc2, v3.18-rc1 |
|
#
64e45507 |
| 13-Oct-2014 |
Peter Feiner <pfeiner@google.com> |
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults have their write bit set. If the read fault h
mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults have their write bit set. If the read fault happens after VM_SOFTDIRTY is cleared, then the PTE's softdirty bit will remain clear after subsequent writes.
Here's a simple code snippet to demonstrate the bug:
char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0); system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */ assert(*m == '\0'); /* new PTE allows write access */ assert(!soft_dirty(x)); *m = 'x'; /* should dirty the page */ assert(soft_dirty(x)); /* fails */
With this patch, write notifications are enabled when VM_SOFTDIRTY is cleared. Furthermore, to avoid unnecessary faults, write notifications are disabled when VM_SOFTDIRTY is set.
As a side effect of enabling and disabling write notifications with care, this patch fixes a bug in mprotect where vm_page_prot bits set by drivers were zapped on mprotect. An analogous bug was fixed in mmap by commit c9d0bf241451 ("mm: uncached vma support with writenotify").
Signed-off-by: Peter Feiner <pfeiner@google.com> Reported-by: Peter Feiner <pfeiner@google.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|