11ad1335dSMike Rapoport==================
21ad1335dSMike RapoportIdle Page Tracking
31ad1335dSMike Rapoport==================
41ad1335dSMike Rapoport
51ad1335dSMike RapoportMotivation
61ad1335dSMike Rapoport==========
71ad1335dSMike Rapoport
81ad1335dSMike RapoportThe idle page tracking feature allows to track which memory pages are being
91ad1335dSMike Rapoportaccessed by a workload and which are idle. This information can be useful for
101ad1335dSMike Rapoportestimating the workload's working set size, which, in turn, can be taken into
111ad1335dSMike Rapoportaccount when configuring the workload parameters, setting memory cgroup limits,
121ad1335dSMike Rapoportor deciding where to place the workload within a compute cluster.
131ad1335dSMike Rapoport
141ad1335dSMike RapoportIt is enabled by CONFIG_IDLE_PAGE_TRACKING=y.
151ad1335dSMike Rapoport
161ad1335dSMike Rapoport.. _user_api:
171ad1335dSMike Rapoport
181ad1335dSMike RapoportUser API
191ad1335dSMike Rapoport========
201ad1335dSMike Rapoport
211ad1335dSMike RapoportThe idle page tracking API is located at ``/sys/kernel/mm/page_idle``.
221ad1335dSMike RapoportCurrently, it consists of the only read-write file,
231ad1335dSMike Rapoport``/sys/kernel/mm/page_idle/bitmap``.
241ad1335dSMike Rapoport
251ad1335dSMike RapoportThe file implements a bitmap where each bit corresponds to a memory page. The
261ad1335dSMike Rapoportbitmap is represented by an array of 8-byte integers, and the page at PFN #i is
271ad1335dSMike Rapoportmapped to bit #i%64 of array element #i/64, byte order is native. When a bit is
281ad1335dSMike Rapoportset, the corresponding page is idle.
291ad1335dSMike Rapoport
301ad1335dSMike RapoportA page is considered idle if it has not been accessed since it was marked idle
311ad1335dSMike Rapoport(for more details on what "accessed" actually means see the :ref:`Implementation
321ad1335dSMike RapoportDetails <impl_details>` section).
331ad1335dSMike RapoportTo mark a page idle one has to set the bit corresponding to
341ad1335dSMike Rapoportthe page by writing to the file. A value written to the file is OR-ed with the
351ad1335dSMike Rapoportcurrent bitmap value.
361ad1335dSMike Rapoport
371ad1335dSMike RapoportOnly accesses to user memory pages are tracked. These are pages mapped to a
381ad1335dSMike Rapoportprocess address space, page cache and buffer pages, swap cache pages. For other
391ad1335dSMike Rapoportpage types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored,
401ad1335dSMike Rapoportand hence such pages are never reported idle.
411ad1335dSMike Rapoport
421ad1335dSMike RapoportFor huge pages the idle flag is set only on the head page, so one has to read
431ad1335dSMike Rapoport``/proc/kpageflags`` in order to correctly count idle huge pages.
441ad1335dSMike Rapoport
451ad1335dSMike RapoportReading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return
461ad1335dSMike Rapoport-EINVAL if you are not starting the read/write on an 8-byte boundary, or
471ad1335dSMike Rapoportif the size of the read/write is not a multiple of 8 bytes. Writing to
481ad1335dSMike Rapoportthis file beyond max PFN will return -ENXIO.
491ad1335dSMike Rapoport
501ad1335dSMike RapoportThat said, in order to estimate the amount of pages that are not used by a
511ad1335dSMike Rapoportworkload one should:
521ad1335dSMike Rapoport
531ad1335dSMike Rapoport 1. Mark all the workload's pages as idle by setting corresponding bits in
541ad1335dSMike Rapoport    ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading
551ad1335dSMike Rapoport    ``/proc/pid/pagemap`` if the workload is represented by a process, or by
561ad1335dSMike Rapoport    filtering out alien pages using ``/proc/kpagecgroup`` in case the workload
571ad1335dSMike Rapoport    is placed in a memory cgroup.
581ad1335dSMike Rapoport
591ad1335dSMike Rapoport 2. Wait until the workload accesses its working set.
601ad1335dSMike Rapoport
611ad1335dSMike Rapoport 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set.
621ad1335dSMike Rapoport    If one wants to ignore certain types of pages, e.g. mlocked pages since they
631ad1335dSMike Rapoport    are not reclaimable, he or she can filter them out using
641ad1335dSMike Rapoport    ``/proc/kpageflags``.
651ad1335dSMike Rapoport
6659ae96ffSChristian HansenThe page-types tool in the tools/mm directory can be used to assist in this.
6759ae96ffSChristian HansenIf the tool is run initially with the appropriate option, it will mark all the
6859ae96ffSChristian Hansenqueried pages as idle.  Subsequent runs of the tool can then show which pages have
6959ae96ffSChristian Hansentheir idle flag cleared in the interim.
7059ae96ffSChristian Hansen
71*00cba6b6SMike Rapoport (IBM)See Documentation/admin-guide/mm/pagemap.rst for more information about
72*00cba6b6SMike Rapoport (IBM)``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``.
731ad1335dSMike Rapoport
741ad1335dSMike Rapoport.. _impl_details:
751ad1335dSMike Rapoport
761ad1335dSMike RapoportImplementation Details
771ad1335dSMike Rapoport======================
781ad1335dSMike Rapoport
791ad1335dSMike RapoportThe kernel internally keeps track of accesses to user memory pages in order to
801ad1335dSMike Rapoportreclaim unreferenced pages first on memory shortage conditions. A page is
811ad1335dSMike Rapoportconsidered referenced if it has been recently accessed via a process address
821ad1335dSMike Rapoportspace, in which case one or more PTEs it is mapped to will have the Accessed bit
831ad1335dSMike Rapoportset, or marked accessed explicitly by the kernel (see mark_page_accessed()). The
841ad1335dSMike Rapoportlatter happens when:
851ad1335dSMike Rapoport
861ad1335dSMike Rapoport - a userspace process reads or writes a page using a system call (e.g. read(2)
871ad1335dSMike Rapoport   or write(2))
881ad1335dSMike Rapoport
891ad1335dSMike Rapoport - a page that is used for storing filesystem buffers is read or written,
901ad1335dSMike Rapoport   because a process needs filesystem metadata stored in it (e.g. lists a
911ad1335dSMike Rapoport   directory tree)
921ad1335dSMike Rapoport
931ad1335dSMike Rapoport - a page is accessed by a device driver using get_user_pages()
941ad1335dSMike Rapoport
951ad1335dSMike RapoportWhen a dirty page is written to swap or disk as a result of memory reclaim or
961ad1335dSMike Rapoportexceeding the dirty memory limit, it is not marked referenced.
971ad1335dSMike Rapoport
981ad1335dSMike RapoportThe idle memory tracking feature adds a new page flag, the Idle flag. This flag
991ad1335dSMike Rapoportis set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the
1001ad1335dSMike Rapoport:ref:`User API <user_api>`
1011ad1335dSMike Rapoportsection), and cleared automatically whenever a page is referenced as defined
1021ad1335dSMike Rapoportabove.
1031ad1335dSMike Rapoport
1041ad1335dSMike RapoportWhen a page is marked idle, the Accessed bit must be cleared in all PTEs it is
1051ad1335dSMike Rapoportmapped to, otherwise we will not be able to detect accesses to the page coming
1061ad1335dSMike Rapoportfrom a process address space. To avoid interference with the reclaimer, which,
1071ad1335dSMike Rapoportas noted above, uses the Accessed bit to promote actively referenced pages, one
1081ad1335dSMike Rapoportmore page flag is introduced, the Young flag. When the PTE Accessed bit is
1091ad1335dSMike Rapoportcleared as a result of setting or updating a page's Idle flag, the Young flag
1101ad1335dSMike Rapoportis set on the page. The reclaimer treats the Young flag as an extra PTE
1111ad1335dSMike RapoportAccessed bit and therefore will consider such a page as referenced.
1121ad1335dSMike Rapoport
1131ad1335dSMike RapoportSince the idle memory tracking feature is based on the memory reclaimer logic,
1141ad1335dSMike Rapoportit only works with pages that are on an LRU list, other pages are silently
1151ad1335dSMike Rapoportignored. That means it will ignore a user memory page if it is isolated, but
1161ad1335dSMike Rapoportsince there are usually not many of them, it should not affect the overall
1171ad1335dSMike Rapoportresult noticeably. In order not to stall scanning of the idle page bitmap,
1181ad1335dSMike Rapoportlocked pages may be skipped too.
119