11ad1335dSMike Rapoport================== 21ad1335dSMike RapoportIdle Page Tracking 31ad1335dSMike Rapoport================== 41ad1335dSMike Rapoport 51ad1335dSMike RapoportMotivation 61ad1335dSMike Rapoport========== 71ad1335dSMike Rapoport 81ad1335dSMike RapoportThe idle page tracking feature allows to track which memory pages are being 91ad1335dSMike Rapoportaccessed by a workload and which are idle. This information can be useful for 101ad1335dSMike Rapoportestimating the workload's working set size, which, in turn, can be taken into 111ad1335dSMike Rapoportaccount when configuring the workload parameters, setting memory cgroup limits, 121ad1335dSMike Rapoportor deciding where to place the workload within a compute cluster. 131ad1335dSMike Rapoport 141ad1335dSMike RapoportIt is enabled by CONFIG_IDLE_PAGE_TRACKING=y. 151ad1335dSMike Rapoport 161ad1335dSMike Rapoport.. _user_api: 171ad1335dSMike Rapoport 181ad1335dSMike RapoportUser API 191ad1335dSMike Rapoport======== 201ad1335dSMike Rapoport 211ad1335dSMike RapoportThe idle page tracking API is located at ``/sys/kernel/mm/page_idle``. 221ad1335dSMike RapoportCurrently, it consists of the only read-write file, 231ad1335dSMike Rapoport``/sys/kernel/mm/page_idle/bitmap``. 241ad1335dSMike Rapoport 251ad1335dSMike RapoportThe file implements a bitmap where each bit corresponds to a memory page. The 261ad1335dSMike Rapoportbitmap is represented by an array of 8-byte integers, and the page at PFN #i is 271ad1335dSMike Rapoportmapped to bit #i%64 of array element #i/64, byte order is native. When a bit is 281ad1335dSMike Rapoportset, the corresponding page is idle. 291ad1335dSMike Rapoport 301ad1335dSMike RapoportA page is considered idle if it has not been accessed since it was marked idle 311ad1335dSMike Rapoport(for more details on what "accessed" actually means see the :ref:`Implementation 321ad1335dSMike RapoportDetails <impl_details>` section). 331ad1335dSMike RapoportTo mark a page idle one has to set the bit corresponding to 341ad1335dSMike Rapoportthe page by writing to the file. A value written to the file is OR-ed with the 351ad1335dSMike Rapoportcurrent bitmap value. 361ad1335dSMike Rapoport 371ad1335dSMike RapoportOnly accesses to user memory pages are tracked. These are pages mapped to a 381ad1335dSMike Rapoportprocess address space, page cache and buffer pages, swap cache pages. For other 391ad1335dSMike Rapoportpage types (e.g. SLAB pages) an attempt to mark a page idle is silently ignored, 401ad1335dSMike Rapoportand hence such pages are never reported idle. 411ad1335dSMike Rapoport 421ad1335dSMike RapoportFor huge pages the idle flag is set only on the head page, so one has to read 431ad1335dSMike Rapoport``/proc/kpageflags`` in order to correctly count idle huge pages. 441ad1335dSMike Rapoport 451ad1335dSMike RapoportReading from or writing to ``/sys/kernel/mm/page_idle/bitmap`` will return 461ad1335dSMike Rapoport-EINVAL if you are not starting the read/write on an 8-byte boundary, or 471ad1335dSMike Rapoportif the size of the read/write is not a multiple of 8 bytes. Writing to 481ad1335dSMike Rapoportthis file beyond max PFN will return -ENXIO. 491ad1335dSMike Rapoport 501ad1335dSMike RapoportThat said, in order to estimate the amount of pages that are not used by a 511ad1335dSMike Rapoportworkload one should: 521ad1335dSMike Rapoport 531ad1335dSMike Rapoport 1. Mark all the workload's pages as idle by setting corresponding bits in 541ad1335dSMike Rapoport ``/sys/kernel/mm/page_idle/bitmap``. The pages can be found by reading 551ad1335dSMike Rapoport ``/proc/pid/pagemap`` if the workload is represented by a process, or by 561ad1335dSMike Rapoport filtering out alien pages using ``/proc/kpagecgroup`` in case the workload 571ad1335dSMike Rapoport is placed in a memory cgroup. 581ad1335dSMike Rapoport 591ad1335dSMike Rapoport 2. Wait until the workload accesses its working set. 601ad1335dSMike Rapoport 611ad1335dSMike Rapoport 3. Read ``/sys/kernel/mm/page_idle/bitmap`` and count the number of bits set. 621ad1335dSMike Rapoport If one wants to ignore certain types of pages, e.g. mlocked pages since they 631ad1335dSMike Rapoport are not reclaimable, he or she can filter them out using 641ad1335dSMike Rapoport ``/proc/kpageflags``. 651ad1335dSMike Rapoport 66799fb82aSSeongJae ParkThe page-types tool in the tools/mm directory can be used to assist in this. 6759ae96ffSChristian HansenIf the tool is run initially with the appropriate option, it will mark all the 6859ae96ffSChristian Hansenqueried pages as idle. Subsequent runs of the tool can then show which pages have 6959ae96ffSChristian Hansentheir idle flag cleared in the interim. 7059ae96ffSChristian Hansen 71*00cba6b6SMike Rapoport (IBM)See Documentation/admin-guide/mm/pagemap.rst for more information about 72*00cba6b6SMike Rapoport (IBM)``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. 731ad1335dSMike Rapoport 741ad1335dSMike Rapoport.. _impl_details: 751ad1335dSMike Rapoport 761ad1335dSMike RapoportImplementation Details 771ad1335dSMike Rapoport====================== 781ad1335dSMike Rapoport 791ad1335dSMike RapoportThe kernel internally keeps track of accesses to user memory pages in order to 801ad1335dSMike Rapoportreclaim unreferenced pages first on memory shortage conditions. A page is 811ad1335dSMike Rapoportconsidered referenced if it has been recently accessed via a process address 821ad1335dSMike Rapoportspace, in which case one or more PTEs it is mapped to will have the Accessed bit 831ad1335dSMike Rapoportset, or marked accessed explicitly by the kernel (see mark_page_accessed()). The 841ad1335dSMike Rapoportlatter happens when: 851ad1335dSMike Rapoport 861ad1335dSMike Rapoport - a userspace process reads or writes a page using a system call (e.g. read(2) 871ad1335dSMike Rapoport or write(2)) 881ad1335dSMike Rapoport 891ad1335dSMike Rapoport - a page that is used for storing filesystem buffers is read or written, 901ad1335dSMike Rapoport because a process needs filesystem metadata stored in it (e.g. lists a 911ad1335dSMike Rapoport directory tree) 921ad1335dSMike Rapoport 931ad1335dSMike Rapoport - a page is accessed by a device driver using get_user_pages() 941ad1335dSMike Rapoport 951ad1335dSMike RapoportWhen a dirty page is written to swap or disk as a result of memory reclaim or 961ad1335dSMike Rapoportexceeding the dirty memory limit, it is not marked referenced. 971ad1335dSMike Rapoport 981ad1335dSMike RapoportThe idle memory tracking feature adds a new page flag, the Idle flag. This flag 991ad1335dSMike Rapoportis set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the 1001ad1335dSMike Rapoport:ref:`User API <user_api>` 1011ad1335dSMike Rapoportsection), and cleared automatically whenever a page is referenced as defined 1021ad1335dSMike Rapoportabove. 1031ad1335dSMike Rapoport 1041ad1335dSMike RapoportWhen a page is marked idle, the Accessed bit must be cleared in all PTEs it is 1051ad1335dSMike Rapoportmapped to, otherwise we will not be able to detect accesses to the page coming 1061ad1335dSMike Rapoportfrom a process address space. To avoid interference with the reclaimer, which, 1071ad1335dSMike Rapoportas noted above, uses the Accessed bit to promote actively referenced pages, one 1081ad1335dSMike Rapoportmore page flag is introduced, the Young flag. When the PTE Accessed bit is 1091ad1335dSMike Rapoportcleared as a result of setting or updating a page's Idle flag, the Young flag 1101ad1335dSMike Rapoportis set on the page. The reclaimer treats the Young flag as an extra PTE 1111ad1335dSMike RapoportAccessed bit and therefore will consider such a page as referenced. 1121ad1335dSMike Rapoport 1131ad1335dSMike RapoportSince the idle memory tracking feature is based on the memory reclaimer logic, 1141ad1335dSMike Rapoportit only works with pages that are on an LRU list, other pages are silently 1151ad1335dSMike Rapoportignored. That means it will ignore a user memory page if it is isolated, but 1161ad1335dSMike Rapoportsince there are usually not many of them, it should not affect the overall 1171ad1335dSMike Rapoportresult noticeably. In order not to stall scanning of the idle page bitmap, 1181ad1335dSMike Rapoportlocked pages may be skipped too. 119