117b6fc88SUsama Arif 2ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0 3ee65728eSMike Rapoport 4ee65728eSMike Rapoport========================================= 5ee65728eSMike RapoportA vmemmap diet for HugeTLB and Device DAX 6ee65728eSMike Rapoport========================================= 7ee65728eSMike Rapoport 8ee65728eSMike RapoportHugeTLB 9ee65728eSMike Rapoport======= 10ee65728eSMike Rapoport 11dff03381SMuchun SongThis section is to explain how HugeTLB Vmemmap Optimization (HVO) works. 12dff03381SMuchun Song 13838691a1SMuchun SongThe ``struct page`` structures are used to describe a physical page frame. By 14*d56b699dSBjorn Helgaasdefault, there is a one-to-one mapping from a page frame to its corresponding 15838691a1SMuchun Song``struct page``. 16ee65728eSMike Rapoport 17ee65728eSMike RapoportHugeTLB pages consist of multiple base page size pages and is supported by many 18ee65728eSMike Rapoportarchitectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more 19ee65728eSMike Rapoportdetails. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are 20ee65728eSMike Rapoportcurrently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page 2117b6fc88SUsama Arifconsists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages. 22838691a1SMuchun SongFor each base page, there is a corresponding ``struct page``. 23ee65728eSMike Rapoport 24838691a1SMuchun SongWithin the HugeTLB subsystem, only the first 4 ``struct page`` are used to 25838691a1SMuchun Songcontain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides 26838691a1SMuchun Songthis upper limit. The only 'useful' information in the remaining ``struct page`` 27ee65728eSMike Rapoportis the compound_head field, and this field is the same for all tail pages. 28ee65728eSMike Rapoport 29838691a1SMuchun SongBy removing redundant ``struct page`` for HugeTLB pages, memory can be returned 30ee65728eSMike Rapoportto the buddy allocator for other uses. 31ee65728eSMike Rapoport 32ee65728eSMike RapoportDifferent architectures support different HugeTLB pages. For example, the 33ee65728eSMike Rapoportfollowing table is the HugeTLB page size supported by x86 and arm64 34ee65728eSMike Rapoportarchitectures. Because arm64 supports 4k, 16k, and 64k base pages and 35ee65728eSMike Rapoportsupports contiguous entries, so it supports many kinds of sizes of HugeTLB 36ee65728eSMike Rapoportpage. 37ee65728eSMike Rapoport 38ee65728eSMike Rapoport+--------------+-----------+-----------------------------------------------+ 39ee65728eSMike Rapoport| Architecture | Page Size | HugeTLB Page Size | 40ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 41ee65728eSMike Rapoport| x86-64 | 4KB | 2MB | 1GB | | | 42ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 43ee65728eSMike Rapoport| | 4KB | 64KB | 2MB | 32MB | 1GB | 44ee65728eSMike Rapoport| +-----------+-----------+-----------+-----------+-----------+ 45ee65728eSMike Rapoport| arm64 | 16KB | 2MB | 32MB | 1GB | | 46ee65728eSMike Rapoport| +-----------+-----------+-----------+-----------+-----------+ 47ee65728eSMike Rapoport| | 64KB | 2MB | 512MB | 16GB | | 48ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 49ee65728eSMike Rapoport 50838691a1SMuchun SongWhen the system boot up, every HugeTLB page has more than one ``struct page`` 51ee65728eSMike Rapoportstructs which size is (unit: pages):: 52ee65728eSMike Rapoport 53ee65728eSMike Rapoport struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 54ee65728eSMike Rapoport 55ee65728eSMike RapoportWhere HugeTLB_Size is the size of the HugeTLB page. We know that the size 56ee65728eSMike Rapoportof the HugeTLB page is always n times PAGE_SIZE. So we can get the following 57ee65728eSMike Rapoportrelationship:: 58ee65728eSMike Rapoport 59ee65728eSMike Rapoport HugeTLB_Size = n * PAGE_SIZE 60ee65728eSMike Rapoport 61ee65728eSMike RapoportThen:: 62ee65728eSMike Rapoport 63ee65728eSMike Rapoport struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 64ee65728eSMike Rapoport = n * sizeof(struct page) / PAGE_SIZE 65ee65728eSMike Rapoport 66ee65728eSMike RapoportWe can use huge mapping at the pud/pmd level for the HugeTLB page. 67ee65728eSMike Rapoport 68ee65728eSMike RapoportFor the HugeTLB page of the pmd level mapping, then:: 69ee65728eSMike Rapoport 70ee65728eSMike Rapoport struct_size = n * sizeof(struct page) / PAGE_SIZE 71ee65728eSMike Rapoport = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE 72ee65728eSMike Rapoport = sizeof(struct page) / sizeof(pte_t) 73ee65728eSMike Rapoport = 64 / 8 74ee65728eSMike Rapoport = 8 (pages) 75ee65728eSMike Rapoport 76ee65728eSMike RapoportWhere n is how many pte entries which one page can contains. So the value of 77ee65728eSMike Rapoportn is (PAGE_SIZE / sizeof(pte_t)). 78ee65728eSMike Rapoport 79ee65728eSMike RapoportThis optimization only supports 64-bit system, so the value of sizeof(pte_t) 80838691a1SMuchun Songis 8. And this optimization also applicable only when the size of ``struct page`` 81838691a1SMuchun Songis a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g. 82ee65728eSMike Rapoportx86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the 83838691a1SMuchun Songsize of ``struct page`` structs of it is 8 page frames which size depends on the 84ee65728eSMike Rapoportsize of the base page. 85ee65728eSMike Rapoport 86ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping, then:: 87ee65728eSMike Rapoport 88ee65728eSMike Rapoport struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) 89ee65728eSMike Rapoport = PAGE_SIZE / 8 * 8 (pages) 90ee65728eSMike Rapoport = PAGE_SIZE (pages) 91ee65728eSMike Rapoport 92838691a1SMuchun SongWhere the struct_size(pmd) is the size of the ``struct page`` structs of a 93ee65728eSMike RapoportHugeTLB page of the pmd level mapping. 94ee65728eSMike Rapoport 95ee65728eSMike RapoportE.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB 96ee65728eSMike RapoportHugeTLB page consists in 4096. 97ee65728eSMike Rapoport 98ee65728eSMike RapoportNext, we take the pmd level mapping of the HugeTLB page as an example to 99ee65728eSMike Rapoportshow the internal implementation of this optimization. There are 8 pages 100838691a1SMuchun Song``struct page`` structs associated with a HugeTLB page which is pmd mapped. 101ee65728eSMike Rapoport 102ee65728eSMike RapoportHere is how things look before optimization:: 103ee65728eSMike Rapoport 104ee65728eSMike Rapoport HugeTLB struct pages(8 pages) page frame(8 pages) 105ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 106ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 107ee65728eSMike Rapoport | | +-----------+ +-----------+ 108ee65728eSMike Rapoport | | | 1 | -------------> | 1 | 109ee65728eSMike Rapoport | | +-----------+ +-----------+ 110ee65728eSMike Rapoport | | | 2 | -------------> | 2 | 111ee65728eSMike Rapoport | | +-----------+ +-----------+ 112ee65728eSMike Rapoport | | | 3 | -------------> | 3 | 113ee65728eSMike Rapoport | | +-----------+ +-----------+ 114ee65728eSMike Rapoport | | | 4 | -------------> | 4 | 115ee65728eSMike Rapoport | PMD | +-----------+ +-----------+ 116ee65728eSMike Rapoport | level | | 5 | -------------> | 5 | 117ee65728eSMike Rapoport | mapping | +-----------+ +-----------+ 118ee65728eSMike Rapoport | | | 6 | -------------> | 6 | 119ee65728eSMike Rapoport | | +-----------+ +-----------+ 120ee65728eSMike Rapoport | | | 7 | -------------> | 7 | 121ee65728eSMike Rapoport | | +-----------+ +-----------+ 122ee65728eSMike Rapoport | | 123ee65728eSMike Rapoport | | 124ee65728eSMike Rapoport | | 125ee65728eSMike Rapoport +-----------+ 126ee65728eSMike Rapoport 127ee65728eSMike RapoportThe value of page->compound_head is the same for all tail pages. The first 128838691a1SMuchun Songpage of ``struct page`` (page 0) associated with the HugeTLB page contains the 4 129838691a1SMuchun Song``struct page`` necessary to describe the HugeTLB. The only use of the remaining 130838691a1SMuchun Songpages of ``struct page`` (page 1 to page 7) is to point to page->compound_head. 131838691a1SMuchun SongTherefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page`` 132ee65728eSMike Rapoportwill be used for each HugeTLB page. This will allow us to free the remaining 133ee65728eSMike Rapoport7 pages to the buddy allocator. 134ee65728eSMike Rapoport 135ee65728eSMike RapoportHere is how things look after remapping:: 136ee65728eSMike Rapoport 137ee65728eSMike Rapoport HugeTLB struct pages(8 pages) page frame(8 pages) 138ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 139ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 140ee65728eSMike Rapoport | | +-----------+ +-----------+ 141ee65728eSMike Rapoport | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ 142ee65728eSMike Rapoport | | +-----------+ | | | | | | 143ee65728eSMike Rapoport | | | 2 | -----------------+ | | | | | 144ee65728eSMike Rapoport | | +-----------+ | | | | | 145ee65728eSMike Rapoport | | | 3 | -------------------+ | | | | 146ee65728eSMike Rapoport | | +-----------+ | | | | 147ee65728eSMike Rapoport | | | 4 | ---------------------+ | | | 148ee65728eSMike Rapoport | PMD | +-----------+ | | | 149ee65728eSMike Rapoport | level | | 5 | -----------------------+ | | 150ee65728eSMike Rapoport | mapping | +-----------+ | | 151ee65728eSMike Rapoport | | | 6 | -------------------------+ | 152ee65728eSMike Rapoport | | +-----------+ | 153ee65728eSMike Rapoport | | | 7 | ---------------------------+ 154ee65728eSMike Rapoport | | +-----------+ 155ee65728eSMike Rapoport | | 156ee65728eSMike Rapoport | | 157ee65728eSMike Rapoport | | 158ee65728eSMike Rapoport +-----------+ 159ee65728eSMike Rapoport 160ee65728eSMike RapoportWhen a HugeTLB is freed to the buddy system, we should allocate 7 pages for 161ee65728eSMike Rapoportvmemmap pages and restore the previous mapping relationship. 162ee65728eSMike Rapoport 163ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping. It is similar to the former. 164ee65728eSMike RapoportWe also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. 165ee65728eSMike Rapoport 166ee65728eSMike RapoportApart from the HugeTLB page of the pmd/pud level mapping, some architectures 167ee65728eSMike Rapoport(e.g. aarch64) provides a contiguous bit in the translation table entries 168ee65728eSMike Rapoportthat hints to the MMU to indicate that it is one of a contiguous set of 169ee65728eSMike Rapoportentries that can be cached in a single TLB entry. 170ee65728eSMike Rapoport 171ee65728eSMike RapoportThe contiguous bit is used to increase the mapping size at the pmd and pte 172ee65728eSMike Rapoport(last) level. So this type of HugeTLB page can be optimized only when its 173838691a1SMuchun Songsize of the ``struct page`` structs is greater than **1** page. 174ee65728eSMike Rapoport 175ee65728eSMike RapoportNotice: The head vmemmap page is not freed to the buddy allocator and all 176ee65728eSMike Rapoporttail vmemmap pages are mapped to the head vmemmap page frame. So we can see 177838691a1SMuchun Songmore than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB 178838691a1SMuchun Songpage) associated with each HugeTLB page. The ``compound_head()`` can handle 179838691a1SMuchun Songthis correctly. There is only **one** head ``struct page``, the tail 180838691a1SMuchun Song``struct page`` with ``PG_head`` are fake head ``struct page``. We need an 181838691a1SMuchun Songapproach to distinguish between those two different types of ``struct page`` so 182838691a1SMuchun Songthat ``compound_head()`` can return the real head ``struct page`` when the 183838691a1SMuchun Songparameter is the tail ``struct page`` but with ``PG_head``. The following code 184838691a1SMuchun Songsnippet describes how to distinguish between real and fake head ``struct page``. 185838691a1SMuchun Song 186838691a1SMuchun Song.. code-block:: c 187838691a1SMuchun Song 188838691a1SMuchun Song if (test_bit(PG_head, &page->flags)) { 189838691a1SMuchun Song unsigned long head = READ_ONCE(page[1].compound_head); 190838691a1SMuchun Song 191838691a1SMuchun Song if (head & 1) { 192838691a1SMuchun Song if (head == (unsigned long)page + 1) 193838691a1SMuchun Song /* head struct page */ 194838691a1SMuchun Song else 195838691a1SMuchun Song /* tail struct page */ 196838691a1SMuchun Song } else { 197838691a1SMuchun Song /* head struct page */ 198838691a1SMuchun Song } 199838691a1SMuchun Song } 200838691a1SMuchun Song 201838691a1SMuchun SongWe can safely access the field of the **page[1]** with ``PG_head`` because the 202838691a1SMuchun Songpage is a compound page composed with at least two contiguous pages. 203838691a1SMuchun SongThe implementation refers to ``page_fixed_fake_head()``. 204ee65728eSMike Rapoport 205ee65728eSMike RapoportDevice DAX 206ee65728eSMike Rapoport========== 207ee65728eSMike Rapoport 208ee65728eSMike RapoportThe device-dax interface uses the same tail deduplication technique explained 209ee65728eSMike Rapoportin the previous chapter, except when used with the vmemmap in 210ee65728eSMike Rapoportthe device (altmap). 211ee65728eSMike Rapoport 212ee65728eSMike RapoportThe following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), 213ee65728eSMike RapoportPMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). 214f2b79c0dSAneesh Kumar K.VFor powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst 215ee65728eSMike Rapoport 216ee65728eSMike RapoportThe differences with HugeTLB are relatively minor. 217ee65728eSMike Rapoport 218838691a1SMuchun SongIt only use 3 ``struct page`` for storing all information as opposed 219ee65728eSMike Rapoportto 4 on HugeTLB pages. 220ee65728eSMike Rapoport 221ee65728eSMike RapoportThere's no remapping of vmemmap given that device-dax memory is not part of 222ee65728eSMike RapoportSystem RAM ranges initialized at boot. Thus the tail page deduplication 223ee65728eSMike Rapoporthappens at a later stage when we populate the sections. HugeTLB reuses the 224ee65728eSMike Rapoportthe head vmemmap page representing, whereas device-dax reuses the tail 225ee65728eSMike Rapoportvmemmap page. This results in only half of the savings compared to HugeTLB. 226ee65728eSMike Rapoport 227ee65728eSMike RapoportDeduplicated tail pages are not mapped read-only. 228ee65728eSMike Rapoport 229ee65728eSMike RapoportHere's how things look like on device-dax after the sections are populated:: 230ee65728eSMike Rapoport 231ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 232ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 233ee65728eSMike Rapoport | | +-----------+ +-----------+ 234ee65728eSMike Rapoport | | | 1 | -------------> | 1 | 235ee65728eSMike Rapoport | | +-----------+ +-----------+ 236ee65728eSMike Rapoport | | | 2 | ----------------^ ^ ^ ^ ^ ^ 237ee65728eSMike Rapoport | | +-----------+ | | | | | 238ee65728eSMike Rapoport | | | 3 | ------------------+ | | | | 239ee65728eSMike Rapoport | | +-----------+ | | | | 240ee65728eSMike Rapoport | | | 4 | --------------------+ | | | 241ee65728eSMike Rapoport | PMD | +-----------+ | | | 242ee65728eSMike Rapoport | level | | 5 | ----------------------+ | | 243ee65728eSMike Rapoport | mapping | +-----------+ | | 244ee65728eSMike Rapoport | | | 6 | ------------------------+ | 245ee65728eSMike Rapoport | | +-----------+ | 246ee65728eSMike Rapoport | | | 7 | --------------------------+ 247ee65728eSMike Rapoport | | +-----------+ 248ee65728eSMike Rapoport | | 249ee65728eSMike Rapoport | | 250ee65728eSMike Rapoport | | 251ee65728eSMike Rapoport +-----------+ 252