1ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0 2ee65728eSMike Rapoport 3ee65728eSMike Rapoport========================================= 4ee65728eSMike RapoportA vmemmap diet for HugeTLB and Device DAX 5ee65728eSMike Rapoport========================================= 6ee65728eSMike Rapoport 7ee65728eSMike RapoportHugeTLB 8ee65728eSMike Rapoport======= 9ee65728eSMike Rapoport 10*dff03381SMuchun SongThis section is to explain how HugeTLB Vmemmap Optimization (HVO) works. 11*dff03381SMuchun Song 12ee65728eSMike RapoportThe struct page structures (page structs) are used to describe a physical 13ee65728eSMike Rapoportpage frame. By default, there is a one-to-one mapping from a page frame to 14ee65728eSMike Rapoportit's corresponding page struct. 15ee65728eSMike Rapoport 16ee65728eSMike RapoportHugeTLB pages consist of multiple base page size pages and is supported by many 17ee65728eSMike Rapoportarchitectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more 18ee65728eSMike Rapoportdetails. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are 19ee65728eSMike Rapoportcurrently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page 20ee65728eSMike Rapoportconsists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages. 21ee65728eSMike RapoportFor each base page, there is a corresponding page struct. 22ee65728eSMike Rapoport 23ee65728eSMike RapoportWithin the HugeTLB subsystem, only the first 4 page structs are used to 24ee65728eSMike Rapoportcontain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides 25ee65728eSMike Rapoportthis upper limit. The only 'useful' information in the remaining page structs 26ee65728eSMike Rapoportis the compound_head field, and this field is the same for all tail pages. 27ee65728eSMike Rapoport 28ee65728eSMike RapoportBy removing redundant page structs for HugeTLB pages, memory can be returned 29ee65728eSMike Rapoportto the buddy allocator for other uses. 30ee65728eSMike Rapoport 31ee65728eSMike RapoportDifferent architectures support different HugeTLB pages. For example, the 32ee65728eSMike Rapoportfollowing table is the HugeTLB page size supported by x86 and arm64 33ee65728eSMike Rapoportarchitectures. Because arm64 supports 4k, 16k, and 64k base pages and 34ee65728eSMike Rapoportsupports contiguous entries, so it supports many kinds of sizes of HugeTLB 35ee65728eSMike Rapoportpage. 36ee65728eSMike Rapoport 37ee65728eSMike Rapoport+--------------+-----------+-----------------------------------------------+ 38ee65728eSMike Rapoport| Architecture | Page Size | HugeTLB Page Size | 39ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 40ee65728eSMike Rapoport| x86-64 | 4KB | 2MB | 1GB | | | 41ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 42ee65728eSMike Rapoport| | 4KB | 64KB | 2MB | 32MB | 1GB | 43ee65728eSMike Rapoport| +-----------+-----------+-----------+-----------+-----------+ 44ee65728eSMike Rapoport| arm64 | 16KB | 2MB | 32MB | 1GB | | 45ee65728eSMike Rapoport| +-----------+-----------+-----------+-----------+-----------+ 46ee65728eSMike Rapoport| | 64KB | 2MB | 512MB | 16GB | | 47ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+ 48ee65728eSMike Rapoport 49ee65728eSMike RapoportWhen the system boot up, every HugeTLB page has more than one struct page 50ee65728eSMike Rapoportstructs which size is (unit: pages):: 51ee65728eSMike Rapoport 52ee65728eSMike Rapoport struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 53ee65728eSMike Rapoport 54ee65728eSMike RapoportWhere HugeTLB_Size is the size of the HugeTLB page. We know that the size 55ee65728eSMike Rapoportof the HugeTLB page is always n times PAGE_SIZE. So we can get the following 56ee65728eSMike Rapoportrelationship:: 57ee65728eSMike Rapoport 58ee65728eSMike Rapoport HugeTLB_Size = n * PAGE_SIZE 59ee65728eSMike Rapoport 60ee65728eSMike RapoportThen:: 61ee65728eSMike Rapoport 62ee65728eSMike Rapoport struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE 63ee65728eSMike Rapoport = n * sizeof(struct page) / PAGE_SIZE 64ee65728eSMike Rapoport 65ee65728eSMike RapoportWe can use huge mapping at the pud/pmd level for the HugeTLB page. 66ee65728eSMike Rapoport 67ee65728eSMike RapoportFor the HugeTLB page of the pmd level mapping, then:: 68ee65728eSMike Rapoport 69ee65728eSMike Rapoport struct_size = n * sizeof(struct page) / PAGE_SIZE 70ee65728eSMike Rapoport = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE 71ee65728eSMike Rapoport = sizeof(struct page) / sizeof(pte_t) 72ee65728eSMike Rapoport = 64 / 8 73ee65728eSMike Rapoport = 8 (pages) 74ee65728eSMike Rapoport 75ee65728eSMike RapoportWhere n is how many pte entries which one page can contains. So the value of 76ee65728eSMike Rapoportn is (PAGE_SIZE / sizeof(pte_t)). 77ee65728eSMike Rapoport 78ee65728eSMike RapoportThis optimization only supports 64-bit system, so the value of sizeof(pte_t) 79ee65728eSMike Rapoportis 8. And this optimization also applicable only when the size of struct page 80ee65728eSMike Rapoportis a power of two. In most cases, the size of struct page is 64 bytes (e.g. 81ee65728eSMike Rapoportx86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the 82ee65728eSMike Rapoportsize of struct page structs of it is 8 page frames which size depends on the 83ee65728eSMike Rapoportsize of the base page. 84ee65728eSMike Rapoport 85ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping, then:: 86ee65728eSMike Rapoport 87ee65728eSMike Rapoport struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) 88ee65728eSMike Rapoport = PAGE_SIZE / 8 * 8 (pages) 89ee65728eSMike Rapoport = PAGE_SIZE (pages) 90ee65728eSMike Rapoport 91ee65728eSMike RapoportWhere the struct_size(pmd) is the size of the struct page structs of a 92ee65728eSMike RapoportHugeTLB page of the pmd level mapping. 93ee65728eSMike Rapoport 94ee65728eSMike RapoportE.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB 95ee65728eSMike RapoportHugeTLB page consists in 4096. 96ee65728eSMike Rapoport 97ee65728eSMike RapoportNext, we take the pmd level mapping of the HugeTLB page as an example to 98ee65728eSMike Rapoportshow the internal implementation of this optimization. There are 8 pages 99ee65728eSMike Rapoportstruct page structs associated with a HugeTLB page which is pmd mapped. 100ee65728eSMike Rapoport 101ee65728eSMike RapoportHere is how things look before optimization:: 102ee65728eSMike Rapoport 103ee65728eSMike Rapoport HugeTLB struct pages(8 pages) page frame(8 pages) 104ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 105ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 106ee65728eSMike Rapoport | | +-----------+ +-----------+ 107ee65728eSMike Rapoport | | | 1 | -------------> | 1 | 108ee65728eSMike Rapoport | | +-----------+ +-----------+ 109ee65728eSMike Rapoport | | | 2 | -------------> | 2 | 110ee65728eSMike Rapoport | | +-----------+ +-----------+ 111ee65728eSMike Rapoport | | | 3 | -------------> | 3 | 112ee65728eSMike Rapoport | | +-----------+ +-----------+ 113ee65728eSMike Rapoport | | | 4 | -------------> | 4 | 114ee65728eSMike Rapoport | PMD | +-----------+ +-----------+ 115ee65728eSMike Rapoport | level | | 5 | -------------> | 5 | 116ee65728eSMike Rapoport | mapping | +-----------+ +-----------+ 117ee65728eSMike Rapoport | | | 6 | -------------> | 6 | 118ee65728eSMike Rapoport | | +-----------+ +-----------+ 119ee65728eSMike Rapoport | | | 7 | -------------> | 7 | 120ee65728eSMike Rapoport | | +-----------+ +-----------+ 121ee65728eSMike Rapoport | | 122ee65728eSMike Rapoport | | 123ee65728eSMike Rapoport | | 124ee65728eSMike Rapoport +-----------+ 125ee65728eSMike Rapoport 126ee65728eSMike RapoportThe value of page->compound_head is the same for all tail pages. The first 127ee65728eSMike Rapoportpage of page structs (page 0) associated with the HugeTLB page contains the 4 128ee65728eSMike Rapoportpage structs necessary to describe the HugeTLB. The only use of the remaining 129ee65728eSMike Rapoportpages of page structs (page 1 to page 7) is to point to page->compound_head. 130ee65728eSMike RapoportTherefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs 131ee65728eSMike Rapoportwill be used for each HugeTLB page. This will allow us to free the remaining 132ee65728eSMike Rapoport7 pages to the buddy allocator. 133ee65728eSMike Rapoport 134ee65728eSMike RapoportHere is how things look after remapping:: 135ee65728eSMike Rapoport 136ee65728eSMike Rapoport HugeTLB struct pages(8 pages) page frame(8 pages) 137ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 138ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 139ee65728eSMike Rapoport | | +-----------+ +-----------+ 140ee65728eSMike Rapoport | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ 141ee65728eSMike Rapoport | | +-----------+ | | | | | | 142ee65728eSMike Rapoport | | | 2 | -----------------+ | | | | | 143ee65728eSMike Rapoport | | +-----------+ | | | | | 144ee65728eSMike Rapoport | | | 3 | -------------------+ | | | | 145ee65728eSMike Rapoport | | +-----------+ | | | | 146ee65728eSMike Rapoport | | | 4 | ---------------------+ | | | 147ee65728eSMike Rapoport | PMD | +-----------+ | | | 148ee65728eSMike Rapoport | level | | 5 | -----------------------+ | | 149ee65728eSMike Rapoport | mapping | +-----------+ | | 150ee65728eSMike Rapoport | | | 6 | -------------------------+ | 151ee65728eSMike Rapoport | | +-----------+ | 152ee65728eSMike Rapoport | | | 7 | ---------------------------+ 153ee65728eSMike Rapoport | | +-----------+ 154ee65728eSMike Rapoport | | 155ee65728eSMike Rapoport | | 156ee65728eSMike Rapoport | | 157ee65728eSMike Rapoport +-----------+ 158ee65728eSMike Rapoport 159ee65728eSMike RapoportWhen a HugeTLB is freed to the buddy system, we should allocate 7 pages for 160ee65728eSMike Rapoportvmemmap pages and restore the previous mapping relationship. 161ee65728eSMike Rapoport 162ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping. It is similar to the former. 163ee65728eSMike RapoportWe also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. 164ee65728eSMike Rapoport 165ee65728eSMike RapoportApart from the HugeTLB page of the pmd/pud level mapping, some architectures 166ee65728eSMike Rapoport(e.g. aarch64) provides a contiguous bit in the translation table entries 167ee65728eSMike Rapoportthat hints to the MMU to indicate that it is one of a contiguous set of 168ee65728eSMike Rapoportentries that can be cached in a single TLB entry. 169ee65728eSMike Rapoport 170ee65728eSMike RapoportThe contiguous bit is used to increase the mapping size at the pmd and pte 171ee65728eSMike Rapoport(last) level. So this type of HugeTLB page can be optimized only when its 172ee65728eSMike Rapoportsize of the struct page structs is greater than 1 page. 173ee65728eSMike Rapoport 174ee65728eSMike RapoportNotice: The head vmemmap page is not freed to the buddy allocator and all 175ee65728eSMike Rapoporttail vmemmap pages are mapped to the head vmemmap page frame. So we can see 176ee65728eSMike Rapoportmore than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page) 177ee65728eSMike Rapoportassociated with each HugeTLB page. The compound_head() can handle this 178ee65728eSMike Rapoportcorrectly (more details refer to the comment above compound_head()). 179ee65728eSMike Rapoport 180ee65728eSMike RapoportDevice DAX 181ee65728eSMike Rapoport========== 182ee65728eSMike Rapoport 183ee65728eSMike RapoportThe device-dax interface uses the same tail deduplication technique explained 184ee65728eSMike Rapoportin the previous chapter, except when used with the vmemmap in 185ee65728eSMike Rapoportthe device (altmap). 186ee65728eSMike Rapoport 187ee65728eSMike RapoportThe following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), 188ee65728eSMike RapoportPMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). 189ee65728eSMike Rapoport 190ee65728eSMike RapoportThe differences with HugeTLB are relatively minor. 191ee65728eSMike Rapoport 192ee65728eSMike RapoportIt only use 3 page structs for storing all information as opposed 193ee65728eSMike Rapoportto 4 on HugeTLB pages. 194ee65728eSMike Rapoport 195ee65728eSMike RapoportThere's no remapping of vmemmap given that device-dax memory is not part of 196ee65728eSMike RapoportSystem RAM ranges initialized at boot. Thus the tail page deduplication 197ee65728eSMike Rapoporthappens at a later stage when we populate the sections. HugeTLB reuses the 198ee65728eSMike Rapoportthe head vmemmap page representing, whereas device-dax reuses the tail 199ee65728eSMike Rapoportvmemmap page. This results in only half of the savings compared to HugeTLB. 200ee65728eSMike Rapoport 201ee65728eSMike RapoportDeduplicated tail pages are not mapped read-only. 202ee65728eSMike Rapoport 203ee65728eSMike RapoportHere's how things look like on device-dax after the sections are populated:: 204ee65728eSMike Rapoport 205ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ 206ee65728eSMike Rapoport | | | 0 | -------------> | 0 | 207ee65728eSMike Rapoport | | +-----------+ +-----------+ 208ee65728eSMike Rapoport | | | 1 | -------------> | 1 | 209ee65728eSMike Rapoport | | +-----------+ +-----------+ 210ee65728eSMike Rapoport | | | 2 | ----------------^ ^ ^ ^ ^ ^ 211ee65728eSMike Rapoport | | +-----------+ | | | | | 212ee65728eSMike Rapoport | | | 3 | ------------------+ | | | | 213ee65728eSMike Rapoport | | +-----------+ | | | | 214ee65728eSMike Rapoport | | | 4 | --------------------+ | | | 215ee65728eSMike Rapoport | PMD | +-----------+ | | | 216ee65728eSMike Rapoport | level | | 5 | ----------------------+ | | 217ee65728eSMike Rapoport | mapping | +-----------+ | | 218ee65728eSMike Rapoport | | | 6 | ------------------------+ | 219ee65728eSMike Rapoport | | +-----------+ | 220ee65728eSMike Rapoport | | | 7 | --------------------------+ 221ee65728eSMike Rapoport | | +-----------+ 222ee65728eSMike Rapoport | | 223ee65728eSMike Rapoport | | 224ee65728eSMike Rapoport | | 225ee65728eSMike Rapoport +-----------+ 226