xref: /openbmc/linux/Documentation/mm/vmemmap_dedup.rst (revision dff033818a06e7d0bf79271e34bda11c2d9d98d0)
1ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0
2ee65728eSMike Rapoport
3ee65728eSMike Rapoport=========================================
4ee65728eSMike RapoportA vmemmap diet for HugeTLB and Device DAX
5ee65728eSMike Rapoport=========================================
6ee65728eSMike Rapoport
7ee65728eSMike RapoportHugeTLB
8ee65728eSMike Rapoport=======
9ee65728eSMike Rapoport
10*dff03381SMuchun SongThis section is to explain how HugeTLB Vmemmap Optimization (HVO) works.
11*dff03381SMuchun Song
12ee65728eSMike RapoportThe struct page structures (page structs) are used to describe a physical
13ee65728eSMike Rapoportpage frame. By default, there is a one-to-one mapping from a page frame to
14ee65728eSMike Rapoportit's corresponding page struct.
15ee65728eSMike Rapoport
16ee65728eSMike RapoportHugeTLB pages consist of multiple base page size pages and is supported by many
17ee65728eSMike Rapoportarchitectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
18ee65728eSMike Rapoportdetails. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
19ee65728eSMike Rapoportcurrently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
20ee65728eSMike Rapoportconsists of 512 base pages and a 1GB HugeTLB page consists of 4096 base pages.
21ee65728eSMike RapoportFor each base page, there is a corresponding page struct.
22ee65728eSMike Rapoport
23ee65728eSMike RapoportWithin the HugeTLB subsystem, only the first 4 page structs are used to
24ee65728eSMike Rapoportcontain unique information about a HugeTLB page. __NR_USED_SUBPAGE provides
25ee65728eSMike Rapoportthis upper limit. The only 'useful' information in the remaining page structs
26ee65728eSMike Rapoportis the compound_head field, and this field is the same for all tail pages.
27ee65728eSMike Rapoport
28ee65728eSMike RapoportBy removing redundant page structs for HugeTLB pages, memory can be returned
29ee65728eSMike Rapoportto the buddy allocator for other uses.
30ee65728eSMike Rapoport
31ee65728eSMike RapoportDifferent architectures support different HugeTLB pages. For example, the
32ee65728eSMike Rapoportfollowing table is the HugeTLB page size supported by x86 and arm64
33ee65728eSMike Rapoportarchitectures. Because arm64 supports 4k, 16k, and 64k base pages and
34ee65728eSMike Rapoportsupports contiguous entries, so it supports many kinds of sizes of HugeTLB
35ee65728eSMike Rapoportpage.
36ee65728eSMike Rapoport
37ee65728eSMike Rapoport+--------------+-----------+-----------------------------------------------+
38ee65728eSMike Rapoport| Architecture | Page Size |                HugeTLB Page Size              |
39ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
40ee65728eSMike Rapoport|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
41ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
42ee65728eSMike Rapoport|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
43ee65728eSMike Rapoport|              +-----------+-----------+-----------+-----------+-----------+
44ee65728eSMike Rapoport|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
45ee65728eSMike Rapoport|              +-----------+-----------+-----------+-----------+-----------+
46ee65728eSMike Rapoport|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
47ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
48ee65728eSMike Rapoport
49ee65728eSMike RapoportWhen the system boot up, every HugeTLB page has more than one struct page
50ee65728eSMike Rapoportstructs which size is (unit: pages)::
51ee65728eSMike Rapoport
52ee65728eSMike Rapoport   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
53ee65728eSMike Rapoport
54ee65728eSMike RapoportWhere HugeTLB_Size is the size of the HugeTLB page. We know that the size
55ee65728eSMike Rapoportof the HugeTLB page is always n times PAGE_SIZE. So we can get the following
56ee65728eSMike Rapoportrelationship::
57ee65728eSMike Rapoport
58ee65728eSMike Rapoport   HugeTLB_Size = n * PAGE_SIZE
59ee65728eSMike Rapoport
60ee65728eSMike RapoportThen::
61ee65728eSMike Rapoport
62ee65728eSMike Rapoport   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
63ee65728eSMike Rapoport               = n * sizeof(struct page) / PAGE_SIZE
64ee65728eSMike Rapoport
65ee65728eSMike RapoportWe can use huge mapping at the pud/pmd level for the HugeTLB page.
66ee65728eSMike Rapoport
67ee65728eSMike RapoportFor the HugeTLB page of the pmd level mapping, then::
68ee65728eSMike Rapoport
69ee65728eSMike Rapoport   struct_size = n * sizeof(struct page) / PAGE_SIZE
70ee65728eSMike Rapoport               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
71ee65728eSMike Rapoport               = sizeof(struct page) / sizeof(pte_t)
72ee65728eSMike Rapoport               = 64 / 8
73ee65728eSMike Rapoport               = 8 (pages)
74ee65728eSMike Rapoport
75ee65728eSMike RapoportWhere n is how many pte entries which one page can contains. So the value of
76ee65728eSMike Rapoportn is (PAGE_SIZE / sizeof(pte_t)).
77ee65728eSMike Rapoport
78ee65728eSMike RapoportThis optimization only supports 64-bit system, so the value of sizeof(pte_t)
79ee65728eSMike Rapoportis 8. And this optimization also applicable only when the size of struct page
80ee65728eSMike Rapoportis a power of two. In most cases, the size of struct page is 64 bytes (e.g.
81ee65728eSMike Rapoportx86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
82ee65728eSMike Rapoportsize of struct page structs of it is 8 page frames which size depends on the
83ee65728eSMike Rapoportsize of the base page.
84ee65728eSMike Rapoport
85ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping, then::
86ee65728eSMike Rapoport
87ee65728eSMike Rapoport   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
88ee65728eSMike Rapoport               = PAGE_SIZE / 8 * 8 (pages)
89ee65728eSMike Rapoport               = PAGE_SIZE (pages)
90ee65728eSMike Rapoport
91ee65728eSMike RapoportWhere the struct_size(pmd) is the size of the struct page structs of a
92ee65728eSMike RapoportHugeTLB page of the pmd level mapping.
93ee65728eSMike Rapoport
94ee65728eSMike RapoportE.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
95ee65728eSMike RapoportHugeTLB page consists in 4096.
96ee65728eSMike Rapoport
97ee65728eSMike RapoportNext, we take the pmd level mapping of the HugeTLB page as an example to
98ee65728eSMike Rapoportshow the internal implementation of this optimization. There are 8 pages
99ee65728eSMike Rapoportstruct page structs associated with a HugeTLB page which is pmd mapped.
100ee65728eSMike Rapoport
101ee65728eSMike RapoportHere is how things look before optimization::
102ee65728eSMike Rapoport
103ee65728eSMike Rapoport    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
104ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
105ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
106ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
107ee65728eSMike Rapoport |           |                     |     1     | -------------> |     1     |
108ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
109ee65728eSMike Rapoport |           |                     |     2     | -------------> |     2     |
110ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
111ee65728eSMike Rapoport |           |                     |     3     | -------------> |     3     |
112ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
113ee65728eSMike Rapoport |           |                     |     4     | -------------> |     4     |
114ee65728eSMike Rapoport |    PMD    |                     +-----------+                +-----------+
115ee65728eSMike Rapoport |   level   |                     |     5     | -------------> |     5     |
116ee65728eSMike Rapoport |  mapping  |                     +-----------+                +-----------+
117ee65728eSMike Rapoport |           |                     |     6     | -------------> |     6     |
118ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
119ee65728eSMike Rapoport |           |                     |     7     | -------------> |     7     |
120ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
121ee65728eSMike Rapoport |           |
122ee65728eSMike Rapoport |           |
123ee65728eSMike Rapoport |           |
124ee65728eSMike Rapoport +-----------+
125ee65728eSMike Rapoport
126ee65728eSMike RapoportThe value of page->compound_head is the same for all tail pages. The first
127ee65728eSMike Rapoportpage of page structs (page 0) associated with the HugeTLB page contains the 4
128ee65728eSMike Rapoportpage structs necessary to describe the HugeTLB. The only use of the remaining
129ee65728eSMike Rapoportpages of page structs (page 1 to page 7) is to point to page->compound_head.
130ee65728eSMike RapoportTherefore, we can remap pages 1 to 7 to page 0. Only 1 page of page structs
131ee65728eSMike Rapoportwill be used for each HugeTLB page. This will allow us to free the remaining
132ee65728eSMike Rapoport7 pages to the buddy allocator.
133ee65728eSMike Rapoport
134ee65728eSMike RapoportHere is how things look after remapping::
135ee65728eSMike Rapoport
136ee65728eSMike Rapoport    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
137ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
138ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
139ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
140ee65728eSMike Rapoport |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
141ee65728eSMike Rapoport |           |                     +-----------+                  | | | | | |
142ee65728eSMike Rapoport |           |                     |     2     | -----------------+ | | | | |
143ee65728eSMike Rapoport |           |                     +-----------+                    | | | | |
144ee65728eSMike Rapoport |           |                     |     3     | -------------------+ | | | |
145ee65728eSMike Rapoport |           |                     +-----------+                      | | | |
146ee65728eSMike Rapoport |           |                     |     4     | ---------------------+ | | |
147ee65728eSMike Rapoport |    PMD    |                     +-----------+                        | | |
148ee65728eSMike Rapoport |   level   |                     |     5     | -----------------------+ | |
149ee65728eSMike Rapoport |  mapping  |                     +-----------+                          | |
150ee65728eSMike Rapoport |           |                     |     6     | -------------------------+ |
151ee65728eSMike Rapoport |           |                     +-----------+                            |
152ee65728eSMike Rapoport |           |                     |     7     | ---------------------------+
153ee65728eSMike Rapoport |           |                     +-----------+
154ee65728eSMike Rapoport |           |
155ee65728eSMike Rapoport |           |
156ee65728eSMike Rapoport |           |
157ee65728eSMike Rapoport +-----------+
158ee65728eSMike Rapoport
159ee65728eSMike RapoportWhen a HugeTLB is freed to the buddy system, we should allocate 7 pages for
160ee65728eSMike Rapoportvmemmap pages and restore the previous mapping relationship.
161ee65728eSMike Rapoport
162ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping. It is similar to the former.
163ee65728eSMike RapoportWe also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
164ee65728eSMike Rapoport
165ee65728eSMike RapoportApart from the HugeTLB page of the pmd/pud level mapping, some architectures
166ee65728eSMike Rapoport(e.g. aarch64) provides a contiguous bit in the translation table entries
167ee65728eSMike Rapoportthat hints to the MMU to indicate that it is one of a contiguous set of
168ee65728eSMike Rapoportentries that can be cached in a single TLB entry.
169ee65728eSMike Rapoport
170ee65728eSMike RapoportThe contiguous bit is used to increase the mapping size at the pmd and pte
171ee65728eSMike Rapoport(last) level. So this type of HugeTLB page can be optimized only when its
172ee65728eSMike Rapoportsize of the struct page structs is greater than 1 page.
173ee65728eSMike Rapoport
174ee65728eSMike RapoportNotice: The head vmemmap page is not freed to the buddy allocator and all
175ee65728eSMike Rapoporttail vmemmap pages are mapped to the head vmemmap page frame. So we can see
176ee65728eSMike Rapoportmore than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page)
177ee65728eSMike Rapoportassociated with each HugeTLB page. The compound_head() can handle this
178ee65728eSMike Rapoportcorrectly (more details refer to the comment above compound_head()).
179ee65728eSMike Rapoport
180ee65728eSMike RapoportDevice DAX
181ee65728eSMike Rapoport==========
182ee65728eSMike Rapoport
183ee65728eSMike RapoportThe device-dax interface uses the same tail deduplication technique explained
184ee65728eSMike Rapoportin the previous chapter, except when used with the vmemmap in
185ee65728eSMike Rapoportthe device (altmap).
186ee65728eSMike Rapoport
187ee65728eSMike RapoportThe following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
188ee65728eSMike RapoportPMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
189ee65728eSMike Rapoport
190ee65728eSMike RapoportThe differences with HugeTLB are relatively minor.
191ee65728eSMike Rapoport
192ee65728eSMike RapoportIt only use 3 page structs for storing all information as opposed
193ee65728eSMike Rapoportto 4 on HugeTLB pages.
194ee65728eSMike Rapoport
195ee65728eSMike RapoportThere's no remapping of vmemmap given that device-dax memory is not part of
196ee65728eSMike RapoportSystem RAM ranges initialized at boot. Thus the tail page deduplication
197ee65728eSMike Rapoporthappens at a later stage when we populate the sections. HugeTLB reuses the
198ee65728eSMike Rapoportthe head vmemmap page representing, whereas device-dax reuses the tail
199ee65728eSMike Rapoportvmemmap page. This results in only half of the savings compared to HugeTLB.
200ee65728eSMike Rapoport
201ee65728eSMike RapoportDeduplicated tail pages are not mapped read-only.
202ee65728eSMike Rapoport
203ee65728eSMike RapoportHere's how things look like on device-dax after the sections are populated::
204ee65728eSMike Rapoport
205ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
206ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
207ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
208ee65728eSMike Rapoport |           |                     |     1     | -------------> |     1     |
209ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
210ee65728eSMike Rapoport |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
211ee65728eSMike Rapoport |           |                     +-----------+                   | | | | |
212ee65728eSMike Rapoport |           |                     |     3     | ------------------+ | | | |
213ee65728eSMike Rapoport |           |                     +-----------+                     | | | |
214ee65728eSMike Rapoport |           |                     |     4     | --------------------+ | | |
215ee65728eSMike Rapoport |    PMD    |                     +-----------+                       | | |
216ee65728eSMike Rapoport |   level   |                     |     5     | ----------------------+ | |
217ee65728eSMike Rapoport |  mapping  |                     +-----------+                         | |
218ee65728eSMike Rapoport |           |                     |     6     | ------------------------+ |
219ee65728eSMike Rapoport |           |                     +-----------+                           |
220ee65728eSMike Rapoport |           |                     |     7     | --------------------------+
221ee65728eSMike Rapoport |           |                     +-----------+
222ee65728eSMike Rapoport |           |
223ee65728eSMike Rapoport |           |
224ee65728eSMike Rapoport |           |
225ee65728eSMike Rapoport +-----------+
226