xref: /openbmc/linux/Documentation/mm/vmemmap_dedup.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
117b6fc88SUsama Arif
2ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0
3ee65728eSMike Rapoport
4ee65728eSMike Rapoport=========================================
5ee65728eSMike RapoportA vmemmap diet for HugeTLB and Device DAX
6ee65728eSMike Rapoport=========================================
7ee65728eSMike Rapoport
8ee65728eSMike RapoportHugeTLB
9ee65728eSMike Rapoport=======
10ee65728eSMike Rapoport
11dff03381SMuchun SongThis section is to explain how HugeTLB Vmemmap Optimization (HVO) works.
12dff03381SMuchun Song
13838691a1SMuchun SongThe ``struct page`` structures are used to describe a physical page frame. By
14*d56b699dSBjorn Helgaasdefault, there is a one-to-one mapping from a page frame to its corresponding
15838691a1SMuchun Song``struct page``.
16ee65728eSMike Rapoport
17ee65728eSMike RapoportHugeTLB pages consist of multiple base page size pages and is supported by many
18ee65728eSMike Rapoportarchitectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more
19ee65728eSMike Rapoportdetails. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are
20ee65728eSMike Rapoportcurrently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page
2117b6fc88SUsama Arifconsists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages.
22838691a1SMuchun SongFor each base page, there is a corresponding ``struct page``.
23ee65728eSMike Rapoport
24838691a1SMuchun SongWithin the HugeTLB subsystem, only the first 4 ``struct page`` are used to
25838691a1SMuchun Songcontain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides
26838691a1SMuchun Songthis upper limit. The only 'useful' information in the remaining ``struct page``
27ee65728eSMike Rapoportis the compound_head field, and this field is the same for all tail pages.
28ee65728eSMike Rapoport
29838691a1SMuchun SongBy removing redundant ``struct page`` for HugeTLB pages, memory can be returned
30ee65728eSMike Rapoportto the buddy allocator for other uses.
31ee65728eSMike Rapoport
32ee65728eSMike RapoportDifferent architectures support different HugeTLB pages. For example, the
33ee65728eSMike Rapoportfollowing table is the HugeTLB page size supported by x86 and arm64
34ee65728eSMike Rapoportarchitectures. Because arm64 supports 4k, 16k, and 64k base pages and
35ee65728eSMike Rapoportsupports contiguous entries, so it supports many kinds of sizes of HugeTLB
36ee65728eSMike Rapoportpage.
37ee65728eSMike Rapoport
38ee65728eSMike Rapoport+--------------+-----------+-----------------------------------------------+
39ee65728eSMike Rapoport| Architecture | Page Size |                HugeTLB Page Size              |
40ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
41ee65728eSMike Rapoport|    x86-64    |    4KB    |    2MB    |    1GB    |           |           |
42ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
43ee65728eSMike Rapoport|              |    4KB    |   64KB    |    2MB    |    32MB   |    1GB    |
44ee65728eSMike Rapoport|              +-----------+-----------+-----------+-----------+-----------+
45ee65728eSMike Rapoport|    arm64     |   16KB    |    2MB    |   32MB    |     1GB   |           |
46ee65728eSMike Rapoport|              +-----------+-----------+-----------+-----------+-----------+
47ee65728eSMike Rapoport|              |   64KB    |    2MB    |  512MB    |    16GB   |           |
48ee65728eSMike Rapoport+--------------+-----------+-----------+-----------+-----------+-----------+
49ee65728eSMike Rapoport
50838691a1SMuchun SongWhen the system boot up, every HugeTLB page has more than one ``struct page``
51ee65728eSMike Rapoportstructs which size is (unit: pages)::
52ee65728eSMike Rapoport
53ee65728eSMike Rapoport   struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
54ee65728eSMike Rapoport
55ee65728eSMike RapoportWhere HugeTLB_Size is the size of the HugeTLB page. We know that the size
56ee65728eSMike Rapoportof the HugeTLB page is always n times PAGE_SIZE. So we can get the following
57ee65728eSMike Rapoportrelationship::
58ee65728eSMike Rapoport
59ee65728eSMike Rapoport   HugeTLB_Size = n * PAGE_SIZE
60ee65728eSMike Rapoport
61ee65728eSMike RapoportThen::
62ee65728eSMike Rapoport
63ee65728eSMike Rapoport   struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE
64ee65728eSMike Rapoport               = n * sizeof(struct page) / PAGE_SIZE
65ee65728eSMike Rapoport
66ee65728eSMike RapoportWe can use huge mapping at the pud/pmd level for the HugeTLB page.
67ee65728eSMike Rapoport
68ee65728eSMike RapoportFor the HugeTLB page of the pmd level mapping, then::
69ee65728eSMike Rapoport
70ee65728eSMike Rapoport   struct_size = n * sizeof(struct page) / PAGE_SIZE
71ee65728eSMike Rapoport               = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE
72ee65728eSMike Rapoport               = sizeof(struct page) / sizeof(pte_t)
73ee65728eSMike Rapoport               = 64 / 8
74ee65728eSMike Rapoport               = 8 (pages)
75ee65728eSMike Rapoport
76ee65728eSMike RapoportWhere n is how many pte entries which one page can contains. So the value of
77ee65728eSMike Rapoportn is (PAGE_SIZE / sizeof(pte_t)).
78ee65728eSMike Rapoport
79ee65728eSMike RapoportThis optimization only supports 64-bit system, so the value of sizeof(pte_t)
80838691a1SMuchun Songis 8. And this optimization also applicable only when the size of ``struct page``
81838691a1SMuchun Songis a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g.
82ee65728eSMike Rapoportx86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the
83838691a1SMuchun Songsize of ``struct page`` structs of it is 8 page frames which size depends on the
84ee65728eSMike Rapoportsize of the base page.
85ee65728eSMike Rapoport
86ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping, then::
87ee65728eSMike Rapoport
88ee65728eSMike Rapoport   struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd)
89ee65728eSMike Rapoport               = PAGE_SIZE / 8 * 8 (pages)
90ee65728eSMike Rapoport               = PAGE_SIZE (pages)
91ee65728eSMike Rapoport
92838691a1SMuchun SongWhere the struct_size(pmd) is the size of the ``struct page`` structs of a
93ee65728eSMike RapoportHugeTLB page of the pmd level mapping.
94ee65728eSMike Rapoport
95ee65728eSMike RapoportE.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB
96ee65728eSMike RapoportHugeTLB page consists in 4096.
97ee65728eSMike Rapoport
98ee65728eSMike RapoportNext, we take the pmd level mapping of the HugeTLB page as an example to
99ee65728eSMike Rapoportshow the internal implementation of this optimization. There are 8 pages
100838691a1SMuchun Song``struct page`` structs associated with a HugeTLB page which is pmd mapped.
101ee65728eSMike Rapoport
102ee65728eSMike RapoportHere is how things look before optimization::
103ee65728eSMike Rapoport
104ee65728eSMike Rapoport    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
105ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
106ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
107ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
108ee65728eSMike Rapoport |           |                     |     1     | -------------> |     1     |
109ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
110ee65728eSMike Rapoport |           |                     |     2     | -------------> |     2     |
111ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
112ee65728eSMike Rapoport |           |                     |     3     | -------------> |     3     |
113ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
114ee65728eSMike Rapoport |           |                     |     4     | -------------> |     4     |
115ee65728eSMike Rapoport |    PMD    |                     +-----------+                +-----------+
116ee65728eSMike Rapoport |   level   |                     |     5     | -------------> |     5     |
117ee65728eSMike Rapoport |  mapping  |                     +-----------+                +-----------+
118ee65728eSMike Rapoport |           |                     |     6     | -------------> |     6     |
119ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
120ee65728eSMike Rapoport |           |                     |     7     | -------------> |     7     |
121ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
122ee65728eSMike Rapoport |           |
123ee65728eSMike Rapoport |           |
124ee65728eSMike Rapoport |           |
125ee65728eSMike Rapoport +-----------+
126ee65728eSMike Rapoport
127ee65728eSMike RapoportThe value of page->compound_head is the same for all tail pages. The first
128838691a1SMuchun Songpage of ``struct page`` (page 0) associated with the HugeTLB page contains the 4
129838691a1SMuchun Song``struct page`` necessary to describe the HugeTLB. The only use of the remaining
130838691a1SMuchun Songpages of ``struct page`` (page 1 to page 7) is to point to page->compound_head.
131838691a1SMuchun SongTherefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page``
132ee65728eSMike Rapoportwill be used for each HugeTLB page. This will allow us to free the remaining
133ee65728eSMike Rapoport7 pages to the buddy allocator.
134ee65728eSMike Rapoport
135ee65728eSMike RapoportHere is how things look after remapping::
136ee65728eSMike Rapoport
137ee65728eSMike Rapoport    HugeTLB                  struct pages(8 pages)         page frame(8 pages)
138ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
139ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
140ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
141ee65728eSMike Rapoport |           |                     |     1     | ---------------^ ^ ^ ^ ^ ^ ^
142ee65728eSMike Rapoport |           |                     +-----------+                  | | | | | |
143ee65728eSMike Rapoport |           |                     |     2     | -----------------+ | | | | |
144ee65728eSMike Rapoport |           |                     +-----------+                    | | | | |
145ee65728eSMike Rapoport |           |                     |     3     | -------------------+ | | | |
146ee65728eSMike Rapoport |           |                     +-----------+                      | | | |
147ee65728eSMike Rapoport |           |                     |     4     | ---------------------+ | | |
148ee65728eSMike Rapoport |    PMD    |                     +-----------+                        | | |
149ee65728eSMike Rapoport |   level   |                     |     5     | -----------------------+ | |
150ee65728eSMike Rapoport |  mapping  |                     +-----------+                          | |
151ee65728eSMike Rapoport |           |                     |     6     | -------------------------+ |
152ee65728eSMike Rapoport |           |                     +-----------+                            |
153ee65728eSMike Rapoport |           |                     |     7     | ---------------------------+
154ee65728eSMike Rapoport |           |                     +-----------+
155ee65728eSMike Rapoport |           |
156ee65728eSMike Rapoport |           |
157ee65728eSMike Rapoport |           |
158ee65728eSMike Rapoport +-----------+
159ee65728eSMike Rapoport
160ee65728eSMike RapoportWhen a HugeTLB is freed to the buddy system, we should allocate 7 pages for
161ee65728eSMike Rapoportvmemmap pages and restore the previous mapping relationship.
162ee65728eSMike Rapoport
163ee65728eSMike RapoportFor the HugeTLB page of the pud level mapping. It is similar to the former.
164ee65728eSMike RapoportWe also can use this approach to free (PAGE_SIZE - 1) vmemmap pages.
165ee65728eSMike Rapoport
166ee65728eSMike RapoportApart from the HugeTLB page of the pmd/pud level mapping, some architectures
167ee65728eSMike Rapoport(e.g. aarch64) provides a contiguous bit in the translation table entries
168ee65728eSMike Rapoportthat hints to the MMU to indicate that it is one of a contiguous set of
169ee65728eSMike Rapoportentries that can be cached in a single TLB entry.
170ee65728eSMike Rapoport
171ee65728eSMike RapoportThe contiguous bit is used to increase the mapping size at the pmd and pte
172ee65728eSMike Rapoport(last) level. So this type of HugeTLB page can be optimized only when its
173838691a1SMuchun Songsize of the ``struct page`` structs is greater than **1** page.
174ee65728eSMike Rapoport
175ee65728eSMike RapoportNotice: The head vmemmap page is not freed to the buddy allocator and all
176ee65728eSMike Rapoporttail vmemmap pages are mapped to the head vmemmap page frame. So we can see
177838691a1SMuchun Songmore than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB
178838691a1SMuchun Songpage) associated with each HugeTLB page. The ``compound_head()`` can handle
179838691a1SMuchun Songthis correctly. There is only **one** head ``struct page``, the tail
180838691a1SMuchun Song``struct page`` with ``PG_head`` are fake head ``struct page``.  We need an
181838691a1SMuchun Songapproach to distinguish between those two different types of ``struct page`` so
182838691a1SMuchun Songthat ``compound_head()`` can return the real head ``struct page`` when the
183838691a1SMuchun Songparameter is the tail ``struct page`` but with ``PG_head``. The following code
184838691a1SMuchun Songsnippet describes how to distinguish between real and fake head ``struct page``.
185838691a1SMuchun Song
186838691a1SMuchun Song.. code-block:: c
187838691a1SMuchun Song
188838691a1SMuchun Song	if (test_bit(PG_head, &page->flags)) {
189838691a1SMuchun Song		unsigned long head = READ_ONCE(page[1].compound_head);
190838691a1SMuchun Song
191838691a1SMuchun Song		if (head & 1) {
192838691a1SMuchun Song			if (head == (unsigned long)page + 1)
193838691a1SMuchun Song				/* head struct page */
194838691a1SMuchun Song			else
195838691a1SMuchun Song				/* tail struct page */
196838691a1SMuchun Song		} else {
197838691a1SMuchun Song			/* head struct page */
198838691a1SMuchun Song		}
199838691a1SMuchun Song	}
200838691a1SMuchun Song
201838691a1SMuchun SongWe can safely access the field of the **page[1]** with ``PG_head`` because the
202838691a1SMuchun Songpage is a compound page composed with at least two contiguous pages.
203838691a1SMuchun SongThe implementation refers to ``page_fixed_fake_head()``.
204ee65728eSMike Rapoport
205ee65728eSMike RapoportDevice DAX
206ee65728eSMike Rapoport==========
207ee65728eSMike Rapoport
208ee65728eSMike RapoportThe device-dax interface uses the same tail deduplication technique explained
209ee65728eSMike Rapoportin the previous chapter, except when used with the vmemmap in
210ee65728eSMike Rapoportthe device (altmap).
211ee65728eSMike Rapoport
212ee65728eSMike RapoportThe following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
213ee65728eSMike RapoportPMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
214f2b79c0dSAneesh Kumar K.VFor powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
215ee65728eSMike Rapoport
216ee65728eSMike RapoportThe differences with HugeTLB are relatively minor.
217ee65728eSMike Rapoport
218838691a1SMuchun SongIt only use 3 ``struct page`` for storing all information as opposed
219ee65728eSMike Rapoportto 4 on HugeTLB pages.
220ee65728eSMike Rapoport
221ee65728eSMike RapoportThere's no remapping of vmemmap given that device-dax memory is not part of
222ee65728eSMike RapoportSystem RAM ranges initialized at boot. Thus the tail page deduplication
223ee65728eSMike Rapoporthappens at a later stage when we populate the sections. HugeTLB reuses the
224ee65728eSMike Rapoportthe head vmemmap page representing, whereas device-dax reuses the tail
225ee65728eSMike Rapoportvmemmap page. This results in only half of the savings compared to HugeTLB.
226ee65728eSMike Rapoport
227ee65728eSMike RapoportDeduplicated tail pages are not mapped read-only.
228ee65728eSMike Rapoport
229ee65728eSMike RapoportHere's how things look like on device-dax after the sections are populated::
230ee65728eSMike Rapoport
231ee65728eSMike Rapoport +-----------+ ---virt_to_page---> +-----------+   mapping to   +-----------+
232ee65728eSMike Rapoport |           |                     |     0     | -------------> |     0     |
233ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
234ee65728eSMike Rapoport |           |                     |     1     | -------------> |     1     |
235ee65728eSMike Rapoport |           |                     +-----------+                +-----------+
236ee65728eSMike Rapoport |           |                     |     2     | ----------------^ ^ ^ ^ ^ ^
237ee65728eSMike Rapoport |           |                     +-----------+                   | | | | |
238ee65728eSMike Rapoport |           |                     |     3     | ------------------+ | | | |
239ee65728eSMike Rapoport |           |                     +-----------+                     | | | |
240ee65728eSMike Rapoport |           |                     |     4     | --------------------+ | | |
241ee65728eSMike Rapoport |    PMD    |                     +-----------+                       | | |
242ee65728eSMike Rapoport |   level   |                     |     5     | ----------------------+ | |
243ee65728eSMike Rapoport |  mapping  |                     +-----------+                         | |
244ee65728eSMike Rapoport |           |                     |     6     | ------------------------+ |
245ee65728eSMike Rapoport |           |                     +-----------+                           |
246ee65728eSMike Rapoport |           |                     |     7     | --------------------------+
247ee65728eSMike Rapoport |           |                     +-----------+
248ee65728eSMike Rapoport |           |
249ee65728eSMike Rapoport |           |
250ee65728eSMike Rapoport |           |
251ee65728eSMike Rapoport +-----------+
252