xref: /openbmc/linux/Documentation/mm/page_tables.rst (revision 2612e3bbc0386368a850140a6c9b990cd496a5ec)
1ee65728eSMike Rapoport.. SPDX-License-Identifier: GPL-2.0
2ee65728eSMike Rapoport
3ee65728eSMike Rapoport===========
4ee65728eSMike RapoportPage Tables
5ee65728eSMike Rapoport===========
6*e72ef2d2SLinus Walleij
7*e72ef2d2SLinus WalleijPaged virtual memory was invented along with virtual memory as a concept in
8*e72ef2d2SLinus Walleij1962 on the Ferranti Atlas Computer which was the first computer with paged
9*e72ef2d2SLinus Walleijvirtual memory. The feature migrated to newer computers and became a de facto
10*e72ef2d2SLinus Walleijfeature of all Unix-like systems as time went by. In 1985 the feature was
11*e72ef2d2SLinus Walleijincluded in the Intel 80386, which was the CPU Linux 1.0 was developed on.
12*e72ef2d2SLinus Walleij
13*e72ef2d2SLinus WalleijPage tables map virtual addresses as seen by the CPU into physical addresses
14*e72ef2d2SLinus Walleijas seen on the external memory bus.
15*e72ef2d2SLinus Walleij
16*e72ef2d2SLinus WalleijLinux defines page tables as a hierarchy which is currently five levels in
17*e72ef2d2SLinus Walleijheight. The architecture code for each supported architecture will then
18*e72ef2d2SLinus Walleijmap this to the restrictions of the hardware.
19*e72ef2d2SLinus Walleij
20*e72ef2d2SLinus WalleijThe physical address corresponding to the virtual address is often referenced
21*e72ef2d2SLinus Walleijby the underlying physical page frame. The **page frame number** or **pfn**
22*e72ef2d2SLinus Walleijis the physical address of the page (as seen on the external memory bus)
23*e72ef2d2SLinus Walleijdivided by `PAGE_SIZE`.
24*e72ef2d2SLinus Walleij
25*e72ef2d2SLinus WalleijPhysical memory address 0 will be *pfn 0* and the highest pfn will be
26*e72ef2d2SLinus Walleijthe last page of physical memory the external address bus of the CPU can
27*e72ef2d2SLinus Walleijaddress.
28*e72ef2d2SLinus Walleij
29*e72ef2d2SLinus WalleijWith a page granularity of 4KB and a address range of 32 bits, pfn 0 is at
30*e72ef2d2SLinus Walleijaddress 0x00000000, pfn 1 is at address 0x00001000, pfn 2 is at 0x00002000
31*e72ef2d2SLinus Walleijand so on until we reach pfn 0xfffff at 0xfffff000. With 16KB pages pfs are
32*e72ef2d2SLinus Walleijat 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
33*e72ef2d2SLinus Walleij
34*e72ef2d2SLinus WalleijAs you can see, with 4KB pages the page base address uses bits 12-31 of the
35*e72ef2d2SLinus Walleijaddress, and this is why `PAGE_SHIFT` in this case is defined as 12 and
36*e72ef2d2SLinus Walleij`PAGE_SIZE` is usually defined in terms of the page shift as `(1 << PAGE_SHIFT)`
37*e72ef2d2SLinus Walleij
38*e72ef2d2SLinus WalleijOver time a deeper hierarchy has been developed in response to increasing memory
39*e72ef2d2SLinus Walleijsizes. When Linux was created, 4KB pages and a single page table called
40*e72ef2d2SLinus Walleij`swapper_pg_dir` with 1024 entries was used, covering 4MB which coincided with
41*e72ef2d2SLinus Walleijthe fact that Torvald's first computer had 4MB of physical memory. Entries in
42*e72ef2d2SLinus Walleijthis single table were referred to as *PTE*:s - page table entries.
43*e72ef2d2SLinus Walleij
44*e72ef2d2SLinus WalleijThe software page table hierarchy reflects the fact that page table hardware has
45*e72ef2d2SLinus Walleijbecome hierarchical and that in turn is done to save page table memory and
46*e72ef2d2SLinus Walleijspeed up mapping.
47*e72ef2d2SLinus Walleij
48*e72ef2d2SLinus WalleijOne could of course imagine a single, linear page table with enormous amounts
49*e72ef2d2SLinus Walleijof entries, breaking down the whole memory into single pages. Such a page table
50*e72ef2d2SLinus Walleijwould be very sparse, because large portions of the virtual memory usually
51*e72ef2d2SLinus Walleijremains unused. By using hierarchical page tables large holes in the virtual
52*e72ef2d2SLinus Walleijaddress space does not waste valuable page table memory, because it will suffice
53*e72ef2d2SLinus Walleijto mark large areas as unmapped at a higher level in the page table hierarchy.
54*e72ef2d2SLinus Walleij
55*e72ef2d2SLinus WalleijAdditionally, on modern CPUs, a higher level page table entry can point directly
56*e72ef2d2SLinus Walleijto a physical memory range, which allows mapping a contiguous range of several
57*e72ef2d2SLinus Walleijmegabytes or even gigabytes in a single high-level page table entry, taking
58*e72ef2d2SLinus Walleijshortcuts in mapping virtual memory to physical memory: there is no need to
59*e72ef2d2SLinus Walleijtraverse deeper in the hierarchy when you find a large mapped range like this.
60*e72ef2d2SLinus Walleij
61*e72ef2d2SLinus WalleijThe page table hierarchy has now developed into this::
62*e72ef2d2SLinus Walleij
63*e72ef2d2SLinus Walleij  +-----+
64*e72ef2d2SLinus Walleij  | PGD |
65*e72ef2d2SLinus Walleij  +-----+
66*e72ef2d2SLinus Walleij     |
67*e72ef2d2SLinus Walleij     |   +-----+
68*e72ef2d2SLinus Walleij     +-->| P4D |
69*e72ef2d2SLinus Walleij         +-----+
70*e72ef2d2SLinus Walleij            |
71*e72ef2d2SLinus Walleij            |   +-----+
72*e72ef2d2SLinus Walleij            +-->| PUD |
73*e72ef2d2SLinus Walleij                +-----+
74*e72ef2d2SLinus Walleij                   |
75*e72ef2d2SLinus Walleij                   |   +-----+
76*e72ef2d2SLinus Walleij                   +-->| PMD |
77*e72ef2d2SLinus Walleij                       +-----+
78*e72ef2d2SLinus Walleij                          |
79*e72ef2d2SLinus Walleij                          |   +-----+
80*e72ef2d2SLinus Walleij                          +-->| PTE |
81*e72ef2d2SLinus Walleij                              +-----+
82*e72ef2d2SLinus Walleij
83*e72ef2d2SLinus Walleij
84*e72ef2d2SLinus WalleijSymbols on the different levels of the page table hierarchy have the following
85*e72ef2d2SLinus Walleijmeaning beginning from the bottom:
86*e72ef2d2SLinus Walleij
87*e72ef2d2SLinus Walleij- **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
88*e72ef2d2SLinus Walleij  The *pte* is an array of `PTRS_PER_PTE` elements of the `pteval_t` type, each
89*e72ef2d2SLinus Walleij  mapping a single page of virtual memory to a single page of physical memory.
90*e72ef2d2SLinus Walleij  The architecture defines the size and contents of `pteval_t`.
91*e72ef2d2SLinus Walleij
92*e72ef2d2SLinus Walleij  A typical example is that the `pteval_t` is a 32- or 64-bit value with the
93*e72ef2d2SLinus Walleij  upper bits being a **pfn** (page frame number), and the lower bits being some
94*e72ef2d2SLinus Walleij  architecture-specific bits such as memory protection.
95*e72ef2d2SLinus Walleij
96*e72ef2d2SLinus Walleij  The **entry** part of the name is a bit confusing because while in Linux 1.0
97*e72ef2d2SLinus Walleij  this did refer to a single page table entry in the single top level page
98*e72ef2d2SLinus Walleij  table, it was retrofitted to be an array of mapping elements when two-level
99*e72ef2d2SLinus Walleij  page tables were first introduced, so the *pte* is the lowermost page
100*e72ef2d2SLinus Walleij  *table*, not a page table *entry*.
101*e72ef2d2SLinus Walleij
102*e72ef2d2SLinus Walleij- **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
103*e72ef2d2SLinus Walleij  above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
104*e72ef2d2SLinus Walleij
105*e72ef2d2SLinus Walleij- **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
106*e72ef2d2SLinus Walleij  the other levels to handle 4-level page tables. It is potentially unused,
107*e72ef2d2SLinus Walleij  or *folded* as we will discuss later.
108*e72ef2d2SLinus Walleij
109*e72ef2d2SLinus Walleij- **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
110*e72ef2d2SLinus Walleij  handle 5-level page tables after the *pud* was introduced. Now it was clear
111*e72ef2d2SLinus Walleij  that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
112*e72ef2d2SLinus Walleij  directory level and that we cannot go on with ad hoc names any more. This
113*e72ef2d2SLinus Walleij  is only used on systems which actually have 5 levels of page tables, otherwise
114*e72ef2d2SLinus Walleij  it is folded.
115*e72ef2d2SLinus Walleij
116*e72ef2d2SLinus Walleij- **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
117*e72ef2d2SLinus Walleij  main page table handling the PGD for the kernel memory is still found in
118*e72ef2d2SLinus Walleij  `swapper_pg_dir`, but each userspace process in the system also has its own
119*e72ef2d2SLinus Walleij  memory context and thus its own *pgd*, found in `struct mm_struct` which
120*e72ef2d2SLinus Walleij  in turn is referenced to in each `struct task_struct`. So tasks have memory
121*e72ef2d2SLinus Walleij  context in the form of a `struct mm_struct` and this in turn has a
122*e72ef2d2SLinus Walleij  `struct pgt_t *pgd` pointer to the corresponding page global directory.
123*e72ef2d2SLinus Walleij
124*e72ef2d2SLinus WalleijTo repeat: each level in the page table hierarchy is a *array of pointers*, so
125*e72ef2d2SLinus Walleijthe **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
126*e72ef2d2SLinus Walleijcontains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
127*e72ef2d2SLinus Walleijpointers on each level is architecture-defined.::
128*e72ef2d2SLinus Walleij
129*e72ef2d2SLinus Walleij        PMD
130*e72ef2d2SLinus Walleij  --> +-----+           PTE
131*e72ef2d2SLinus Walleij      | ptr |-------> +-----+
132*e72ef2d2SLinus Walleij      | ptr |-        | ptr |-------> PAGE
133*e72ef2d2SLinus Walleij      | ptr | \       | ptr |
134*e72ef2d2SLinus Walleij      | ptr |  \        ...
135*e72ef2d2SLinus Walleij      | ... |   \
136*e72ef2d2SLinus Walleij      | ptr |    \         PTE
137*e72ef2d2SLinus Walleij      +-----+     +----> +-----+
138*e72ef2d2SLinus Walleij                         | ptr |-------> PAGE
139*e72ef2d2SLinus Walleij                         | ptr |
140*e72ef2d2SLinus Walleij                           ...
141*e72ef2d2SLinus Walleij
142*e72ef2d2SLinus Walleij
143*e72ef2d2SLinus WalleijPage Table Folding
144*e72ef2d2SLinus Walleij==================
145*e72ef2d2SLinus Walleij
146*e72ef2d2SLinus WalleijIf the architecture does not use all the page table levels, they can be *folded*
147*e72ef2d2SLinus Walleijwhich means skipped, and all operations performed on page tables will be
148*e72ef2d2SLinus Walleijcompile-time augmented to just skip a level when accessing the next lower
149*e72ef2d2SLinus Walleijlevel.
150*e72ef2d2SLinus Walleij
151*e72ef2d2SLinus WalleijPage table handling code that wishes to be architecture-neutral, such as the
152*e72ef2d2SLinus Walleijvirtual memory manager, will need to be written so that it traverses all of the
153*e72ef2d2SLinus Walleijcurrently five levels. This style should also be preferred for
154*e72ef2d2SLinus Walleijarchitecture-specific code, so as to be robust to future changes.
155