1.. SPDX-License-Identifier: GPL-2.0
2
3=====================
4Physical Memory Model
5=====================
6
7Physical memory in a system may be addressed in different ways. The
8simplest case is when the physical memory starts at address 0 and
9spans a contiguous range up to the maximal address. It could be,
10however, that this range contains small holes that are not accessible
11for the CPU. Then there could be several contiguous ranges at
12completely distinct addresses. And, don't forget about NUMA, where
13different memory banks are attached to different CPUs.
14
15Linux abstracts this diversity using one of the two memory models:
16FLATMEM and SPARSEMEM. Each architecture defines what
17memory models it supports, what the default memory model is and
18whether it is possible to manually override that default.
19
20All the memory models track the status of physical page frames using
21struct page arranged in one or more arrays.
22
23Regardless of the selected memory model, there exists one-to-one
24mapping between the physical page frame number (PFN) and the
25corresponding `struct page`.
26
27Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
28helpers that allow the conversion from PFN to `struct page` and vice
29versa.
30
31FLATMEM
32=======
33
34The simplest memory model is FLATMEM. This model is suitable for
35non-NUMA systems with contiguous, or mostly contiguous, physical
36memory.
37
38In the FLATMEM memory model, there is a global `mem_map` array that
39maps the entire physical memory. For most architectures, the holes
40have entries in the `mem_map` array. The `struct page` objects
41corresponding to the holes are never fully initialized.
42
43To allocate the `mem_map` array, architecture specific setup code should
44call :c:func:`free_area_init` function. Yet, the mappings array is not
45usable until the call to :c:func:`memblock_free_all` that hands all the
46memory to the page allocator.
47
48An architecture may free parts of the `mem_map` array that do not cover the
49actual physical pages. In such case, the architecture specific
50:c:func:`pfn_valid` implementation should take the holes in the
51`mem_map` into account.
52
53With FLATMEM, the conversion between a PFN and the `struct page` is
54straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
55`mem_map` array.
56
57The `ARCH_PFN_OFFSET` defines the first page frame number for
58systems with physical memory starting at address different from 0.
59
60SPARSEMEM
61=========
62
63SPARSEMEM is the most versatile memory model available in Linux and it
64is the only memory model that supports several advanced features such
65as hot-plug and hot-remove of the physical memory, alternative memory
66maps for non-volatile memory devices and deferred initialization of
67the memory map for larger systems.
68
69The SPARSEMEM model presents the physical memory as a collection of
70sections. A section is represented with struct mem_section
71that contains `section_mem_map` that is, logically, a pointer to an
72array of struct pages. However, it is stored with some other magic
73that aids the sections management. The section size and maximal number
74of section is specified using `SECTION_SIZE_BITS` and
75`MAX_PHYSMEM_BITS` constants defined by each architecture that
76supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
77physical address that an architecture supports, the
78`SECTION_SIZE_BITS` is an arbitrary value.
79
80The maximal number of sections is denoted `NR_MEM_SECTIONS` and
81defined as
82
83.. math::
84
85   NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
86
87The `mem_section` objects are arranged in a two-dimensional array
88called `mem_sections`. The size and placement of this array depend
89on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
90sections:
91
92* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
93  array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
94  single `mem_section` object.
95* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
96  array is dynamically allocated. Each row contains PAGE_SIZE worth of
97  `mem_section` objects and the number of rows is calculated to fit
98  all the memory sections.
99
100The architecture setup code should call sparse_init() to
101initialize the memory sections and the memory maps.
102
103With SPARSEMEM there are two possible ways to convert a PFN to the
104corresponding `struct page` - a "classic sparse" and "sparse
105vmemmap". The selection is made at build time and it is determined by
106the value of `CONFIG_SPARSEMEM_VMEMMAP`.
107
108The classic sparse encodes the section number of a page in page->flags
109and uses high bits of a PFN to access the section that maps that page
110frame. Inside a section, the PFN is the index to the array of pages.
111
112The sparse vmemmap uses a virtually mapped memory map to optimize
113pfn_to_page and page_to_pfn operations. There is a global `struct
114page *vmemmap` pointer that points to a virtually contiguous array of
115`struct page` objects. A PFN is an index to that array and the
116offset of the `struct page` from `vmemmap` is the PFN of that
117page.
118
119To use vmemmap, an architecture has to reserve a range of virtual
120addresses that will map the physical pages containing the memory
121map and make sure that `vmemmap` points to that range. In addition,
122the architecture should implement :c:func:`vmemmap_populate` method
123that will allocate the physical memory and create page tables for the
124virtual memory map. If an architecture does not have any special
125requirements for the vmemmap mappings, it can use default
126:c:func:`vmemmap_populate_basepages` provided by the generic memory
127management.
128
129The virtually mapped memory map allows storing `struct page` objects
130for persistent memory devices in pre-allocated storage on those
131devices. This storage is represented with struct vmem_altmap
132that is eventually passed to vmemmap_populate() through a long chain
133of function calls. The vmemmap_populate() implementation may use the
134`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
135allocate memory map on the persistent memory device.
136
137ZONE_DEVICE
138===========
139The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
140`struct page` `mem_map` services for device driver identified physical
141address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
142that the page objects for these address ranges are never marked online,
143and that a reference must be taken against the device, not just the page
144to keep the memory pinned for active use. `ZONE_DEVICE`, via
145:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
146turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
147:c:func:`get_user_pages` service for the given range of pfns. Since the
148page reference count never drops below 1 the page is never tracked as
149free memory and the page's `struct list_head lru` space is repurposed
150for back referencing to the host device / driver that mapped the memory.
151
152While `SPARSEMEM` presents memory as a collection of sections,
153optionally collected into memory blocks, `ZONE_DEVICE` users have a need
154for smaller granularity of populating the `mem_map`. Given that
155`ZONE_DEVICE` memory is never marked online it is subsequently never
156subject to its memory ranges being exposed through the sysfs memory
157hotplug api on memory block boundaries. The implementation relies on
158this lack of user-api constraint to allow sub-section sized memory
159ranges to be specified to :c:func:`arch_add_memory`, the top-half of
160memory hotplug. Sub-section support allows for 2MB as the cross-arch
161common alignment granularity for :c:func:`devm_memremap_pages`.
162
163The users of `ZONE_DEVICE` are:
164
165* pmem: Map platform persistent memory to be used as a direct-I/O target
166  via DAX mappings.
167
168* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
169  event callbacks to allow a device-driver to coordinate memory management
170  events related to device-memory, typically GPU memory. See
171  Documentation/mm/hmm.rst.
172
173* p2pdma: Create `struct page` objects to allow peer devices in a
174  PCI/-E topology to coordinate direct-DMA operations between themselves,
175  i.e. bypass host memory.
176