1.. SPDX-License-Identifier: GPL-2.0 2 3=============== 4Physical Memory 5=============== 6 7Linux is available for a wide range of architectures so there is a need for an 8architecture-independent abstraction to represent the physical memory. This 9chapter describes the structures used to manage physical memory in a running 10system. 11 12The first principal concept prevalent in the memory management is 13`Non-Uniform Memory Access (NUMA) 14<https://en.wikipedia.org/wiki/Non-uniform_memory_access>`_. 15With multi-core and multi-socket machines, memory may be arranged into banks 16that incur a different cost to access depending on the “distance” from the 17processor. For example, there might be a bank of memory assigned to each CPU or 18a bank of memory very suitable for DMA near peripheral devices. 19 20Each bank is called a node and the concept is represented under Linux by a 21``struct pglist_data`` even if the architecture is UMA. This structure is 22always referenced to by it's typedef ``pg_data_t``. ``A pg_data_t`` structure 23for a particular node can be referenced by ``NODE_DATA(nid)`` macro where 24``nid`` is the ID of that node. 25 26For NUMA architectures, the node structures are allocated by the architecture 27specific code early during boot. Usually, these structures are allocated 28locally on the memory bank they represent. For UMA architectures, only one 29static ``pg_data_t`` structure called ``contig_page_data`` is used. Nodes will 30be discussed further in Section :ref:`Nodes <nodes>` 31 32The entire physical address space is partitioned into one or more blocks 33called zones which represent ranges within memory. These ranges are usually 34determined by architectural constraints for accessing the physical memory. 35The memory range within a node that corresponds to a particular zone is 36described by a ``struct zone``, typedeffed to ``zone_t``. Each zone has 37one of the types described below. 38 39* ``ZONE_DMA`` and ``ZONE_DMA32`` historically represented memory suitable for 40 DMA by peripheral devices that cannot access all of the addressable 41 memory. For many years there are better more and robust interfaces to get 42 memory with DMA specific requirements (Documentation/core-api/dma-api.rst), 43 but ``ZONE_DMA`` and ``ZONE_DMA32`` still represent memory ranges that have 44 restrictions on how they can be accessed. 45 Depending on the architecture, either of these zone types or even they both 46 can be disabled at build time using ``CONFIG_ZONE_DMA`` and 47 ``CONFIG_ZONE_DMA32`` configuration options. Some 64-bit platforms may need 48 both zones as they support peripherals with different DMA addressing 49 limitations. 50 51* ``ZONE_NORMAL`` is for normal memory that can be accessed by the kernel all 52 the time. DMA operations can be performed on pages in this zone if the DMA 53 devices support transfers to all addressable memory. ``ZONE_NORMAL`` is 54 always enabled. 55 56* ``ZONE_HIGHMEM`` is the part of the physical memory that is not covered by a 57 permanent mapping in the kernel page tables. The memory in this zone is only 58 accessible to the kernel using temporary mappings. This zone is available 59 only on some 32-bit architectures and is enabled with ``CONFIG_HIGHMEM``. 60 61* ``ZONE_MOVABLE`` is for normal accessible memory, just like ``ZONE_NORMAL``. 62 The difference is that the contents of most pages in ``ZONE_MOVABLE`` is 63 movable. That means that while virtual addresses of these pages do not 64 change, their content may move between different physical pages. Often 65 ``ZONE_MOVABLE`` is populated during memory hotplug, but it may be 66 also populated on boot using one of ``kernelcore``, ``movablecore`` and 67 ``movable_node`` kernel command line parameters. See 68 Documentation/mm/page_migration.rst and 69 Documentation/admin-guide/mm/memory_hotplug.rst for additional details. 70 71* ``ZONE_DEVICE`` represents memory residing on devices such as PMEM and GPU. 72 It has different characteristics than RAM zone types and it exists to provide 73 :ref:`struct page <Pages>` and memory map services for device driver 74 identified physical address ranges. ``ZONE_DEVICE`` is enabled with 75 configuration option ``CONFIG_ZONE_DEVICE``. 76 77It is important to note that many kernel operations can only take place using 78``ZONE_NORMAL`` so it is the most performance critical zone. Zones are 79discussed further in Section :ref:`Zones <zones>`. 80 81The relation between node and zone extents is determined by the physical memory 82map reported by the firmware, architectural constraints for memory addressing 83and certain parameters in the kernel command line. 84 85For example, with 32-bit kernel on an x86 UMA machine with 2 Gbytes of RAM the 86entire memory will be on node 0 and there will be three zones: ``ZONE_DMA``, 87``ZONE_NORMAL`` and ``ZONE_HIGHMEM``:: 88 89 0 2G 90 +-------------------------------------------------------------+ 91 | node 0 | 92 +-------------------------------------------------------------+ 93 94 0 16M 896M 2G 95 +----------+-----------------------+--------------------------+ 96 | ZONE_DMA | ZONE_NORMAL | ZONE_HIGHMEM | 97 +----------+-----------------------+--------------------------+ 98 99 100With a kernel built with ``ZONE_DMA`` disabled and ``ZONE_DMA32`` enabled and 101booted with ``movablecore=80%`` parameter on an arm64 machine with 16 Gbytes of 102RAM equally split between two nodes, there will be ``ZONE_DMA32``, 103``ZONE_NORMAL`` and ``ZONE_MOVABLE`` on node 0, and ``ZONE_NORMAL`` and 104``ZONE_MOVABLE`` on node 1:: 105 106 107 1G 9G 17G 108 +--------------------------------+ +--------------------------+ 109 | node 0 | | node 1 | 110 +--------------------------------+ +--------------------------+ 111 112 1G 4G 4200M 9G 9320M 17G 113 +---------+----------+-----------+ +------------+-------------+ 114 | DMA32 | NORMAL | MOVABLE | | NORMAL | MOVABLE | 115 +---------+----------+-----------+ +------------+-------------+ 116 117.. _nodes: 118 119Nodes 120===== 121 122As we have mentioned, each node in memory is described by a ``pg_data_t`` which 123is a typedef for a ``struct pglist_data``. When allocating a page, by default 124Linux uses a node-local allocation policy to allocate memory from the node 125closest to the running CPU. As processes tend to run on the same CPU, it is 126likely the memory from the current node will be used. The allocation policy can 127be controlled by users as described in 128Documentation/admin-guide/mm/numa_memory_policy.rst. 129 130Most NUMA architectures maintain an array of pointers to the node 131structures. The actual structures are allocated early during boot when 132architecture specific code parses the physical memory map reported by the 133firmware. The bulk of the node initialization happens slightly later in the 134boot process by free_area_init() function, described later in Section 135:ref:`Initialization <initialization>`. 136 137 138Along with the node structures, kernel maintains an array of ``nodemask_t`` 139bitmasks called ``node_states``. Each bitmask in this array represents a set of 140nodes with particular properties as defined by ``enum node_states``: 141 142``N_POSSIBLE`` 143 The node could become online at some point. 144``N_ONLINE`` 145 The node is online. 146``N_NORMAL_MEMORY`` 147 The node has regular memory. 148``N_HIGH_MEMORY`` 149 The node has regular or high memory. When ``CONFIG_HIGHMEM`` is disabled 150 aliased to ``N_NORMAL_MEMORY``. 151``N_MEMORY`` 152 The node has memory(regular, high, movable) 153``N_CPU`` 154 The node has one or more CPUs 155 156For each node that has a property described above, the bit corresponding to the 157node ID in the ``node_states[<property>]`` bitmask is set. 158 159For example, for node 2 with normal memory and CPUs, bit 2 will be set in :: 160 161 node_states[N_POSSIBLE] 162 node_states[N_ONLINE] 163 node_states[N_NORMAL_MEMORY] 164 node_states[N_HIGH_MEMORY] 165 node_states[N_MEMORY] 166 node_states[N_CPU] 167 168For various operations possible with nodemasks please refer to 169``include/linux/nodemask.h``. 170 171Among other things, nodemasks are used to provide macros for node traversal, 172namely ``for_each_node()`` and ``for_each_online_node()``. 173 174For instance, to call a function foo() for each online node:: 175 176 for_each_online_node(nid) { 177 pg_data_t *pgdat = NODE_DATA(nid); 178 179 foo(pgdat); 180 } 181 182Node structure 183-------------- 184 185The nodes structure ``struct pglist_data`` is declared in 186``include/linux/mmzone.h``. Here we briefly describe fields of this 187structure: 188 189General 190~~~~~~~ 191 192``node_zones`` 193 The zones for this node. Not all of the zones may be populated, but it is 194 the full list. It is referenced by this node's node_zonelists as well as 195 other node's node_zonelists. 196 197``node_zonelists`` 198 The list of all zones in all nodes. This list defines the order of zones 199 that allocations are preferred from. The ``node_zonelists`` is set up by 200 ``build_zonelists()`` in ``mm/page_alloc.c`` during the initialization of 201 core memory management structures. 202 203``nr_zones`` 204 Number of populated zones in this node. 205 206``node_mem_map`` 207 For UMA systems that use FLATMEM memory model the 0's node 208 ``node_mem_map`` is array of struct pages representing each physical frame. 209 210``node_page_ext`` 211 For UMA systems that use FLATMEM memory model the 0's node 212 ``node_page_ext`` is array of extensions of struct pages. Available only 213 in the kernels built with ``CONFIG_PAGE_EXTENSION`` enabled. 214 215``node_start_pfn`` 216 The page frame number of the starting page frame in this node. 217 218``node_present_pages`` 219 Total number of physical pages present in this node. 220 221``node_spanned_pages`` 222 Total size of physical page range, including holes. 223 224``node_size_lock`` 225 A lock that protects the fields defining the node extents. Only defined when 226 at least one of ``CONFIG_MEMORY_HOTPLUG`` or 227 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` configuration options are enabled. 228 ``pgdat_resize_lock()`` and ``pgdat_resize_unlock()`` are provided to 229 manipulate ``node_size_lock`` without checking for ``CONFIG_MEMORY_HOTPLUG`` 230 or ``CONFIG_DEFERRED_STRUCT_PAGE_INIT``. 231 232``node_id`` 233 The Node ID (NID) of the node, starts at 0. 234 235``totalreserve_pages`` 236 This is a per-node reserve of pages that are not available to userspace 237 allocations. 238 239``first_deferred_pfn`` 240 If memory initialization on large machines is deferred then this is the first 241 PFN that needs to be initialized. Defined only when 242 ``CONFIG_DEFERRED_STRUCT_PAGE_INIT`` is enabled 243 244``deferred_split_queue`` 245 Per-node queue of huge pages that their split was deferred. Defined only when ``CONFIG_TRANSPARENT_HUGEPAGE`` is enabled. 246 247``__lruvec`` 248 Per-node lruvec holding LRU lists and related parameters. Used only when 249 memory cgroups are disabled. It should not be accessed directly, use 250 ``mem_cgroup_lruvec()`` to look up lruvecs instead. 251 252Reclaim control 253~~~~~~~~~~~~~~~ 254 255See also Documentation/mm/page_reclaim.rst. 256 257``kswapd`` 258 Per-node instance of kswapd kernel thread. 259 260``kswapd_wait``, ``pfmemalloc_wait``, ``reclaim_wait`` 261 Workqueues used to synchronize memory reclaim tasks 262 263``nr_writeback_throttled`` 264 Number of tasks that are throttled waiting on dirty pages to clean. 265 266``nr_reclaim_start`` 267 Number of pages written while reclaim is throttled waiting for writeback. 268 269``kswapd_order`` 270 Controls the order kswapd tries to reclaim 271 272``kswapd_highest_zoneidx`` 273 The highest zone index to be reclaimed by kswapd 274 275``kswapd_failures`` 276 Number of runs kswapd was unable to reclaim any pages 277 278``min_unmapped_pages`` 279 Minimal number of unmapped file backed pages that cannot be reclaimed. 280 Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when 281 ``CONFIG_NUMA`` is enabled. 282 283``min_slab_pages`` 284 Minimal number of SLAB pages that cannot be reclaimed. Determined by 285 ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled 286 287``flags`` 288 Flags controlling reclaim behavior. 289 290Compaction control 291~~~~~~~~~~~~~~~~~~ 292 293``kcompactd_max_order`` 294 Page order that kcompactd should try to achieve. 295 296``kcompactd_highest_zoneidx`` 297 The highest zone index to be compacted by kcompactd. 298 299``kcompactd_wait`` 300 Workqueue used to synchronize memory compaction tasks. 301 302``kcompactd`` 303 Per-node instance of kcompactd kernel thread. 304 305``proactive_compact_trigger`` 306 Determines if proactive compaction is enabled. Controlled by 307 ``vm.compaction_proactiveness`` sysctl. 308 309Statistics 310~~~~~~~~~~ 311 312``per_cpu_nodestats`` 313 Per-CPU VM statistics for the node 314 315``vm_stat`` 316 VM statistics for the node. 317 318.. _zones: 319 320Zones 321===== 322 323.. admonition:: Stub 324 325 This section is incomplete. Please list and describe the appropriate fields. 326 327.. _pages: 328 329Pages 330===== 331 332.. admonition:: Stub 333 334 This section is incomplete. Please list and describe the appropriate fields. 335 336.. _folios: 337 338Folios 339====== 340 341.. admonition:: Stub 342 343 This section is incomplete. Please list and describe the appropriate fields. 344 345.. _initialization: 346 347Initialization 348============== 349 350.. admonition:: Stub 351 352 This section is incomplete. Please list and describe the appropriate fields. 353