1ac3332c4SDavid Hildenbrand================== 2ac3332c4SDavid HildenbrandMemory Hot(Un)Plug 3ac3332c4SDavid Hildenbrand================== 46bf53999SMike Rapoport 5ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with 6ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support. 76bf53999SMike Rapoport 86bf53999SMike Rapoport.. contents:: :local: 96bf53999SMike Rapoport 106bf53999SMike RapoportIntroduction 116bf53999SMike Rapoport============ 126bf53999SMike Rapoport 13ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical 14ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of 15ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the 16ac3332c4SDavid Hildenbrandoperating system. 176bf53999SMike Rapoport 18ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes: 196bf53999SMike Rapoport 20ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or 21ac3332c4SDavid Hildenbrand downgrading the memory capacity. This dynamic memory resizing, sometimes 22ac3332c4SDavid Hildenbrand referred to as "capacity on demand", is frequently used with virtual machines 23ac3332c4SDavid Hildenbrand and logical partitions. 246bf53999SMike Rapoport 25ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One 26ac3332c4SDavid Hildenbrand example is replacing failing memory modules. 276bf53999SMike Rapoport 28ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or 29ac3332c4SDavid Hildenbrand by logically unplugging (parts of) memory modules from Linux. 306bf53999SMike Rapoport 31ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also 32ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and 33ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux. 346bf53999SMike Rapoport 35ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as 36ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64. 376bf53999SMike Rapoport 38ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity 39ac3332c4SDavid Hildenbrand------------------------------ 406bf53999SMike Rapoport 41ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the 42ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The 43ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses 44ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB. 456bf53999SMike Rapoport 466bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The 47ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest 48ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is 49ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise. 506bf53999SMike Rapoport 51ac3332c4SDavid HildenbrandAll memory blocks have the same size. 526bf53999SMike Rapoport 53ac3332c4SDavid HildenbrandPhases of Memory Hotplug 54ac3332c4SDavid Hildenbrand------------------------ 556bf53999SMike Rapoport 56ac3332c4SDavid HildenbrandMemory hotplug consists of two phases: 576bf53999SMike Rapoport 58ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux 59ac3332c4SDavid Hildenbrand(2) Onlining memory blocks 606bf53999SMike Rapoport 61ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables 62ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are 63ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory 64ac3332c4SDavid Hildenbrandblocks. 656bf53999SMike Rapoport 66ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this 67ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total 68ac3332c4SDavid Hildenbrandmemory, of the system. 696bf53999SMike Rapoport 70ac3332c4SDavid HildenbrandPhases of Memory Hotunplug 71ac3332c4SDavid Hildenbrand-------------------------- 726bf53999SMike Rapoport 73ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases: 746bf53999SMike Rapoport 75ac3332c4SDavid Hildenbrand(1) Offlining memory blocks 76ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux 776bf53999SMike Rapoport 78ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for 79ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all 80ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no 81ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system. 826bf53999SMike Rapoport 83ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed. 846bf53999SMike Rapoport 85ac3332c4SDavid HildenbrandMemory Hotplug Notifications 86ac3332c4SDavid Hildenbrand============================ 876bf53999SMike Rapoport 88ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such 89ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to 90ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or 91ac3332c4SDavid Hildenbrandvirtual machines are not described. 92ac3332c4SDavid Hildenbrand 93ac3332c4SDavid HildenbrandACPI Notifications 94ac3332c4SDavid Hildenbrand------------------ 95ac3332c4SDavid Hildenbrand 96ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug 97ac3332c4SDavid Hildenbrandnotifications via ACPI. 98ac3332c4SDavid Hildenbrand 99ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object 100ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI 101ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux. 102ac3332c4SDavid Hildenbrand 103ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID 104ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all 105ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver. 106ac3332c4SDavid Hildenbrand 107ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or 108ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory 109ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux. 110ac3332c4SDavid Hildenbrand 111ac3332c4SDavid HildenbrandManual Probing 112ac3332c4SDavid Hildenbrand-------------- 113ac3332c4SDavid Hildenbrand 114ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating 115ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually 116ac3332c4SDavid Hildenbrandprobed from user space. 117ac3332c4SDavid Hildenbrand 118ac3332c4SDavid HildenbrandThe probe interface is located at:: 119ac3332c4SDavid Hildenbrand 120ac3332c4SDavid Hildenbrand /sys/devices/system/memory/probe 121ac3332c4SDavid Hildenbrand 122ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed 123ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block:: 124ac3332c4SDavid Hildenbrand 125ac3332c4SDavid Hildenbrand % echo addr > /sys/devices/system/memory/probe 126ac3332c4SDavid Hildenbrand 127ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size) 128ac3332c4SDavid Hildenbrandbeing created. 129ac3332c4SDavid Hildenbrand 130ac3332c4SDavid Hildenbrand.. note:: 131ac3332c4SDavid Hildenbrand 132ac3332c4SDavid Hildenbrand Using the probe interface is discouraged as it is easy to crash the kernel, 133ac3332c4SDavid Hildenbrand because Linux cannot validate user input; this interface might be removed in 134ac3332c4SDavid Hildenbrand the future. 135ac3332c4SDavid Hildenbrand 136ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks 137ac3332c4SDavid Hildenbrand==================================== 138ac3332c4SDavid Hildenbrand 139ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually 140ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online". 141ac3332c4SDavid Hildenbrand 142ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of 143ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined". 144ac3332c4SDavid Hildenbrand 145ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks 146ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying 147ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded 148ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of 149ac3332c4SDavid Hildenbrandmemory. 150ac3332c4SDavid Hildenbrand 151ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually 152ac3332c4SDavid Hildenbrand------------------------------- 153ac3332c4SDavid Hildenbrand 154ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually 155ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this 156ac3332c4SDavid Hildenbrandtask in user space. 157ac3332c4SDavid Hildenbrand 158ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via:: 159ac3332c4SDavid Hildenbrand 160ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/memoryXXX/state 161ac3332c4SDavid Hildenbrand 162ac3332c4SDavid HildenbrandOr alternatively:: 163ac3332c4SDavid Hildenbrand 164ac3332c4SDavid Hildenbrand % echo 1 > /sys/devices/system/memory/memoryXXX/online 165ac3332c4SDavid Hildenbrand 1669e122cc1SDavid HildenbrandThe kernel will select the target zone automatically, depending on the 1679e122cc1SDavid Hildenbrandconfigured ``online_policy``. 168ac3332c4SDavid Hildenbrand 169ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with 170ac3332c4SDavid HildenbrandZONE_MOVABLE by:: 171ac3332c4SDavid Hildenbrand 172ac3332c4SDavid Hildenbrand % echo online_movable > /sys/devices/system/memory/memoryXXX/state 173ac3332c4SDavid Hildenbrand 174ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 175ac3332c4SDavid Hildenbrand 176ac3332c4SDavid Hildenbrand % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 177ac3332c4SDavid Hildenbrand 178ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to 179ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged 180ac3332c4SDavid Hildenbrandand the above commands will fail. 181ac3332c4SDavid Hildenbrand 182ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically 183ac3332c4SDavid Hildenbrand------------------------------------ 184ac3332c4SDavid Hildenbrand 185ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks. 186ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until 187ac3332c4SDavid Hildenbrandexplicitly onlined from user space. 188ac3332c4SDavid Hildenbrand 189ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via:: 190ac3332c4SDavid Hildenbrand 191ac3332c4SDavid Hildenbrand % cat /sys/devices/system/memory/auto_online_blocks 192ac3332c4SDavid Hildenbrand 193ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or 194ac3332c4SDavid Hildenbrand``online_movable`` to that file, like:: 195ac3332c4SDavid Hildenbrand 196ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/auto_online_blocks 197ac3332c4SDavid Hildenbrand 1989e122cc1SDavid HildenbrandSimilarly to manual onlining, with ``online`` the kernel will select the 1999e122cc1SDavid Hildenbrandtarget zone automatically, depending on the configured ``online_policy``. 2009e122cc1SDavid Hildenbrand 201ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added 202ac3332c4SDavid Hildenbrandmemory blocks only. 203ac3332c4SDavid Hildenbrand 204ac3332c4SDavid Hildenbrand.. note:: 205ac3332c4SDavid Hildenbrand 206ac3332c4SDavid Hildenbrand In corner cases, auto-onlining can fail. The kernel won't retry. Note that 207ac3332c4SDavid Hildenbrand auto-onlining is not expected to fail in default configurations. 208ac3332c4SDavid Hildenbrand 209ac3332c4SDavid Hildenbrand.. note:: 210ac3332c4SDavid Hildenbrand 211ac3332c4SDavid Hildenbrand DLPAR on ppc64 ignores the ``offline`` setting and will still online added 212ac3332c4SDavid Hildenbrand memory blocks; if onlining fails, memory blocks are removed again. 213ac3332c4SDavid Hildenbrand 214ac3332c4SDavid HildenbrandOfflining Memory Blocks 215ac3332c4SDavid Hildenbrand----------------------- 216ac3332c4SDavid Hildenbrand 217ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all 218ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as 219ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit 220ac3332c4SDavid Hildenbrandmemory offlining from succeeding. 221ac3332c4SDavid Hildenbrand 222ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly 223ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in 224ac3332c4SDavid Hildenbrandsome corner cases. 225ac3332c4SDavid Hildenbrand 226ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until 227ac3332c4SDavid Hildenbrandaborted by the user. 228ac3332c4SDavid Hildenbrand 229ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via:: 230ac3332c4SDavid Hildenbrand 231ac3332c4SDavid Hildenbrand % echo offline > /sys/devices/system/memory/memoryXXX/state 232ac3332c4SDavid Hildenbrand 233ac3332c4SDavid HildenbrandOr alternatively:: 234ac3332c4SDavid Hildenbrand 235ac3332c4SDavid Hildenbrand % echo 0 > /sys/devices/system/memory/memoryXXX/online 236ac3332c4SDavid Hildenbrand 237ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline". 238ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above 239ac3332c4SDavid Hildenbrandcommands will fail, for example, via:: 240ac3332c4SDavid Hildenbrand 241ac3332c4SDavid Hildenbrand bash: echo: write error: Device or resource busy 242ac3332c4SDavid Hildenbrand 243ac3332c4SDavid Hildenbrandor via:: 244ac3332c4SDavid Hildenbrand 245ac3332c4SDavid Hildenbrand bash: echo: write error: Invalid argument 246ac3332c4SDavid Hildenbrand 247ac3332c4SDavid HildenbrandObserving the State of Memory Blocks 248ac3332c4SDavid Hildenbrand------------------------------------ 249ac3332c4SDavid Hildenbrand 250ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed 251ac3332c4SDavid Hildenbrandeither via:: 252ac3332c4SDavid Hildenbrand 253ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/state 254ac3332c4SDavid Hildenbrand 255ac3332c4SDavid HildenbrandOr alternatively (1/0) via:: 256ac3332c4SDavid Hildenbrand 257ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/online 258ac3332c4SDavid Hildenbrand 259ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via:: 260ac3332c4SDavid Hildenbrand 261ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/valid_zones 262ac3332c4SDavid Hildenbrand 263ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug 2646bf53999SMike Rapoport============================== 2656bf53999SMike Rapoport 266ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory 267ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them. 268ac3332c4SDavid Hildenbrand 269ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs 270ac3332c4SDavid Hildenbrand------------------------------------------ 271ac3332c4SDavid Hildenbrand 272ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in:: 273ac3332c4SDavid Hildenbrand 274ac3332c4SDavid Hildenbrand /sys/devices/system/memory/ 275ac3332c4SDavid Hildenbrand 276ac3332c4SDavid HildenbrandThe following files are currently defined: 277ac3332c4SDavid Hildenbrand 278ac3332c4SDavid Hildenbrand====================== ========================================================= 279ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory 280ac3332c4SDavid Hildenbrand blocks; configure auto-onlining. 281ac3332c4SDavid Hildenbrand 282ac3332c4SDavid Hildenbrand The default value depends on the 283ac3332c4SDavid Hildenbrand CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration 284ac3332c4SDavid Hildenbrand option. 285ac3332c4SDavid Hildenbrand 286ac3332c4SDavid Hildenbrand See the ``state`` property of memory blocks for details. 287ac3332c4SDavid Hildenbrand``block_size_bytes`` read-only: the size in bytes of a memory block. 288ac3332c4SDavid Hildenbrand``probe`` write-only: add (probe) selected memory blocks manually 289ac3332c4SDavid Hildenbrand from user space by supplying the physical start address. 290ac3332c4SDavid Hildenbrand 291ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_ARCH_MEMORY_PROBE 292ac3332c4SDavid Hildenbrand kernel configuration option. 293ac3332c4SDavid Hildenbrand``uevent`` read-write: generic udev file for device subsystems. 294*88a6f899SEric DeVolder``crash_hotplug`` read-only: when changes to the system memory map 295*88a6f899SEric DeVolder occur due to hot un/plug of memory, this file contains 296*88a6f899SEric DeVolder '1' if the kernel updates the kdump capture kernel memory 297*88a6f899SEric DeVolder map itself (via elfcorehdr), or '0' if userspace must update 298*88a6f899SEric DeVolder the kdump capture kernel memory map. 299*88a6f899SEric DeVolder 300*88a6f899SEric DeVolder Availability depends on the CONFIG_MEMORY_HOTPLUG kernel 301*88a6f899SEric DeVolder configuration option. 302ac3332c4SDavid Hildenbrand====================== ========================================================= 303ac3332c4SDavid Hildenbrand 304ac3332c4SDavid Hildenbrand.. note:: 305ac3332c4SDavid Hildenbrand 306ac3332c4SDavid Hildenbrand When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two 307ac3332c4SDavid Hildenbrand additional files ``hard_offline_page`` and ``soft_offline_page`` are available 308ac3332c4SDavid Hildenbrand to trigger hwpoisoning of pages, for example, for testing purposes. Note that 309ac3332c4SDavid Hildenbrand this functionality is not really related to memory hot(un)plug or actual 310ac3332c4SDavid Hildenbrand offlining of memory blocks. 311ac3332c4SDavid Hildenbrand 312ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs 313ac3332c4SDavid Hildenbrand------------------------------------ 314ac3332c4SDavid Hildenbrand 315ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be 316ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in 317ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under 318ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as:: 3196bf53999SMike Rapoport 3206bf53999SMike Rapoport /sys/devices/system/memory/memoryXXX 3216bf53999SMike Rapoport 322ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable. 3236bf53999SMike Rapoport 324ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present; 325ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory 326ac3332c4SDavid Hildenbrandholes cannot be offlined. 3276bf53999SMike Rapoport 3286bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at 3296bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``:: 3306bf53999SMike Rapoport 3316bf53999SMike Rapoport (0x100000000 / 1Gib = 4) 3326bf53999SMike Rapoport 3336bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000) 3346bf53999SMike Rapoport 335ac3332c4SDavid HildenbrandThe following files are currently defined: 3366bf53999SMike Rapoport 3376bf53999SMike Rapoport=================== ============================================================ 338ac3332c4SDavid Hildenbrand``online`` read-write: simplified interface to trigger onlining / 339ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 340ac3332c4SDavid Hildenbrand When onlining, the zone is selected automatically. 341e9a2e48eSDavid Hildenbrand``phys_device`` read-only: legacy interface only ever used on s390x to 342e9a2e48eSDavid Hildenbrand expose the covered storage increment. 343ac3332c4SDavid Hildenbrand``phys_index`` read-only: the memory block id (XXX). 344a89107c0SDavid Hildenbrand``removable`` read-only: legacy interface that indicated whether a memory 345ac3332c4SDavid Hildenbrand block was likely to be offlineable or not. Nowadays, the 346ac3332c4SDavid Hildenbrand kernel return ``1`` if and only if it supports memory 347ac3332c4SDavid Hildenbrand offlining. 348ac3332c4SDavid Hildenbrand``state`` read-write: advanced interface to trigger onlining / 349ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 3506bf53999SMike Rapoport 351ac3332c4SDavid Hildenbrand When writing, ``online``, ``offline``, ``online_kernel`` and 352ac3332c4SDavid Hildenbrand ``online_movable`` are supported. 3536bf53999SMike Rapoport 354ac3332c4SDavid Hildenbrand ``online_movable`` specifies onlining to ZONE_MOVABLE. 355ac3332c4SDavid Hildenbrand ``online_kernel`` specifies onlining to the default kernel 356ac3332c4SDavid Hildenbrand zone for the memory block, such as ZONE_NORMAL. 357ac3332c4SDavid Hildenbrand ``online`` let's the kernel select the zone automatically. 3586bf53999SMike Rapoport 359ac3332c4SDavid Hildenbrand When reading, ``online``, ``offline`` and ``going-offline`` 360ac3332c4SDavid Hildenbrand may be returned. 361ac3332c4SDavid Hildenbrand``uevent`` read-write: generic uevent file for devices. 362ac3332c4SDavid Hildenbrand``valid_zones`` read-only: when a block is online, shows the zone it 363ac3332c4SDavid Hildenbrand belongs to; when a block is offline, shows what zone will 364ac3332c4SDavid Hildenbrand manage it when the block will be onlined. 365ac3332c4SDavid Hildenbrand 366ac3332c4SDavid Hildenbrand For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, 367ac3332c4SDavid Hildenbrand ``Movable`` and ``none`` may be returned. ``none`` indicates 368ac3332c4SDavid Hildenbrand that memory provided by a memory block is managed by 369ac3332c4SDavid Hildenbrand multiple zones or spans multiple nodes; such memory blocks 370ac3332c4SDavid Hildenbrand cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. 371ac3332c4SDavid Hildenbrand Other values indicate a kernel zone. 372ac3332c4SDavid Hildenbrand 373ac3332c4SDavid Hildenbrand For offline memory blocks, the first column shows the 374ac3332c4SDavid Hildenbrand zone the kernel would select when onlining the memory block 375ac3332c4SDavid Hildenbrand right now without further specifying a zone. 376ac3332c4SDavid Hildenbrand 377ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_MEMORY_HOTREMOVE 378ac3332c4SDavid Hildenbrand kernel configuration option. 3796bf53999SMike Rapoport=================== ============================================================ 3806bf53999SMike Rapoport 3816bf53999SMike Rapoport.. note:: 3826bf53999SMike Rapoport 383ac3332c4SDavid Hildenbrand If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ 384ac3332c4SDavid Hildenbrand directories can also be accessed via symbolic links located in the 385ac3332c4SDavid Hildenbrand ``/sys/devices/system/node/node*`` directories. 3866bf53999SMike Rapoport 3876bf53999SMike Rapoport For example:: 3886bf53999SMike Rapoport 3896bf53999SMike Rapoport /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 3906bf53999SMike Rapoport 3916bf53999SMike Rapoport A backlink will also be created:: 3926bf53999SMike Rapoport 3936bf53999SMike Rapoport /sys/devices/system/memory/memory9/node0 -> ../../node/node0 3946bf53999SMike Rapoport 395ac3332c4SDavid HildenbrandCommand Line Parameters 396ac3332c4SDavid Hildenbrand----------------------- 3976bf53999SMike Rapoport 398ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following 399ac3332c4SDavid Hildenbrandcommand line parameters are relevant: 4006bf53999SMike Rapoport 401ac3332c4SDavid Hildenbrand======================== ======================================================= 402ac3332c4SDavid Hildenbrand``memhp_default_state`` configure auto-onlining by essentially setting 403ac3332c4SDavid Hildenbrand ``/sys/devices/system/memory/auto_online_blocks``. 4049e122cc1SDavid Hildenbrand``movable_node`` configure automatic zone selection in the kernel when 4059e122cc1SDavid Hildenbrand using the ``contig-zones`` online policy. When 4069e122cc1SDavid Hildenbrand set, the kernel will default to ZONE_MOVABLE when 4079e122cc1SDavid Hildenbrand onlining a memory block, unless other zones can be kept 4089e122cc1SDavid Hildenbrand contiguous. 409ac3332c4SDavid Hildenbrand======================== ======================================================= 4106bf53999SMike Rapoport 4119e122cc1SDavid HildenbrandSee Documentation/admin-guide/kernel-parameters.txt for a more generic 4129e122cc1SDavid Hildenbranddescription of these command line parameters. 4139e122cc1SDavid Hildenbrand 414ac3332c4SDavid HildenbrandModule Parameters 415ac3332c4SDavid Hildenbrand------------------ 4166bf53999SMike Rapoport 417ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the 418ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module 419ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating 420ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as:: 4216bf53999SMike Rapoport 422ac3332c4SDavid Hildenbrand memory_hotplug.memmap_on_memory=1 4236bf53999SMike Rapoport 424ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via:: 4256bf53999SMike Rapoport 426a8db400fSDavid Hildenbrand /sys/module/memory_hotplug/parameters/ 4276bf53999SMike Rapoport 428ac3332c4SDavid HildenbrandThe following module parameters are currently defined: 4296bf53999SMike Rapoport 430ac3332c4SDavid Hildenbrand================================ =============================================== 4319e122cc1SDavid Hildenbrand``memmap_on_memory`` read-write: Allocate memory for the memmap from 4329e122cc1SDavid Hildenbrand the added memory block itself. Even if enabled, 4339e122cc1SDavid Hildenbrand actual support depends on various other system 4349e122cc1SDavid Hildenbrand properties and should only be regarded as a 4359e122cc1SDavid Hildenbrand hint whether the behavior would be desired. 4366bf53999SMike Rapoport 4379e122cc1SDavid Hildenbrand While allocating the memmap from the memory 4389e122cc1SDavid Hildenbrand block itself makes memory hotplug less likely 4399e122cc1SDavid Hildenbrand to fail and keeps the memmap on the same NUMA 4409e122cc1SDavid Hildenbrand node in any case, it can fragment physical 4419e122cc1SDavid Hildenbrand memory in a way that huge pages in bigger 4429e122cc1SDavid Hildenbrand granularity cannot be formed on hotplugged 443ac3332c4SDavid Hildenbrand memory. 4442d1f649cSAneesh Kumar K.V 4452d1f649cSAneesh Kumar K.V With value "force" it could result in memory 4462d1f649cSAneesh Kumar K.V wastage due to memmap size limitations. For 4472d1f649cSAneesh Kumar K.V example, if the memmap for a memory block 4482d1f649cSAneesh Kumar K.V requires 1 MiB, but the pageblock size is 2 4492d1f649cSAneesh Kumar K.V MiB, 1 MiB of hotplugged memory will be wasted. 4502d1f649cSAneesh Kumar K.V Note that there are still cases where the 4512d1f649cSAneesh Kumar K.V feature cannot be enforced: for example, if the 4522d1f649cSAneesh Kumar K.V memmap is smaller than a single page, or if the 4532d1f649cSAneesh Kumar K.V architecture does not support the forced mode 4542d1f649cSAneesh Kumar K.V in all configurations. 4552d1f649cSAneesh Kumar K.V 4569e122cc1SDavid Hildenbrand``online_policy`` read-write: Set the basic policy used for 4579e122cc1SDavid Hildenbrand automatic zone selection when onlining memory 4589e122cc1SDavid Hildenbrand blocks without specifying a target zone. 4599e122cc1SDavid Hildenbrand ``contig-zones`` has been the kernel default 4609e122cc1SDavid Hildenbrand before this parameter was added. After an 4619e122cc1SDavid Hildenbrand online policy was configured and memory was 4629e122cc1SDavid Hildenbrand online, the policy should not be changed 4639e122cc1SDavid Hildenbrand anymore. 4649e122cc1SDavid Hildenbrand 4659e122cc1SDavid Hildenbrand When set to ``contig-zones``, the kernel will 4669e122cc1SDavid Hildenbrand try keeping zones contiguous. If a memory block 4679e122cc1SDavid Hildenbrand intersects multiple zones or no zone, the 4689e122cc1SDavid Hildenbrand behavior depends on the ``movable_node`` kernel 4699e122cc1SDavid Hildenbrand command line parameter: default to ZONE_MOVABLE 4709e122cc1SDavid Hildenbrand if set, default to the applicable kernel zone 4719e122cc1SDavid Hildenbrand (usually ZONE_NORMAL) if not set. 4729e122cc1SDavid Hildenbrand 4739e122cc1SDavid Hildenbrand When set to ``auto-movable``, the kernel will 4749e122cc1SDavid Hildenbrand try onlining memory blocks to ZONE_MOVABLE if 4759e122cc1SDavid Hildenbrand possible according to the configuration and 4769e122cc1SDavid Hildenbrand memory device details. With this policy, one 4779e122cc1SDavid Hildenbrand can avoid zone imbalances when eventually 4789e122cc1SDavid Hildenbrand hotplugging a lot of memory later and still 4799e122cc1SDavid Hildenbrand wanting to be able to hotunplug as much as 4809e122cc1SDavid Hildenbrand possible reliably, very desirable in 4819e122cc1SDavid Hildenbrand virtualized environments. This policy ignores 4829e122cc1SDavid Hildenbrand the ``movable_node`` kernel command line 4839e122cc1SDavid Hildenbrand parameter and isn't really applicable in 4849e122cc1SDavid Hildenbrand environments that require it (e.g., bare metal 4859e122cc1SDavid Hildenbrand with hotunpluggable nodes) where hotplugged 4869e122cc1SDavid Hildenbrand memory might be exposed via the 4879e122cc1SDavid Hildenbrand firmware-provided memory map early during boot 4889e122cc1SDavid Hildenbrand to the system instead of getting detected, 4899e122cc1SDavid Hildenbrand added and onlined later during boot (such as 4909e122cc1SDavid Hildenbrand done by virtio-mem or by some hypervisors 4919e122cc1SDavid Hildenbrand implementing emulated DIMMs). As one example, a 4929e122cc1SDavid Hildenbrand hotplugged DIMM will be onlined either 4939e122cc1SDavid Hildenbrand completely to ZONE_MOVABLE or completely to 4949e122cc1SDavid Hildenbrand ZONE_NORMAL, not a mixture. 4959e122cc1SDavid Hildenbrand As another example, as many memory blocks 4969e122cc1SDavid Hildenbrand belonging to a virtio-mem device will be 4979e122cc1SDavid Hildenbrand onlined to ZONE_MOVABLE as possible, 4989e122cc1SDavid Hildenbrand special-casing units of memory blocks that can 4999e122cc1SDavid Hildenbrand only get hotunplugged together. *This policy 5009e122cc1SDavid Hildenbrand does not protect from setups that are 5019e122cc1SDavid Hildenbrand problematic with ZONE_MOVABLE and does not 5029e122cc1SDavid Hildenbrand change the zone of memory blocks dynamically 5039e122cc1SDavid Hildenbrand after they were onlined.* 5049e122cc1SDavid Hildenbrand``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL 5059e122cc1SDavid Hildenbrand memory ratio in % for the ``auto-movable`` 5069e122cc1SDavid Hildenbrand online policy. Whether the ratio applies only 5079e122cc1SDavid Hildenbrand for the system across all NUMA nodes or also 5089e122cc1SDavid Hildenbrand per NUMA nodes depends on the 5099e122cc1SDavid Hildenbrand ``auto_movable_numa_aware`` configuration. 5109e122cc1SDavid Hildenbrand 5119e122cc1SDavid Hildenbrand All accounting is based on present memory pages 5129e122cc1SDavid Hildenbrand in the zones combined with accounting per 5139e122cc1SDavid Hildenbrand memory device. Memory dedicated to the CMA 5149e122cc1SDavid Hildenbrand allocator is accounted as MOVABLE, although 5159e122cc1SDavid Hildenbrand residing on one of the kernel zones. The 5169e122cc1SDavid Hildenbrand possible ratio depends on the actual workload. 5179e122cc1SDavid Hildenbrand The kernel default is "301" %, for example, 5189e122cc1SDavid Hildenbrand allowing for hotplugging 24 GiB to a 8 GiB VM 5199e122cc1SDavid Hildenbrand and automatically onlining all hotplugged 5209e122cc1SDavid Hildenbrand memory to ZONE_MOVABLE in many setups. The 5219e122cc1SDavid Hildenbrand additional 1% deals with some pages being not 5229e122cc1SDavid Hildenbrand present, for example, because of some firmware 5239e122cc1SDavid Hildenbrand allocations. 5249e122cc1SDavid Hildenbrand 5259e122cc1SDavid Hildenbrand Note that ZONE_NORMAL memory provided by one 5269e122cc1SDavid Hildenbrand memory device does not allow for more 5279e122cc1SDavid Hildenbrand ZONE_MOVABLE memory for a different memory 5289e122cc1SDavid Hildenbrand device. As one example, onlining memory of a 5299e122cc1SDavid Hildenbrand hotplugged DIMM to ZONE_NORMAL will not allow 5309e122cc1SDavid Hildenbrand for another hotplugged DIMM to get onlined to 5319e122cc1SDavid Hildenbrand ZONE_MOVABLE automatically. In contrast, memory 5329e122cc1SDavid Hildenbrand hotplugged by a virtio-mem device that got 5339e122cc1SDavid Hildenbrand onlined to ZONE_NORMAL will allow for more 5349e122cc1SDavid Hildenbrand ZONE_MOVABLE memory within *the same* 5359e122cc1SDavid Hildenbrand virtio-mem device. 5369e122cc1SDavid Hildenbrand``auto_movable_numa_aware`` read-write: Configure whether the 5379e122cc1SDavid Hildenbrand ``auto_movable_ratio`` in the ``auto-movable`` 5389e122cc1SDavid Hildenbrand online policy also applies per NUMA 5399e122cc1SDavid Hildenbrand node in addition to the whole system across all 5409e122cc1SDavid Hildenbrand NUMA nodes. The kernel default is "Y". 5419e122cc1SDavid Hildenbrand 5429e122cc1SDavid Hildenbrand Disabling NUMA awareness can be helpful when 5439e122cc1SDavid Hildenbrand dealing with NUMA nodes that should be 5449e122cc1SDavid Hildenbrand completely hotunpluggable, onlining the memory 5459e122cc1SDavid Hildenbrand completely to ZONE_MOVABLE automatically if 5469e122cc1SDavid Hildenbrand possible. 5479e122cc1SDavid Hildenbrand 5489e122cc1SDavid Hildenbrand Parameter availability depends on CONFIG_NUMA. 549ac3332c4SDavid Hildenbrand================================ =============================================== 5506bf53999SMike Rapoport 551ac3332c4SDavid HildenbrandZONE_MOVABLE 552ac3332c4SDavid Hildenbrand============ 5536bf53999SMike Rapoport 554ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining. 555ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the 556ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and 557ac3332c4SDavid Hildenbranddynamically allocated huge pages. 5586bf53999SMike Rapoport 559ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory 560ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations 561ac3332c4SDavid Hildenbrandcan only be served from the kernel zones. 5626bf53999SMike Rapoport 563ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are 564ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. 5656bf53999SMike Rapoport 566ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable 567ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is 568ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully. 569ac3332c4SDavid Hildenbrand 570ac3332c4SDavid HildenbrandZone Imbalances 5716bf53999SMike Rapoport--------------- 5726bf53999SMike Rapoport 573ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, 574ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel 575ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations, 576ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE. 5776bf53999SMike Rapoport 578ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 579ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map. 5806bf53999SMike Rapoport 581ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive 582ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. 5836bf53999SMike Rapoport 5846bf53999SMike Rapoport.. note:: 5856bf53999SMike Rapoport 586ac3332c4SDavid Hildenbrand CMA memory part of a kernel zone essentially behaves like memory in 587ac3332c4SDavid Hildenbrand ZONE_MOVABLE and similar considerations apply, especially when combining 588ac3332c4SDavid Hildenbrand CMA with ZONE_MOVABLE. 5896bf53999SMike Rapoport 590ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations 591ac3332c4SDavid Hildenbrand---------------------------------- 592ad2fa371SMuchun Song 593ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually 594ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In 595ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine. 596ad2fa371SMuchun Song 597ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM 598ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using 599ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios: 600fa965fd5SPavel Tatashin 601ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume 602ac3332c4SDavid Hildenbrand memory for metadata and page tables in the direct map; having a lot of offline 603ac3332c4SDavid Hildenbrand memory blocks is not a typical case, though. 6046bf53999SMike Rapoport 605ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with 606ac3332c4SDavid Hildenbrand ZONE_MOVABLE. Only some implementations, such as virtio-balloon and 607ac3332c4SDavid Hildenbrand pseries CMM, fully support balloon compaction. 6086bf53999SMike Rapoport 609ac3332c4SDavid Hildenbrand Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be 610ac3332c4SDavid Hildenbrand disabled. In that case, balloon inflation will only perform unmovable 611ac3332c4SDavid Hildenbrand allocations and silently create a zone imbalance, usually triggered by 612ac3332c4SDavid Hildenbrand inflation requests from the hypervisor. 6136bf53999SMike Rapoport 614ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a 615ac3332c4SDavid Hildenbrand lot of unmovable memory. 6166bf53999SMike Rapoport 617ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge 618ac3332c4SDavid Hildenbrand page migration, resulting in a similar issue as with gigantic pages. 6196bf53999SMike Rapoport 620ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large 621ac3332c4SDavid Hildenbrand files or ZONE_DEVICE memory can be problematic, although only really relevant 622ac3332c4SDavid Hildenbrand in corner cases. When we manage a lot of user space memory that has been 623ac3332c4SDavid Hildenbrand swapped out or is served from a file/persistent memory/... we still need a lot 624ac3332c4SDavid Hildenbrand of page tables to manage that memory once user space accessed that memory. 6256bf53999SMike Rapoport 626ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be 627ac3332c4SDavid Hildenbrand allocated from the kernel zones. 6286bf53999SMike Rapoport 629ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of 630ac3332c4SDavid Hildenbrand the total system memory size as (unmovable) tracking metadata. 6316bf53999SMike Rapoport 632ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings 633ac3332c4SDavid Hildenbrand (especially, RDMA and vfio/mdev) are fundamentally problematic with 634ac3332c4SDavid Hildenbrand ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside 635ac3332c4SDavid Hildenbrand on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they 636ac3332c4SDavid Hildenbrand have to be migrated off that zone while pinning. Pinning a page can fail 637ac3332c4SDavid Hildenbrand even if there is plenty of free memory in ZONE_MOVABLE. 6386bf53999SMike Rapoport 639ac3332c4SDavid Hildenbrand In addition, using ZONE_MOVABLE might make page pinning more expensive, 640ac3332c4SDavid Hildenbrand because of the page migration overhead. 641dee6da22SDavid Hildenbrand 642ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel 643ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used. 6446bf53999SMike Rapoport 645ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the 646ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options: 647ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See 648ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description. 649ac3332c4SDavid Hildenbrand 650ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE 651ac3332c4SDavid Hildenbrand--------------------------------- 652ac3332c4SDavid Hildenbrand 653ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory 654ac3332c4SDavid Hildenbrandblock might fail: 655ac3332c4SDavid Hildenbrand 656ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during 657ac3332c4SDavid Hildenbrand boot and can apply to memory blocks hotplugged via the XEN balloon and the 658ac3332c4SDavid Hildenbrand Hyper-V balloon. 659ac3332c4SDavid Hildenbrand 660ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory 661ac3332c4SDavid Hildenbrand offlining; this applies to memory blocks present during boot only. 662ac3332c4SDavid Hildenbrand 663ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples 664ac3332c4SDavid Hildenbrand include any memory available during boot on arm64 or memory blocks spanning 665ac3332c4SDavid Hildenbrand the crashkernel area on s390x; this usually applies to memory blocks present 666ac3332c4SDavid Hildenbrand during boot only. 667ac3332c4SDavid Hildenbrand 668ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to 669ac3332c4SDavid Hildenbrand memory blocks present during boot only. 670ac3332c4SDavid Hildenbrand 671ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as 672ac3332c4SDavid Hildenbrand allocating gigantic pages, can result in temporary offlining failures. 673ac3332c4SDavid Hildenbrand 674dff03381SMuchun Song- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap 675dff03381SMuchun Song Optimization (HVO) is enabled. 676ac3332c4SDavid Hildenbrand 677ac3332c4SDavid Hildenbrand Offlining code may be able to migrate huge page contents, but may not be able 678ac3332c4SDavid Hildenbrand to dissolve the source huge page because it fails allocating (unmovable) pages 679ac3332c4SDavid Hildenbrand for the vmemmap, because the system might not have free memory in the kernel 680ac3332c4SDavid Hildenbrand zones left. 681ac3332c4SDavid Hildenbrand 682ac3332c4SDavid Hildenbrand Users that depend on memory offlining to succeed for movable zones should 683ac3332c4SDavid Hildenbrand carefully consider whether the memory savings gained from this feature are 684ac3332c4SDavid Hildenbrand worth the risk of possibly not being able to offline memory in certain 685ac3332c4SDavid Hildenbrand situations. 686ac3332c4SDavid Hildenbrand 687ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or 688ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE 689ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds. 690ac3332c4SDavid Hildenbrand 691ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be 692de7cb03dSDavid Hildenbrandterminated by sending a signal. A timeout based offlining can easily be 693ac3332c4SDavid Hildenbrandimplemented via:: 694ac3332c4SDavid Hildenbrand 695ac3332c4SDavid Hildenbrand % timeout $TIMEOUT offline_block | failure_handling 696