16bf53999SMike Rapoport.. _admin_guide_memory_hotplug: 26bf53999SMike Rapoport 3ac3332c4SDavid Hildenbrand================== 4ac3332c4SDavid HildenbrandMemory Hot(Un)Plug 5ac3332c4SDavid Hildenbrand================== 66bf53999SMike Rapoport 7ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with 8ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support. 96bf53999SMike Rapoport 106bf53999SMike Rapoport.. contents:: :local: 116bf53999SMike Rapoport 126bf53999SMike RapoportIntroduction 136bf53999SMike Rapoport============ 146bf53999SMike Rapoport 15ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical 16ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of 17ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the 18ac3332c4SDavid Hildenbrandoperating system. 196bf53999SMike Rapoport 20ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes: 216bf53999SMike Rapoport 22ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or 23ac3332c4SDavid Hildenbrand downgrading the memory capacity. This dynamic memory resizing, sometimes 24ac3332c4SDavid Hildenbrand referred to as "capacity on demand", is frequently used with virtual machines 25ac3332c4SDavid Hildenbrand and logical partitions. 266bf53999SMike Rapoport 27ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One 28ac3332c4SDavid Hildenbrand example is replacing failing memory modules. 296bf53999SMike Rapoport 30ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or 31ac3332c4SDavid Hildenbrand by logically unplugging (parts of) memory modules from Linux. 326bf53999SMike Rapoport 33ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also 34ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and 35ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux. 366bf53999SMike Rapoport 37ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as 38ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64. 396bf53999SMike Rapoport 40ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity 41ac3332c4SDavid Hildenbrand------------------------------ 426bf53999SMike Rapoport 43ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the 44ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The 45ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses 46ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB. 476bf53999SMike Rapoport 486bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The 49ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest 50ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is 51ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise. 526bf53999SMike Rapoport 53ac3332c4SDavid HildenbrandAll memory blocks have the same size. 546bf53999SMike Rapoport 55ac3332c4SDavid HildenbrandPhases of Memory Hotplug 56ac3332c4SDavid Hildenbrand------------------------ 576bf53999SMike Rapoport 58ac3332c4SDavid HildenbrandMemory hotplug consists of two phases: 596bf53999SMike Rapoport 60ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux 61ac3332c4SDavid Hildenbrand(2) Onlining memory blocks 626bf53999SMike Rapoport 63ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables 64ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are 65ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory 66ac3332c4SDavid Hildenbrandblocks. 676bf53999SMike Rapoport 68ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this 69ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total 70ac3332c4SDavid Hildenbrandmemory, of the system. 716bf53999SMike Rapoport 72ac3332c4SDavid HildenbrandPhases of Memory Hotunplug 73ac3332c4SDavid Hildenbrand-------------------------- 746bf53999SMike Rapoport 75ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases: 766bf53999SMike Rapoport 77ac3332c4SDavid Hildenbrand(1) Offlining memory blocks 78ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux 796bf53999SMike Rapoport 80ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for 81ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all 82ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no 83ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system. 846bf53999SMike Rapoport 85ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed. 866bf53999SMike Rapoport 87ac3332c4SDavid HildenbrandMemory Hotplug Notifications 88ac3332c4SDavid Hildenbrand============================ 896bf53999SMike Rapoport 90ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such 91ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to 92ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or 93ac3332c4SDavid Hildenbrandvirtual machines are not described. 94ac3332c4SDavid Hildenbrand 95ac3332c4SDavid HildenbrandACPI Notifications 96ac3332c4SDavid Hildenbrand------------------ 97ac3332c4SDavid Hildenbrand 98ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug 99ac3332c4SDavid Hildenbrandnotifications via ACPI. 100ac3332c4SDavid Hildenbrand 101ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object 102ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI 103ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux. 104ac3332c4SDavid Hildenbrand 105ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID 106ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all 107ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver. 108ac3332c4SDavid Hildenbrand 109ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or 110ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory 111ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux. 112ac3332c4SDavid Hildenbrand 113ac3332c4SDavid HildenbrandManual Probing 114ac3332c4SDavid Hildenbrand-------------- 115ac3332c4SDavid Hildenbrand 116ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating 117ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually 118ac3332c4SDavid Hildenbrandprobed from user space. 119ac3332c4SDavid Hildenbrand 120ac3332c4SDavid HildenbrandThe probe interface is located at:: 121ac3332c4SDavid Hildenbrand 122ac3332c4SDavid Hildenbrand /sys/devices/system/memory/probe 123ac3332c4SDavid Hildenbrand 124ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed 125ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block:: 126ac3332c4SDavid Hildenbrand 127ac3332c4SDavid Hildenbrand % echo addr > /sys/devices/system/memory/probe 128ac3332c4SDavid Hildenbrand 129ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size) 130ac3332c4SDavid Hildenbrandbeing created. 131ac3332c4SDavid Hildenbrand 132ac3332c4SDavid Hildenbrand.. note:: 133ac3332c4SDavid Hildenbrand 134ac3332c4SDavid Hildenbrand Using the probe interface is discouraged as it is easy to crash the kernel, 135ac3332c4SDavid Hildenbrand because Linux cannot validate user input; this interface might be removed in 136ac3332c4SDavid Hildenbrand the future. 137ac3332c4SDavid Hildenbrand 138ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks 139ac3332c4SDavid Hildenbrand==================================== 140ac3332c4SDavid Hildenbrand 141ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually 142ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online". 143ac3332c4SDavid Hildenbrand 144ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of 145ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined". 146ac3332c4SDavid Hildenbrand 147ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks 148ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying 149ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded 150ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of 151ac3332c4SDavid Hildenbrandmemory. 152ac3332c4SDavid Hildenbrand 153ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually 154ac3332c4SDavid Hildenbrand------------------------------- 155ac3332c4SDavid Hildenbrand 156ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually 157ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this 158ac3332c4SDavid Hildenbrandtask in user space. 159ac3332c4SDavid Hildenbrand 160ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via:: 161ac3332c4SDavid Hildenbrand 162ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/memoryXXX/state 163ac3332c4SDavid Hildenbrand 164ac3332c4SDavid HildenbrandOr alternatively:: 165ac3332c4SDavid Hildenbrand 166ac3332c4SDavid Hildenbrand % echo 1 > /sys/devices/system/memory/memoryXXX/online 167ac3332c4SDavid Hildenbrand 168*9e122cc1SDavid HildenbrandThe kernel will select the target zone automatically, depending on the 169*9e122cc1SDavid Hildenbrandconfigured ``online_policy``. 170ac3332c4SDavid Hildenbrand 171ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with 172ac3332c4SDavid HildenbrandZONE_MOVABLE by:: 173ac3332c4SDavid Hildenbrand 174ac3332c4SDavid Hildenbrand % echo online_movable > /sys/devices/system/memory/memoryXXX/state 175ac3332c4SDavid Hildenbrand 176ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 177ac3332c4SDavid Hildenbrand 178ac3332c4SDavid Hildenbrand % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 179ac3332c4SDavid Hildenbrand 180ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to 181ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged 182ac3332c4SDavid Hildenbrandand the above commands will fail. 183ac3332c4SDavid Hildenbrand 184ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically 185ac3332c4SDavid Hildenbrand------------------------------------ 186ac3332c4SDavid Hildenbrand 187ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks. 188ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until 189ac3332c4SDavid Hildenbrandexplicitly onlined from user space. 190ac3332c4SDavid Hildenbrand 191ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via:: 192ac3332c4SDavid Hildenbrand 193ac3332c4SDavid Hildenbrand % cat /sys/devices/system/memory/auto_online_blocks 194ac3332c4SDavid Hildenbrand 195ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or 196ac3332c4SDavid Hildenbrand``online_movable`` to that file, like:: 197ac3332c4SDavid Hildenbrand 198ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/auto_online_blocks 199ac3332c4SDavid Hildenbrand 200*9e122cc1SDavid HildenbrandSimilarly to manual onlining, with ``online`` the kernel will select the 201*9e122cc1SDavid Hildenbrandtarget zone automatically, depending on the configured ``online_policy``. 202*9e122cc1SDavid Hildenbrand 203ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added 204ac3332c4SDavid Hildenbrandmemory blocks only. 205ac3332c4SDavid Hildenbrand 206ac3332c4SDavid Hildenbrand.. note:: 207ac3332c4SDavid Hildenbrand 208ac3332c4SDavid Hildenbrand In corner cases, auto-onlining can fail. The kernel won't retry. Note that 209ac3332c4SDavid Hildenbrand auto-onlining is not expected to fail in default configurations. 210ac3332c4SDavid Hildenbrand 211ac3332c4SDavid Hildenbrand.. note:: 212ac3332c4SDavid Hildenbrand 213ac3332c4SDavid Hildenbrand DLPAR on ppc64 ignores the ``offline`` setting and will still online added 214ac3332c4SDavid Hildenbrand memory blocks; if onlining fails, memory blocks are removed again. 215ac3332c4SDavid Hildenbrand 216ac3332c4SDavid HildenbrandOfflining Memory Blocks 217ac3332c4SDavid Hildenbrand----------------------- 218ac3332c4SDavid Hildenbrand 219ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all 220ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as 221ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit 222ac3332c4SDavid Hildenbrandmemory offlining from succeeding. 223ac3332c4SDavid Hildenbrand 224ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly 225ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in 226ac3332c4SDavid Hildenbrandsome corner cases. 227ac3332c4SDavid Hildenbrand 228ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until 229ac3332c4SDavid Hildenbrandaborted by the user. 230ac3332c4SDavid Hildenbrand 231ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via:: 232ac3332c4SDavid Hildenbrand 233ac3332c4SDavid Hildenbrand % echo offline > /sys/devices/system/memory/memoryXXX/state 234ac3332c4SDavid Hildenbrand 235ac3332c4SDavid HildenbrandOr alternatively:: 236ac3332c4SDavid Hildenbrand 237ac3332c4SDavid Hildenbrand % echo 0 > /sys/devices/system/memory/memoryXXX/online 238ac3332c4SDavid Hildenbrand 239ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline". 240ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above 241ac3332c4SDavid Hildenbrandcommands will fail, for example, via:: 242ac3332c4SDavid Hildenbrand 243ac3332c4SDavid Hildenbrand bash: echo: write error: Device or resource busy 244ac3332c4SDavid Hildenbrand 245ac3332c4SDavid Hildenbrandor via:: 246ac3332c4SDavid Hildenbrand 247ac3332c4SDavid Hildenbrand bash: echo: write error: Invalid argument 248ac3332c4SDavid Hildenbrand 249ac3332c4SDavid HildenbrandObserving the State of Memory Blocks 250ac3332c4SDavid Hildenbrand------------------------------------ 251ac3332c4SDavid Hildenbrand 252ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed 253ac3332c4SDavid Hildenbrandeither via:: 254ac3332c4SDavid Hildenbrand 255ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/state 256ac3332c4SDavid Hildenbrand 257ac3332c4SDavid HildenbrandOr alternatively (1/0) via:: 258ac3332c4SDavid Hildenbrand 259ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/online 260ac3332c4SDavid Hildenbrand 261ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via:: 262ac3332c4SDavid Hildenbrand 263ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/valid_zones 264ac3332c4SDavid Hildenbrand 265ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug 2666bf53999SMike Rapoport============================== 2676bf53999SMike Rapoport 268ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory 269ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them. 270ac3332c4SDavid Hildenbrand 271ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs 272ac3332c4SDavid Hildenbrand------------------------------------------ 273ac3332c4SDavid Hildenbrand 274ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in:: 275ac3332c4SDavid Hildenbrand 276ac3332c4SDavid Hildenbrand /sys/devices/system/memory/ 277ac3332c4SDavid Hildenbrand 278ac3332c4SDavid HildenbrandThe following files are currently defined: 279ac3332c4SDavid Hildenbrand 280ac3332c4SDavid Hildenbrand====================== ========================================================= 281ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory 282ac3332c4SDavid Hildenbrand blocks; configure auto-onlining. 283ac3332c4SDavid Hildenbrand 284ac3332c4SDavid Hildenbrand The default value depends on the 285ac3332c4SDavid Hildenbrand CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration 286ac3332c4SDavid Hildenbrand option. 287ac3332c4SDavid Hildenbrand 288ac3332c4SDavid Hildenbrand See the ``state`` property of memory blocks for details. 289ac3332c4SDavid Hildenbrand``block_size_bytes`` read-only: the size in bytes of a memory block. 290ac3332c4SDavid Hildenbrand``probe`` write-only: add (probe) selected memory blocks manually 291ac3332c4SDavid Hildenbrand from user space by supplying the physical start address. 292ac3332c4SDavid Hildenbrand 293ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_ARCH_MEMORY_PROBE 294ac3332c4SDavid Hildenbrand kernel configuration option. 295ac3332c4SDavid Hildenbrand``uevent`` read-write: generic udev file for device subsystems. 296ac3332c4SDavid Hildenbrand====================== ========================================================= 297ac3332c4SDavid Hildenbrand 298ac3332c4SDavid Hildenbrand.. note:: 299ac3332c4SDavid Hildenbrand 300ac3332c4SDavid Hildenbrand When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two 301ac3332c4SDavid Hildenbrand additional files ``hard_offline_page`` and ``soft_offline_page`` are available 302ac3332c4SDavid Hildenbrand to trigger hwpoisoning of pages, for example, for testing purposes. Note that 303ac3332c4SDavid Hildenbrand this functionality is not really related to memory hot(un)plug or actual 304ac3332c4SDavid Hildenbrand offlining of memory blocks. 305ac3332c4SDavid Hildenbrand 306ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs 307ac3332c4SDavid Hildenbrand------------------------------------ 308ac3332c4SDavid Hildenbrand 309ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be 310ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in 311ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under 312ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as:: 3136bf53999SMike Rapoport 3146bf53999SMike Rapoport /sys/devices/system/memory/memoryXXX 3156bf53999SMike Rapoport 316ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable. 3176bf53999SMike Rapoport 318ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present; 319ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory 320ac3332c4SDavid Hildenbrandholes cannot be offlined. 3216bf53999SMike Rapoport 3226bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at 3236bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``:: 3246bf53999SMike Rapoport 3256bf53999SMike Rapoport (0x100000000 / 1Gib = 4) 3266bf53999SMike Rapoport 3276bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000) 3286bf53999SMike Rapoport 329ac3332c4SDavid HildenbrandThe following files are currently defined: 3306bf53999SMike Rapoport 3316bf53999SMike Rapoport=================== ============================================================ 332ac3332c4SDavid Hildenbrand``online`` read-write: simplified interface to trigger onlining / 333ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 334ac3332c4SDavid Hildenbrand When onlining, the zone is selected automatically. 335e9a2e48eSDavid Hildenbrand``phys_device`` read-only: legacy interface only ever used on s390x to 336e9a2e48eSDavid Hildenbrand expose the covered storage increment. 337ac3332c4SDavid Hildenbrand``phys_index`` read-only: the memory block id (XXX). 338a89107c0SDavid Hildenbrand``removable`` read-only: legacy interface that indicated whether a memory 339ac3332c4SDavid Hildenbrand block was likely to be offlineable or not. Nowadays, the 340ac3332c4SDavid Hildenbrand kernel return ``1`` if and only if it supports memory 341ac3332c4SDavid Hildenbrand offlining. 342ac3332c4SDavid Hildenbrand``state`` read-write: advanced interface to trigger onlining / 343ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 3446bf53999SMike Rapoport 345ac3332c4SDavid Hildenbrand When writing, ``online``, ``offline``, ``online_kernel`` and 346ac3332c4SDavid Hildenbrand ``online_movable`` are supported. 3476bf53999SMike Rapoport 348ac3332c4SDavid Hildenbrand ``online_movable`` specifies onlining to ZONE_MOVABLE. 349ac3332c4SDavid Hildenbrand ``online_kernel`` specifies onlining to the default kernel 350ac3332c4SDavid Hildenbrand zone for the memory block, such as ZONE_NORMAL. 351ac3332c4SDavid Hildenbrand ``online`` let's the kernel select the zone automatically. 3526bf53999SMike Rapoport 353ac3332c4SDavid Hildenbrand When reading, ``online``, ``offline`` and ``going-offline`` 354ac3332c4SDavid Hildenbrand may be returned. 355ac3332c4SDavid Hildenbrand``uevent`` read-write: generic uevent file for devices. 356ac3332c4SDavid Hildenbrand``valid_zones`` read-only: when a block is online, shows the zone it 357ac3332c4SDavid Hildenbrand belongs to; when a block is offline, shows what zone will 358ac3332c4SDavid Hildenbrand manage it when the block will be onlined. 359ac3332c4SDavid Hildenbrand 360ac3332c4SDavid Hildenbrand For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, 361ac3332c4SDavid Hildenbrand ``Movable`` and ``none`` may be returned. ``none`` indicates 362ac3332c4SDavid Hildenbrand that memory provided by a memory block is managed by 363ac3332c4SDavid Hildenbrand multiple zones or spans multiple nodes; such memory blocks 364ac3332c4SDavid Hildenbrand cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. 365ac3332c4SDavid Hildenbrand Other values indicate a kernel zone. 366ac3332c4SDavid Hildenbrand 367ac3332c4SDavid Hildenbrand For offline memory blocks, the first column shows the 368ac3332c4SDavid Hildenbrand zone the kernel would select when onlining the memory block 369ac3332c4SDavid Hildenbrand right now without further specifying a zone. 370ac3332c4SDavid Hildenbrand 371ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_MEMORY_HOTREMOVE 372ac3332c4SDavid Hildenbrand kernel configuration option. 3736bf53999SMike Rapoport=================== ============================================================ 3746bf53999SMike Rapoport 3756bf53999SMike Rapoport.. note:: 3766bf53999SMike Rapoport 377ac3332c4SDavid Hildenbrand If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ 378ac3332c4SDavid Hildenbrand directories can also be accessed via symbolic links located in the 379ac3332c4SDavid Hildenbrand ``/sys/devices/system/node/node*`` directories. 3806bf53999SMike Rapoport 3816bf53999SMike Rapoport For example:: 3826bf53999SMike Rapoport 3836bf53999SMike Rapoport /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 3846bf53999SMike Rapoport 3856bf53999SMike Rapoport A backlink will also be created:: 3866bf53999SMike Rapoport 3876bf53999SMike Rapoport /sys/devices/system/memory/memory9/node0 -> ../../node/node0 3886bf53999SMike Rapoport 389ac3332c4SDavid HildenbrandCommand Line Parameters 390ac3332c4SDavid Hildenbrand----------------------- 3916bf53999SMike Rapoport 392ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following 393ac3332c4SDavid Hildenbrandcommand line parameters are relevant: 3946bf53999SMike Rapoport 395ac3332c4SDavid Hildenbrand======================== ======================================================= 396ac3332c4SDavid Hildenbrand``memhp_default_state`` configure auto-onlining by essentially setting 397ac3332c4SDavid Hildenbrand ``/sys/devices/system/memory/auto_online_blocks``. 398*9e122cc1SDavid Hildenbrand``movable_node`` configure automatic zone selection in the kernel when 399*9e122cc1SDavid Hildenbrand using the ``contig-zones`` online policy. When 400*9e122cc1SDavid Hildenbrand set, the kernel will default to ZONE_MOVABLE when 401*9e122cc1SDavid Hildenbrand onlining a memory block, unless other zones can be kept 402*9e122cc1SDavid Hildenbrand contiguous. 403ac3332c4SDavid Hildenbrand======================== ======================================================= 4046bf53999SMike Rapoport 405*9e122cc1SDavid HildenbrandSee Documentation/admin-guide/kernel-parameters.txt for a more generic 406*9e122cc1SDavid Hildenbranddescription of these command line parameters. 407*9e122cc1SDavid Hildenbrand 408ac3332c4SDavid HildenbrandModule Parameters 409ac3332c4SDavid Hildenbrand------------------ 4106bf53999SMike Rapoport 411ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the 412ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module 413ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating 414ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as:: 4156bf53999SMike Rapoport 416ac3332c4SDavid Hildenbrand memory_hotplug.memmap_on_memory=1 4176bf53999SMike Rapoport 418ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via:: 4196bf53999SMike Rapoport 420a8db400fSDavid Hildenbrand /sys/module/memory_hotplug/parameters/ 4216bf53999SMike Rapoport 422ac3332c4SDavid HildenbrandThe following module parameters are currently defined: 4236bf53999SMike Rapoport 424ac3332c4SDavid Hildenbrand================================ =============================================== 425*9e122cc1SDavid Hildenbrand``memmap_on_memory`` read-write: Allocate memory for the memmap from 426*9e122cc1SDavid Hildenbrand the added memory block itself. Even if enabled, 427*9e122cc1SDavid Hildenbrand actual support depends on various other system 428*9e122cc1SDavid Hildenbrand properties and should only be regarded as a 429*9e122cc1SDavid Hildenbrand hint whether the behavior would be desired. 4306bf53999SMike Rapoport 431*9e122cc1SDavid Hildenbrand While allocating the memmap from the memory 432*9e122cc1SDavid Hildenbrand block itself makes memory hotplug less likely 433*9e122cc1SDavid Hildenbrand to fail and keeps the memmap on the same NUMA 434*9e122cc1SDavid Hildenbrand node in any case, it can fragment physical 435*9e122cc1SDavid Hildenbrand memory in a way that huge pages in bigger 436*9e122cc1SDavid Hildenbrand granularity cannot be formed on hotplugged 437ac3332c4SDavid Hildenbrand memory. 438*9e122cc1SDavid Hildenbrand``online_policy`` read-write: Set the basic policy used for 439*9e122cc1SDavid Hildenbrand automatic zone selection when onlining memory 440*9e122cc1SDavid Hildenbrand blocks without specifying a target zone. 441*9e122cc1SDavid Hildenbrand ``contig-zones`` has been the kernel default 442*9e122cc1SDavid Hildenbrand before this parameter was added. After an 443*9e122cc1SDavid Hildenbrand online policy was configured and memory was 444*9e122cc1SDavid Hildenbrand online, the policy should not be changed 445*9e122cc1SDavid Hildenbrand anymore. 446*9e122cc1SDavid Hildenbrand 447*9e122cc1SDavid Hildenbrand When set to ``contig-zones``, the kernel will 448*9e122cc1SDavid Hildenbrand try keeping zones contiguous. If a memory block 449*9e122cc1SDavid Hildenbrand intersects multiple zones or no zone, the 450*9e122cc1SDavid Hildenbrand behavior depends on the ``movable_node`` kernel 451*9e122cc1SDavid Hildenbrand command line parameter: default to ZONE_MOVABLE 452*9e122cc1SDavid Hildenbrand if set, default to the applicable kernel zone 453*9e122cc1SDavid Hildenbrand (usually ZONE_NORMAL) if not set. 454*9e122cc1SDavid Hildenbrand 455*9e122cc1SDavid Hildenbrand When set to ``auto-movable``, the kernel will 456*9e122cc1SDavid Hildenbrand try onlining memory blocks to ZONE_MOVABLE if 457*9e122cc1SDavid Hildenbrand possible according to the configuration and 458*9e122cc1SDavid Hildenbrand memory device details. With this policy, one 459*9e122cc1SDavid Hildenbrand can avoid zone imbalances when eventually 460*9e122cc1SDavid Hildenbrand hotplugging a lot of memory later and still 461*9e122cc1SDavid Hildenbrand wanting to be able to hotunplug as much as 462*9e122cc1SDavid Hildenbrand possible reliably, very desirable in 463*9e122cc1SDavid Hildenbrand virtualized environments. This policy ignores 464*9e122cc1SDavid Hildenbrand the ``movable_node`` kernel command line 465*9e122cc1SDavid Hildenbrand parameter and isn't really applicable in 466*9e122cc1SDavid Hildenbrand environments that require it (e.g., bare metal 467*9e122cc1SDavid Hildenbrand with hotunpluggable nodes) where hotplugged 468*9e122cc1SDavid Hildenbrand memory might be exposed via the 469*9e122cc1SDavid Hildenbrand firmware-provided memory map early during boot 470*9e122cc1SDavid Hildenbrand to the system instead of getting detected, 471*9e122cc1SDavid Hildenbrand added and onlined later during boot (such as 472*9e122cc1SDavid Hildenbrand done by virtio-mem or by some hypervisors 473*9e122cc1SDavid Hildenbrand implementing emulated DIMMs). As one example, a 474*9e122cc1SDavid Hildenbrand hotplugged DIMM will be onlined either 475*9e122cc1SDavid Hildenbrand completely to ZONE_MOVABLE or completely to 476*9e122cc1SDavid Hildenbrand ZONE_NORMAL, not a mixture. 477*9e122cc1SDavid Hildenbrand As another example, as many memory blocks 478*9e122cc1SDavid Hildenbrand belonging to a virtio-mem device will be 479*9e122cc1SDavid Hildenbrand onlined to ZONE_MOVABLE as possible, 480*9e122cc1SDavid Hildenbrand special-casing units of memory blocks that can 481*9e122cc1SDavid Hildenbrand only get hotunplugged together. *This policy 482*9e122cc1SDavid Hildenbrand does not protect from setups that are 483*9e122cc1SDavid Hildenbrand problematic with ZONE_MOVABLE and does not 484*9e122cc1SDavid Hildenbrand change the zone of memory blocks dynamically 485*9e122cc1SDavid Hildenbrand after they were onlined.* 486*9e122cc1SDavid Hildenbrand``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL 487*9e122cc1SDavid Hildenbrand memory ratio in % for the ``auto-movable`` 488*9e122cc1SDavid Hildenbrand online policy. Whether the ratio applies only 489*9e122cc1SDavid Hildenbrand for the system across all NUMA nodes or also 490*9e122cc1SDavid Hildenbrand per NUMA nodes depends on the 491*9e122cc1SDavid Hildenbrand ``auto_movable_numa_aware`` configuration. 492*9e122cc1SDavid Hildenbrand 493*9e122cc1SDavid Hildenbrand All accounting is based on present memory pages 494*9e122cc1SDavid Hildenbrand in the zones combined with accounting per 495*9e122cc1SDavid Hildenbrand memory device. Memory dedicated to the CMA 496*9e122cc1SDavid Hildenbrand allocator is accounted as MOVABLE, although 497*9e122cc1SDavid Hildenbrand residing on one of the kernel zones. The 498*9e122cc1SDavid Hildenbrand possible ratio depends on the actual workload. 499*9e122cc1SDavid Hildenbrand The kernel default is "301" %, for example, 500*9e122cc1SDavid Hildenbrand allowing for hotplugging 24 GiB to a 8 GiB VM 501*9e122cc1SDavid Hildenbrand and automatically onlining all hotplugged 502*9e122cc1SDavid Hildenbrand memory to ZONE_MOVABLE in many setups. The 503*9e122cc1SDavid Hildenbrand additional 1% deals with some pages being not 504*9e122cc1SDavid Hildenbrand present, for example, because of some firmware 505*9e122cc1SDavid Hildenbrand allocations. 506*9e122cc1SDavid Hildenbrand 507*9e122cc1SDavid Hildenbrand Note that ZONE_NORMAL memory provided by one 508*9e122cc1SDavid Hildenbrand memory device does not allow for more 509*9e122cc1SDavid Hildenbrand ZONE_MOVABLE memory for a different memory 510*9e122cc1SDavid Hildenbrand device. As one example, onlining memory of a 511*9e122cc1SDavid Hildenbrand hotplugged DIMM to ZONE_NORMAL will not allow 512*9e122cc1SDavid Hildenbrand for another hotplugged DIMM to get onlined to 513*9e122cc1SDavid Hildenbrand ZONE_MOVABLE automatically. In contrast, memory 514*9e122cc1SDavid Hildenbrand hotplugged by a virtio-mem device that got 515*9e122cc1SDavid Hildenbrand onlined to ZONE_NORMAL will allow for more 516*9e122cc1SDavid Hildenbrand ZONE_MOVABLE memory within *the same* 517*9e122cc1SDavid Hildenbrand virtio-mem device. 518*9e122cc1SDavid Hildenbrand``auto_movable_numa_aware`` read-write: Configure whether the 519*9e122cc1SDavid Hildenbrand ``auto_movable_ratio`` in the ``auto-movable`` 520*9e122cc1SDavid Hildenbrand online policy also applies per NUMA 521*9e122cc1SDavid Hildenbrand node in addition to the whole system across all 522*9e122cc1SDavid Hildenbrand NUMA nodes. The kernel default is "Y". 523*9e122cc1SDavid Hildenbrand 524*9e122cc1SDavid Hildenbrand Disabling NUMA awareness can be helpful when 525*9e122cc1SDavid Hildenbrand dealing with NUMA nodes that should be 526*9e122cc1SDavid Hildenbrand completely hotunpluggable, onlining the memory 527*9e122cc1SDavid Hildenbrand completely to ZONE_MOVABLE automatically if 528*9e122cc1SDavid Hildenbrand possible. 529*9e122cc1SDavid Hildenbrand 530*9e122cc1SDavid Hildenbrand Parameter availability depends on CONFIG_NUMA. 531ac3332c4SDavid Hildenbrand================================ =============================================== 5326bf53999SMike Rapoport 533ac3332c4SDavid HildenbrandZONE_MOVABLE 534ac3332c4SDavid Hildenbrand============ 5356bf53999SMike Rapoport 536ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining. 537ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the 538ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and 539ac3332c4SDavid Hildenbranddynamically allocated huge pages. 5406bf53999SMike Rapoport 541ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory 542ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations 543ac3332c4SDavid Hildenbrandcan only be served from the kernel zones. 5446bf53999SMike Rapoport 545ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are 546ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. 5476bf53999SMike Rapoport 548ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable 549ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is 550ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully. 551ac3332c4SDavid Hildenbrand 552ac3332c4SDavid HildenbrandZone Imbalances 5536bf53999SMike Rapoport--------------- 5546bf53999SMike Rapoport 555ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, 556ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel 557ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations, 558ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE. 5596bf53999SMike Rapoport 560ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 561ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map. 5626bf53999SMike Rapoport 563ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive 564ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. 5656bf53999SMike Rapoport 5666bf53999SMike Rapoport.. note:: 5676bf53999SMike Rapoport 568ac3332c4SDavid Hildenbrand CMA memory part of a kernel zone essentially behaves like memory in 569ac3332c4SDavid Hildenbrand ZONE_MOVABLE and similar considerations apply, especially when combining 570ac3332c4SDavid Hildenbrand CMA with ZONE_MOVABLE. 5716bf53999SMike Rapoport 572ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations 573ac3332c4SDavid Hildenbrand---------------------------------- 574ad2fa371SMuchun Song 575ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually 576ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In 577ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine. 578ad2fa371SMuchun Song 579ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM 580ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using 581ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios: 582fa965fd5SPavel Tatashin 583ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume 584ac3332c4SDavid Hildenbrand memory for metadata and page tables in the direct map; having a lot of offline 585ac3332c4SDavid Hildenbrand memory blocks is not a typical case, though. 5866bf53999SMike Rapoport 587ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with 588ac3332c4SDavid Hildenbrand ZONE_MOVABLE. Only some implementations, such as virtio-balloon and 589ac3332c4SDavid Hildenbrand pseries CMM, fully support balloon compaction. 5906bf53999SMike Rapoport 591ac3332c4SDavid Hildenbrand Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be 592ac3332c4SDavid Hildenbrand disabled. In that case, balloon inflation will only perform unmovable 593ac3332c4SDavid Hildenbrand allocations and silently create a zone imbalance, usually triggered by 594ac3332c4SDavid Hildenbrand inflation requests from the hypervisor. 5956bf53999SMike Rapoport 596ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a 597ac3332c4SDavid Hildenbrand lot of unmovable memory. 5986bf53999SMike Rapoport 599ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge 600ac3332c4SDavid Hildenbrand page migration, resulting in a similar issue as with gigantic pages. 6016bf53999SMike Rapoport 602ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large 603ac3332c4SDavid Hildenbrand files or ZONE_DEVICE memory can be problematic, although only really relevant 604ac3332c4SDavid Hildenbrand in corner cases. When we manage a lot of user space memory that has been 605ac3332c4SDavid Hildenbrand swapped out or is served from a file/persistent memory/... we still need a lot 606ac3332c4SDavid Hildenbrand of page tables to manage that memory once user space accessed that memory. 6076bf53999SMike Rapoport 608ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be 609ac3332c4SDavid Hildenbrand allocated from the kernel zones. 6106bf53999SMike Rapoport 611ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of 612ac3332c4SDavid Hildenbrand the total system memory size as (unmovable) tracking metadata. 6136bf53999SMike Rapoport 614ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings 615ac3332c4SDavid Hildenbrand (especially, RDMA and vfio/mdev) are fundamentally problematic with 616ac3332c4SDavid Hildenbrand ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside 617ac3332c4SDavid Hildenbrand on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they 618ac3332c4SDavid Hildenbrand have to be migrated off that zone while pinning. Pinning a page can fail 619ac3332c4SDavid Hildenbrand even if there is plenty of free memory in ZONE_MOVABLE. 6206bf53999SMike Rapoport 621ac3332c4SDavid Hildenbrand In addition, using ZONE_MOVABLE might make page pinning more expensive, 622ac3332c4SDavid Hildenbrand because of the page migration overhead. 623dee6da22SDavid Hildenbrand 624ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel 625ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used. 6266bf53999SMike Rapoport 627ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the 628ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options: 629ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See 630ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description. 631ac3332c4SDavid Hildenbrand 632ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE 633ac3332c4SDavid Hildenbrand--------------------------------- 634ac3332c4SDavid Hildenbrand 635ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory 636ac3332c4SDavid Hildenbrandblock might fail: 637ac3332c4SDavid Hildenbrand 638ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during 639ac3332c4SDavid Hildenbrand boot and can apply to memory blocks hotplugged via the XEN balloon and the 640ac3332c4SDavid Hildenbrand Hyper-V balloon. 641ac3332c4SDavid Hildenbrand 642ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory 643ac3332c4SDavid Hildenbrand offlining; this applies to memory blocks present during boot only. 644ac3332c4SDavid Hildenbrand 645ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples 646ac3332c4SDavid Hildenbrand include any memory available during boot on arm64 or memory blocks spanning 647ac3332c4SDavid Hildenbrand the crashkernel area on s390x; this usually applies to memory blocks present 648ac3332c4SDavid Hildenbrand during boot only. 649ac3332c4SDavid Hildenbrand 650ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to 651ac3332c4SDavid Hildenbrand memory blocks present during boot only. 652ac3332c4SDavid Hildenbrand 653ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as 654ac3332c4SDavid Hildenbrand allocating gigantic pages, can result in temporary offlining failures. 655ac3332c4SDavid Hildenbrand 656ac3332c4SDavid Hildenbrand- Out of memory when dissolving huge pages, especially when freeing unused 657ac3332c4SDavid Hildenbrand vmemmap pages associated with each hugetlb page is enabled. 658ac3332c4SDavid Hildenbrand 659ac3332c4SDavid Hildenbrand Offlining code may be able to migrate huge page contents, but may not be able 660ac3332c4SDavid Hildenbrand to dissolve the source huge page because it fails allocating (unmovable) pages 661ac3332c4SDavid Hildenbrand for the vmemmap, because the system might not have free memory in the kernel 662ac3332c4SDavid Hildenbrand zones left. 663ac3332c4SDavid Hildenbrand 664ac3332c4SDavid Hildenbrand Users that depend on memory offlining to succeed for movable zones should 665ac3332c4SDavid Hildenbrand carefully consider whether the memory savings gained from this feature are 666ac3332c4SDavid Hildenbrand worth the risk of possibly not being able to offline memory in certain 667ac3332c4SDavid Hildenbrand situations. 668ac3332c4SDavid Hildenbrand 669ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or 670ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE 671ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds. 672ac3332c4SDavid Hildenbrand 673ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be 674ac3332c4SDavid Hildenbrandterminated by sending a fatal signal. A timeout based offlining can easily be 675ac3332c4SDavid Hildenbrandimplemented via:: 676ac3332c4SDavid Hildenbrand 677ac3332c4SDavid Hildenbrand % timeout $TIMEOUT offline_block | failure_handling 678