16bf53999SMike Rapoport.. _admin_guide_memory_hotplug: 26bf53999SMike Rapoport 3*ac3332c4SDavid Hildenbrand================== 4*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug 5*ac3332c4SDavid Hildenbrand================== 66bf53999SMike Rapoport 7*ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with 8*ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support. 96bf53999SMike Rapoport 106bf53999SMike Rapoport.. contents:: :local: 116bf53999SMike Rapoport 126bf53999SMike RapoportIntroduction 136bf53999SMike Rapoport============ 146bf53999SMike Rapoport 15*ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical 16*ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of 17*ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the 18*ac3332c4SDavid Hildenbrandoperating system. 196bf53999SMike Rapoport 20*ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes: 216bf53999SMike Rapoport 22*ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or 23*ac3332c4SDavid Hildenbrand downgrading the memory capacity. This dynamic memory resizing, sometimes 24*ac3332c4SDavid Hildenbrand referred to as "capacity on demand", is frequently used with virtual machines 25*ac3332c4SDavid Hildenbrand and logical partitions. 266bf53999SMike Rapoport 27*ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One 28*ac3332c4SDavid Hildenbrand example is replacing failing memory modules. 296bf53999SMike Rapoport 30*ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or 31*ac3332c4SDavid Hildenbrand by logically unplugging (parts of) memory modules from Linux. 326bf53999SMike Rapoport 33*ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also 34*ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and 35*ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux. 366bf53999SMike Rapoport 37*ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as 38*ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64. 396bf53999SMike Rapoport 40*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity 41*ac3332c4SDavid Hildenbrand------------------------------ 426bf53999SMike Rapoport 43*ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the 44*ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The 45*ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses 46*ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB. 476bf53999SMike Rapoport 486bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The 49*ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest 50*ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is 51*ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise. 526bf53999SMike Rapoport 53*ac3332c4SDavid HildenbrandAll memory blocks have the same size. 546bf53999SMike Rapoport 55*ac3332c4SDavid HildenbrandPhases of Memory Hotplug 56*ac3332c4SDavid Hildenbrand------------------------ 576bf53999SMike Rapoport 58*ac3332c4SDavid HildenbrandMemory hotplug consists of two phases: 596bf53999SMike Rapoport 60*ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux 61*ac3332c4SDavid Hildenbrand(2) Onlining memory blocks 626bf53999SMike Rapoport 63*ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables 64*ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are 65*ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory 66*ac3332c4SDavid Hildenbrandblocks. 676bf53999SMike Rapoport 68*ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this 69*ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total 70*ac3332c4SDavid Hildenbrandmemory, of the system. 716bf53999SMike Rapoport 72*ac3332c4SDavid HildenbrandPhases of Memory Hotunplug 73*ac3332c4SDavid Hildenbrand-------------------------- 746bf53999SMike Rapoport 75*ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases: 766bf53999SMike Rapoport 77*ac3332c4SDavid Hildenbrand(1) Offlining memory blocks 78*ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux 796bf53999SMike Rapoport 80*ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for 81*ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all 82*ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no 83*ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system. 846bf53999SMike Rapoport 85*ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed. 866bf53999SMike Rapoport 87*ac3332c4SDavid HildenbrandMemory Hotplug Notifications 88*ac3332c4SDavid Hildenbrand============================ 896bf53999SMike Rapoport 90*ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such 91*ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to 92*ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or 93*ac3332c4SDavid Hildenbrandvirtual machines are not described. 94*ac3332c4SDavid Hildenbrand 95*ac3332c4SDavid HildenbrandACPI Notifications 96*ac3332c4SDavid Hildenbrand------------------ 97*ac3332c4SDavid Hildenbrand 98*ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug 99*ac3332c4SDavid Hildenbrandnotifications via ACPI. 100*ac3332c4SDavid Hildenbrand 101*ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object 102*ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI 103*ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux. 104*ac3332c4SDavid Hildenbrand 105*ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID 106*ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all 107*ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver. 108*ac3332c4SDavid Hildenbrand 109*ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or 110*ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory 111*ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux. 112*ac3332c4SDavid Hildenbrand 113*ac3332c4SDavid HildenbrandManual Probing 114*ac3332c4SDavid Hildenbrand-------------- 115*ac3332c4SDavid Hildenbrand 116*ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating 117*ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually 118*ac3332c4SDavid Hildenbrandprobed from user space. 119*ac3332c4SDavid Hildenbrand 120*ac3332c4SDavid HildenbrandThe probe interface is located at:: 121*ac3332c4SDavid Hildenbrand 122*ac3332c4SDavid Hildenbrand /sys/devices/system/memory/probe 123*ac3332c4SDavid Hildenbrand 124*ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed 125*ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block:: 126*ac3332c4SDavid Hildenbrand 127*ac3332c4SDavid Hildenbrand % echo addr > /sys/devices/system/memory/probe 128*ac3332c4SDavid Hildenbrand 129*ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size) 130*ac3332c4SDavid Hildenbrandbeing created. 131*ac3332c4SDavid Hildenbrand 132*ac3332c4SDavid Hildenbrand.. note:: 133*ac3332c4SDavid Hildenbrand 134*ac3332c4SDavid Hildenbrand Using the probe interface is discouraged as it is easy to crash the kernel, 135*ac3332c4SDavid Hildenbrand because Linux cannot validate user input; this interface might be removed in 136*ac3332c4SDavid Hildenbrand the future. 137*ac3332c4SDavid Hildenbrand 138*ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks 139*ac3332c4SDavid Hildenbrand==================================== 140*ac3332c4SDavid Hildenbrand 141*ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually 142*ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online". 143*ac3332c4SDavid Hildenbrand 144*ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of 145*ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined". 146*ac3332c4SDavid Hildenbrand 147*ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks 148*ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying 149*ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded 150*ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of 151*ac3332c4SDavid Hildenbrandmemory. 152*ac3332c4SDavid Hildenbrand 153*ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually 154*ac3332c4SDavid Hildenbrand------------------------------- 155*ac3332c4SDavid Hildenbrand 156*ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually 157*ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this 158*ac3332c4SDavid Hildenbrandtask in user space. 159*ac3332c4SDavid Hildenbrand 160*ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via:: 161*ac3332c4SDavid Hildenbrand 162*ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/memoryXXX/state 163*ac3332c4SDavid Hildenbrand 164*ac3332c4SDavid HildenbrandOr alternatively:: 165*ac3332c4SDavid Hildenbrand 166*ac3332c4SDavid Hildenbrand % echo 1 > /sys/devices/system/memory/memoryXXX/online 167*ac3332c4SDavid Hildenbrand 168*ac3332c4SDavid HildenbrandThe kernel will select the target zone automatically, usually defaulting to 169*ac3332c4SDavid Hildenbrand``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel 170*ac3332c4SDavid Hildenbrandcommand line or if the memory block would intersect the ZONE_MOVABLE already. 171*ac3332c4SDavid Hildenbrand 172*ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with 173*ac3332c4SDavid HildenbrandZONE_MOVABLE by:: 174*ac3332c4SDavid Hildenbrand 175*ac3332c4SDavid Hildenbrand % echo online_movable > /sys/devices/system/memory/memoryXXX/state 176*ac3332c4SDavid Hildenbrand 177*ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 178*ac3332c4SDavid Hildenbrand 179*ac3332c4SDavid Hildenbrand % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 180*ac3332c4SDavid Hildenbrand 181*ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to 182*ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged 183*ac3332c4SDavid Hildenbrandand the above commands will fail. 184*ac3332c4SDavid Hildenbrand 185*ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically 186*ac3332c4SDavid Hildenbrand------------------------------------ 187*ac3332c4SDavid Hildenbrand 188*ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks. 189*ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until 190*ac3332c4SDavid Hildenbrandexplicitly onlined from user space. 191*ac3332c4SDavid Hildenbrand 192*ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via:: 193*ac3332c4SDavid Hildenbrand 194*ac3332c4SDavid Hildenbrand % cat /sys/devices/system/memory/auto_online_blocks 195*ac3332c4SDavid Hildenbrand 196*ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or 197*ac3332c4SDavid Hildenbrand``online_movable`` to that file, like:: 198*ac3332c4SDavid Hildenbrand 199*ac3332c4SDavid Hildenbrand % echo online > /sys/devices/system/memory/auto_online_blocks 200*ac3332c4SDavid Hildenbrand 201*ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added 202*ac3332c4SDavid Hildenbrandmemory blocks only. 203*ac3332c4SDavid Hildenbrand 204*ac3332c4SDavid Hildenbrand.. note:: 205*ac3332c4SDavid Hildenbrand 206*ac3332c4SDavid Hildenbrand In corner cases, auto-onlining can fail. The kernel won't retry. Note that 207*ac3332c4SDavid Hildenbrand auto-onlining is not expected to fail in default configurations. 208*ac3332c4SDavid Hildenbrand 209*ac3332c4SDavid Hildenbrand.. note:: 210*ac3332c4SDavid Hildenbrand 211*ac3332c4SDavid Hildenbrand DLPAR on ppc64 ignores the ``offline`` setting and will still online added 212*ac3332c4SDavid Hildenbrand memory blocks; if onlining fails, memory blocks are removed again. 213*ac3332c4SDavid Hildenbrand 214*ac3332c4SDavid HildenbrandOfflining Memory Blocks 215*ac3332c4SDavid Hildenbrand----------------------- 216*ac3332c4SDavid Hildenbrand 217*ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all 218*ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as 219*ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit 220*ac3332c4SDavid Hildenbrandmemory offlining from succeeding. 221*ac3332c4SDavid Hildenbrand 222*ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly 223*ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in 224*ac3332c4SDavid Hildenbrandsome corner cases. 225*ac3332c4SDavid Hildenbrand 226*ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until 227*ac3332c4SDavid Hildenbrandaborted by the user. 228*ac3332c4SDavid Hildenbrand 229*ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via:: 230*ac3332c4SDavid Hildenbrand 231*ac3332c4SDavid Hildenbrand % echo offline > /sys/devices/system/memory/memoryXXX/state 232*ac3332c4SDavid Hildenbrand 233*ac3332c4SDavid HildenbrandOr alternatively:: 234*ac3332c4SDavid Hildenbrand 235*ac3332c4SDavid Hildenbrand % echo 0 > /sys/devices/system/memory/memoryXXX/online 236*ac3332c4SDavid Hildenbrand 237*ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline". 238*ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above 239*ac3332c4SDavid Hildenbrandcommands will fail, for example, via:: 240*ac3332c4SDavid Hildenbrand 241*ac3332c4SDavid Hildenbrand bash: echo: write error: Device or resource busy 242*ac3332c4SDavid Hildenbrand 243*ac3332c4SDavid Hildenbrandor via:: 244*ac3332c4SDavid Hildenbrand 245*ac3332c4SDavid Hildenbrand bash: echo: write error: Invalid argument 246*ac3332c4SDavid Hildenbrand 247*ac3332c4SDavid HildenbrandObserving the State of Memory Blocks 248*ac3332c4SDavid Hildenbrand------------------------------------ 249*ac3332c4SDavid Hildenbrand 250*ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed 251*ac3332c4SDavid Hildenbrandeither via:: 252*ac3332c4SDavid Hildenbrand 253*ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/state 254*ac3332c4SDavid Hildenbrand 255*ac3332c4SDavid HildenbrandOr alternatively (1/0) via:: 256*ac3332c4SDavid Hildenbrand 257*ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/online 258*ac3332c4SDavid Hildenbrand 259*ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via:: 260*ac3332c4SDavid Hildenbrand 261*ac3332c4SDavid Hildenbrand % cat /sys/device/system/memory/memoryXXX/valid_zones 262*ac3332c4SDavid Hildenbrand 263*ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug 2646bf53999SMike Rapoport============================== 2656bf53999SMike Rapoport 266*ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory 267*ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them. 268*ac3332c4SDavid Hildenbrand 269*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs 270*ac3332c4SDavid Hildenbrand------------------------------------------ 271*ac3332c4SDavid Hildenbrand 272*ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in:: 273*ac3332c4SDavid Hildenbrand 274*ac3332c4SDavid Hildenbrand /sys/devices/system/memory/ 275*ac3332c4SDavid Hildenbrand 276*ac3332c4SDavid HildenbrandThe following files are currently defined: 277*ac3332c4SDavid Hildenbrand 278*ac3332c4SDavid Hildenbrand====================== ========================================================= 279*ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory 280*ac3332c4SDavid Hildenbrand blocks; configure auto-onlining. 281*ac3332c4SDavid Hildenbrand 282*ac3332c4SDavid Hildenbrand The default value depends on the 283*ac3332c4SDavid Hildenbrand CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration 284*ac3332c4SDavid Hildenbrand option. 285*ac3332c4SDavid Hildenbrand 286*ac3332c4SDavid Hildenbrand See the ``state`` property of memory blocks for details. 287*ac3332c4SDavid Hildenbrand``block_size_bytes`` read-only: the size in bytes of a memory block. 288*ac3332c4SDavid Hildenbrand``probe`` write-only: add (probe) selected memory blocks manually 289*ac3332c4SDavid Hildenbrand from user space by supplying the physical start address. 290*ac3332c4SDavid Hildenbrand 291*ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_ARCH_MEMORY_PROBE 292*ac3332c4SDavid Hildenbrand kernel configuration option. 293*ac3332c4SDavid Hildenbrand``uevent`` read-write: generic udev file for device subsystems. 294*ac3332c4SDavid Hildenbrand====================== ========================================================= 295*ac3332c4SDavid Hildenbrand 296*ac3332c4SDavid Hildenbrand.. note:: 297*ac3332c4SDavid Hildenbrand 298*ac3332c4SDavid Hildenbrand When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two 299*ac3332c4SDavid Hildenbrand additional files ``hard_offline_page`` and ``soft_offline_page`` are available 300*ac3332c4SDavid Hildenbrand to trigger hwpoisoning of pages, for example, for testing purposes. Note that 301*ac3332c4SDavid Hildenbrand this functionality is not really related to memory hot(un)plug or actual 302*ac3332c4SDavid Hildenbrand offlining of memory blocks. 303*ac3332c4SDavid Hildenbrand 304*ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs 305*ac3332c4SDavid Hildenbrand------------------------------------ 306*ac3332c4SDavid Hildenbrand 307*ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be 308*ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in 309*ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under 310*ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as:: 3116bf53999SMike Rapoport 3126bf53999SMike Rapoport /sys/devices/system/memory/memoryXXX 3136bf53999SMike Rapoport 314*ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable. 3156bf53999SMike Rapoport 316*ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present; 317*ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory 318*ac3332c4SDavid Hildenbrandholes cannot be offlined. 3196bf53999SMike Rapoport 3206bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at 3216bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``:: 3226bf53999SMike Rapoport 3236bf53999SMike Rapoport (0x100000000 / 1Gib = 4) 3246bf53999SMike Rapoport 3256bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000) 3266bf53999SMike Rapoport 327*ac3332c4SDavid HildenbrandThe following files are currently defined: 3286bf53999SMike Rapoport 3296bf53999SMike Rapoport=================== ============================================================ 330*ac3332c4SDavid Hildenbrand``online`` read-write: simplified interface to trigger onlining / 331*ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 332*ac3332c4SDavid Hildenbrand When onlining, the zone is selected automatically. 333e9a2e48eSDavid Hildenbrand``phys_device`` read-only: legacy interface only ever used on s390x to 334e9a2e48eSDavid Hildenbrand expose the covered storage increment. 335*ac3332c4SDavid Hildenbrand``phys_index`` read-only: the memory block id (XXX). 336a89107c0SDavid Hildenbrand``removable`` read-only: legacy interface that indicated whether a memory 337*ac3332c4SDavid Hildenbrand block was likely to be offlineable or not. Nowadays, the 338*ac3332c4SDavid Hildenbrand kernel return ``1`` if and only if it supports memory 339*ac3332c4SDavid Hildenbrand offlining. 340*ac3332c4SDavid Hildenbrand``state`` read-write: advanced interface to trigger onlining / 341*ac3332c4SDavid Hildenbrand offlining and to observe the state of a memory block. 3426bf53999SMike Rapoport 343*ac3332c4SDavid Hildenbrand When writing, ``online``, ``offline``, ``online_kernel`` and 344*ac3332c4SDavid Hildenbrand ``online_movable`` are supported. 3456bf53999SMike Rapoport 346*ac3332c4SDavid Hildenbrand ``online_movable`` specifies onlining to ZONE_MOVABLE. 347*ac3332c4SDavid Hildenbrand ``online_kernel`` specifies onlining to the default kernel 348*ac3332c4SDavid Hildenbrand zone for the memory block, such as ZONE_NORMAL. 349*ac3332c4SDavid Hildenbrand ``online`` let's the kernel select the zone automatically. 3506bf53999SMike Rapoport 351*ac3332c4SDavid Hildenbrand When reading, ``online``, ``offline`` and ``going-offline`` 352*ac3332c4SDavid Hildenbrand may be returned. 353*ac3332c4SDavid Hildenbrand``uevent`` read-write: generic uevent file for devices. 354*ac3332c4SDavid Hildenbrand``valid_zones`` read-only: when a block is online, shows the zone it 355*ac3332c4SDavid Hildenbrand belongs to; when a block is offline, shows what zone will 356*ac3332c4SDavid Hildenbrand manage it when the block will be onlined. 357*ac3332c4SDavid Hildenbrand 358*ac3332c4SDavid Hildenbrand For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, 359*ac3332c4SDavid Hildenbrand ``Movable`` and ``none`` may be returned. ``none`` indicates 360*ac3332c4SDavid Hildenbrand that memory provided by a memory block is managed by 361*ac3332c4SDavid Hildenbrand multiple zones or spans multiple nodes; such memory blocks 362*ac3332c4SDavid Hildenbrand cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. 363*ac3332c4SDavid Hildenbrand Other values indicate a kernel zone. 364*ac3332c4SDavid Hildenbrand 365*ac3332c4SDavid Hildenbrand For offline memory blocks, the first column shows the 366*ac3332c4SDavid Hildenbrand zone the kernel would select when onlining the memory block 367*ac3332c4SDavid Hildenbrand right now without further specifying a zone. 368*ac3332c4SDavid Hildenbrand 369*ac3332c4SDavid Hildenbrand Availability depends on the CONFIG_MEMORY_HOTREMOVE 370*ac3332c4SDavid Hildenbrand kernel configuration option. 3716bf53999SMike Rapoport=================== ============================================================ 3726bf53999SMike Rapoport 3736bf53999SMike Rapoport.. note:: 3746bf53999SMike Rapoport 375*ac3332c4SDavid Hildenbrand If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ 376*ac3332c4SDavid Hildenbrand directories can also be accessed via symbolic links located in the 377*ac3332c4SDavid Hildenbrand ``/sys/devices/system/node/node*`` directories. 3786bf53999SMike Rapoport 3796bf53999SMike Rapoport For example:: 3806bf53999SMike Rapoport 3816bf53999SMike Rapoport /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 3826bf53999SMike Rapoport 3836bf53999SMike Rapoport A backlink will also be created:: 3846bf53999SMike Rapoport 3856bf53999SMike Rapoport /sys/devices/system/memory/memory9/node0 -> ../../node/node0 3866bf53999SMike Rapoport 387*ac3332c4SDavid HildenbrandCommand Line Parameters 388*ac3332c4SDavid Hildenbrand----------------------- 3896bf53999SMike Rapoport 390*ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following 391*ac3332c4SDavid Hildenbrandcommand line parameters are relevant: 3926bf53999SMike Rapoport 393*ac3332c4SDavid Hildenbrand======================== ======================================================= 394*ac3332c4SDavid Hildenbrand``memhp_default_state`` configure auto-onlining by essentially setting 395*ac3332c4SDavid Hildenbrand ``/sys/devices/system/memory/auto_online_blocks``. 396*ac3332c4SDavid Hildenbrand``movablecore`` configure automatic zone selection of the kernel. When 397*ac3332c4SDavid Hildenbrand set, the kernel will default to ZONE_MOVABLE, unless 398*ac3332c4SDavid Hildenbrand other zones can be kept contiguous. 399*ac3332c4SDavid Hildenbrand======================== ======================================================= 4006bf53999SMike Rapoport 401*ac3332c4SDavid HildenbrandModule Parameters 402*ac3332c4SDavid Hildenbrand------------------ 4036bf53999SMike Rapoport 404*ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the 405*ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module 406*ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating 407*ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as:: 4086bf53999SMike Rapoport 409*ac3332c4SDavid Hildenbrand memory_hotplug.memmap_on_memory=1 4106bf53999SMike Rapoport 411*ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via:: 4126bf53999SMike Rapoport 413*ac3332c4SDavid Hildenbrand /sys/modules/memory_hotplug/parameters/ 4146bf53999SMike Rapoport 415*ac3332c4SDavid HildenbrandThe following module parameters are currently defined: 4166bf53999SMike Rapoport 417*ac3332c4SDavid Hildenbrand======================== ======================================================= 418*ac3332c4SDavid Hildenbrand``memmap_on_memory`` read-write: Allocate memory for the memmap from the 419*ac3332c4SDavid Hildenbrand added memory block itself. Even if enabled, actual 420*ac3332c4SDavid Hildenbrand support depends on various other system properties and 421*ac3332c4SDavid Hildenbrand should only be regarded as a hint whether the behavior 422*ac3332c4SDavid Hildenbrand would be desired. 4236bf53999SMike Rapoport 424*ac3332c4SDavid Hildenbrand While allocating the memmap from the memory block 425*ac3332c4SDavid Hildenbrand itself makes memory hotplug less likely to fail and 426*ac3332c4SDavid Hildenbrand keeps the memmap on the same NUMA node in any case, it 427*ac3332c4SDavid Hildenbrand can fragment physical memory in a way that huge pages 428*ac3332c4SDavid Hildenbrand in bigger granularity cannot be formed on hotplugged 429*ac3332c4SDavid Hildenbrand memory. 430*ac3332c4SDavid Hildenbrand======================== ======================================================= 4316bf53999SMike Rapoport 432*ac3332c4SDavid HildenbrandZONE_MOVABLE 433*ac3332c4SDavid Hildenbrand============ 4346bf53999SMike Rapoport 435*ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining. 436*ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the 437*ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and 438*ac3332c4SDavid Hildenbranddynamically allocated huge pages. 4396bf53999SMike Rapoport 440*ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory 441*ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations 442*ac3332c4SDavid Hildenbrandcan only be served from the kernel zones. 4436bf53999SMike Rapoport 444*ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are 445*ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. 4466bf53999SMike Rapoport 447*ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable 448*ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is 449*ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully. 450*ac3332c4SDavid Hildenbrand 451*ac3332c4SDavid HildenbrandZone Imbalances 4526bf53999SMike Rapoport--------------- 4536bf53999SMike Rapoport 454*ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, 455*ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel 456*ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations, 457*ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE. 4586bf53999SMike Rapoport 459*ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 460*ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map. 4616bf53999SMike Rapoport 462*ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive 463*ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. 4646bf53999SMike Rapoport 4656bf53999SMike Rapoport.. note:: 4666bf53999SMike Rapoport 467*ac3332c4SDavid Hildenbrand CMA memory part of a kernel zone essentially behaves like memory in 468*ac3332c4SDavid Hildenbrand ZONE_MOVABLE and similar considerations apply, especially when combining 469*ac3332c4SDavid Hildenbrand CMA with ZONE_MOVABLE. 4706bf53999SMike Rapoport 471*ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations 472*ac3332c4SDavid Hildenbrand---------------------------------- 473ad2fa371SMuchun Song 474*ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually 475*ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In 476*ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine. 477ad2fa371SMuchun Song 478*ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM 479*ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using 480*ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios: 481fa965fd5SPavel Tatashin 482*ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume 483*ac3332c4SDavid Hildenbrand memory for metadata and page tables in the direct map; having a lot of offline 484*ac3332c4SDavid Hildenbrand memory blocks is not a typical case, though. 4856bf53999SMike Rapoport 486*ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with 487*ac3332c4SDavid Hildenbrand ZONE_MOVABLE. Only some implementations, such as virtio-balloon and 488*ac3332c4SDavid Hildenbrand pseries CMM, fully support balloon compaction. 4896bf53999SMike Rapoport 490*ac3332c4SDavid Hildenbrand Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be 491*ac3332c4SDavid Hildenbrand disabled. In that case, balloon inflation will only perform unmovable 492*ac3332c4SDavid Hildenbrand allocations and silently create a zone imbalance, usually triggered by 493*ac3332c4SDavid Hildenbrand inflation requests from the hypervisor. 4946bf53999SMike Rapoport 495*ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a 496*ac3332c4SDavid Hildenbrand lot of unmovable memory. 4976bf53999SMike Rapoport 498*ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge 499*ac3332c4SDavid Hildenbrand page migration, resulting in a similar issue as with gigantic pages. 5006bf53999SMike Rapoport 501*ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large 502*ac3332c4SDavid Hildenbrand files or ZONE_DEVICE memory can be problematic, although only really relevant 503*ac3332c4SDavid Hildenbrand in corner cases. When we manage a lot of user space memory that has been 504*ac3332c4SDavid Hildenbrand swapped out or is served from a file/persistent memory/... we still need a lot 505*ac3332c4SDavid Hildenbrand of page tables to manage that memory once user space accessed that memory. 5066bf53999SMike Rapoport 507*ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be 508*ac3332c4SDavid Hildenbrand allocated from the kernel zones. 5096bf53999SMike Rapoport 510*ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of 511*ac3332c4SDavid Hildenbrand the total system memory size as (unmovable) tracking metadata. 5126bf53999SMike Rapoport 513*ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings 514*ac3332c4SDavid Hildenbrand (especially, RDMA and vfio/mdev) are fundamentally problematic with 515*ac3332c4SDavid Hildenbrand ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside 516*ac3332c4SDavid Hildenbrand on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they 517*ac3332c4SDavid Hildenbrand have to be migrated off that zone while pinning. Pinning a page can fail 518*ac3332c4SDavid Hildenbrand even if there is plenty of free memory in ZONE_MOVABLE. 5196bf53999SMike Rapoport 520*ac3332c4SDavid Hildenbrand In addition, using ZONE_MOVABLE might make page pinning more expensive, 521*ac3332c4SDavid Hildenbrand because of the page migration overhead. 522dee6da22SDavid Hildenbrand 523*ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel 524*ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used. 5256bf53999SMike Rapoport 526*ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the 527*ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options: 528*ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See 529*ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description. 530*ac3332c4SDavid Hildenbrand 531*ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE 532*ac3332c4SDavid Hildenbrand--------------------------------- 533*ac3332c4SDavid Hildenbrand 534*ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory 535*ac3332c4SDavid Hildenbrandblock might fail: 536*ac3332c4SDavid Hildenbrand 537*ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during 538*ac3332c4SDavid Hildenbrand boot and can apply to memory blocks hotplugged via the XEN balloon and the 539*ac3332c4SDavid Hildenbrand Hyper-V balloon. 540*ac3332c4SDavid Hildenbrand 541*ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory 542*ac3332c4SDavid Hildenbrand offlining; this applies to memory blocks present during boot only. 543*ac3332c4SDavid Hildenbrand 544*ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples 545*ac3332c4SDavid Hildenbrand include any memory available during boot on arm64 or memory blocks spanning 546*ac3332c4SDavid Hildenbrand the crashkernel area on s390x; this usually applies to memory blocks present 547*ac3332c4SDavid Hildenbrand during boot only. 548*ac3332c4SDavid Hildenbrand 549*ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to 550*ac3332c4SDavid Hildenbrand memory blocks present during boot only. 551*ac3332c4SDavid Hildenbrand 552*ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as 553*ac3332c4SDavid Hildenbrand allocating gigantic pages, can result in temporary offlining failures. 554*ac3332c4SDavid Hildenbrand 555*ac3332c4SDavid Hildenbrand- Out of memory when dissolving huge pages, especially when freeing unused 556*ac3332c4SDavid Hildenbrand vmemmap pages associated with each hugetlb page is enabled. 557*ac3332c4SDavid Hildenbrand 558*ac3332c4SDavid Hildenbrand Offlining code may be able to migrate huge page contents, but may not be able 559*ac3332c4SDavid Hildenbrand to dissolve the source huge page because it fails allocating (unmovable) pages 560*ac3332c4SDavid Hildenbrand for the vmemmap, because the system might not have free memory in the kernel 561*ac3332c4SDavid Hildenbrand zones left. 562*ac3332c4SDavid Hildenbrand 563*ac3332c4SDavid Hildenbrand Users that depend on memory offlining to succeed for movable zones should 564*ac3332c4SDavid Hildenbrand carefully consider whether the memory savings gained from this feature are 565*ac3332c4SDavid Hildenbrand worth the risk of possibly not being able to offline memory in certain 566*ac3332c4SDavid Hildenbrand situations. 567*ac3332c4SDavid Hildenbrand 568*ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or 569*ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE 570*ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds. 571*ac3332c4SDavid Hildenbrand 572*ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be 573*ac3332c4SDavid Hildenbrandterminated by sending a fatal signal. A timeout based offlining can easily be 574*ac3332c4SDavid Hildenbrandimplemented via:: 575*ac3332c4SDavid Hildenbrand 576*ac3332c4SDavid Hildenbrand % timeout $TIMEOUT offline_block | failure_handling 577