16bf53999SMike Rapoport.. _admin_guide_memory_hotplug: 26bf53999SMike Rapoport 36bf53999SMike Rapoport============== 46bf53999SMike RapoportMemory Hotplug 56bf53999SMike Rapoport============== 66bf53999SMike Rapoport 76bf53999SMike Rapoport:Created: Jul 28 2007 8dee6da22SDavid Hildenbrand:Updated: Add some details about locking internals: Aug 20 2018 96bf53999SMike Rapoport 106bf53999SMike RapoportThis document is about memory hotplug including how-to-use and current status. 116bf53999SMike RapoportBecause Memory Hotplug is still under development, contents of this text will 126bf53999SMike Rapoportbe changed often. 136bf53999SMike Rapoport 146bf53999SMike Rapoport.. contents:: :local: 156bf53999SMike Rapoport 166bf53999SMike Rapoport.. note:: 176bf53999SMike Rapoport 186bf53999SMike Rapoport (1) x86_64's has special implementation for memory hotplug. 196bf53999SMike Rapoport This text does not describe it. 206bf53999SMike Rapoport (2) This text assumes that sysfs is mounted at ``/sys``. 216bf53999SMike Rapoport 226bf53999SMike Rapoport 236bf53999SMike RapoportIntroduction 246bf53999SMike Rapoport============ 256bf53999SMike Rapoport 266bf53999SMike RapoportPurpose of memory hotplug 276bf53999SMike Rapoport------------------------- 286bf53999SMike Rapoport 296bf53999SMike RapoportMemory Hotplug allows users to increase/decrease the amount of memory. 306bf53999SMike RapoportGenerally, there are two purposes. 316bf53999SMike Rapoport 326bf53999SMike Rapoport(A) For changing the amount of memory. 336bf53999SMike Rapoport This is to allow a feature like capacity on demand. 346bf53999SMike Rapoport(B) For installing/removing DIMMs or NUMA-nodes physically. 356bf53999SMike Rapoport This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. 366bf53999SMike Rapoport 376bf53999SMike Rapoport(A) is required by highly virtualized environments and (B) is required by 386bf53999SMike Rapoporthardware which supports memory power management. 396bf53999SMike Rapoport 406bf53999SMike RapoportLinux memory hotplug is designed for both purpose. 416bf53999SMike Rapoport 426bf53999SMike RapoportPhases of memory hotplug 436bf53999SMike Rapoport------------------------ 446bf53999SMike Rapoport 456bf53999SMike RapoportThere are 2 phases in Memory Hotplug: 466bf53999SMike Rapoport 476bf53999SMike Rapoport 1) Physical Memory Hotplug phase 486bf53999SMike Rapoport 2) Logical Memory Hotplug phase. 496bf53999SMike Rapoport 506bf53999SMike RapoportThe First phase is to communicate hardware/firmware and make/erase 516bf53999SMike Rapoportenvironment for hotplugged memory. Basically, this phase is necessary 526bf53999SMike Rapoportfor the purpose (B), but this is good phase for communication between 536bf53999SMike Rapoporthighly virtualized environments too. 546bf53999SMike Rapoport 556bf53999SMike RapoportWhen memory is hotplugged, the kernel recognizes new memory, makes new memory 566bf53999SMike Rapoportmanagement tables, and makes sysfs files for new memory's operation. 576bf53999SMike Rapoport 586bf53999SMike RapoportIf firmware supports notification of connection of new memory to OS, 596bf53999SMike Rapoportthis phase is triggered automatically. ACPI can notify this event. If not, 606bf53999SMike Rapoport"probe" operation by system administration is used instead. 616bf53999SMike Rapoport(see :ref:`memory_hotplug_physical_mem`). 626bf53999SMike Rapoport 636bf53999SMike RapoportLogical Memory Hotplug phase is to change memory state into 646bf53999SMike Rapoportavailable/unavailable for users. Amount of memory from user's view is 656bf53999SMike Rapoportchanged by this phase. The kernel makes all memory in it as free pages 666bf53999SMike Rapoportwhen a memory range is available. 676bf53999SMike Rapoport 686bf53999SMike RapoportIn this document, this phase is described as online/offline. 696bf53999SMike Rapoport 706bf53999SMike RapoportLogical Memory Hotplug phase is triggered by write of sysfs file by system 716bf53999SMike Rapoportadministrator. For the hot-add case, it must be executed after Physical Hotplug 726bf53999SMike Rapoportphase by hand. 736bf53999SMike Rapoport(However, if you writes udev's hotplug scripts for memory hotplug, these 746bf53999SMike Rapoportphases can be execute in seamless way.) 756bf53999SMike Rapoport 766bf53999SMike RapoportUnit of Memory online/offline operation 776bf53999SMike Rapoport--------------------------------------- 786bf53999SMike Rapoport 796bf53999SMike RapoportMemory hotplug uses SPARSEMEM memory model which allows memory to be divided 806bf53999SMike Rapoportinto chunks of the same size. These chunks are called "sections". The size of 816bf53999SMike Rapoporta memory section is architecture dependent. For example, power uses 16MiB, ia64 826bf53999SMike Rapoportuses 1GiB. 836bf53999SMike Rapoport 846bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The 856bf53999SMike Rapoportsize of a memory block is architecture dependent and represents the logical 866bf53999SMike Rapoportunit upon which memory online/offline operations are to be performed. The 876bf53999SMike Rapoportdefault size of a memory block is the same as memory section size unless an 886bf53999SMike Rapoportarchitecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.) 896bf53999SMike Rapoport 906bf53999SMike RapoportTo determine the size (in bytes) of a memory block please read this file:: 916bf53999SMike Rapoport 926bf53999SMike Rapoport /sys/devices/system/memory/block_size_bytes 936bf53999SMike Rapoport 946bf53999SMike RapoportKernel Configuration 956bf53999SMike Rapoport==================== 966bf53999SMike Rapoport 976bf53999SMike RapoportTo use memory hotplug feature, kernel must be compiled with following 986bf53999SMike Rapoportconfig options. 996bf53999SMike Rapoport 1006bf53999SMike Rapoport- For all memory hotplug: 1016bf53999SMike Rapoport - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``) 1026bf53999SMike Rapoport - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``) 1036bf53999SMike Rapoport 1046bf53999SMike Rapoport- To enable memory removal, the following are also necessary: 1056bf53999SMike Rapoport - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``) 1066bf53999SMike Rapoport - Page Migration (``CONFIG_MIGRATION``) 1076bf53999SMike Rapoport 1086bf53999SMike Rapoport- For ACPI memory hotplug, the following are also necessary: 1096bf53999SMike Rapoport - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``) 1106bf53999SMike Rapoport - This option can be kernel module. 1116bf53999SMike Rapoport 1126bf53999SMike Rapoport- As a related configuration, if your box has a feature of NUMA-node hotplug 1136bf53999SMike Rapoport via ACPI, then this option is necessary too. 1146bf53999SMike Rapoport 1156bf53999SMike Rapoport - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) 1166bf53999SMike Rapoport (``CONFIG_ACPI_CONTAINER``). 1176bf53999SMike Rapoport 1186bf53999SMike Rapoport This option can be kernel module too. 1196bf53999SMike Rapoport 1206bf53999SMike Rapoport 1216bf53999SMike Rapoport.. _memory_hotplug_sysfs_files: 1226bf53999SMike Rapoport 1236bf53999SMike Rapoportsysfs files for memory hotplug 1246bf53999SMike Rapoport============================== 1256bf53999SMike Rapoport 1266bf53999SMike RapoportAll memory blocks have their device information in sysfs. Each memory block 1276bf53999SMike Rapoportis described under ``/sys/devices/system/memory`` as:: 1286bf53999SMike Rapoport 1296bf53999SMike Rapoport /sys/devices/system/memory/memoryXXX 1306bf53999SMike Rapoport 1316bf53999SMike Rapoportwhere XXX is the memory block id. 1326bf53999SMike Rapoport 1336bf53999SMike RapoportFor the memory block covered by the sysfs directory. It is expected that all 1346bf53999SMike Rapoportmemory sections in this range are present and no memory holes exist in the 1356bf53999SMike Rapoportrange. Currently there is no way to determine if there is a memory hole, but 1366bf53999SMike Rapoportthe existence of one should not affect the hotplug capabilities of the memory 1376bf53999SMike Rapoportblock. 1386bf53999SMike Rapoport 1396bf53999SMike RapoportFor example, assume 1GiB memory block size. A device for a memory starting at 1406bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``:: 1416bf53999SMike Rapoport 1426bf53999SMike Rapoport (0x100000000 / 1Gib = 4) 1436bf53999SMike Rapoport 1446bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000) 1456bf53999SMike Rapoport 1466bf53999SMike RapoportUnder each memory block, you can see 5 files: 1476bf53999SMike Rapoport 1486bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/phys_index`` 1496bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/phys_device`` 1506bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/state`` 1516bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/removable`` 1526bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/valid_zones`` 1536bf53999SMike Rapoport 1546bf53999SMike Rapoport=================== ============================================================ 1556bf53999SMike Rapoport``phys_index`` read-only and contains memory block id, same as XXX. 1566bf53999SMike Rapoport``state`` read-write 1576bf53999SMike Rapoport 1586bf53999SMike Rapoport - at read: contains online/offline state of memory. 1596bf53999SMike Rapoport - at write: user can specify "online_kernel", 1606bf53999SMike Rapoport 1616bf53999SMike Rapoport "online_movable", "online", "offline" command 1626bf53999SMike Rapoport which will be performed on all sections in the block. 163e9a2e48eSDavid Hildenbrand``phys_device`` read-only: legacy interface only ever used on s390x to 164e9a2e48eSDavid Hildenbrand expose the covered storage increment. 165a89107c0SDavid Hildenbrand``removable`` read-only: legacy interface that indicated whether a memory 166a89107c0SDavid Hildenbrand block was likely to be offlineable or not. Newer kernel 167a89107c0SDavid Hildenbrand versions return "1" if and only if the kernel supports 168a89107c0SDavid Hildenbrand memory offlining. 169a89107c0SDavid Hildenbrand``valid_zones`` read-only: designed to show by which zone memory provided by 170a89107c0SDavid Hildenbrand a memory block is managed, and to show by which zone memory 171a89107c0SDavid Hildenbrand provided by an offline memory block could be managed when 172a89107c0SDavid Hildenbrand onlining. 1736bf53999SMike Rapoport 1746bf53999SMike Rapoport The first column shows it`s default zone. 1756bf53999SMike Rapoport 1766bf53999SMike Rapoport "memory6/valid_zones: Normal Movable" shows this memoryblock 1776bf53999SMike Rapoport can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE 1786bf53999SMike Rapoport by online_movable. 1796bf53999SMike Rapoport 1806bf53999SMike Rapoport "memory7/valid_zones: Movable Normal" shows this memoryblock 1816bf53999SMike Rapoport can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL 1826bf53999SMike Rapoport by online_kernel. 1836bf53999SMike Rapoport=================== ============================================================ 1846bf53999SMike Rapoport 1856bf53999SMike Rapoport.. note:: 1866bf53999SMike Rapoport 1876bf53999SMike Rapoport These directories/files appear after physical memory hotplug phase. 1886bf53999SMike Rapoport 1896bf53999SMike RapoportIf CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed 1906bf53999SMike Rapoportvia symbolic links located in the ``/sys/devices/system/node/node*`` directories. 1916bf53999SMike Rapoport 1926bf53999SMike RapoportFor example:: 1936bf53999SMike Rapoport 1946bf53999SMike Rapoport /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 1956bf53999SMike Rapoport 1966bf53999SMike RapoportA backlink will also be created:: 1976bf53999SMike Rapoport 1986bf53999SMike Rapoport /sys/devices/system/memory/memory9/node0 -> ../../node/node0 1996bf53999SMike Rapoport 2006bf53999SMike Rapoport.. _memory_hotplug_physical_mem: 2016bf53999SMike Rapoport 2026bf53999SMike RapoportPhysical memory hot-add phase 2036bf53999SMike Rapoport============================= 2046bf53999SMike Rapoport 2056bf53999SMike RapoportHardware(Firmware) Support 2066bf53999SMike Rapoport-------------------------- 2076bf53999SMike Rapoport 2086bf53999SMike RapoportOn x86_64/ia64 platform, memory hotplug by ACPI is supported. 2096bf53999SMike Rapoport 2106bf53999SMike RapoportIn general, the firmware (ACPI) which supports memory hotplug defines 2116bf53999SMike Rapoportmemory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, 2126bf53999SMike RapoportLinux's ACPI handler does hot-add memory to the system and calls a hotplug udev 2136bf53999SMike Rapoportscript. This will be done automatically. 2146bf53999SMike Rapoport 2156bf53999SMike RapoportBut scripts for memory hotplug are not contained in generic udev package(now). 2166bf53999SMike RapoportYou may have to write it by yourself or online/offline memory by hand. 2176bf53999SMike RapoportPlease see :ref:`memory_hotplug_how_to_online_memory` and 2186bf53999SMike Rapoport:ref:`memory_hotplug_how_to_offline_memory`. 2196bf53999SMike Rapoport 2206bf53999SMike RapoportIf firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", 2216bf53999SMike Rapoport"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler 2226bf53999SMike Rapoportcalls hotplug code for all of objects which are defined in it. 2236bf53999SMike RapoportIf memory device is found, memory hotplug code will be called. 2246bf53999SMike Rapoport 2256bf53999SMike RapoportNotify memory hot-add event by hand 2266bf53999SMike Rapoport----------------------------------- 2276bf53999SMike Rapoport 2286bf53999SMike RapoportOn some architectures, the firmware may not notify the kernel of a memory 2296bf53999SMike Rapoporthotplug event. Therefore, the memory "probe" interface is supported to 2306bf53999SMike Rapoportexplicitly notify the kernel. This interface depends on 2316bf53999SMike RapoportCONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86 2326bf53999SMike Rapoportif hotplug is supported, although for x86 this should be handled by ACPI 2336bf53999SMike Rapoportnotification. 2346bf53999SMike Rapoport 2356bf53999SMike RapoportProbe interface is located at:: 2366bf53999SMike Rapoport 2376bf53999SMike Rapoport /sys/devices/system/memory/probe 2386bf53999SMike Rapoport 2396bf53999SMike RapoportYou can tell the physical address of new memory to the kernel by:: 2406bf53999SMike Rapoport 2416bf53999SMike Rapoport % echo start_address_of_new_memory > /sys/devices/system/memory/probe 2426bf53999SMike Rapoport 2436bf53999SMike RapoportThen, [start_address_of_new_memory, start_address_of_new_memory + 2446bf53999SMike Rapoportmemory_block_size] memory range is hot-added. In this case, hotplug script is 2456bf53999SMike Rapoportnot called (in current implementation). You'll have to online memory by 2466bf53999SMike Rapoportyourself. Please see :ref:`memory_hotplug_how_to_online_memory`. 2476bf53999SMike Rapoport 2486bf53999SMike RapoportLogical Memory hot-add phase 2496bf53999SMike Rapoport============================ 2506bf53999SMike Rapoport 2516bf53999SMike RapoportState of memory 2526bf53999SMike Rapoport--------------- 2536bf53999SMike Rapoport 2546bf53999SMike RapoportTo see (online/offline) state of a memory block, read 'state' file:: 2556bf53999SMike Rapoport 2566bf53999SMike Rapoport % cat /sys/device/system/memory/memoryXXX/state 2576bf53999SMike Rapoport 2586bf53999SMike Rapoport 2596bf53999SMike Rapoport- If the memory block is online, you'll read "online". 2606bf53999SMike Rapoport- If the memory block is offline, you'll read "offline". 2616bf53999SMike Rapoport 2626bf53999SMike Rapoport 2636bf53999SMike Rapoport.. _memory_hotplug_how_to_online_memory: 2646bf53999SMike Rapoport 2656bf53999SMike RapoportHow to online memory 2666bf53999SMike Rapoport-------------------- 2676bf53999SMike Rapoport 2686bf53999SMike RapoportWhen the memory is hot-added, the kernel decides whether or not to "online" 2696bf53999SMike Rapoportit according to the policy which can be read from "auto_online_blocks" file:: 2706bf53999SMike Rapoport 2716bf53999SMike Rapoport % cat /sys/devices/system/memory/auto_online_blocks 2726bf53999SMike Rapoport 2736bf53999SMike RapoportThe default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config 2746bf53999SMike Rapoportoption. If it is disabled the default is "offline" which means the newly added 2756bf53999SMike Rapoportmemory is not in a ready-to-use state and you have to "online" the newly added 2766bf53999SMike Rapoportmemory blocks manually. Automatic onlining can be requested by writing "online" 2776bf53999SMike Rapoportto "auto_online_blocks" file:: 2786bf53999SMike Rapoport 2796bf53999SMike Rapoport % echo online > /sys/devices/system/memory/auto_online_blocks 2806bf53999SMike Rapoport 2816bf53999SMike RapoportThis sets a global policy and impacts all memory blocks that will subsequently 2826bf53999SMike Rapoportbe hotplugged. Currently offline blocks keep their state. It is possible, under 2836bf53999SMike Rapoportcertain circumstances, that some memory blocks will be added but will fail to 2846bf53999SMike Rapoportonline. User space tools can check their "state" files 2856bf53999SMike Rapoport(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually. 2866bf53999SMike Rapoport 2876bf53999SMike RapoportIf the automatic onlining wasn't requested, failed, or some memory block was 2886bf53999SMike Rapoportofflined it is possible to change the individual block's state by writing to the 2896bf53999SMike Rapoport"state" file:: 2906bf53999SMike Rapoport 2916bf53999SMike Rapoport % echo online > /sys/devices/system/memory/memoryXXX/state 2926bf53999SMike Rapoport 2936bf53999SMike RapoportThis onlining will not change the ZONE type of the target memory block, 2946bf53999SMike RapoportIf the memory block doesn't belong to any zone an appropriate kernel zone 2956bf53999SMike Rapoport(usually ZONE_NORMAL) will be used unless movable_node kernel command line 2966bf53999SMike Rapoportoption is specified when ZONE_MOVABLE will be used. 2976bf53999SMike Rapoport 2986bf53999SMike RapoportYou can explicitly request to associate it with ZONE_MOVABLE by:: 2996bf53999SMike Rapoport 3006bf53999SMike Rapoport % echo online_movable > /sys/devices/system/memory/memoryXXX/state 3016bf53999SMike Rapoport 3026bf53999SMike Rapoport.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE 3036bf53999SMike Rapoport 3046bf53999SMike RapoportOr you can explicitly request a kernel zone (usually ZONE_NORMAL) by:: 3056bf53999SMike Rapoport 3066bf53999SMike Rapoport % echo online_kernel > /sys/devices/system/memory/memoryXXX/state 3076bf53999SMike Rapoport 3086bf53999SMike Rapoport.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL 3096bf53999SMike Rapoport 3106bf53999SMike RapoportAn explicit zone onlining can fail (e.g. when the range is already within 3116bf53999SMike Rapoportand existing and incompatible zone already). 3126bf53999SMike Rapoport 3136bf53999SMike RapoportAfter this, memory block XXX's state will be 'online' and the amount of 3146bf53999SMike Rapoportavailable memory will be increased. 3156bf53999SMike Rapoport 3166bf53999SMike RapoportThis may be changed in future. 3176bf53999SMike Rapoport 3186bf53999SMike RapoportLogical memory remove 3196bf53999SMike Rapoport===================== 3206bf53999SMike Rapoport 3216bf53999SMike RapoportMemory offline and ZONE_MOVABLE 3226bf53999SMike Rapoport------------------------------- 3236bf53999SMike Rapoport 3246bf53999SMike RapoportMemory offlining is more complicated than memory online. Because memory offline 3256bf53999SMike Rapoporthas to make the whole memory block be unused, memory offline can fail if 3266bf53999SMike Rapoportthe memory block includes memory which cannot be freed. 3276bf53999SMike Rapoport 3286bf53999SMike RapoportIn general, memory offline can use 2 techniques. 3296bf53999SMike Rapoport 3306bf53999SMike Rapoport(1) reclaim and free all memory in the memory block. 3316bf53999SMike Rapoport(2) migrate all pages in the memory block. 3326bf53999SMike Rapoport 3336bf53999SMike RapoportIn the current implementation, Linux's memory offline uses method (2), freeing 3346bf53999SMike Rapoportall pages in the memory block by page migration. But not all pages are 3356bf53999SMike Rapoportmigratable. Under current Linux, migratable pages are anonymous pages and 3366bf53999SMike Rapoportpage caches. For offlining a memory block by migration, the kernel has to 3376bf53999SMike Rapoportguarantee that the memory block contains only migratable pages. 3386bf53999SMike Rapoport 3396bf53999SMike RapoportNow, a boot option for making a memory block which consists of migratable pages 3406bf53999SMike Rapoportis supported. By specifying "kernelcore=" or "movablecore=" boot option, you can 3416bf53999SMike Rapoportcreate ZONE_MOVABLE...a zone which is just used for movable pages. 3426bf53999SMike Rapoport(See also Documentation/admin-guide/kernel-parameters.rst) 3436bf53999SMike Rapoport 3446bf53999SMike RapoportAssume the system has "TOTAL" amount of memory at boot time, this boot option 3456bf53999SMike Rapoportcreates ZONE_MOVABLE as following. 3466bf53999SMike Rapoport 3476bf53999SMike Rapoport1) When kernelcore=YYYY boot option is used, 3486bf53999SMike Rapoport Size of memory not for movable pages (not for offline) is YYYY. 3496bf53999SMike Rapoport Size of memory for movable pages (for offline) is TOTAL-YYYY. 3506bf53999SMike Rapoport 3516bf53999SMike Rapoport2) When movablecore=ZZZZ boot option is used, 3526bf53999SMike Rapoport Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. 3536bf53999SMike Rapoport Size of memory for movable pages (for offline) is ZZZZ. 3546bf53999SMike Rapoport 3556bf53999SMike Rapoport.. note:: 3566bf53999SMike Rapoport 3576bf53999SMike Rapoport Unfortunately, there is no information to show which memory block belongs 3586bf53999SMike Rapoport to ZONE_MOVABLE. This is TBD. 3596bf53999SMike Rapoport 360*ad2fa371SMuchun Song Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE 361*ad2fa371SMuchun Song and the feature of freeing unused vmemmap pages associated with each hugetlb 362*ad2fa371SMuchun Song page is enabled. 363*ad2fa371SMuchun Song 364*ad2fa371SMuchun Song This can happen when we have plenty of ZONE_MOVABLE memory, but not enough 365*ad2fa371SMuchun Song kernel memory to allocate vmemmmap pages. We may even be able to migrate 366*ad2fa371SMuchun Song huge page contents, but will not be able to dissolve the source huge page. 367*ad2fa371SMuchun Song This will prevent an offline operation and is unfortunate as memory offlining 368*ad2fa371SMuchun Song is expected to succeed on movable zones. Users that depend on memory hotplug 369*ad2fa371SMuchun Song to succeed for movable zones should carefully consider whether the memory 370*ad2fa371SMuchun Song savings gained from this feature are worth the risk of possibly not being 371*ad2fa371SMuchun Song able to offline memory in certain situations. 372*ad2fa371SMuchun Song 373fa965fd5SPavel Tatashin.. note:: 374fa965fd5SPavel Tatashin Techniques that rely on long-term pinnings of memory (especially, RDMA and 375fa965fd5SPavel Tatashin vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory 376fa965fd5SPavel Tatashin hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that 377fa965fd5SPavel Tatashin memory can still get hot removed - be aware that pinning can fail even if 378fa965fd5SPavel Tatashin there is plenty of free memory in ZONE_MOVABLE. In addition, using 379fa965fd5SPavel Tatashin ZONE_MOVABLE might make page pinning more expensive, because pages have to be 380fa965fd5SPavel Tatashin migrated off that zone first. 381fa965fd5SPavel Tatashin 3826bf53999SMike Rapoport.. _memory_hotplug_how_to_offline_memory: 3836bf53999SMike Rapoport 3846bf53999SMike RapoportHow to offline memory 3856bf53999SMike Rapoport--------------------- 3866bf53999SMike Rapoport 3876bf53999SMike RapoportYou can offline a memory block by using the same sysfs interface that was used 3886bf53999SMike Rapoportin memory onlining:: 3896bf53999SMike Rapoport 3906bf53999SMike Rapoport % echo offline > /sys/devices/system/memory/memoryXXX/state 3916bf53999SMike Rapoport 3926bf53999SMike RapoportIf offline succeeds, the state of the memory block is changed to be "offline". 3936bf53999SMike RapoportIf it fails, some error core (like -EBUSY) will be returned by the kernel. 3946bf53999SMike RapoportEven if a memory block does not belong to ZONE_MOVABLE, you can try to offline 3956bf53999SMike Rapoportit. If it doesn't contain 'unmovable' memory, you'll get success. 3966bf53999SMike Rapoport 3976bf53999SMike RapoportA memory block under ZONE_MOVABLE is considered to be able to be offlined 3986bf53999SMike Rapoporteasily. But under some busy state, it may return -EBUSY. Even if a memory 3996bf53999SMike Rapoportblock cannot be offlined due to -EBUSY, you can retry offlining it and may be 4006bf53999SMike Rapoportable to offline it (or not). (For example, a page is referred to by some kernel 4016bf53999SMike Rapoportinternal call and released soon.) 4026bf53999SMike Rapoport 4036bf53999SMike RapoportConsideration: 4046bf53999SMike Rapoport Memory hotplug's design direction is to make the possibility of memory 4056bf53999SMike Rapoport offlining higher and to guarantee unplugging memory under any situation. But 4066bf53999SMike Rapoport it needs more work. Returning -EBUSY under some situation may be good because 4076bf53999SMike Rapoport the user can decide to retry more or not by himself. Currently, memory 4086bf53999SMike Rapoport offlining code does some amount of retry with 120 seconds timeout. 4096bf53999SMike Rapoport 4106bf53999SMike RapoportPhysical memory remove 4116bf53999SMike Rapoport====================== 4126bf53999SMike Rapoport 4136bf53999SMike RapoportNeed more implementation yet.... 4146bf53999SMike Rapoport - Notification completion of remove works by OS to firmware. 4156bf53999SMike Rapoport - Guard from remove if not yet. 4166bf53999SMike Rapoport 417dee6da22SDavid Hildenbrand 418dee6da22SDavid HildenbrandLocking Internals 419dee6da22SDavid Hildenbrand================= 420dee6da22SDavid Hildenbrand 421dee6da22SDavid HildenbrandWhen adding/removing memory that uses memory block devices (i.e. ordinary RAM), 422dee6da22SDavid Hildenbrandthe device_hotplug_lock should be held to: 423dee6da22SDavid Hildenbrand 424dee6da22SDavid Hildenbrand- synchronize against online/offline requests (e.g. via sysfs). This way, memory 425dee6da22SDavid Hildenbrand block devices can only be accessed (.online/.state attributes) by user 426dee6da22SDavid Hildenbrand space once memory has been fully added. And when removing memory, we 427dee6da22SDavid Hildenbrand know nobody is in critical sections. 428dee6da22SDavid Hildenbrand- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC) 429dee6da22SDavid Hildenbrand 430dee6da22SDavid HildenbrandEspecially, there is a possible lock inversion that is avoided using 431dee6da22SDavid Hildenbranddevice_hotplug_lock when adding memory and user space tries to online that 432dee6da22SDavid Hildenbrandmemory faster than expected: 433dee6da22SDavid Hildenbrand 434dee6da22SDavid Hildenbrand- device_online() will first take the device_lock(), followed by 435dee6da22SDavid Hildenbrand mem_hotplug_lock 436dee6da22SDavid Hildenbrand- add_memory_resource() will first take the mem_hotplug_lock, followed by 437dee6da22SDavid Hildenbrand the device_lock() (while creating the devices, during bus_add_device()). 438dee6da22SDavid Hildenbrand 439dee6da22SDavid HildenbrandAs the device is visible to user space before taking the device_lock(), this 440dee6da22SDavid Hildenbrandcan result in a lock inversion. 441dee6da22SDavid Hildenbrand 442dee6da22SDavid Hildenbrandonlining/offlining of memory should be done via device_online()/ 443dee6da22SDavid Hildenbranddevice_offline() - to make sure it is properly synchronized to actions 444dee6da22SDavid Hildenbrandvia sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type) 445dee6da22SDavid Hildenbrand 446dee6da22SDavid HildenbrandWhen adding/removing/onlining/offlining memory or adding/removing 447dee6da22SDavid Hildenbrandheterogeneous/device memory, we should always hold the mem_hotplug_lock in 448dee6da22SDavid Hildenbrandwrite mode to serialise memory hotplug (e.g. access to global/zone 449dee6da22SDavid Hildenbrandvariables). 450dee6da22SDavid Hildenbrand 451dee6da22SDavid HildenbrandIn addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read 452dee6da22SDavid Hildenbrandmode allows for a quite efficient get_online_mems/put_online_mems 453dee6da22SDavid Hildenbrandimplementation, so code accessing memory can protect from that memory 454dee6da22SDavid Hildenbrandvanishing. 455dee6da22SDavid Hildenbrand 456dee6da22SDavid Hildenbrand 4576bf53999SMike RapoportFuture Work 4586bf53999SMike Rapoport=========== 4596bf53999SMike Rapoport 4606bf53999SMike Rapoport - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like 4616bf53999SMike Rapoport sysctl or new control file. 4626bf53999SMike Rapoport - showing memory block and physical device relationship. 4636bf53999SMike Rapoport - test and make it better memory offlining. 4646bf53999SMike Rapoport - support HugeTLB page migration and offlining. 4656bf53999SMike Rapoport - memmap removing at memory offline. 4666bf53999SMike Rapoport - physical remove memory. 467