xref: /openbmc/linux/Documentation/admin-guide/mm/memory-hotplug.rst (revision ad2fa3717b74994a22519dbe045757135db00dbb)
16bf53999SMike Rapoport.. _admin_guide_memory_hotplug:
26bf53999SMike Rapoport
36bf53999SMike Rapoport==============
46bf53999SMike RapoportMemory Hotplug
56bf53999SMike Rapoport==============
66bf53999SMike Rapoport
76bf53999SMike Rapoport:Created:							Jul 28 2007
8dee6da22SDavid Hildenbrand:Updated: Add some details about locking internals:		Aug 20 2018
96bf53999SMike Rapoport
106bf53999SMike RapoportThis document is about memory hotplug including how-to-use and current status.
116bf53999SMike RapoportBecause Memory Hotplug is still under development, contents of this text will
126bf53999SMike Rapoportbe changed often.
136bf53999SMike Rapoport
146bf53999SMike Rapoport.. contents:: :local:
156bf53999SMike Rapoport
166bf53999SMike Rapoport.. note::
176bf53999SMike Rapoport
186bf53999SMike Rapoport    (1) x86_64's has special implementation for memory hotplug.
196bf53999SMike Rapoport        This text does not describe it.
206bf53999SMike Rapoport    (2) This text assumes that sysfs is mounted at ``/sys``.
216bf53999SMike Rapoport
226bf53999SMike Rapoport
236bf53999SMike RapoportIntroduction
246bf53999SMike Rapoport============
256bf53999SMike Rapoport
266bf53999SMike RapoportPurpose of memory hotplug
276bf53999SMike Rapoport-------------------------
286bf53999SMike Rapoport
296bf53999SMike RapoportMemory Hotplug allows users to increase/decrease the amount of memory.
306bf53999SMike RapoportGenerally, there are two purposes.
316bf53999SMike Rapoport
326bf53999SMike Rapoport(A) For changing the amount of memory.
336bf53999SMike Rapoport    This is to allow a feature like capacity on demand.
346bf53999SMike Rapoport(B) For installing/removing DIMMs or NUMA-nodes physically.
356bf53999SMike Rapoport    This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
366bf53999SMike Rapoport
376bf53999SMike Rapoport(A) is required by highly virtualized environments and (B) is required by
386bf53999SMike Rapoporthardware which supports memory power management.
396bf53999SMike Rapoport
406bf53999SMike RapoportLinux memory hotplug is designed for both purpose.
416bf53999SMike Rapoport
426bf53999SMike RapoportPhases of memory hotplug
436bf53999SMike Rapoport------------------------
446bf53999SMike Rapoport
456bf53999SMike RapoportThere are 2 phases in Memory Hotplug:
466bf53999SMike Rapoport
476bf53999SMike Rapoport  1) Physical Memory Hotplug phase
486bf53999SMike Rapoport  2) Logical Memory Hotplug phase.
496bf53999SMike Rapoport
506bf53999SMike RapoportThe First phase is to communicate hardware/firmware and make/erase
516bf53999SMike Rapoportenvironment for hotplugged memory. Basically, this phase is necessary
526bf53999SMike Rapoportfor the purpose (B), but this is good phase for communication between
536bf53999SMike Rapoporthighly virtualized environments too.
546bf53999SMike Rapoport
556bf53999SMike RapoportWhen memory is hotplugged, the kernel recognizes new memory, makes new memory
566bf53999SMike Rapoportmanagement tables, and makes sysfs files for new memory's operation.
576bf53999SMike Rapoport
586bf53999SMike RapoportIf firmware supports notification of connection of new memory to OS,
596bf53999SMike Rapoportthis phase is triggered automatically. ACPI can notify this event. If not,
606bf53999SMike Rapoport"probe" operation by system administration is used instead.
616bf53999SMike Rapoport(see :ref:`memory_hotplug_physical_mem`).
626bf53999SMike Rapoport
636bf53999SMike RapoportLogical Memory Hotplug phase is to change memory state into
646bf53999SMike Rapoportavailable/unavailable for users. Amount of memory from user's view is
656bf53999SMike Rapoportchanged by this phase. The kernel makes all memory in it as free pages
666bf53999SMike Rapoportwhen a memory range is available.
676bf53999SMike Rapoport
686bf53999SMike RapoportIn this document, this phase is described as online/offline.
696bf53999SMike Rapoport
706bf53999SMike RapoportLogical Memory Hotplug phase is triggered by write of sysfs file by system
716bf53999SMike Rapoportadministrator. For the hot-add case, it must be executed after Physical Hotplug
726bf53999SMike Rapoportphase by hand.
736bf53999SMike Rapoport(However, if you writes udev's hotplug scripts for memory hotplug, these
746bf53999SMike Rapoportphases can be execute in seamless way.)
756bf53999SMike Rapoport
766bf53999SMike RapoportUnit of Memory online/offline operation
776bf53999SMike Rapoport---------------------------------------
786bf53999SMike Rapoport
796bf53999SMike RapoportMemory hotplug uses SPARSEMEM memory model which allows memory to be divided
806bf53999SMike Rapoportinto chunks of the same size. These chunks are called "sections". The size of
816bf53999SMike Rapoporta memory section is architecture dependent. For example, power uses 16MiB, ia64
826bf53999SMike Rapoportuses 1GiB.
836bf53999SMike Rapoport
846bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The
856bf53999SMike Rapoportsize of a memory block is architecture dependent and represents the logical
866bf53999SMike Rapoportunit upon which memory online/offline operations are to be performed. The
876bf53999SMike Rapoportdefault size of a memory block is the same as memory section size unless an
886bf53999SMike Rapoportarchitecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
896bf53999SMike Rapoport
906bf53999SMike RapoportTo determine the size (in bytes) of a memory block please read this file::
916bf53999SMike Rapoport
926bf53999SMike Rapoport  /sys/devices/system/memory/block_size_bytes
936bf53999SMike Rapoport
946bf53999SMike RapoportKernel Configuration
956bf53999SMike Rapoport====================
966bf53999SMike Rapoport
976bf53999SMike RapoportTo use memory hotplug feature, kernel must be compiled with following
986bf53999SMike Rapoportconfig options.
996bf53999SMike Rapoport
1006bf53999SMike Rapoport- For all memory hotplug:
1016bf53999SMike Rapoport    - Memory model -> Sparse Memory  (``CONFIG_SPARSEMEM``)
1026bf53999SMike Rapoport    - Allow for memory hot-add       (``CONFIG_MEMORY_HOTPLUG``)
1036bf53999SMike Rapoport
1046bf53999SMike Rapoport- To enable memory removal, the following are also necessary:
1056bf53999SMike Rapoport    - Allow for memory hot remove    (``CONFIG_MEMORY_HOTREMOVE``)
1066bf53999SMike Rapoport    - Page Migration                 (``CONFIG_MIGRATION``)
1076bf53999SMike Rapoport
1086bf53999SMike Rapoport- For ACPI memory hotplug, the following are also necessary:
1096bf53999SMike Rapoport    - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
1106bf53999SMike Rapoport    - This option can be kernel module.
1116bf53999SMike Rapoport
1126bf53999SMike Rapoport- As a related configuration, if your box has a feature of NUMA-node hotplug
1136bf53999SMike Rapoport  via ACPI, then this option is necessary too.
1146bf53999SMike Rapoport
1156bf53999SMike Rapoport    - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
1166bf53999SMike Rapoport      (``CONFIG_ACPI_CONTAINER``).
1176bf53999SMike Rapoport
1186bf53999SMike Rapoport     This option can be kernel module too.
1196bf53999SMike Rapoport
1206bf53999SMike Rapoport
1216bf53999SMike Rapoport.. _memory_hotplug_sysfs_files:
1226bf53999SMike Rapoport
1236bf53999SMike Rapoportsysfs files for memory hotplug
1246bf53999SMike Rapoport==============================
1256bf53999SMike Rapoport
1266bf53999SMike RapoportAll memory blocks have their device information in sysfs.  Each memory block
1276bf53999SMike Rapoportis described under ``/sys/devices/system/memory`` as::
1286bf53999SMike Rapoport
1296bf53999SMike Rapoport	/sys/devices/system/memory/memoryXXX
1306bf53999SMike Rapoport
1316bf53999SMike Rapoportwhere XXX is the memory block id.
1326bf53999SMike Rapoport
1336bf53999SMike RapoportFor the memory block covered by the sysfs directory.  It is expected that all
1346bf53999SMike Rapoportmemory sections in this range are present and no memory holes exist in the
1356bf53999SMike Rapoportrange. Currently there is no way to determine if there is a memory hole, but
1366bf53999SMike Rapoportthe existence of one should not affect the hotplug capabilities of the memory
1376bf53999SMike Rapoportblock.
1386bf53999SMike Rapoport
1396bf53999SMike RapoportFor example, assume 1GiB memory block size. A device for a memory starting at
1406bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``::
1416bf53999SMike Rapoport
1426bf53999SMike Rapoport	(0x100000000 / 1Gib = 4)
1436bf53999SMike Rapoport
1446bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000)
1456bf53999SMike Rapoport
1466bf53999SMike RapoportUnder each memory block, you can see 5 files:
1476bf53999SMike Rapoport
1486bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/phys_index``
1496bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/phys_device``
1506bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/state``
1516bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/removable``
1526bf53999SMike Rapoport- ``/sys/devices/system/memory/memoryXXX/valid_zones``
1536bf53999SMike Rapoport
1546bf53999SMike Rapoport=================== ============================================================
1556bf53999SMike Rapoport``phys_index``      read-only and contains memory block id, same as XXX.
1566bf53999SMike Rapoport``state``           read-write
1576bf53999SMike Rapoport
1586bf53999SMike Rapoport                    - at read:  contains online/offline state of memory.
1596bf53999SMike Rapoport                    - at write: user can specify "online_kernel",
1606bf53999SMike Rapoport
1616bf53999SMike Rapoport                    "online_movable", "online", "offline" command
1626bf53999SMike Rapoport                    which will be performed on all sections in the block.
163e9a2e48eSDavid Hildenbrand``phys_device``	    read-only: legacy interface only ever used on s390x to
164e9a2e48eSDavid Hildenbrand		    expose the covered storage increment.
165a89107c0SDavid Hildenbrand``removable``	    read-only: legacy interface that indicated whether a memory
166a89107c0SDavid Hildenbrand		    block was likely to be offlineable or not.  Newer kernel
167a89107c0SDavid Hildenbrand		    versions return "1" if and only if the kernel supports
168a89107c0SDavid Hildenbrand		    memory offlining.
169a89107c0SDavid Hildenbrand``valid_zones``     read-only: designed to show by which zone memory provided by
170a89107c0SDavid Hildenbrand		    a memory block is managed, and to show by which zone memory
171a89107c0SDavid Hildenbrand		    provided by an offline memory block could be managed when
172a89107c0SDavid Hildenbrand		    onlining.
1736bf53999SMike Rapoport
1746bf53999SMike Rapoport		    The first column shows it`s default zone.
1756bf53999SMike Rapoport
1766bf53999SMike Rapoport		    "memory6/valid_zones: Normal Movable" shows this memoryblock
1776bf53999SMike Rapoport		    can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
1786bf53999SMike Rapoport		    by online_movable.
1796bf53999SMike Rapoport
1806bf53999SMike Rapoport		    "memory7/valid_zones: Movable Normal" shows this memoryblock
1816bf53999SMike Rapoport		    can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
1826bf53999SMike Rapoport		    by online_kernel.
1836bf53999SMike Rapoport=================== ============================================================
1846bf53999SMike Rapoport
1856bf53999SMike Rapoport.. note::
1866bf53999SMike Rapoport
1876bf53999SMike Rapoport  These directories/files appear after physical memory hotplug phase.
1886bf53999SMike Rapoport
1896bf53999SMike RapoportIf CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
1906bf53999SMike Rapoportvia symbolic links located in the ``/sys/devices/system/node/node*`` directories.
1916bf53999SMike Rapoport
1926bf53999SMike RapoportFor example::
1936bf53999SMike Rapoport
1946bf53999SMike Rapoport	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
1956bf53999SMike Rapoport
1966bf53999SMike RapoportA backlink will also be created::
1976bf53999SMike Rapoport
1986bf53999SMike Rapoport	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
1996bf53999SMike Rapoport
2006bf53999SMike Rapoport.. _memory_hotplug_physical_mem:
2016bf53999SMike Rapoport
2026bf53999SMike RapoportPhysical memory hot-add phase
2036bf53999SMike Rapoport=============================
2046bf53999SMike Rapoport
2056bf53999SMike RapoportHardware(Firmware) Support
2066bf53999SMike Rapoport--------------------------
2076bf53999SMike Rapoport
2086bf53999SMike RapoportOn x86_64/ia64 platform, memory hotplug by ACPI is supported.
2096bf53999SMike Rapoport
2106bf53999SMike RapoportIn general, the firmware (ACPI) which supports memory hotplug defines
2116bf53999SMike Rapoportmemory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
2126bf53999SMike RapoportLinux's ACPI handler does hot-add memory to the system and calls a hotplug udev
2136bf53999SMike Rapoportscript. This will be done automatically.
2146bf53999SMike Rapoport
2156bf53999SMike RapoportBut scripts for memory hotplug are not contained in generic udev package(now).
2166bf53999SMike RapoportYou may have to write it by yourself or online/offline memory by hand.
2176bf53999SMike RapoportPlease see :ref:`memory_hotplug_how_to_online_memory` and
2186bf53999SMike Rapoport:ref:`memory_hotplug_how_to_offline_memory`.
2196bf53999SMike Rapoport
2206bf53999SMike RapoportIf firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
2216bf53999SMike Rapoport"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
2226bf53999SMike Rapoportcalls hotplug code for all of objects which are defined in it.
2236bf53999SMike RapoportIf memory device is found, memory hotplug code will be called.
2246bf53999SMike Rapoport
2256bf53999SMike RapoportNotify memory hot-add event by hand
2266bf53999SMike Rapoport-----------------------------------
2276bf53999SMike Rapoport
2286bf53999SMike RapoportOn some architectures, the firmware may not notify the kernel of a memory
2296bf53999SMike Rapoporthotplug event.  Therefore, the memory "probe" interface is supported to
2306bf53999SMike Rapoportexplicitly notify the kernel.  This interface depends on
2316bf53999SMike RapoportCONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
2326bf53999SMike Rapoportif hotplug is supported, although for x86 this should be handled by ACPI
2336bf53999SMike Rapoportnotification.
2346bf53999SMike Rapoport
2356bf53999SMike RapoportProbe interface is located at::
2366bf53999SMike Rapoport
2376bf53999SMike Rapoport	/sys/devices/system/memory/probe
2386bf53999SMike Rapoport
2396bf53999SMike RapoportYou can tell the physical address of new memory to the kernel by::
2406bf53999SMike Rapoport
2416bf53999SMike Rapoport	% echo start_address_of_new_memory > /sys/devices/system/memory/probe
2426bf53999SMike Rapoport
2436bf53999SMike RapoportThen, [start_address_of_new_memory, start_address_of_new_memory +
2446bf53999SMike Rapoportmemory_block_size] memory range is hot-added. In this case, hotplug script is
2456bf53999SMike Rapoportnot called (in current implementation). You'll have to online memory by
2466bf53999SMike Rapoportyourself.  Please see :ref:`memory_hotplug_how_to_online_memory`.
2476bf53999SMike Rapoport
2486bf53999SMike RapoportLogical Memory hot-add phase
2496bf53999SMike Rapoport============================
2506bf53999SMike Rapoport
2516bf53999SMike RapoportState of memory
2526bf53999SMike Rapoport---------------
2536bf53999SMike Rapoport
2546bf53999SMike RapoportTo see (online/offline) state of a memory block, read 'state' file::
2556bf53999SMike Rapoport
2566bf53999SMike Rapoport	% cat /sys/device/system/memory/memoryXXX/state
2576bf53999SMike Rapoport
2586bf53999SMike Rapoport
2596bf53999SMike Rapoport- If the memory block is online, you'll read "online".
2606bf53999SMike Rapoport- If the memory block is offline, you'll read "offline".
2616bf53999SMike Rapoport
2626bf53999SMike Rapoport
2636bf53999SMike Rapoport.. _memory_hotplug_how_to_online_memory:
2646bf53999SMike Rapoport
2656bf53999SMike RapoportHow to online memory
2666bf53999SMike Rapoport--------------------
2676bf53999SMike Rapoport
2686bf53999SMike RapoportWhen the memory is hot-added, the kernel decides whether or not to "online"
2696bf53999SMike Rapoportit according to the policy which can be read from "auto_online_blocks" file::
2706bf53999SMike Rapoport
2716bf53999SMike Rapoport	% cat /sys/devices/system/memory/auto_online_blocks
2726bf53999SMike Rapoport
2736bf53999SMike RapoportThe default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
2746bf53999SMike Rapoportoption. If it is disabled the default is "offline" which means the newly added
2756bf53999SMike Rapoportmemory is not in a ready-to-use state and you have to "online" the newly added
2766bf53999SMike Rapoportmemory blocks manually. Automatic onlining can be requested by writing "online"
2776bf53999SMike Rapoportto "auto_online_blocks" file::
2786bf53999SMike Rapoport
2796bf53999SMike Rapoport	% echo online > /sys/devices/system/memory/auto_online_blocks
2806bf53999SMike Rapoport
2816bf53999SMike RapoportThis sets a global policy and impacts all memory blocks that will subsequently
2826bf53999SMike Rapoportbe hotplugged. Currently offline blocks keep their state. It is possible, under
2836bf53999SMike Rapoportcertain circumstances, that some memory blocks will be added but will fail to
2846bf53999SMike Rapoportonline. User space tools can check their "state" files
2856bf53999SMike Rapoport(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
2866bf53999SMike Rapoport
2876bf53999SMike RapoportIf the automatic onlining wasn't requested, failed, or some memory block was
2886bf53999SMike Rapoportofflined it is possible to change the individual block's state by writing to the
2896bf53999SMike Rapoport"state" file::
2906bf53999SMike Rapoport
2916bf53999SMike Rapoport	% echo online > /sys/devices/system/memory/memoryXXX/state
2926bf53999SMike Rapoport
2936bf53999SMike RapoportThis onlining will not change the ZONE type of the target memory block,
2946bf53999SMike RapoportIf the memory block doesn't belong to any zone an appropriate kernel zone
2956bf53999SMike Rapoport(usually ZONE_NORMAL) will be used unless movable_node kernel command line
2966bf53999SMike Rapoportoption is specified when ZONE_MOVABLE will be used.
2976bf53999SMike Rapoport
2986bf53999SMike RapoportYou can explicitly request to associate it with ZONE_MOVABLE by::
2996bf53999SMike Rapoport
3006bf53999SMike Rapoport	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
3016bf53999SMike Rapoport
3026bf53999SMike Rapoport.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
3036bf53999SMike Rapoport
3046bf53999SMike RapoportOr you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
3056bf53999SMike Rapoport
3066bf53999SMike Rapoport	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
3076bf53999SMike Rapoport
3086bf53999SMike Rapoport.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
3096bf53999SMike Rapoport
3106bf53999SMike RapoportAn explicit zone onlining can fail (e.g. when the range is already within
3116bf53999SMike Rapoportand existing and incompatible zone already).
3126bf53999SMike Rapoport
3136bf53999SMike RapoportAfter this, memory block XXX's state will be 'online' and the amount of
3146bf53999SMike Rapoportavailable memory will be increased.
3156bf53999SMike Rapoport
3166bf53999SMike RapoportThis may be changed in future.
3176bf53999SMike Rapoport
3186bf53999SMike RapoportLogical memory remove
3196bf53999SMike Rapoport=====================
3206bf53999SMike Rapoport
3216bf53999SMike RapoportMemory offline and ZONE_MOVABLE
3226bf53999SMike Rapoport-------------------------------
3236bf53999SMike Rapoport
3246bf53999SMike RapoportMemory offlining is more complicated than memory online. Because memory offline
3256bf53999SMike Rapoporthas to make the whole memory block be unused, memory offline can fail if
3266bf53999SMike Rapoportthe memory block includes memory which cannot be freed.
3276bf53999SMike Rapoport
3286bf53999SMike RapoportIn general, memory offline can use 2 techniques.
3296bf53999SMike Rapoport
3306bf53999SMike Rapoport(1) reclaim and free all memory in the memory block.
3316bf53999SMike Rapoport(2) migrate all pages in the memory block.
3326bf53999SMike Rapoport
3336bf53999SMike RapoportIn the current implementation, Linux's memory offline uses method (2), freeing
3346bf53999SMike Rapoportall  pages in the memory block by page migration. But not all pages are
3356bf53999SMike Rapoportmigratable. Under current Linux, migratable pages are anonymous pages and
3366bf53999SMike Rapoportpage caches. For offlining a memory block by migration, the kernel has to
3376bf53999SMike Rapoportguarantee that the memory block contains only migratable pages.
3386bf53999SMike Rapoport
3396bf53999SMike RapoportNow, a boot option for making a memory block which consists of migratable pages
3406bf53999SMike Rapoportis supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
3416bf53999SMike Rapoportcreate ZONE_MOVABLE...a zone which is just used for movable pages.
3426bf53999SMike Rapoport(See also Documentation/admin-guide/kernel-parameters.rst)
3436bf53999SMike Rapoport
3446bf53999SMike RapoportAssume the system has "TOTAL" amount of memory at boot time, this boot option
3456bf53999SMike Rapoportcreates ZONE_MOVABLE as following.
3466bf53999SMike Rapoport
3476bf53999SMike Rapoport1) When kernelcore=YYYY boot option is used,
3486bf53999SMike Rapoport   Size of memory not for movable pages (not for offline) is YYYY.
3496bf53999SMike Rapoport   Size of memory for movable pages (for offline) is TOTAL-YYYY.
3506bf53999SMike Rapoport
3516bf53999SMike Rapoport2) When movablecore=ZZZZ boot option is used,
3526bf53999SMike Rapoport   Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
3536bf53999SMike Rapoport   Size of memory for movable pages (for offline) is ZZZZ.
3546bf53999SMike Rapoport
3556bf53999SMike Rapoport.. note::
3566bf53999SMike Rapoport
3576bf53999SMike Rapoport   Unfortunately, there is no information to show which memory block belongs
3586bf53999SMike Rapoport   to ZONE_MOVABLE. This is TBD.
3596bf53999SMike Rapoport
360*ad2fa371SMuchun Song   Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
361*ad2fa371SMuchun Song   and the feature of freeing unused vmemmap pages associated with each hugetlb
362*ad2fa371SMuchun Song   page is enabled.
363*ad2fa371SMuchun Song
364*ad2fa371SMuchun Song   This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
365*ad2fa371SMuchun Song   kernel memory to allocate vmemmmap pages.  We may even be able to migrate
366*ad2fa371SMuchun Song   huge page contents, but will not be able to dissolve the source huge page.
367*ad2fa371SMuchun Song   This will prevent an offline operation and is unfortunate as memory offlining
368*ad2fa371SMuchun Song   is expected to succeed on movable zones.  Users that depend on memory hotplug
369*ad2fa371SMuchun Song   to succeed for movable zones should carefully consider whether the memory
370*ad2fa371SMuchun Song   savings gained from this feature are worth the risk of possibly not being
371*ad2fa371SMuchun Song   able to offline memory in certain situations.
372*ad2fa371SMuchun Song
373fa965fd5SPavel Tatashin.. note::
374fa965fd5SPavel Tatashin   Techniques that rely on long-term pinnings of memory (especially, RDMA and
375fa965fd5SPavel Tatashin   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
376fa965fd5SPavel Tatashin   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
377fa965fd5SPavel Tatashin   memory can still get hot removed - be aware that pinning can fail even if
378fa965fd5SPavel Tatashin   there is plenty of free memory in ZONE_MOVABLE. In addition, using
379fa965fd5SPavel Tatashin   ZONE_MOVABLE might make page pinning more expensive, because pages have to be
380fa965fd5SPavel Tatashin   migrated off that zone first.
381fa965fd5SPavel Tatashin
3826bf53999SMike Rapoport.. _memory_hotplug_how_to_offline_memory:
3836bf53999SMike Rapoport
3846bf53999SMike RapoportHow to offline memory
3856bf53999SMike Rapoport---------------------
3866bf53999SMike Rapoport
3876bf53999SMike RapoportYou can offline a memory block by using the same sysfs interface that was used
3886bf53999SMike Rapoportin memory onlining::
3896bf53999SMike Rapoport
3906bf53999SMike Rapoport	% echo offline > /sys/devices/system/memory/memoryXXX/state
3916bf53999SMike Rapoport
3926bf53999SMike RapoportIf offline succeeds, the state of the memory block is changed to be "offline".
3936bf53999SMike RapoportIf it fails, some error core (like -EBUSY) will be returned by the kernel.
3946bf53999SMike RapoportEven if a memory block does not belong to ZONE_MOVABLE, you can try to offline
3956bf53999SMike Rapoportit.  If it doesn't contain 'unmovable' memory, you'll get success.
3966bf53999SMike Rapoport
3976bf53999SMike RapoportA memory block under ZONE_MOVABLE is considered to be able to be offlined
3986bf53999SMike Rapoporteasily.  But under some busy state, it may return -EBUSY. Even if a memory
3996bf53999SMike Rapoportblock cannot be offlined due to -EBUSY, you can retry offlining it and may be
4006bf53999SMike Rapoportable to offline it (or not). (For example, a page is referred to by some kernel
4016bf53999SMike Rapoportinternal call and released soon.)
4026bf53999SMike Rapoport
4036bf53999SMike RapoportConsideration:
4046bf53999SMike Rapoport  Memory hotplug's design direction is to make the possibility of memory
4056bf53999SMike Rapoport  offlining higher and to guarantee unplugging memory under any situation. But
4066bf53999SMike Rapoport  it needs more work. Returning -EBUSY under some situation may be good because
4076bf53999SMike Rapoport  the user can decide to retry more or not by himself. Currently, memory
4086bf53999SMike Rapoport  offlining code does some amount of retry with 120 seconds timeout.
4096bf53999SMike Rapoport
4106bf53999SMike RapoportPhysical memory remove
4116bf53999SMike Rapoport======================
4126bf53999SMike Rapoport
4136bf53999SMike RapoportNeed more implementation yet....
4146bf53999SMike Rapoport - Notification completion of remove works by OS to firmware.
4156bf53999SMike Rapoport - Guard from remove if not yet.
4166bf53999SMike Rapoport
417dee6da22SDavid Hildenbrand
418dee6da22SDavid HildenbrandLocking Internals
419dee6da22SDavid Hildenbrand=================
420dee6da22SDavid Hildenbrand
421dee6da22SDavid HildenbrandWhen adding/removing memory that uses memory block devices (i.e. ordinary RAM),
422dee6da22SDavid Hildenbrandthe device_hotplug_lock should be held to:
423dee6da22SDavid Hildenbrand
424dee6da22SDavid Hildenbrand- synchronize against online/offline requests (e.g. via sysfs). This way, memory
425dee6da22SDavid Hildenbrand  block devices can only be accessed (.online/.state attributes) by user
426dee6da22SDavid Hildenbrand  space once memory has been fully added. And when removing memory, we
427dee6da22SDavid Hildenbrand  know nobody is in critical sections.
428dee6da22SDavid Hildenbrand- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
429dee6da22SDavid Hildenbrand
430dee6da22SDavid HildenbrandEspecially, there is a possible lock inversion that is avoided using
431dee6da22SDavid Hildenbranddevice_hotplug_lock when adding memory and user space tries to online that
432dee6da22SDavid Hildenbrandmemory faster than expected:
433dee6da22SDavid Hildenbrand
434dee6da22SDavid Hildenbrand- device_online() will first take the device_lock(), followed by
435dee6da22SDavid Hildenbrand  mem_hotplug_lock
436dee6da22SDavid Hildenbrand- add_memory_resource() will first take the mem_hotplug_lock, followed by
437dee6da22SDavid Hildenbrand  the device_lock() (while creating the devices, during bus_add_device()).
438dee6da22SDavid Hildenbrand
439dee6da22SDavid HildenbrandAs the device is visible to user space before taking the device_lock(), this
440dee6da22SDavid Hildenbrandcan result in a lock inversion.
441dee6da22SDavid Hildenbrand
442dee6da22SDavid Hildenbrandonlining/offlining of memory should be done via device_online()/
443dee6da22SDavid Hildenbranddevice_offline() - to make sure it is properly synchronized to actions
444dee6da22SDavid Hildenbrandvia sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
445dee6da22SDavid Hildenbrand
446dee6da22SDavid HildenbrandWhen adding/removing/onlining/offlining memory or adding/removing
447dee6da22SDavid Hildenbrandheterogeneous/device memory, we should always hold the mem_hotplug_lock in
448dee6da22SDavid Hildenbrandwrite mode to serialise memory hotplug (e.g. access to global/zone
449dee6da22SDavid Hildenbrandvariables).
450dee6da22SDavid Hildenbrand
451dee6da22SDavid HildenbrandIn addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
452dee6da22SDavid Hildenbrandmode allows for a quite efficient get_online_mems/put_online_mems
453dee6da22SDavid Hildenbrandimplementation, so code accessing memory can protect from that memory
454dee6da22SDavid Hildenbrandvanishing.
455dee6da22SDavid Hildenbrand
456dee6da22SDavid Hildenbrand
4576bf53999SMike RapoportFuture Work
4586bf53999SMike Rapoport===========
4596bf53999SMike Rapoport
4606bf53999SMike Rapoport  - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
4616bf53999SMike Rapoport    sysctl or new control file.
4626bf53999SMike Rapoport  - showing memory block and physical device relationship.
4636bf53999SMike Rapoport  - test and make it better memory offlining.
4646bf53999SMike Rapoport  - support HugeTLB page migration and offlining.
4656bf53999SMike Rapoport  - memmap removing at memory offline.
4666bf53999SMike Rapoport  - physical remove memory.
467