xref: /openbmc/linux/Documentation/admin-guide/mm/memory-hotplug.rst (revision dff033818a06e7d0bf79271e34bda11c2d9d98d0)
16bf53999SMike Rapoport.. _admin_guide_memory_hotplug:
26bf53999SMike Rapoport
3ac3332c4SDavid Hildenbrand==================
4ac3332c4SDavid HildenbrandMemory Hot(Un)Plug
5ac3332c4SDavid Hildenbrand==================
66bf53999SMike Rapoport
7ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with
8ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support.
96bf53999SMike Rapoport
106bf53999SMike Rapoport.. contents:: :local:
116bf53999SMike Rapoport
126bf53999SMike RapoportIntroduction
136bf53999SMike Rapoport============
146bf53999SMike Rapoport
15ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical
16ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of
17ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the
18ac3332c4SDavid Hildenbrandoperating system.
196bf53999SMike Rapoport
20ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes:
216bf53999SMike Rapoport
22ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or
23ac3332c4SDavid Hildenbrand  downgrading the memory capacity. This dynamic memory resizing, sometimes
24ac3332c4SDavid Hildenbrand  referred to as "capacity on demand", is frequently used with virtual machines
25ac3332c4SDavid Hildenbrand  and logical partitions.
266bf53999SMike Rapoport
27ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
28ac3332c4SDavid Hildenbrand  example is replacing failing memory modules.
296bf53999SMike Rapoport
30ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or
31ac3332c4SDavid Hildenbrand  by logically unplugging (parts of) memory modules from Linux.
326bf53999SMike Rapoport
33ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also
34ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and
35ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux.
366bf53999SMike Rapoport
37ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as
38ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64.
396bf53999SMike Rapoport
40ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity
41ac3332c4SDavid Hildenbrand------------------------------
426bf53999SMike Rapoport
43ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
44ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The
45ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses
46ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB.
476bf53999SMike Rapoport
486bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The
49ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest
50ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is
51ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise.
526bf53999SMike Rapoport
53ac3332c4SDavid HildenbrandAll memory blocks have the same size.
546bf53999SMike Rapoport
55ac3332c4SDavid HildenbrandPhases of Memory Hotplug
56ac3332c4SDavid Hildenbrand------------------------
576bf53999SMike Rapoport
58ac3332c4SDavid HildenbrandMemory hotplug consists of two phases:
596bf53999SMike Rapoport
60ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux
61ac3332c4SDavid Hildenbrand(2) Onlining memory blocks
626bf53999SMike Rapoport
63ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables
64ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are
65ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory
66ac3332c4SDavid Hildenbrandblocks.
676bf53999SMike Rapoport
68ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this
69ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total
70ac3332c4SDavid Hildenbrandmemory, of the system.
716bf53999SMike Rapoport
72ac3332c4SDavid HildenbrandPhases of Memory Hotunplug
73ac3332c4SDavid Hildenbrand--------------------------
746bf53999SMike Rapoport
75ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases:
766bf53999SMike Rapoport
77ac3332c4SDavid Hildenbrand(1) Offlining memory blocks
78ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux
796bf53999SMike Rapoport
80ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for
81ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all
82ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no
83ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system.
846bf53999SMike Rapoport
85ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed.
866bf53999SMike Rapoport
87ac3332c4SDavid HildenbrandMemory Hotplug Notifications
88ac3332c4SDavid Hildenbrand============================
896bf53999SMike Rapoport
90ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such
91ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to
92ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or
93ac3332c4SDavid Hildenbrandvirtual machines are not described.
94ac3332c4SDavid Hildenbrand
95ac3332c4SDavid HildenbrandACPI Notifications
96ac3332c4SDavid Hildenbrand------------------
97ac3332c4SDavid Hildenbrand
98ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug
99ac3332c4SDavid Hildenbrandnotifications via ACPI.
100ac3332c4SDavid Hildenbrand
101ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object
102ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
103ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux.
104ac3332c4SDavid Hildenbrand
105ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID
106ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
107ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver.
108ac3332c4SDavid Hildenbrand
109ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or
110ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
111ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux.
112ac3332c4SDavid Hildenbrand
113ac3332c4SDavid HildenbrandManual Probing
114ac3332c4SDavid Hildenbrand--------------
115ac3332c4SDavid Hildenbrand
116ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating
117ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually
118ac3332c4SDavid Hildenbrandprobed from user space.
119ac3332c4SDavid Hildenbrand
120ac3332c4SDavid HildenbrandThe probe interface is located at::
121ac3332c4SDavid Hildenbrand
122ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/probe
123ac3332c4SDavid Hildenbrand
124ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed
125ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block::
126ac3332c4SDavid Hildenbrand
127ac3332c4SDavid Hildenbrand	% echo addr > /sys/devices/system/memory/probe
128ac3332c4SDavid Hildenbrand
129ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size)
130ac3332c4SDavid Hildenbrandbeing created.
131ac3332c4SDavid Hildenbrand
132ac3332c4SDavid Hildenbrand.. note::
133ac3332c4SDavid Hildenbrand
134ac3332c4SDavid Hildenbrand  Using the probe interface is discouraged as it is easy to crash the kernel,
135ac3332c4SDavid Hildenbrand  because Linux cannot validate user input; this interface might be removed in
136ac3332c4SDavid Hildenbrand  the future.
137ac3332c4SDavid Hildenbrand
138ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks
139ac3332c4SDavid Hildenbrand====================================
140ac3332c4SDavid Hildenbrand
141ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually
142ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online".
143ac3332c4SDavid Hildenbrand
144ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of
145ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined".
146ac3332c4SDavid Hildenbrand
147ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks
148ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying
149ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded
150ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of
151ac3332c4SDavid Hildenbrandmemory.
152ac3332c4SDavid Hildenbrand
153ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually
154ac3332c4SDavid Hildenbrand-------------------------------
155ac3332c4SDavid Hildenbrand
156ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually
157ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this
158ac3332c4SDavid Hildenbrandtask in user space.
159ac3332c4SDavid Hildenbrand
160ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via::
161ac3332c4SDavid Hildenbrand
162ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/memoryXXX/state
163ac3332c4SDavid Hildenbrand
164ac3332c4SDavid HildenbrandOr alternatively::
165ac3332c4SDavid Hildenbrand
166ac3332c4SDavid Hildenbrand	% echo 1 > /sys/devices/system/memory/memoryXXX/online
167ac3332c4SDavid Hildenbrand
1689e122cc1SDavid HildenbrandThe kernel will select the target zone automatically, depending on the
1699e122cc1SDavid Hildenbrandconfigured ``online_policy``.
170ac3332c4SDavid Hildenbrand
171ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with
172ac3332c4SDavid HildenbrandZONE_MOVABLE by::
173ac3332c4SDavid Hildenbrand
174ac3332c4SDavid Hildenbrand	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
175ac3332c4SDavid Hildenbrand
176ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
177ac3332c4SDavid Hildenbrand
178ac3332c4SDavid Hildenbrand	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
179ac3332c4SDavid Hildenbrand
180ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to
181ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged
182ac3332c4SDavid Hildenbrandand the above commands will fail.
183ac3332c4SDavid Hildenbrand
184ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically
185ac3332c4SDavid Hildenbrand------------------------------------
186ac3332c4SDavid Hildenbrand
187ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks.
188ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until
189ac3332c4SDavid Hildenbrandexplicitly onlined from user space.
190ac3332c4SDavid Hildenbrand
191ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via::
192ac3332c4SDavid Hildenbrand
193ac3332c4SDavid Hildenbrand	% cat /sys/devices/system/memory/auto_online_blocks
194ac3332c4SDavid Hildenbrand
195ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or
196ac3332c4SDavid Hildenbrand``online_movable`` to that file, like::
197ac3332c4SDavid Hildenbrand
198ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/auto_online_blocks
199ac3332c4SDavid Hildenbrand
2009e122cc1SDavid HildenbrandSimilarly to manual onlining, with ``online`` the kernel will select the
2019e122cc1SDavid Hildenbrandtarget zone automatically, depending on the configured ``online_policy``.
2029e122cc1SDavid Hildenbrand
203ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added
204ac3332c4SDavid Hildenbrandmemory blocks only.
205ac3332c4SDavid Hildenbrand
206ac3332c4SDavid Hildenbrand.. note::
207ac3332c4SDavid Hildenbrand
208ac3332c4SDavid Hildenbrand  In corner cases, auto-onlining can fail. The kernel won't retry. Note that
209ac3332c4SDavid Hildenbrand  auto-onlining is not expected to fail in default configurations.
210ac3332c4SDavid Hildenbrand
211ac3332c4SDavid Hildenbrand.. note::
212ac3332c4SDavid Hildenbrand
213ac3332c4SDavid Hildenbrand  DLPAR on ppc64 ignores the ``offline`` setting and will still online added
214ac3332c4SDavid Hildenbrand  memory blocks; if onlining fails, memory blocks are removed again.
215ac3332c4SDavid Hildenbrand
216ac3332c4SDavid HildenbrandOfflining Memory Blocks
217ac3332c4SDavid Hildenbrand-----------------------
218ac3332c4SDavid Hildenbrand
219ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all
220ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as
221ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit
222ac3332c4SDavid Hildenbrandmemory offlining from succeeding.
223ac3332c4SDavid Hildenbrand
224ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly
225ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in
226ac3332c4SDavid Hildenbrandsome corner cases.
227ac3332c4SDavid Hildenbrand
228ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until
229ac3332c4SDavid Hildenbrandaborted by the user.
230ac3332c4SDavid Hildenbrand
231ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via::
232ac3332c4SDavid Hildenbrand
233ac3332c4SDavid Hildenbrand	% echo offline > /sys/devices/system/memory/memoryXXX/state
234ac3332c4SDavid Hildenbrand
235ac3332c4SDavid HildenbrandOr alternatively::
236ac3332c4SDavid Hildenbrand
237ac3332c4SDavid Hildenbrand	% echo 0 > /sys/devices/system/memory/memoryXXX/online
238ac3332c4SDavid Hildenbrand
239ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline".
240ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above
241ac3332c4SDavid Hildenbrandcommands will fail, for example, via::
242ac3332c4SDavid Hildenbrand
243ac3332c4SDavid Hildenbrand	bash: echo: write error: Device or resource busy
244ac3332c4SDavid Hildenbrand
245ac3332c4SDavid Hildenbrandor via::
246ac3332c4SDavid Hildenbrand
247ac3332c4SDavid Hildenbrand	bash: echo: write error: Invalid argument
248ac3332c4SDavid Hildenbrand
249ac3332c4SDavid HildenbrandObserving the State of Memory Blocks
250ac3332c4SDavid Hildenbrand------------------------------------
251ac3332c4SDavid Hildenbrand
252ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed
253ac3332c4SDavid Hildenbrandeither via::
254ac3332c4SDavid Hildenbrand
255ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/state
256ac3332c4SDavid Hildenbrand
257ac3332c4SDavid HildenbrandOr alternatively (1/0) via::
258ac3332c4SDavid Hildenbrand
259ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/online
260ac3332c4SDavid Hildenbrand
261ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via::
262ac3332c4SDavid Hildenbrand
263ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/valid_zones
264ac3332c4SDavid Hildenbrand
265ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug
2666bf53999SMike Rapoport==============================
2676bf53999SMike Rapoport
268ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory
269ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them.
270ac3332c4SDavid Hildenbrand
271ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs
272ac3332c4SDavid Hildenbrand------------------------------------------
273ac3332c4SDavid Hildenbrand
274ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in::
275ac3332c4SDavid Hildenbrand
276ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/
277ac3332c4SDavid Hildenbrand
278ac3332c4SDavid HildenbrandThe following files are currently defined:
279ac3332c4SDavid Hildenbrand
280ac3332c4SDavid Hildenbrand====================== =========================================================
281ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory
282ac3332c4SDavid Hildenbrand		       blocks; configure auto-onlining.
283ac3332c4SDavid Hildenbrand
284ac3332c4SDavid Hildenbrand		       The default value depends on the
285ac3332c4SDavid Hildenbrand		       CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
286ac3332c4SDavid Hildenbrand		       option.
287ac3332c4SDavid Hildenbrand
288ac3332c4SDavid Hildenbrand		       See the ``state`` property of memory blocks for details.
289ac3332c4SDavid Hildenbrand``block_size_bytes``   read-only: the size in bytes of a memory block.
290ac3332c4SDavid Hildenbrand``probe``	       write-only: add (probe) selected memory blocks manually
291ac3332c4SDavid Hildenbrand		       from user space by supplying the physical start address.
292ac3332c4SDavid Hildenbrand
293ac3332c4SDavid Hildenbrand		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
294ac3332c4SDavid Hildenbrand		       kernel configuration option.
295ac3332c4SDavid Hildenbrand``uevent``	       read-write: generic udev file for device subsystems.
296ac3332c4SDavid Hildenbrand====================== =========================================================
297ac3332c4SDavid Hildenbrand
298ac3332c4SDavid Hildenbrand.. note::
299ac3332c4SDavid Hildenbrand
300ac3332c4SDavid Hildenbrand  When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
301ac3332c4SDavid Hildenbrand  additional files ``hard_offline_page`` and ``soft_offline_page`` are available
302ac3332c4SDavid Hildenbrand  to trigger hwpoisoning of pages, for example, for testing purposes. Note that
303ac3332c4SDavid Hildenbrand  this functionality is not really related to memory hot(un)plug or actual
304ac3332c4SDavid Hildenbrand  offlining of memory blocks.
305ac3332c4SDavid Hildenbrand
306ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs
307ac3332c4SDavid Hildenbrand------------------------------------
308ac3332c4SDavid Hildenbrand
309ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be
310ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in
311ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under
312ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as::
3136bf53999SMike Rapoport
3146bf53999SMike Rapoport	/sys/devices/system/memory/memoryXXX
3156bf53999SMike Rapoport
316ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable.
3176bf53999SMike Rapoport
318ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present;
319ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory
320ac3332c4SDavid Hildenbrandholes cannot be offlined.
3216bf53999SMike Rapoport
3226bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at
3236bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``::
3246bf53999SMike Rapoport
3256bf53999SMike Rapoport	(0x100000000 / 1Gib = 4)
3266bf53999SMike Rapoport
3276bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000)
3286bf53999SMike Rapoport
329ac3332c4SDavid HildenbrandThe following files are currently defined:
3306bf53999SMike Rapoport
3316bf53999SMike Rapoport=================== ============================================================
332ac3332c4SDavid Hildenbrand``online``	    read-write: simplified interface to trigger onlining /
333ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
334ac3332c4SDavid Hildenbrand		    When onlining, the zone is selected automatically.
335e9a2e48eSDavid Hildenbrand``phys_device``	    read-only: legacy interface only ever used on s390x to
336e9a2e48eSDavid Hildenbrand		    expose the covered storage increment.
337ac3332c4SDavid Hildenbrand``phys_index``	    read-only: the memory block id (XXX).
338a89107c0SDavid Hildenbrand``removable``	    read-only: legacy interface that indicated whether a memory
339ac3332c4SDavid Hildenbrand		    block was likely to be offlineable or not. Nowadays, the
340ac3332c4SDavid Hildenbrand		    kernel return ``1`` if and only if it supports memory
341ac3332c4SDavid Hildenbrand		    offlining.
342ac3332c4SDavid Hildenbrand``state``	    read-write: advanced interface to trigger onlining /
343ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
3446bf53999SMike Rapoport
345ac3332c4SDavid Hildenbrand		    When writing, ``online``, ``offline``, ``online_kernel`` and
346ac3332c4SDavid Hildenbrand		    ``online_movable`` are supported.
3476bf53999SMike Rapoport
348ac3332c4SDavid Hildenbrand		    ``online_movable`` specifies onlining to ZONE_MOVABLE.
349ac3332c4SDavid Hildenbrand		    ``online_kernel`` specifies onlining to the default kernel
350ac3332c4SDavid Hildenbrand		    zone for the memory block, such as ZONE_NORMAL.
351ac3332c4SDavid Hildenbrand                    ``online`` let's the kernel select the zone automatically.
3526bf53999SMike Rapoport
353ac3332c4SDavid Hildenbrand		    When reading, ``online``, ``offline`` and ``going-offline``
354ac3332c4SDavid Hildenbrand		    may be returned.
355ac3332c4SDavid Hildenbrand``uevent``	    read-write: generic uevent file for devices.
356ac3332c4SDavid Hildenbrand``valid_zones``     read-only: when a block is online, shows the zone it
357ac3332c4SDavid Hildenbrand		    belongs to; when a block is offline, shows what zone will
358ac3332c4SDavid Hildenbrand		    manage it when the block will be onlined.
359ac3332c4SDavid Hildenbrand
360ac3332c4SDavid Hildenbrand		    For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
361ac3332c4SDavid Hildenbrand		    ``Movable`` and ``none`` may be returned. ``none`` indicates
362ac3332c4SDavid Hildenbrand		    that memory provided by a memory block is managed by
363ac3332c4SDavid Hildenbrand		    multiple zones or spans multiple nodes; such memory blocks
364ac3332c4SDavid Hildenbrand		    cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
365ac3332c4SDavid Hildenbrand		    Other values indicate a kernel zone.
366ac3332c4SDavid Hildenbrand
367ac3332c4SDavid Hildenbrand		    For offline memory blocks, the first column shows the
368ac3332c4SDavid Hildenbrand		    zone the kernel would select when onlining the memory block
369ac3332c4SDavid Hildenbrand		    right now without further specifying a zone.
370ac3332c4SDavid Hildenbrand
371ac3332c4SDavid Hildenbrand		    Availability depends on the CONFIG_MEMORY_HOTREMOVE
372ac3332c4SDavid Hildenbrand		    kernel configuration option.
3736bf53999SMike Rapoport=================== ============================================================
3746bf53999SMike Rapoport
3756bf53999SMike Rapoport.. note::
3766bf53999SMike Rapoport
377ac3332c4SDavid Hildenbrand  If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
378ac3332c4SDavid Hildenbrand  directories can also be accessed via symbolic links located in the
379ac3332c4SDavid Hildenbrand  ``/sys/devices/system/node/node*`` directories.
3806bf53999SMike Rapoport
3816bf53999SMike Rapoport  For example::
3826bf53999SMike Rapoport
3836bf53999SMike Rapoport	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
3846bf53999SMike Rapoport
3856bf53999SMike Rapoport  A backlink will also be created::
3866bf53999SMike Rapoport
3876bf53999SMike Rapoport	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
3886bf53999SMike Rapoport
389ac3332c4SDavid HildenbrandCommand Line Parameters
390ac3332c4SDavid Hildenbrand-----------------------
3916bf53999SMike Rapoport
392ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following
393ac3332c4SDavid Hildenbrandcommand line parameters are relevant:
3946bf53999SMike Rapoport
395ac3332c4SDavid Hildenbrand======================== =======================================================
396ac3332c4SDavid Hildenbrand``memhp_default_state``	 configure auto-onlining by essentially setting
397ac3332c4SDavid Hildenbrand                         ``/sys/devices/system/memory/auto_online_blocks``.
3989e122cc1SDavid Hildenbrand``movable_node``	 configure automatic zone selection in the kernel when
3999e122cc1SDavid Hildenbrand			 using the ``contig-zones`` online policy. When
4009e122cc1SDavid Hildenbrand			 set, the kernel will default to ZONE_MOVABLE when
4019e122cc1SDavid Hildenbrand			 onlining a memory block, unless other zones can be kept
4029e122cc1SDavid Hildenbrand			 contiguous.
403ac3332c4SDavid Hildenbrand======================== =======================================================
4046bf53999SMike Rapoport
4059e122cc1SDavid HildenbrandSee Documentation/admin-guide/kernel-parameters.txt for a more generic
4069e122cc1SDavid Hildenbranddescription of these command line parameters.
4079e122cc1SDavid Hildenbrand
408ac3332c4SDavid HildenbrandModule Parameters
409ac3332c4SDavid Hildenbrand------------------
4106bf53999SMike Rapoport
411ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the
412ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module
413ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating
414ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as::
4156bf53999SMike Rapoport
416ac3332c4SDavid Hildenbrand	memory_hotplug.memmap_on_memory=1
4176bf53999SMike Rapoport
418ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via::
4196bf53999SMike Rapoport
420a8db400fSDavid Hildenbrand	/sys/module/memory_hotplug/parameters/
4216bf53999SMike Rapoport
422ac3332c4SDavid HildenbrandThe following module parameters are currently defined:
4236bf53999SMike Rapoport
424ac3332c4SDavid Hildenbrand================================ ===============================================
4259e122cc1SDavid Hildenbrand``memmap_on_memory``		 read-write: Allocate memory for the memmap from
4269e122cc1SDavid Hildenbrand				 the added memory block itself. Even if enabled,
4279e122cc1SDavid Hildenbrand				 actual support depends on various other system
4289e122cc1SDavid Hildenbrand				 properties and should only be regarded as a
4299e122cc1SDavid Hildenbrand				 hint whether the behavior would be desired.
4306bf53999SMike Rapoport
4319e122cc1SDavid Hildenbrand				 While allocating the memmap from the memory
4329e122cc1SDavid Hildenbrand				 block itself makes memory hotplug less likely
4339e122cc1SDavid Hildenbrand				 to fail and keeps the memmap on the same NUMA
4349e122cc1SDavid Hildenbrand				 node in any case, it can fragment physical
4359e122cc1SDavid Hildenbrand				 memory in a way that huge pages in bigger
4369e122cc1SDavid Hildenbrand				 granularity cannot be formed on hotplugged
437ac3332c4SDavid Hildenbrand				 memory.
4389e122cc1SDavid Hildenbrand``online_policy``		 read-write: Set the basic policy used for
4399e122cc1SDavid Hildenbrand				 automatic zone selection when onlining memory
4409e122cc1SDavid Hildenbrand				 blocks without specifying a target zone.
4419e122cc1SDavid Hildenbrand				 ``contig-zones`` has been the kernel default
4429e122cc1SDavid Hildenbrand				 before this parameter was added. After an
4439e122cc1SDavid Hildenbrand				 online policy was configured and memory was
4449e122cc1SDavid Hildenbrand				 online, the policy should not be changed
4459e122cc1SDavid Hildenbrand				 anymore.
4469e122cc1SDavid Hildenbrand
4479e122cc1SDavid Hildenbrand				 When set to ``contig-zones``, the kernel will
4489e122cc1SDavid Hildenbrand				 try keeping zones contiguous. If a memory block
4499e122cc1SDavid Hildenbrand				 intersects multiple zones or no zone, the
4509e122cc1SDavid Hildenbrand				 behavior depends on the ``movable_node`` kernel
4519e122cc1SDavid Hildenbrand				 command line parameter: default to ZONE_MOVABLE
4529e122cc1SDavid Hildenbrand				 if set, default to the applicable kernel zone
4539e122cc1SDavid Hildenbrand				 (usually ZONE_NORMAL) if not set.
4549e122cc1SDavid Hildenbrand
4559e122cc1SDavid Hildenbrand				 When set to ``auto-movable``, the kernel will
4569e122cc1SDavid Hildenbrand				 try onlining memory blocks to ZONE_MOVABLE if
4579e122cc1SDavid Hildenbrand				 possible according to the configuration and
4589e122cc1SDavid Hildenbrand				 memory device details. With this policy, one
4599e122cc1SDavid Hildenbrand				 can avoid zone imbalances when eventually
4609e122cc1SDavid Hildenbrand				 hotplugging a lot of memory later and still
4619e122cc1SDavid Hildenbrand				 wanting to be able to hotunplug as much as
4629e122cc1SDavid Hildenbrand				 possible reliably, very desirable in
4639e122cc1SDavid Hildenbrand				 virtualized environments. This policy ignores
4649e122cc1SDavid Hildenbrand				 the ``movable_node`` kernel command line
4659e122cc1SDavid Hildenbrand				 parameter and isn't really applicable in
4669e122cc1SDavid Hildenbrand				 environments that require it (e.g., bare metal
4679e122cc1SDavid Hildenbrand				 with hotunpluggable nodes) where hotplugged
4689e122cc1SDavid Hildenbrand				 memory might be exposed via the
4699e122cc1SDavid Hildenbrand				 firmware-provided memory map early during boot
4709e122cc1SDavid Hildenbrand				 to the system instead of getting detected,
4719e122cc1SDavid Hildenbrand				 added and onlined  later during boot (such as
4729e122cc1SDavid Hildenbrand				 done by virtio-mem or by some hypervisors
4739e122cc1SDavid Hildenbrand				 implementing emulated DIMMs). As one example, a
4749e122cc1SDavid Hildenbrand				 hotplugged DIMM will be onlined either
4759e122cc1SDavid Hildenbrand				 completely to ZONE_MOVABLE or completely to
4769e122cc1SDavid Hildenbrand				 ZONE_NORMAL, not a mixture.
4779e122cc1SDavid Hildenbrand				 As another example, as many memory blocks
4789e122cc1SDavid Hildenbrand				 belonging to a virtio-mem device will be
4799e122cc1SDavid Hildenbrand				 onlined to ZONE_MOVABLE as possible,
4809e122cc1SDavid Hildenbrand				 special-casing units of memory blocks that can
4819e122cc1SDavid Hildenbrand				 only get hotunplugged together. *This policy
4829e122cc1SDavid Hildenbrand				 does not protect from setups that are
4839e122cc1SDavid Hildenbrand				 problematic with ZONE_MOVABLE and does not
4849e122cc1SDavid Hildenbrand				 change the zone of memory blocks dynamically
4859e122cc1SDavid Hildenbrand				 after they were onlined.*
4869e122cc1SDavid Hildenbrand``auto_movable_ratio``		 read-write: Set the maximum MOVABLE:KERNEL
4879e122cc1SDavid Hildenbrand				 memory ratio in % for the ``auto-movable``
4889e122cc1SDavid Hildenbrand				 online policy. Whether the ratio applies only
4899e122cc1SDavid Hildenbrand				 for the system across all NUMA nodes or also
4909e122cc1SDavid Hildenbrand				 per NUMA nodes depends on the
4919e122cc1SDavid Hildenbrand				 ``auto_movable_numa_aware`` configuration.
4929e122cc1SDavid Hildenbrand
4939e122cc1SDavid Hildenbrand				 All accounting is based on present memory pages
4949e122cc1SDavid Hildenbrand				 in the zones combined with accounting per
4959e122cc1SDavid Hildenbrand				 memory device. Memory dedicated to the CMA
4969e122cc1SDavid Hildenbrand				 allocator is accounted as MOVABLE, although
4979e122cc1SDavid Hildenbrand				 residing on one of the kernel zones. The
4989e122cc1SDavid Hildenbrand				 possible ratio depends on the actual workload.
4999e122cc1SDavid Hildenbrand				 The kernel default is "301" %, for example,
5009e122cc1SDavid Hildenbrand				 allowing for hotplugging 24 GiB to a 8 GiB VM
5019e122cc1SDavid Hildenbrand				 and automatically onlining all hotplugged
5029e122cc1SDavid Hildenbrand				 memory to ZONE_MOVABLE in many setups. The
5039e122cc1SDavid Hildenbrand				 additional 1% deals with some pages being not
5049e122cc1SDavid Hildenbrand				 present, for example, because of some firmware
5059e122cc1SDavid Hildenbrand				 allocations.
5069e122cc1SDavid Hildenbrand
5079e122cc1SDavid Hildenbrand				 Note that ZONE_NORMAL memory provided by one
5089e122cc1SDavid Hildenbrand				 memory device does not allow for more
5099e122cc1SDavid Hildenbrand				 ZONE_MOVABLE memory for a different memory
5109e122cc1SDavid Hildenbrand				 device. As one example, onlining memory of a
5119e122cc1SDavid Hildenbrand				 hotplugged DIMM to ZONE_NORMAL will not allow
5129e122cc1SDavid Hildenbrand				 for another hotplugged DIMM to get onlined to
5139e122cc1SDavid Hildenbrand				 ZONE_MOVABLE automatically. In contrast, memory
5149e122cc1SDavid Hildenbrand				 hotplugged by a virtio-mem device that got
5159e122cc1SDavid Hildenbrand				 onlined to ZONE_NORMAL will allow for more
5169e122cc1SDavid Hildenbrand				 ZONE_MOVABLE memory within *the same*
5179e122cc1SDavid Hildenbrand				 virtio-mem device.
5189e122cc1SDavid Hildenbrand``auto_movable_numa_aware``	 read-write: Configure whether the
5199e122cc1SDavid Hildenbrand				 ``auto_movable_ratio`` in the ``auto-movable``
5209e122cc1SDavid Hildenbrand				 online policy also applies per NUMA
5219e122cc1SDavid Hildenbrand				 node in addition to the whole system across all
5229e122cc1SDavid Hildenbrand				 NUMA nodes. The kernel default is "Y".
5239e122cc1SDavid Hildenbrand
5249e122cc1SDavid Hildenbrand				 Disabling NUMA awareness can be helpful when
5259e122cc1SDavid Hildenbrand				 dealing with NUMA nodes that should be
5269e122cc1SDavid Hildenbrand				 completely hotunpluggable, onlining the memory
5279e122cc1SDavid Hildenbrand				 completely to ZONE_MOVABLE automatically if
5289e122cc1SDavid Hildenbrand				 possible.
5299e122cc1SDavid Hildenbrand
5309e122cc1SDavid Hildenbrand				 Parameter availability depends on CONFIG_NUMA.
531ac3332c4SDavid Hildenbrand================================ ===============================================
5326bf53999SMike Rapoport
533ac3332c4SDavid HildenbrandZONE_MOVABLE
534ac3332c4SDavid Hildenbrand============
5356bf53999SMike Rapoport
536ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining.
537ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the
538ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and
539ac3332c4SDavid Hildenbranddynamically allocated huge pages.
5406bf53999SMike Rapoport
541ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory
542ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
543ac3332c4SDavid Hildenbrandcan only be served from the kernel zones.
5446bf53999SMike Rapoport
545ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are
546ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
5476bf53999SMike Rapoport
548ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable
549ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
550ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully.
551ac3332c4SDavid Hildenbrand
552ac3332c4SDavid HildenbrandZone Imbalances
5536bf53999SMike Rapoport---------------
5546bf53999SMike Rapoport
555ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
556ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel
557ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations,
558ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE.
5596bf53999SMike Rapoport
560ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
561ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map.
5626bf53999SMike Rapoport
563ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive
564ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
5656bf53999SMike Rapoport
5666bf53999SMike Rapoport.. note::
5676bf53999SMike Rapoport
568ac3332c4SDavid Hildenbrand  CMA memory part of a kernel zone essentially behaves like memory in
569ac3332c4SDavid Hildenbrand  ZONE_MOVABLE and similar considerations apply, especially when combining
570ac3332c4SDavid Hildenbrand  CMA with ZONE_MOVABLE.
5716bf53999SMike Rapoport
572ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations
573ac3332c4SDavid Hildenbrand----------------------------------
574ad2fa371SMuchun Song
575ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually
576ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In
577ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
578ad2fa371SMuchun Song
579ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM
580ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using
581ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios:
582fa965fd5SPavel Tatashin
583ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume
584ac3332c4SDavid Hildenbrand  memory for metadata and page tables in the direct map; having a lot of offline
585ac3332c4SDavid Hildenbrand  memory blocks is not a typical case, though.
5866bf53999SMike Rapoport
587ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with
588ac3332c4SDavid Hildenbrand  ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
589ac3332c4SDavid Hildenbrand  pseries CMM, fully support balloon compaction.
5906bf53999SMike Rapoport
591ac3332c4SDavid Hildenbrand  Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
592ac3332c4SDavid Hildenbrand  disabled. In that case, balloon inflation will only perform unmovable
593ac3332c4SDavid Hildenbrand  allocations and silently create a zone imbalance, usually triggered by
594ac3332c4SDavid Hildenbrand  inflation requests from the hypervisor.
5956bf53999SMike Rapoport
596ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a
597ac3332c4SDavid Hildenbrand  lot of unmovable memory.
5986bf53999SMike Rapoport
599ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge
600ac3332c4SDavid Hildenbrand  page migration, resulting in a similar issue as with gigantic pages.
6016bf53999SMike Rapoport
602ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large
603ac3332c4SDavid Hildenbrand  files or ZONE_DEVICE memory can be problematic, although only really relevant
604ac3332c4SDavid Hildenbrand  in corner cases. When we manage a lot of user space memory that has been
605ac3332c4SDavid Hildenbrand  swapped out or is served from a file/persistent memory/... we still need a lot
606ac3332c4SDavid Hildenbrand  of page tables to manage that memory once user space accessed that memory.
6076bf53999SMike Rapoport
608ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be
609ac3332c4SDavid Hildenbrand  allocated from the kernel zones.
6106bf53999SMike Rapoport
611ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of
612ac3332c4SDavid Hildenbrand  the total system memory size as (unmovable) tracking metadata.
6136bf53999SMike Rapoport
614ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings
615ac3332c4SDavid Hildenbrand  (especially, RDMA and vfio/mdev) are fundamentally problematic with
616ac3332c4SDavid Hildenbrand  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
617ac3332c4SDavid Hildenbrand  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
618ac3332c4SDavid Hildenbrand  have to be migrated off that zone while pinning. Pinning a page can fail
619ac3332c4SDavid Hildenbrand  even if there is plenty of free memory in ZONE_MOVABLE.
6206bf53999SMike Rapoport
621ac3332c4SDavid Hildenbrand  In addition, using ZONE_MOVABLE might make page pinning more expensive,
622ac3332c4SDavid Hildenbrand  because of the page migration overhead.
623dee6da22SDavid Hildenbrand
624ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel
625ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used.
6266bf53999SMike Rapoport
627ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the
628ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options:
629ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See
630ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description.
631ac3332c4SDavid Hildenbrand
632ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE
633ac3332c4SDavid Hildenbrand---------------------------------
634ac3332c4SDavid Hildenbrand
635ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory
636ac3332c4SDavid Hildenbrandblock might fail:
637ac3332c4SDavid Hildenbrand
638ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during
639ac3332c4SDavid Hildenbrand  boot and can apply to memory blocks hotplugged via the XEN balloon and the
640ac3332c4SDavid Hildenbrand  Hyper-V balloon.
641ac3332c4SDavid Hildenbrand
642ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
643ac3332c4SDavid Hildenbrand  offlining; this applies to memory blocks present during boot only.
644ac3332c4SDavid Hildenbrand
645ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples
646ac3332c4SDavid Hildenbrand  include any memory available during boot on arm64 or memory blocks spanning
647ac3332c4SDavid Hildenbrand  the crashkernel area on s390x; this usually applies to memory blocks present
648ac3332c4SDavid Hildenbrand  during boot only.
649ac3332c4SDavid Hildenbrand
650ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
651ac3332c4SDavid Hildenbrand  memory blocks present during boot only.
652ac3332c4SDavid Hildenbrand
653ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as
654ac3332c4SDavid Hildenbrand  allocating gigantic pages, can result in temporary offlining failures.
655ac3332c4SDavid Hildenbrand
656*dff03381SMuchun Song- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
657*dff03381SMuchun Song  Optimization (HVO) is enabled.
658ac3332c4SDavid Hildenbrand
659ac3332c4SDavid Hildenbrand  Offlining code may be able to migrate huge page contents, but may not be able
660ac3332c4SDavid Hildenbrand  to dissolve the source huge page because it fails allocating (unmovable) pages
661ac3332c4SDavid Hildenbrand  for the vmemmap, because the system might not have free memory in the kernel
662ac3332c4SDavid Hildenbrand  zones left.
663ac3332c4SDavid Hildenbrand
664ac3332c4SDavid Hildenbrand  Users that depend on memory offlining to succeed for movable zones should
665ac3332c4SDavid Hildenbrand  carefully consider whether the memory savings gained from this feature are
666ac3332c4SDavid Hildenbrand  worth the risk of possibly not being able to offline memory in certain
667ac3332c4SDavid Hildenbrand  situations.
668ac3332c4SDavid Hildenbrand
669ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or
670ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE
671ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds.
672ac3332c4SDavid Hildenbrand
673ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be
674ac3332c4SDavid Hildenbrandterminated by sending a fatal signal. A timeout based offlining can easily be
675ac3332c4SDavid Hildenbrandimplemented via::
676ac3332c4SDavid Hildenbrand
677ac3332c4SDavid Hildenbrand	% timeout $TIMEOUT offline_block | failure_handling
678