xref: /openbmc/linux/Documentation/admin-guide/mm/memory-hotplug.rst (revision c900529f3d9161bfde5cca0754f83b4d3c3e0220)
1ac3332c4SDavid Hildenbrand==================
2ac3332c4SDavid HildenbrandMemory Hot(Un)Plug
3ac3332c4SDavid Hildenbrand==================
46bf53999SMike Rapoport
5ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with
6ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support.
76bf53999SMike Rapoport
86bf53999SMike Rapoport.. contents:: :local:
96bf53999SMike Rapoport
106bf53999SMike RapoportIntroduction
116bf53999SMike Rapoport============
126bf53999SMike Rapoport
13ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical
14ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of
15ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the
16ac3332c4SDavid Hildenbrandoperating system.
176bf53999SMike Rapoport
18ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes:
196bf53999SMike Rapoport
20ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or
21ac3332c4SDavid Hildenbrand  downgrading the memory capacity. This dynamic memory resizing, sometimes
22ac3332c4SDavid Hildenbrand  referred to as "capacity on demand", is frequently used with virtual machines
23ac3332c4SDavid Hildenbrand  and logical partitions.
246bf53999SMike Rapoport
25ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
26ac3332c4SDavid Hildenbrand  example is replacing failing memory modules.
276bf53999SMike Rapoport
28ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or
29ac3332c4SDavid Hildenbrand  by logically unplugging (parts of) memory modules from Linux.
306bf53999SMike Rapoport
31ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also
32ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and
33ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux.
346bf53999SMike Rapoport
35ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as
36ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64.
376bf53999SMike Rapoport
38ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity
39ac3332c4SDavid Hildenbrand------------------------------
406bf53999SMike Rapoport
41ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
42ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The
43ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses
44ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB.
456bf53999SMike Rapoport
466bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The
47ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest
48ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is
49ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise.
506bf53999SMike Rapoport
51ac3332c4SDavid HildenbrandAll memory blocks have the same size.
526bf53999SMike Rapoport
53ac3332c4SDavid HildenbrandPhases of Memory Hotplug
54ac3332c4SDavid Hildenbrand------------------------
556bf53999SMike Rapoport
56ac3332c4SDavid HildenbrandMemory hotplug consists of two phases:
576bf53999SMike Rapoport
58ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux
59ac3332c4SDavid Hildenbrand(2) Onlining memory blocks
606bf53999SMike Rapoport
61ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables
62ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are
63ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory
64ac3332c4SDavid Hildenbrandblocks.
656bf53999SMike Rapoport
66ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this
67ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total
68ac3332c4SDavid Hildenbrandmemory, of the system.
696bf53999SMike Rapoport
70ac3332c4SDavid HildenbrandPhases of Memory Hotunplug
71ac3332c4SDavid Hildenbrand--------------------------
726bf53999SMike Rapoport
73ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases:
746bf53999SMike Rapoport
75ac3332c4SDavid Hildenbrand(1) Offlining memory blocks
76ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux
776bf53999SMike Rapoport
78ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for
79ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all
80ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no
81ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system.
826bf53999SMike Rapoport
83ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed.
846bf53999SMike Rapoport
85ac3332c4SDavid HildenbrandMemory Hotplug Notifications
86ac3332c4SDavid Hildenbrand============================
876bf53999SMike Rapoport
88ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such
89ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to
90ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or
91ac3332c4SDavid Hildenbrandvirtual machines are not described.
92ac3332c4SDavid Hildenbrand
93ac3332c4SDavid HildenbrandACPI Notifications
94ac3332c4SDavid Hildenbrand------------------
95ac3332c4SDavid Hildenbrand
96ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug
97ac3332c4SDavid Hildenbrandnotifications via ACPI.
98ac3332c4SDavid Hildenbrand
99ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object
100ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
101ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux.
102ac3332c4SDavid Hildenbrand
103ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID
104ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
105ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver.
106ac3332c4SDavid Hildenbrand
107ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or
108ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
109ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux.
110ac3332c4SDavid Hildenbrand
111ac3332c4SDavid HildenbrandManual Probing
112ac3332c4SDavid Hildenbrand--------------
113ac3332c4SDavid Hildenbrand
114ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating
115ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually
116ac3332c4SDavid Hildenbrandprobed from user space.
117ac3332c4SDavid Hildenbrand
118ac3332c4SDavid HildenbrandThe probe interface is located at::
119ac3332c4SDavid Hildenbrand
120ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/probe
121ac3332c4SDavid Hildenbrand
122ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed
123ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block::
124ac3332c4SDavid Hildenbrand
125ac3332c4SDavid Hildenbrand	% echo addr > /sys/devices/system/memory/probe
126ac3332c4SDavid Hildenbrand
127ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size)
128ac3332c4SDavid Hildenbrandbeing created.
129ac3332c4SDavid Hildenbrand
130ac3332c4SDavid Hildenbrand.. note::
131ac3332c4SDavid Hildenbrand
132ac3332c4SDavid Hildenbrand  Using the probe interface is discouraged as it is easy to crash the kernel,
133ac3332c4SDavid Hildenbrand  because Linux cannot validate user input; this interface might be removed in
134ac3332c4SDavid Hildenbrand  the future.
135ac3332c4SDavid Hildenbrand
136ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks
137ac3332c4SDavid Hildenbrand====================================
138ac3332c4SDavid Hildenbrand
139ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually
140ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online".
141ac3332c4SDavid Hildenbrand
142ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of
143ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined".
144ac3332c4SDavid Hildenbrand
145ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks
146ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying
147ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded
148ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of
149ac3332c4SDavid Hildenbrandmemory.
150ac3332c4SDavid Hildenbrand
151ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually
152ac3332c4SDavid Hildenbrand-------------------------------
153ac3332c4SDavid Hildenbrand
154ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually
155ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this
156ac3332c4SDavid Hildenbrandtask in user space.
157ac3332c4SDavid Hildenbrand
158ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via::
159ac3332c4SDavid Hildenbrand
160ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/memoryXXX/state
161ac3332c4SDavid Hildenbrand
162ac3332c4SDavid HildenbrandOr alternatively::
163ac3332c4SDavid Hildenbrand
164ac3332c4SDavid Hildenbrand	% echo 1 > /sys/devices/system/memory/memoryXXX/online
165ac3332c4SDavid Hildenbrand
1669e122cc1SDavid HildenbrandThe kernel will select the target zone automatically, depending on the
1679e122cc1SDavid Hildenbrandconfigured ``online_policy``.
168ac3332c4SDavid Hildenbrand
169ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with
170ac3332c4SDavid HildenbrandZONE_MOVABLE by::
171ac3332c4SDavid Hildenbrand
172ac3332c4SDavid Hildenbrand	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
173ac3332c4SDavid Hildenbrand
174ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
175ac3332c4SDavid Hildenbrand
176ac3332c4SDavid Hildenbrand	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
177ac3332c4SDavid Hildenbrand
178ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to
179ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged
180ac3332c4SDavid Hildenbrandand the above commands will fail.
181ac3332c4SDavid Hildenbrand
182ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically
183ac3332c4SDavid Hildenbrand------------------------------------
184ac3332c4SDavid Hildenbrand
185ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks.
186ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until
187ac3332c4SDavid Hildenbrandexplicitly onlined from user space.
188ac3332c4SDavid Hildenbrand
189ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via::
190ac3332c4SDavid Hildenbrand
191ac3332c4SDavid Hildenbrand	% cat /sys/devices/system/memory/auto_online_blocks
192ac3332c4SDavid Hildenbrand
193ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or
194ac3332c4SDavid Hildenbrand``online_movable`` to that file, like::
195ac3332c4SDavid Hildenbrand
196ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/auto_online_blocks
197ac3332c4SDavid Hildenbrand
1989e122cc1SDavid HildenbrandSimilarly to manual onlining, with ``online`` the kernel will select the
1999e122cc1SDavid Hildenbrandtarget zone automatically, depending on the configured ``online_policy``.
2009e122cc1SDavid Hildenbrand
201ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added
202ac3332c4SDavid Hildenbrandmemory blocks only.
203ac3332c4SDavid Hildenbrand
204ac3332c4SDavid Hildenbrand.. note::
205ac3332c4SDavid Hildenbrand
206ac3332c4SDavid Hildenbrand  In corner cases, auto-onlining can fail. The kernel won't retry. Note that
207ac3332c4SDavid Hildenbrand  auto-onlining is not expected to fail in default configurations.
208ac3332c4SDavid Hildenbrand
209ac3332c4SDavid Hildenbrand.. note::
210ac3332c4SDavid Hildenbrand
211ac3332c4SDavid Hildenbrand  DLPAR on ppc64 ignores the ``offline`` setting and will still online added
212ac3332c4SDavid Hildenbrand  memory blocks; if onlining fails, memory blocks are removed again.
213ac3332c4SDavid Hildenbrand
214ac3332c4SDavid HildenbrandOfflining Memory Blocks
215ac3332c4SDavid Hildenbrand-----------------------
216ac3332c4SDavid Hildenbrand
217ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all
218ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as
219ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit
220ac3332c4SDavid Hildenbrandmemory offlining from succeeding.
221ac3332c4SDavid Hildenbrand
222ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly
223ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in
224ac3332c4SDavid Hildenbrandsome corner cases.
225ac3332c4SDavid Hildenbrand
226ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until
227ac3332c4SDavid Hildenbrandaborted by the user.
228ac3332c4SDavid Hildenbrand
229ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via::
230ac3332c4SDavid Hildenbrand
231ac3332c4SDavid Hildenbrand	% echo offline > /sys/devices/system/memory/memoryXXX/state
232ac3332c4SDavid Hildenbrand
233ac3332c4SDavid HildenbrandOr alternatively::
234ac3332c4SDavid Hildenbrand
235ac3332c4SDavid Hildenbrand	% echo 0 > /sys/devices/system/memory/memoryXXX/online
236ac3332c4SDavid Hildenbrand
237ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline".
238ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above
239ac3332c4SDavid Hildenbrandcommands will fail, for example, via::
240ac3332c4SDavid Hildenbrand
241ac3332c4SDavid Hildenbrand	bash: echo: write error: Device or resource busy
242ac3332c4SDavid Hildenbrand
243ac3332c4SDavid Hildenbrandor via::
244ac3332c4SDavid Hildenbrand
245ac3332c4SDavid Hildenbrand	bash: echo: write error: Invalid argument
246ac3332c4SDavid Hildenbrand
247ac3332c4SDavid HildenbrandObserving the State of Memory Blocks
248ac3332c4SDavid Hildenbrand------------------------------------
249ac3332c4SDavid Hildenbrand
250ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed
251ac3332c4SDavid Hildenbrandeither via::
252ac3332c4SDavid Hildenbrand
253ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/state
254ac3332c4SDavid Hildenbrand
255ac3332c4SDavid HildenbrandOr alternatively (1/0) via::
256ac3332c4SDavid Hildenbrand
257ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/online
258ac3332c4SDavid Hildenbrand
259ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via::
260ac3332c4SDavid Hildenbrand
261ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/valid_zones
262ac3332c4SDavid Hildenbrand
263ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug
2646bf53999SMike Rapoport==============================
2656bf53999SMike Rapoport
266ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory
267ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them.
268ac3332c4SDavid Hildenbrand
269ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs
270ac3332c4SDavid Hildenbrand------------------------------------------
271ac3332c4SDavid Hildenbrand
272ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in::
273ac3332c4SDavid Hildenbrand
274ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/
275ac3332c4SDavid Hildenbrand
276ac3332c4SDavid HildenbrandThe following files are currently defined:
277ac3332c4SDavid Hildenbrand
278ac3332c4SDavid Hildenbrand====================== =========================================================
279ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory
280ac3332c4SDavid Hildenbrand		       blocks; configure auto-onlining.
281ac3332c4SDavid Hildenbrand
282ac3332c4SDavid Hildenbrand		       The default value depends on the
283ac3332c4SDavid Hildenbrand		       CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
284ac3332c4SDavid Hildenbrand		       option.
285ac3332c4SDavid Hildenbrand
286ac3332c4SDavid Hildenbrand		       See the ``state`` property of memory blocks for details.
287ac3332c4SDavid Hildenbrand``block_size_bytes``   read-only: the size in bytes of a memory block.
288ac3332c4SDavid Hildenbrand``probe``	       write-only: add (probe) selected memory blocks manually
289ac3332c4SDavid Hildenbrand		       from user space by supplying the physical start address.
290ac3332c4SDavid Hildenbrand
291ac3332c4SDavid Hildenbrand		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
292ac3332c4SDavid Hildenbrand		       kernel configuration option.
293ac3332c4SDavid Hildenbrand``uevent``	       read-write: generic udev file for device subsystems.
294*88a6f899SEric DeVolder``crash_hotplug``      read-only: when changes to the system memory map
295*88a6f899SEric DeVolder		       occur due to hot un/plug of memory, this file contains
296*88a6f899SEric DeVolder		       '1' if the kernel updates the kdump capture kernel memory
297*88a6f899SEric DeVolder		       map itself (via elfcorehdr), or '0' if userspace must update
298*88a6f899SEric DeVolder		       the kdump capture kernel memory map.
299*88a6f899SEric DeVolder
300*88a6f899SEric DeVolder		       Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
301*88a6f899SEric DeVolder		       configuration option.
302ac3332c4SDavid Hildenbrand====================== =========================================================
303ac3332c4SDavid Hildenbrand
304ac3332c4SDavid Hildenbrand.. note::
305ac3332c4SDavid Hildenbrand
306ac3332c4SDavid Hildenbrand  When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
307ac3332c4SDavid Hildenbrand  additional files ``hard_offline_page`` and ``soft_offline_page`` are available
308ac3332c4SDavid Hildenbrand  to trigger hwpoisoning of pages, for example, for testing purposes. Note that
309ac3332c4SDavid Hildenbrand  this functionality is not really related to memory hot(un)plug or actual
310ac3332c4SDavid Hildenbrand  offlining of memory blocks.
311ac3332c4SDavid Hildenbrand
312ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs
313ac3332c4SDavid Hildenbrand------------------------------------
314ac3332c4SDavid Hildenbrand
315ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be
316ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in
317ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under
318ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as::
3196bf53999SMike Rapoport
3206bf53999SMike Rapoport	/sys/devices/system/memory/memoryXXX
3216bf53999SMike Rapoport
322ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable.
3236bf53999SMike Rapoport
324ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present;
325ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory
326ac3332c4SDavid Hildenbrandholes cannot be offlined.
3276bf53999SMike Rapoport
3286bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at
3296bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``::
3306bf53999SMike Rapoport
3316bf53999SMike Rapoport	(0x100000000 / 1Gib = 4)
3326bf53999SMike Rapoport
3336bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000)
3346bf53999SMike Rapoport
335ac3332c4SDavid HildenbrandThe following files are currently defined:
3366bf53999SMike Rapoport
3376bf53999SMike Rapoport=================== ============================================================
338ac3332c4SDavid Hildenbrand``online``	    read-write: simplified interface to trigger onlining /
339ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
340ac3332c4SDavid Hildenbrand		    When onlining, the zone is selected automatically.
341e9a2e48eSDavid Hildenbrand``phys_device``	    read-only: legacy interface only ever used on s390x to
342e9a2e48eSDavid Hildenbrand		    expose the covered storage increment.
343ac3332c4SDavid Hildenbrand``phys_index``	    read-only: the memory block id (XXX).
344a89107c0SDavid Hildenbrand``removable``	    read-only: legacy interface that indicated whether a memory
345ac3332c4SDavid Hildenbrand		    block was likely to be offlineable or not. Nowadays, the
346ac3332c4SDavid Hildenbrand		    kernel return ``1`` if and only if it supports memory
347ac3332c4SDavid Hildenbrand		    offlining.
348ac3332c4SDavid Hildenbrand``state``	    read-write: advanced interface to trigger onlining /
349ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
3506bf53999SMike Rapoport
351ac3332c4SDavid Hildenbrand		    When writing, ``online``, ``offline``, ``online_kernel`` and
352ac3332c4SDavid Hildenbrand		    ``online_movable`` are supported.
3536bf53999SMike Rapoport
354ac3332c4SDavid Hildenbrand		    ``online_movable`` specifies onlining to ZONE_MOVABLE.
355ac3332c4SDavid Hildenbrand		    ``online_kernel`` specifies onlining to the default kernel
356ac3332c4SDavid Hildenbrand		    zone for the memory block, such as ZONE_NORMAL.
357ac3332c4SDavid Hildenbrand                    ``online`` let's the kernel select the zone automatically.
3586bf53999SMike Rapoport
359ac3332c4SDavid Hildenbrand		    When reading, ``online``, ``offline`` and ``going-offline``
360ac3332c4SDavid Hildenbrand		    may be returned.
361ac3332c4SDavid Hildenbrand``uevent``	    read-write: generic uevent file for devices.
362ac3332c4SDavid Hildenbrand``valid_zones``     read-only: when a block is online, shows the zone it
363ac3332c4SDavid Hildenbrand		    belongs to; when a block is offline, shows what zone will
364ac3332c4SDavid Hildenbrand		    manage it when the block will be onlined.
365ac3332c4SDavid Hildenbrand
366ac3332c4SDavid Hildenbrand		    For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
367ac3332c4SDavid Hildenbrand		    ``Movable`` and ``none`` may be returned. ``none`` indicates
368ac3332c4SDavid Hildenbrand		    that memory provided by a memory block is managed by
369ac3332c4SDavid Hildenbrand		    multiple zones or spans multiple nodes; such memory blocks
370ac3332c4SDavid Hildenbrand		    cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
371ac3332c4SDavid Hildenbrand		    Other values indicate a kernel zone.
372ac3332c4SDavid Hildenbrand
373ac3332c4SDavid Hildenbrand		    For offline memory blocks, the first column shows the
374ac3332c4SDavid Hildenbrand		    zone the kernel would select when onlining the memory block
375ac3332c4SDavid Hildenbrand		    right now without further specifying a zone.
376ac3332c4SDavid Hildenbrand
377ac3332c4SDavid Hildenbrand		    Availability depends on the CONFIG_MEMORY_HOTREMOVE
378ac3332c4SDavid Hildenbrand		    kernel configuration option.
3796bf53999SMike Rapoport=================== ============================================================
3806bf53999SMike Rapoport
3816bf53999SMike Rapoport.. note::
3826bf53999SMike Rapoport
383ac3332c4SDavid Hildenbrand  If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
384ac3332c4SDavid Hildenbrand  directories can also be accessed via symbolic links located in the
385ac3332c4SDavid Hildenbrand  ``/sys/devices/system/node/node*`` directories.
3866bf53999SMike Rapoport
3876bf53999SMike Rapoport  For example::
3886bf53999SMike Rapoport
3896bf53999SMike Rapoport	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
3906bf53999SMike Rapoport
3916bf53999SMike Rapoport  A backlink will also be created::
3926bf53999SMike Rapoport
3936bf53999SMike Rapoport	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
3946bf53999SMike Rapoport
395ac3332c4SDavid HildenbrandCommand Line Parameters
396ac3332c4SDavid Hildenbrand-----------------------
3976bf53999SMike Rapoport
398ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following
399ac3332c4SDavid Hildenbrandcommand line parameters are relevant:
4006bf53999SMike Rapoport
401ac3332c4SDavid Hildenbrand======================== =======================================================
402ac3332c4SDavid Hildenbrand``memhp_default_state``	 configure auto-onlining by essentially setting
403ac3332c4SDavid Hildenbrand                         ``/sys/devices/system/memory/auto_online_blocks``.
4049e122cc1SDavid Hildenbrand``movable_node``	 configure automatic zone selection in the kernel when
4059e122cc1SDavid Hildenbrand			 using the ``contig-zones`` online policy. When
4069e122cc1SDavid Hildenbrand			 set, the kernel will default to ZONE_MOVABLE when
4079e122cc1SDavid Hildenbrand			 onlining a memory block, unless other zones can be kept
4089e122cc1SDavid Hildenbrand			 contiguous.
409ac3332c4SDavid Hildenbrand======================== =======================================================
4106bf53999SMike Rapoport
4119e122cc1SDavid HildenbrandSee Documentation/admin-guide/kernel-parameters.txt for a more generic
4129e122cc1SDavid Hildenbranddescription of these command line parameters.
4139e122cc1SDavid Hildenbrand
414ac3332c4SDavid HildenbrandModule Parameters
415ac3332c4SDavid Hildenbrand------------------
4166bf53999SMike Rapoport
417ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the
418ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module
419ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating
420ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as::
4216bf53999SMike Rapoport
422ac3332c4SDavid Hildenbrand	memory_hotplug.memmap_on_memory=1
4236bf53999SMike Rapoport
424ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via::
4256bf53999SMike Rapoport
426a8db400fSDavid Hildenbrand	/sys/module/memory_hotplug/parameters/
4276bf53999SMike Rapoport
428ac3332c4SDavid HildenbrandThe following module parameters are currently defined:
4296bf53999SMike Rapoport
430ac3332c4SDavid Hildenbrand================================ ===============================================
4319e122cc1SDavid Hildenbrand``memmap_on_memory``		 read-write: Allocate memory for the memmap from
4329e122cc1SDavid Hildenbrand				 the added memory block itself. Even if enabled,
4339e122cc1SDavid Hildenbrand				 actual support depends on various other system
4349e122cc1SDavid Hildenbrand				 properties and should only be regarded as a
4359e122cc1SDavid Hildenbrand				 hint whether the behavior would be desired.
4366bf53999SMike Rapoport
4379e122cc1SDavid Hildenbrand				 While allocating the memmap from the memory
4389e122cc1SDavid Hildenbrand				 block itself makes memory hotplug less likely
4399e122cc1SDavid Hildenbrand				 to fail and keeps the memmap on the same NUMA
4409e122cc1SDavid Hildenbrand				 node in any case, it can fragment physical
4419e122cc1SDavid Hildenbrand				 memory in a way that huge pages in bigger
4429e122cc1SDavid Hildenbrand				 granularity cannot be formed on hotplugged
443ac3332c4SDavid Hildenbrand				 memory.
4442d1f649cSAneesh Kumar K.V
4452d1f649cSAneesh Kumar K.V				 With value "force" it could result in memory
4462d1f649cSAneesh Kumar K.V				 wastage due to memmap size limitations. For
4472d1f649cSAneesh Kumar K.V				 example, if the memmap for a memory block
4482d1f649cSAneesh Kumar K.V				 requires 1 MiB, but the pageblock size is 2
4492d1f649cSAneesh Kumar K.V				 MiB, 1 MiB of hotplugged memory will be wasted.
4502d1f649cSAneesh Kumar K.V				 Note that there are still cases where the
4512d1f649cSAneesh Kumar K.V				 feature cannot be enforced: for example, if the
4522d1f649cSAneesh Kumar K.V				 memmap is smaller than a single page, or if the
4532d1f649cSAneesh Kumar K.V				 architecture does not support the forced mode
4542d1f649cSAneesh Kumar K.V				 in all configurations.
4552d1f649cSAneesh Kumar K.V
4569e122cc1SDavid Hildenbrand``online_policy``		 read-write: Set the basic policy used for
4579e122cc1SDavid Hildenbrand				 automatic zone selection when onlining memory
4589e122cc1SDavid Hildenbrand				 blocks without specifying a target zone.
4599e122cc1SDavid Hildenbrand				 ``contig-zones`` has been the kernel default
4609e122cc1SDavid Hildenbrand				 before this parameter was added. After an
4619e122cc1SDavid Hildenbrand				 online policy was configured and memory was
4629e122cc1SDavid Hildenbrand				 online, the policy should not be changed
4639e122cc1SDavid Hildenbrand				 anymore.
4649e122cc1SDavid Hildenbrand
4659e122cc1SDavid Hildenbrand				 When set to ``contig-zones``, the kernel will
4669e122cc1SDavid Hildenbrand				 try keeping zones contiguous. If a memory block
4679e122cc1SDavid Hildenbrand				 intersects multiple zones or no zone, the
4689e122cc1SDavid Hildenbrand				 behavior depends on the ``movable_node`` kernel
4699e122cc1SDavid Hildenbrand				 command line parameter: default to ZONE_MOVABLE
4709e122cc1SDavid Hildenbrand				 if set, default to the applicable kernel zone
4719e122cc1SDavid Hildenbrand				 (usually ZONE_NORMAL) if not set.
4729e122cc1SDavid Hildenbrand
4739e122cc1SDavid Hildenbrand				 When set to ``auto-movable``, the kernel will
4749e122cc1SDavid Hildenbrand				 try onlining memory blocks to ZONE_MOVABLE if
4759e122cc1SDavid Hildenbrand				 possible according to the configuration and
4769e122cc1SDavid Hildenbrand				 memory device details. With this policy, one
4779e122cc1SDavid Hildenbrand				 can avoid zone imbalances when eventually
4789e122cc1SDavid Hildenbrand				 hotplugging a lot of memory later and still
4799e122cc1SDavid Hildenbrand				 wanting to be able to hotunplug as much as
4809e122cc1SDavid Hildenbrand				 possible reliably, very desirable in
4819e122cc1SDavid Hildenbrand				 virtualized environments. This policy ignores
4829e122cc1SDavid Hildenbrand				 the ``movable_node`` kernel command line
4839e122cc1SDavid Hildenbrand				 parameter and isn't really applicable in
4849e122cc1SDavid Hildenbrand				 environments that require it (e.g., bare metal
4859e122cc1SDavid Hildenbrand				 with hotunpluggable nodes) where hotplugged
4869e122cc1SDavid Hildenbrand				 memory might be exposed via the
4879e122cc1SDavid Hildenbrand				 firmware-provided memory map early during boot
4889e122cc1SDavid Hildenbrand				 to the system instead of getting detected,
4899e122cc1SDavid Hildenbrand				 added and onlined  later during boot (such as
4909e122cc1SDavid Hildenbrand				 done by virtio-mem or by some hypervisors
4919e122cc1SDavid Hildenbrand				 implementing emulated DIMMs). As one example, a
4929e122cc1SDavid Hildenbrand				 hotplugged DIMM will be onlined either
4939e122cc1SDavid Hildenbrand				 completely to ZONE_MOVABLE or completely to
4949e122cc1SDavid Hildenbrand				 ZONE_NORMAL, not a mixture.
4959e122cc1SDavid Hildenbrand				 As another example, as many memory blocks
4969e122cc1SDavid Hildenbrand				 belonging to a virtio-mem device will be
4979e122cc1SDavid Hildenbrand				 onlined to ZONE_MOVABLE as possible,
4989e122cc1SDavid Hildenbrand				 special-casing units of memory blocks that can
4999e122cc1SDavid Hildenbrand				 only get hotunplugged together. *This policy
5009e122cc1SDavid Hildenbrand				 does not protect from setups that are
5019e122cc1SDavid Hildenbrand				 problematic with ZONE_MOVABLE and does not
5029e122cc1SDavid Hildenbrand				 change the zone of memory blocks dynamically
5039e122cc1SDavid Hildenbrand				 after they were onlined.*
5049e122cc1SDavid Hildenbrand``auto_movable_ratio``		 read-write: Set the maximum MOVABLE:KERNEL
5059e122cc1SDavid Hildenbrand				 memory ratio in % for the ``auto-movable``
5069e122cc1SDavid Hildenbrand				 online policy. Whether the ratio applies only
5079e122cc1SDavid Hildenbrand				 for the system across all NUMA nodes or also
5089e122cc1SDavid Hildenbrand				 per NUMA nodes depends on the
5099e122cc1SDavid Hildenbrand				 ``auto_movable_numa_aware`` configuration.
5109e122cc1SDavid Hildenbrand
5119e122cc1SDavid Hildenbrand				 All accounting is based on present memory pages
5129e122cc1SDavid Hildenbrand				 in the zones combined with accounting per
5139e122cc1SDavid Hildenbrand				 memory device. Memory dedicated to the CMA
5149e122cc1SDavid Hildenbrand				 allocator is accounted as MOVABLE, although
5159e122cc1SDavid Hildenbrand				 residing on one of the kernel zones. The
5169e122cc1SDavid Hildenbrand				 possible ratio depends on the actual workload.
5179e122cc1SDavid Hildenbrand				 The kernel default is "301" %, for example,
5189e122cc1SDavid Hildenbrand				 allowing for hotplugging 24 GiB to a 8 GiB VM
5199e122cc1SDavid Hildenbrand				 and automatically onlining all hotplugged
5209e122cc1SDavid Hildenbrand				 memory to ZONE_MOVABLE in many setups. The
5219e122cc1SDavid Hildenbrand				 additional 1% deals with some pages being not
5229e122cc1SDavid Hildenbrand				 present, for example, because of some firmware
5239e122cc1SDavid Hildenbrand				 allocations.
5249e122cc1SDavid Hildenbrand
5259e122cc1SDavid Hildenbrand				 Note that ZONE_NORMAL memory provided by one
5269e122cc1SDavid Hildenbrand				 memory device does not allow for more
5279e122cc1SDavid Hildenbrand				 ZONE_MOVABLE memory for a different memory
5289e122cc1SDavid Hildenbrand				 device. As one example, onlining memory of a
5299e122cc1SDavid Hildenbrand				 hotplugged DIMM to ZONE_NORMAL will not allow
5309e122cc1SDavid Hildenbrand				 for another hotplugged DIMM to get onlined to
5319e122cc1SDavid Hildenbrand				 ZONE_MOVABLE automatically. In contrast, memory
5329e122cc1SDavid Hildenbrand				 hotplugged by a virtio-mem device that got
5339e122cc1SDavid Hildenbrand				 onlined to ZONE_NORMAL will allow for more
5349e122cc1SDavid Hildenbrand				 ZONE_MOVABLE memory within *the same*
5359e122cc1SDavid Hildenbrand				 virtio-mem device.
5369e122cc1SDavid Hildenbrand``auto_movable_numa_aware``	 read-write: Configure whether the
5379e122cc1SDavid Hildenbrand				 ``auto_movable_ratio`` in the ``auto-movable``
5389e122cc1SDavid Hildenbrand				 online policy also applies per NUMA
5399e122cc1SDavid Hildenbrand				 node in addition to the whole system across all
5409e122cc1SDavid Hildenbrand				 NUMA nodes. The kernel default is "Y".
5419e122cc1SDavid Hildenbrand
5429e122cc1SDavid Hildenbrand				 Disabling NUMA awareness can be helpful when
5439e122cc1SDavid Hildenbrand				 dealing with NUMA nodes that should be
5449e122cc1SDavid Hildenbrand				 completely hotunpluggable, onlining the memory
5459e122cc1SDavid Hildenbrand				 completely to ZONE_MOVABLE automatically if
5469e122cc1SDavid Hildenbrand				 possible.
5479e122cc1SDavid Hildenbrand
5489e122cc1SDavid Hildenbrand				 Parameter availability depends on CONFIG_NUMA.
549ac3332c4SDavid Hildenbrand================================ ===============================================
5506bf53999SMike Rapoport
551ac3332c4SDavid HildenbrandZONE_MOVABLE
552ac3332c4SDavid Hildenbrand============
5536bf53999SMike Rapoport
554ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining.
555ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the
556ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and
557ac3332c4SDavid Hildenbranddynamically allocated huge pages.
5586bf53999SMike Rapoport
559ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory
560ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
561ac3332c4SDavid Hildenbrandcan only be served from the kernel zones.
5626bf53999SMike Rapoport
563ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are
564ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
5656bf53999SMike Rapoport
566ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable
567ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
568ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully.
569ac3332c4SDavid Hildenbrand
570ac3332c4SDavid HildenbrandZone Imbalances
5716bf53999SMike Rapoport---------------
5726bf53999SMike Rapoport
573ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
574ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel
575ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations,
576ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE.
5776bf53999SMike Rapoport
578ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
579ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map.
5806bf53999SMike Rapoport
581ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive
582ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
5836bf53999SMike Rapoport
5846bf53999SMike Rapoport.. note::
5856bf53999SMike Rapoport
586ac3332c4SDavid Hildenbrand  CMA memory part of a kernel zone essentially behaves like memory in
587ac3332c4SDavid Hildenbrand  ZONE_MOVABLE and similar considerations apply, especially when combining
588ac3332c4SDavid Hildenbrand  CMA with ZONE_MOVABLE.
5896bf53999SMike Rapoport
590ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations
591ac3332c4SDavid Hildenbrand----------------------------------
592ad2fa371SMuchun Song
593ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually
594ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In
595ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
596ad2fa371SMuchun Song
597ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM
598ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using
599ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios:
600fa965fd5SPavel Tatashin
601ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume
602ac3332c4SDavid Hildenbrand  memory for metadata and page tables in the direct map; having a lot of offline
603ac3332c4SDavid Hildenbrand  memory blocks is not a typical case, though.
6046bf53999SMike Rapoport
605ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with
606ac3332c4SDavid Hildenbrand  ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
607ac3332c4SDavid Hildenbrand  pseries CMM, fully support balloon compaction.
6086bf53999SMike Rapoport
609ac3332c4SDavid Hildenbrand  Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
610ac3332c4SDavid Hildenbrand  disabled. In that case, balloon inflation will only perform unmovable
611ac3332c4SDavid Hildenbrand  allocations and silently create a zone imbalance, usually triggered by
612ac3332c4SDavid Hildenbrand  inflation requests from the hypervisor.
6136bf53999SMike Rapoport
614ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a
615ac3332c4SDavid Hildenbrand  lot of unmovable memory.
6166bf53999SMike Rapoport
617ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge
618ac3332c4SDavid Hildenbrand  page migration, resulting in a similar issue as with gigantic pages.
6196bf53999SMike Rapoport
620ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large
621ac3332c4SDavid Hildenbrand  files or ZONE_DEVICE memory can be problematic, although only really relevant
622ac3332c4SDavid Hildenbrand  in corner cases. When we manage a lot of user space memory that has been
623ac3332c4SDavid Hildenbrand  swapped out or is served from a file/persistent memory/... we still need a lot
624ac3332c4SDavid Hildenbrand  of page tables to manage that memory once user space accessed that memory.
6256bf53999SMike Rapoport
626ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be
627ac3332c4SDavid Hildenbrand  allocated from the kernel zones.
6286bf53999SMike Rapoport
629ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of
630ac3332c4SDavid Hildenbrand  the total system memory size as (unmovable) tracking metadata.
6316bf53999SMike Rapoport
632ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings
633ac3332c4SDavid Hildenbrand  (especially, RDMA and vfio/mdev) are fundamentally problematic with
634ac3332c4SDavid Hildenbrand  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
635ac3332c4SDavid Hildenbrand  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
636ac3332c4SDavid Hildenbrand  have to be migrated off that zone while pinning. Pinning a page can fail
637ac3332c4SDavid Hildenbrand  even if there is plenty of free memory in ZONE_MOVABLE.
6386bf53999SMike Rapoport
639ac3332c4SDavid Hildenbrand  In addition, using ZONE_MOVABLE might make page pinning more expensive,
640ac3332c4SDavid Hildenbrand  because of the page migration overhead.
641dee6da22SDavid Hildenbrand
642ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel
643ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used.
6446bf53999SMike Rapoport
645ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the
646ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options:
647ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See
648ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description.
649ac3332c4SDavid Hildenbrand
650ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE
651ac3332c4SDavid Hildenbrand---------------------------------
652ac3332c4SDavid Hildenbrand
653ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory
654ac3332c4SDavid Hildenbrandblock might fail:
655ac3332c4SDavid Hildenbrand
656ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during
657ac3332c4SDavid Hildenbrand  boot and can apply to memory blocks hotplugged via the XEN balloon and the
658ac3332c4SDavid Hildenbrand  Hyper-V balloon.
659ac3332c4SDavid Hildenbrand
660ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
661ac3332c4SDavid Hildenbrand  offlining; this applies to memory blocks present during boot only.
662ac3332c4SDavid Hildenbrand
663ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples
664ac3332c4SDavid Hildenbrand  include any memory available during boot on arm64 or memory blocks spanning
665ac3332c4SDavid Hildenbrand  the crashkernel area on s390x; this usually applies to memory blocks present
666ac3332c4SDavid Hildenbrand  during boot only.
667ac3332c4SDavid Hildenbrand
668ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
669ac3332c4SDavid Hildenbrand  memory blocks present during boot only.
670ac3332c4SDavid Hildenbrand
671ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as
672ac3332c4SDavid Hildenbrand  allocating gigantic pages, can result in temporary offlining failures.
673ac3332c4SDavid Hildenbrand
674dff03381SMuchun Song- Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap
675dff03381SMuchun Song  Optimization (HVO) is enabled.
676ac3332c4SDavid Hildenbrand
677ac3332c4SDavid Hildenbrand  Offlining code may be able to migrate huge page contents, but may not be able
678ac3332c4SDavid Hildenbrand  to dissolve the source huge page because it fails allocating (unmovable) pages
679ac3332c4SDavid Hildenbrand  for the vmemmap, because the system might not have free memory in the kernel
680ac3332c4SDavid Hildenbrand  zones left.
681ac3332c4SDavid Hildenbrand
682ac3332c4SDavid Hildenbrand  Users that depend on memory offlining to succeed for movable zones should
683ac3332c4SDavid Hildenbrand  carefully consider whether the memory savings gained from this feature are
684ac3332c4SDavid Hildenbrand  worth the risk of possibly not being able to offline memory in certain
685ac3332c4SDavid Hildenbrand  situations.
686ac3332c4SDavid Hildenbrand
687ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or
688ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE
689ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds.
690ac3332c4SDavid Hildenbrand
691ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be
692de7cb03dSDavid Hildenbrandterminated by sending a signal. A timeout based offlining can easily be
693ac3332c4SDavid Hildenbrandimplemented via::
694ac3332c4SDavid Hildenbrand
695ac3332c4SDavid Hildenbrand	% timeout $TIMEOUT offline_block | failure_handling
696