xref: /openbmc/linux/Documentation/admin-guide/mm/memory-hotplug.rst (revision ac3332c44767b17b761b703523ac4ae9b2bcd227)
16bf53999SMike Rapoport.. _admin_guide_memory_hotplug:
26bf53999SMike Rapoport
3*ac3332c4SDavid Hildenbrand==================
4*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug
5*ac3332c4SDavid Hildenbrand==================
66bf53999SMike Rapoport
7*ac3332c4SDavid HildenbrandThis document describes generic Linux support for memory hot(un)plug with
8*ac3332c4SDavid Hildenbranda focus on System RAM, including ZONE_MOVABLE support.
96bf53999SMike Rapoport
106bf53999SMike Rapoport.. contents:: :local:
116bf53999SMike Rapoport
126bf53999SMike RapoportIntroduction
136bf53999SMike Rapoport============
146bf53999SMike Rapoport
15*ac3332c4SDavid HildenbrandMemory hot(un)plug allows for increasing and decreasing the size of physical
16*ac3332c4SDavid Hildenbrandmemory available to a machine at runtime. In the simplest case, it consists of
17*ac3332c4SDavid Hildenbrandphysically plugging or unplugging a DIMM at runtime, coordinated with the
18*ac3332c4SDavid Hildenbrandoperating system.
196bf53999SMike Rapoport
20*ac3332c4SDavid HildenbrandMemory hot(un)plug is used for various purposes:
216bf53999SMike Rapoport
22*ac3332c4SDavid Hildenbrand- The physical memory available to a machine can be adjusted at runtime, up- or
23*ac3332c4SDavid Hildenbrand  downgrading the memory capacity. This dynamic memory resizing, sometimes
24*ac3332c4SDavid Hildenbrand  referred to as "capacity on demand", is frequently used with virtual machines
25*ac3332c4SDavid Hildenbrand  and logical partitions.
266bf53999SMike Rapoport
27*ac3332c4SDavid Hildenbrand- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
28*ac3332c4SDavid Hildenbrand  example is replacing failing memory modules.
296bf53999SMike Rapoport
30*ac3332c4SDavid Hildenbrand- Reducing energy consumption either by physically unplugging memory modules or
31*ac3332c4SDavid Hildenbrand  by logically unplugging (parts of) memory modules from Linux.
326bf53999SMike Rapoport
33*ac3332c4SDavid HildenbrandFurther, the basic memory hot(un)plug infrastructure in Linux is nowadays also
34*ac3332c4SDavid Hildenbrandused to expose persistent memory, other performance-differentiated memory and
35*ac3332c4SDavid Hildenbrandreserved memory regions as ordinary system RAM to Linux.
366bf53999SMike Rapoport
37*ac3332c4SDavid HildenbrandLinux only supports memory hot(un)plug on selected 64 bit architectures, such as
38*ac3332c4SDavid Hildenbrandx86_64, arm64, ppc64, s390x and ia64.
396bf53999SMike Rapoport
40*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Granularity
41*ac3332c4SDavid Hildenbrand------------------------------
426bf53999SMike Rapoport
43*ac3332c4SDavid HildenbrandMemory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
44*ac3332c4SDavid Hildenbrandphysical memory address space into chunks of the same size: memory sections. The
45*ac3332c4SDavid Hildenbrandsize of a memory section is architecture dependent. For example, x86_64 uses
46*ac3332c4SDavid Hildenbrand128 MiB and ppc64 uses 16 MiB.
476bf53999SMike Rapoport
486bf53999SMike RapoportMemory sections are combined into chunks referred to as "memory blocks". The
49*ac3332c4SDavid Hildenbrandsize of a memory block is architecture dependent and corresponds to the smallest
50*ac3332c4SDavid Hildenbrandgranularity that can be hot(un)plugged. The default size of a memory block is
51*ac3332c4SDavid Hildenbrandthe same as memory section size, unless an architecture specifies otherwise.
526bf53999SMike Rapoport
53*ac3332c4SDavid HildenbrandAll memory blocks have the same size.
546bf53999SMike Rapoport
55*ac3332c4SDavid HildenbrandPhases of Memory Hotplug
56*ac3332c4SDavid Hildenbrand------------------------
576bf53999SMike Rapoport
58*ac3332c4SDavid HildenbrandMemory hotplug consists of two phases:
596bf53999SMike Rapoport
60*ac3332c4SDavid Hildenbrand(1) Adding the memory to Linux
61*ac3332c4SDavid Hildenbrand(2) Onlining memory blocks
626bf53999SMike Rapoport
63*ac3332c4SDavid HildenbrandIn the first phase, metadata, such as the memory map ("memmap") and page tables
64*ac3332c4SDavid Hildenbrandfor the direct mapping, is allocated and initialized, and memory blocks are
65*ac3332c4SDavid Hildenbrandcreated; the latter also creates sysfs files for managing newly created memory
66*ac3332c4SDavid Hildenbrandblocks.
676bf53999SMike Rapoport
68*ac3332c4SDavid HildenbrandIn the second phase, added memory is exposed to the page allocator. After this
69*ac3332c4SDavid Hildenbrandphase, the memory is visible in memory statistics, such as free and total
70*ac3332c4SDavid Hildenbrandmemory, of the system.
716bf53999SMike Rapoport
72*ac3332c4SDavid HildenbrandPhases of Memory Hotunplug
73*ac3332c4SDavid Hildenbrand--------------------------
746bf53999SMike Rapoport
75*ac3332c4SDavid HildenbrandMemory hotunplug consists of two phases:
766bf53999SMike Rapoport
77*ac3332c4SDavid Hildenbrand(1) Offlining memory blocks
78*ac3332c4SDavid Hildenbrand(2) Removing the memory from Linux
796bf53999SMike Rapoport
80*ac3332c4SDavid HildenbrandIn the fist phase, memory is "hidden" from the page allocator again, for
81*ac3332c4SDavid Hildenbrandexample, by migrating busy memory to other memory locations and removing all
82*ac3332c4SDavid Hildenbrandrelevant free pages from the page allocator After this phase, the memory is no
83*ac3332c4SDavid Hildenbrandlonger visible in memory statistics of the system.
846bf53999SMike Rapoport
85*ac3332c4SDavid HildenbrandIn the second phase, the memory blocks are removed and metadata is freed.
866bf53999SMike Rapoport
87*ac3332c4SDavid HildenbrandMemory Hotplug Notifications
88*ac3332c4SDavid Hildenbrand============================
896bf53999SMike Rapoport
90*ac3332c4SDavid HildenbrandThere are various ways how Linux is notified about memory hotplug events such
91*ac3332c4SDavid Hildenbrandthat it can start adding hotplugged memory. This description is limited to
92*ac3332c4SDavid Hildenbrandsystems that support ACPI; mechanisms specific to other firmware interfaces or
93*ac3332c4SDavid Hildenbrandvirtual machines are not described.
94*ac3332c4SDavid Hildenbrand
95*ac3332c4SDavid HildenbrandACPI Notifications
96*ac3332c4SDavid Hildenbrand------------------
97*ac3332c4SDavid Hildenbrand
98*ac3332c4SDavid HildenbrandPlatforms that support ACPI, such as x86_64, can support memory hotplug
99*ac3332c4SDavid Hildenbrandnotifications via ACPI.
100*ac3332c4SDavid Hildenbrand
101*ac3332c4SDavid HildenbrandIn general, a firmware supporting memory hotplug defines a memory class object
102*ac3332c4SDavid HildenbrandHID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
103*ac3332c4SDavid Hildenbranddriver will hotplug the memory to Linux.
104*ac3332c4SDavid Hildenbrand
105*ac3332c4SDavid HildenbrandIf the firmware supports hotplug of NUMA nodes, it defines an object _HID
106*ac3332c4SDavid Hildenbrand"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
107*ac3332c4SDavid Hildenbrandassigned memory devices are added to Linux by the ACPI driver.
108*ac3332c4SDavid Hildenbrand
109*ac3332c4SDavid HildenbrandSimilarly, Linux can be notified about requests to hotunplug a memory device or
110*ac3332c4SDavid Hildenbranda NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
111*ac3332c4SDavid Hildenbrandblocks, and, if successful, hotunplug the memory from Linux.
112*ac3332c4SDavid Hildenbrand
113*ac3332c4SDavid HildenbrandManual Probing
114*ac3332c4SDavid Hildenbrand--------------
115*ac3332c4SDavid Hildenbrand
116*ac3332c4SDavid HildenbrandOn some architectures, the firmware may not be able to notify the operating
117*ac3332c4SDavid Hildenbrandsystem about a memory hotplug event. Instead, the memory has to be manually
118*ac3332c4SDavid Hildenbrandprobed from user space.
119*ac3332c4SDavid Hildenbrand
120*ac3332c4SDavid HildenbrandThe probe interface is located at::
121*ac3332c4SDavid Hildenbrand
122*ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/probe
123*ac3332c4SDavid Hildenbrand
124*ac3332c4SDavid HildenbrandOnly complete memory blocks can be probed. Individual memory blocks are probed
125*ac3332c4SDavid Hildenbrandby providing the physical start address of the memory block::
126*ac3332c4SDavid Hildenbrand
127*ac3332c4SDavid Hildenbrand	% echo addr > /sys/devices/system/memory/probe
128*ac3332c4SDavid Hildenbrand
129*ac3332c4SDavid HildenbrandWhich results in a memory block for the range [addr, addr + memory_block_size)
130*ac3332c4SDavid Hildenbrandbeing created.
131*ac3332c4SDavid Hildenbrand
132*ac3332c4SDavid Hildenbrand.. note::
133*ac3332c4SDavid Hildenbrand
134*ac3332c4SDavid Hildenbrand  Using the probe interface is discouraged as it is easy to crash the kernel,
135*ac3332c4SDavid Hildenbrand  because Linux cannot validate user input; this interface might be removed in
136*ac3332c4SDavid Hildenbrand  the future.
137*ac3332c4SDavid Hildenbrand
138*ac3332c4SDavid HildenbrandOnlining and Offlining Memory Blocks
139*ac3332c4SDavid Hildenbrand====================================
140*ac3332c4SDavid Hildenbrand
141*ac3332c4SDavid HildenbrandAfter a memory block has been created, Linux has to be instructed to actually
142*ac3332c4SDavid Hildenbrandmake use of that memory: the memory block has to be "online".
143*ac3332c4SDavid Hildenbrand
144*ac3332c4SDavid HildenbrandBefore a memory block can be removed, Linux has to stop using any memory part of
145*ac3332c4SDavid Hildenbrandthe memory block: the memory block has to be "offlined".
146*ac3332c4SDavid Hildenbrand
147*ac3332c4SDavid HildenbrandThe Linux kernel can be configured to automatically online added memory blocks
148*ac3332c4SDavid Hildenbrandand drivers automatically trigger offlining of memory blocks when trying
149*ac3332c4SDavid Hildenbrandhotunplug of memory. Memory blocks can only be removed once offlining succeeded
150*ac3332c4SDavid Hildenbrandand drivers may trigger offlining of memory blocks when attempting hotunplug of
151*ac3332c4SDavid Hildenbrandmemory.
152*ac3332c4SDavid Hildenbrand
153*ac3332c4SDavid HildenbrandOnlining Memory Blocks Manually
154*ac3332c4SDavid Hildenbrand-------------------------------
155*ac3332c4SDavid Hildenbrand
156*ac3332c4SDavid HildenbrandIf auto-onlining of memory blocks isn't enabled, user-space has to manually
157*ac3332c4SDavid Hildenbrandtrigger onlining of memory blocks. Often, udev rules are used to automate this
158*ac3332c4SDavid Hildenbrandtask in user space.
159*ac3332c4SDavid Hildenbrand
160*ac3332c4SDavid HildenbrandOnlining of a memory block can be triggered via::
161*ac3332c4SDavid Hildenbrand
162*ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/memoryXXX/state
163*ac3332c4SDavid Hildenbrand
164*ac3332c4SDavid HildenbrandOr alternatively::
165*ac3332c4SDavid Hildenbrand
166*ac3332c4SDavid Hildenbrand	% echo 1 > /sys/devices/system/memory/memoryXXX/online
167*ac3332c4SDavid Hildenbrand
168*ac3332c4SDavid HildenbrandThe kernel will select the target zone automatically, usually defaulting to
169*ac3332c4SDavid Hildenbrand``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
170*ac3332c4SDavid Hildenbrandcommand line or if the memory block would intersect the ZONE_MOVABLE already.
171*ac3332c4SDavid Hildenbrand
172*ac3332c4SDavid HildenbrandOne can explicitly request to associate an offline memory block with
173*ac3332c4SDavid HildenbrandZONE_MOVABLE by::
174*ac3332c4SDavid Hildenbrand
175*ac3332c4SDavid Hildenbrand	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
176*ac3332c4SDavid Hildenbrand
177*ac3332c4SDavid HildenbrandOr one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
178*ac3332c4SDavid Hildenbrand
179*ac3332c4SDavid Hildenbrand	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
180*ac3332c4SDavid Hildenbrand
181*ac3332c4SDavid HildenbrandIn any case, if onlining succeeds, the state of the memory block is changed to
182*ac3332c4SDavid Hildenbrandbe "online". If it fails, the state of the memory block will remain unchanged
183*ac3332c4SDavid Hildenbrandand the above commands will fail.
184*ac3332c4SDavid Hildenbrand
185*ac3332c4SDavid HildenbrandOnlining Memory Blocks Automatically
186*ac3332c4SDavid Hildenbrand------------------------------------
187*ac3332c4SDavid Hildenbrand
188*ac3332c4SDavid HildenbrandThe kernel can be configured to try auto-onlining of newly added memory blocks.
189*ac3332c4SDavid HildenbrandIf this feature is disabled, the memory blocks will stay offline until
190*ac3332c4SDavid Hildenbrandexplicitly onlined from user space.
191*ac3332c4SDavid Hildenbrand
192*ac3332c4SDavid HildenbrandThe configured auto-online behavior can be observed via::
193*ac3332c4SDavid Hildenbrand
194*ac3332c4SDavid Hildenbrand	% cat /sys/devices/system/memory/auto_online_blocks
195*ac3332c4SDavid Hildenbrand
196*ac3332c4SDavid HildenbrandAuto-onlining can be enabled by writing ``online``, ``online_kernel`` or
197*ac3332c4SDavid Hildenbrand``online_movable`` to that file, like::
198*ac3332c4SDavid Hildenbrand
199*ac3332c4SDavid Hildenbrand	% echo online > /sys/devices/system/memory/auto_online_blocks
200*ac3332c4SDavid Hildenbrand
201*ac3332c4SDavid HildenbrandModifying the auto-online behavior will only affect all subsequently added
202*ac3332c4SDavid Hildenbrandmemory blocks only.
203*ac3332c4SDavid Hildenbrand
204*ac3332c4SDavid Hildenbrand.. note::
205*ac3332c4SDavid Hildenbrand
206*ac3332c4SDavid Hildenbrand  In corner cases, auto-onlining can fail. The kernel won't retry. Note that
207*ac3332c4SDavid Hildenbrand  auto-onlining is not expected to fail in default configurations.
208*ac3332c4SDavid Hildenbrand
209*ac3332c4SDavid Hildenbrand.. note::
210*ac3332c4SDavid Hildenbrand
211*ac3332c4SDavid Hildenbrand  DLPAR on ppc64 ignores the ``offline`` setting and will still online added
212*ac3332c4SDavid Hildenbrand  memory blocks; if onlining fails, memory blocks are removed again.
213*ac3332c4SDavid Hildenbrand
214*ac3332c4SDavid HildenbrandOfflining Memory Blocks
215*ac3332c4SDavid Hildenbrand-----------------------
216*ac3332c4SDavid Hildenbrand
217*ac3332c4SDavid HildenbrandIn the current implementation, Linux's memory offlining will try migrating all
218*ac3332c4SDavid Hildenbrandmovable pages off the affected memory block. As most kernel allocations, such as
219*ac3332c4SDavid Hildenbrandpage tables, are unmovable, page migration can fail and, therefore, inhibit
220*ac3332c4SDavid Hildenbrandmemory offlining from succeeding.
221*ac3332c4SDavid Hildenbrand
222*ac3332c4SDavid HildenbrandHaving the memory provided by memory block managed by ZONE_MOVABLE significantly
223*ac3332c4SDavid Hildenbrandincreases memory offlining reliability; still, memory offlining can fail in
224*ac3332c4SDavid Hildenbrandsome corner cases.
225*ac3332c4SDavid Hildenbrand
226*ac3332c4SDavid HildenbrandFurther, memory offlining might retry for a long time (or even forever), until
227*ac3332c4SDavid Hildenbrandaborted by the user.
228*ac3332c4SDavid Hildenbrand
229*ac3332c4SDavid HildenbrandOfflining of a memory block can be triggered via::
230*ac3332c4SDavid Hildenbrand
231*ac3332c4SDavid Hildenbrand	% echo offline > /sys/devices/system/memory/memoryXXX/state
232*ac3332c4SDavid Hildenbrand
233*ac3332c4SDavid HildenbrandOr alternatively::
234*ac3332c4SDavid Hildenbrand
235*ac3332c4SDavid Hildenbrand	% echo 0 > /sys/devices/system/memory/memoryXXX/online
236*ac3332c4SDavid Hildenbrand
237*ac3332c4SDavid HildenbrandIf offlining succeeds, the state of the memory block is changed to be "offline".
238*ac3332c4SDavid HildenbrandIf it fails, the state of the memory block will remain unchanged and the above
239*ac3332c4SDavid Hildenbrandcommands will fail, for example, via::
240*ac3332c4SDavid Hildenbrand
241*ac3332c4SDavid Hildenbrand	bash: echo: write error: Device or resource busy
242*ac3332c4SDavid Hildenbrand
243*ac3332c4SDavid Hildenbrandor via::
244*ac3332c4SDavid Hildenbrand
245*ac3332c4SDavid Hildenbrand	bash: echo: write error: Invalid argument
246*ac3332c4SDavid Hildenbrand
247*ac3332c4SDavid HildenbrandObserving the State of Memory Blocks
248*ac3332c4SDavid Hildenbrand------------------------------------
249*ac3332c4SDavid Hildenbrand
250*ac3332c4SDavid HildenbrandThe state (online/offline/going-offline) of a memory block can be observed
251*ac3332c4SDavid Hildenbrandeither via::
252*ac3332c4SDavid Hildenbrand
253*ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/state
254*ac3332c4SDavid Hildenbrand
255*ac3332c4SDavid HildenbrandOr alternatively (1/0) via::
256*ac3332c4SDavid Hildenbrand
257*ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/online
258*ac3332c4SDavid Hildenbrand
259*ac3332c4SDavid HildenbrandFor an online memory block, the managing zone can be observed via::
260*ac3332c4SDavid Hildenbrand
261*ac3332c4SDavid Hildenbrand	% cat /sys/device/system/memory/memoryXXX/valid_zones
262*ac3332c4SDavid Hildenbrand
263*ac3332c4SDavid HildenbrandConfiguring Memory Hot(Un)Plug
2646bf53999SMike Rapoport==============================
2656bf53999SMike Rapoport
266*ac3332c4SDavid HildenbrandThere are various ways how system administrators can configure memory
267*ac3332c4SDavid Hildenbrandhot(un)plug and interact with memory blocks, especially, to online them.
268*ac3332c4SDavid Hildenbrand
269*ac3332c4SDavid HildenbrandMemory Hot(Un)Plug Configuration via Sysfs
270*ac3332c4SDavid Hildenbrand------------------------------------------
271*ac3332c4SDavid Hildenbrand
272*ac3332c4SDavid HildenbrandSome memory hot(un)plug properties can be configured or inspected via sysfs in::
273*ac3332c4SDavid Hildenbrand
274*ac3332c4SDavid Hildenbrand	/sys/devices/system/memory/
275*ac3332c4SDavid Hildenbrand
276*ac3332c4SDavid HildenbrandThe following files are currently defined:
277*ac3332c4SDavid Hildenbrand
278*ac3332c4SDavid Hildenbrand====================== =========================================================
279*ac3332c4SDavid Hildenbrand``auto_online_blocks`` read-write: set or get the default state of new memory
280*ac3332c4SDavid Hildenbrand		       blocks; configure auto-onlining.
281*ac3332c4SDavid Hildenbrand
282*ac3332c4SDavid Hildenbrand		       The default value depends on the
283*ac3332c4SDavid Hildenbrand		       CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
284*ac3332c4SDavid Hildenbrand		       option.
285*ac3332c4SDavid Hildenbrand
286*ac3332c4SDavid Hildenbrand		       See the ``state`` property of memory blocks for details.
287*ac3332c4SDavid Hildenbrand``block_size_bytes``   read-only: the size in bytes of a memory block.
288*ac3332c4SDavid Hildenbrand``probe``	       write-only: add (probe) selected memory blocks manually
289*ac3332c4SDavid Hildenbrand		       from user space by supplying the physical start address.
290*ac3332c4SDavid Hildenbrand
291*ac3332c4SDavid Hildenbrand		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
292*ac3332c4SDavid Hildenbrand		       kernel configuration option.
293*ac3332c4SDavid Hildenbrand``uevent``	       read-write: generic udev file for device subsystems.
294*ac3332c4SDavid Hildenbrand====================== =========================================================
295*ac3332c4SDavid Hildenbrand
296*ac3332c4SDavid Hildenbrand.. note::
297*ac3332c4SDavid Hildenbrand
298*ac3332c4SDavid Hildenbrand  When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
299*ac3332c4SDavid Hildenbrand  additional files ``hard_offline_page`` and ``soft_offline_page`` are available
300*ac3332c4SDavid Hildenbrand  to trigger hwpoisoning of pages, for example, for testing purposes. Note that
301*ac3332c4SDavid Hildenbrand  this functionality is not really related to memory hot(un)plug or actual
302*ac3332c4SDavid Hildenbrand  offlining of memory blocks.
303*ac3332c4SDavid Hildenbrand
304*ac3332c4SDavid HildenbrandMemory Block Configuration via Sysfs
305*ac3332c4SDavid Hildenbrand------------------------------------
306*ac3332c4SDavid Hildenbrand
307*ac3332c4SDavid HildenbrandEach memory block is represented as a memory block device that can be
308*ac3332c4SDavid Hildenbrandonlined or offlined. All memory blocks have their device information located in
309*ac3332c4SDavid Hildenbrandsysfs. Each present memory block is listed under
310*ac3332c4SDavid Hildenbrand``/sys/devices/system/memory`` as::
3116bf53999SMike Rapoport
3126bf53999SMike Rapoport	/sys/devices/system/memory/memoryXXX
3136bf53999SMike Rapoport
314*ac3332c4SDavid Hildenbrandwhere XXX is the memory block id; the number of digits is variable.
3156bf53999SMike Rapoport
316*ac3332c4SDavid HildenbrandA present memory block indicates that some memory in the range is present;
317*ac3332c4SDavid Hildenbrandhowever, a memory block might span memory holes. A memory block spanning memory
318*ac3332c4SDavid Hildenbrandholes cannot be offlined.
3196bf53999SMike Rapoport
3206bf53999SMike RapoportFor example, assume 1 GiB memory block size. A device for a memory starting at
3216bf53999SMike Rapoport0x100000000 is ``/sys/device/system/memory/memory4``::
3226bf53999SMike Rapoport
3236bf53999SMike Rapoport	(0x100000000 / 1Gib = 4)
3246bf53999SMike Rapoport
3256bf53999SMike RapoportThis device covers address range [0x100000000 ... 0x140000000)
3266bf53999SMike Rapoport
327*ac3332c4SDavid HildenbrandThe following files are currently defined:
3286bf53999SMike Rapoport
3296bf53999SMike Rapoport=================== ============================================================
330*ac3332c4SDavid Hildenbrand``online``	    read-write: simplified interface to trigger onlining /
331*ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
332*ac3332c4SDavid Hildenbrand		    When onlining, the zone is selected automatically.
333e9a2e48eSDavid Hildenbrand``phys_device``	    read-only: legacy interface only ever used on s390x to
334e9a2e48eSDavid Hildenbrand		    expose the covered storage increment.
335*ac3332c4SDavid Hildenbrand``phys_index``	    read-only: the memory block id (XXX).
336a89107c0SDavid Hildenbrand``removable``	    read-only: legacy interface that indicated whether a memory
337*ac3332c4SDavid Hildenbrand		    block was likely to be offlineable or not. Nowadays, the
338*ac3332c4SDavid Hildenbrand		    kernel return ``1`` if and only if it supports memory
339*ac3332c4SDavid Hildenbrand		    offlining.
340*ac3332c4SDavid Hildenbrand``state``	    read-write: advanced interface to trigger onlining /
341*ac3332c4SDavid Hildenbrand		    offlining and to observe the state of a memory block.
3426bf53999SMike Rapoport
343*ac3332c4SDavid Hildenbrand		    When writing, ``online``, ``offline``, ``online_kernel`` and
344*ac3332c4SDavid Hildenbrand		    ``online_movable`` are supported.
3456bf53999SMike Rapoport
346*ac3332c4SDavid Hildenbrand		    ``online_movable`` specifies onlining to ZONE_MOVABLE.
347*ac3332c4SDavid Hildenbrand		    ``online_kernel`` specifies onlining to the default kernel
348*ac3332c4SDavid Hildenbrand		    zone for the memory block, such as ZONE_NORMAL.
349*ac3332c4SDavid Hildenbrand                    ``online`` let's the kernel select the zone automatically.
3506bf53999SMike Rapoport
351*ac3332c4SDavid Hildenbrand		    When reading, ``online``, ``offline`` and ``going-offline``
352*ac3332c4SDavid Hildenbrand		    may be returned.
353*ac3332c4SDavid Hildenbrand``uevent``	    read-write: generic uevent file for devices.
354*ac3332c4SDavid Hildenbrand``valid_zones``     read-only: when a block is online, shows the zone it
355*ac3332c4SDavid Hildenbrand		    belongs to; when a block is offline, shows what zone will
356*ac3332c4SDavid Hildenbrand		    manage it when the block will be onlined.
357*ac3332c4SDavid Hildenbrand
358*ac3332c4SDavid Hildenbrand		    For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
359*ac3332c4SDavid Hildenbrand		    ``Movable`` and ``none`` may be returned. ``none`` indicates
360*ac3332c4SDavid Hildenbrand		    that memory provided by a memory block is managed by
361*ac3332c4SDavid Hildenbrand		    multiple zones or spans multiple nodes; such memory blocks
362*ac3332c4SDavid Hildenbrand		    cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
363*ac3332c4SDavid Hildenbrand		    Other values indicate a kernel zone.
364*ac3332c4SDavid Hildenbrand
365*ac3332c4SDavid Hildenbrand		    For offline memory blocks, the first column shows the
366*ac3332c4SDavid Hildenbrand		    zone the kernel would select when onlining the memory block
367*ac3332c4SDavid Hildenbrand		    right now without further specifying a zone.
368*ac3332c4SDavid Hildenbrand
369*ac3332c4SDavid Hildenbrand		    Availability depends on the CONFIG_MEMORY_HOTREMOVE
370*ac3332c4SDavid Hildenbrand		    kernel configuration option.
3716bf53999SMike Rapoport=================== ============================================================
3726bf53999SMike Rapoport
3736bf53999SMike Rapoport.. note::
3746bf53999SMike Rapoport
375*ac3332c4SDavid Hildenbrand  If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
376*ac3332c4SDavid Hildenbrand  directories can also be accessed via symbolic links located in the
377*ac3332c4SDavid Hildenbrand  ``/sys/devices/system/node/node*`` directories.
3786bf53999SMike Rapoport
3796bf53999SMike Rapoport  For example::
3806bf53999SMike Rapoport
3816bf53999SMike Rapoport	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
3826bf53999SMike Rapoport
3836bf53999SMike Rapoport  A backlink will also be created::
3846bf53999SMike Rapoport
3856bf53999SMike Rapoport	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
3866bf53999SMike Rapoport
387*ac3332c4SDavid HildenbrandCommand Line Parameters
388*ac3332c4SDavid Hildenbrand-----------------------
3896bf53999SMike Rapoport
390*ac3332c4SDavid HildenbrandSome command line parameters affect memory hot(un)plug handling. The following
391*ac3332c4SDavid Hildenbrandcommand line parameters are relevant:
3926bf53999SMike Rapoport
393*ac3332c4SDavid Hildenbrand======================== =======================================================
394*ac3332c4SDavid Hildenbrand``memhp_default_state``	 configure auto-onlining by essentially setting
395*ac3332c4SDavid Hildenbrand                         ``/sys/devices/system/memory/auto_online_blocks``.
396*ac3332c4SDavid Hildenbrand``movablecore``		 configure automatic zone selection of the kernel. When
397*ac3332c4SDavid Hildenbrand			 set, the kernel will default to ZONE_MOVABLE, unless
398*ac3332c4SDavid Hildenbrand			 other zones can be kept contiguous.
399*ac3332c4SDavid Hildenbrand======================== =======================================================
4006bf53999SMike Rapoport
401*ac3332c4SDavid HildenbrandModule Parameters
402*ac3332c4SDavid Hildenbrand------------------
4036bf53999SMike Rapoport
404*ac3332c4SDavid HildenbrandInstead of additional command line parameters or sysfs files, the
405*ac3332c4SDavid Hildenbrand``memory_hotplug`` subsystem now provides a dedicated namespace for module
406*ac3332c4SDavid Hildenbrandparameters. Module parameters can be set via the command line by predicating
407*ac3332c4SDavid Hildenbrandthem with ``memory_hotplug.`` such as::
4086bf53999SMike Rapoport
409*ac3332c4SDavid Hildenbrand	memory_hotplug.memmap_on_memory=1
4106bf53999SMike Rapoport
411*ac3332c4SDavid Hildenbrandand they can be observed (and some even modified at runtime) via::
4126bf53999SMike Rapoport
413*ac3332c4SDavid Hildenbrand	/sys/modules/memory_hotplug/parameters/
4146bf53999SMike Rapoport
415*ac3332c4SDavid HildenbrandThe following module parameters are currently defined:
4166bf53999SMike Rapoport
417*ac3332c4SDavid Hildenbrand======================== =======================================================
418*ac3332c4SDavid Hildenbrand``memmap_on_memory``	 read-write: Allocate memory for the memmap from the
419*ac3332c4SDavid Hildenbrand			 added memory block itself. Even if enabled, actual
420*ac3332c4SDavid Hildenbrand			 support depends on various other system properties and
421*ac3332c4SDavid Hildenbrand			 should only be regarded as a hint whether the behavior
422*ac3332c4SDavid Hildenbrand			 would be desired.
4236bf53999SMike Rapoport
424*ac3332c4SDavid Hildenbrand			 While allocating the memmap from the memory block
425*ac3332c4SDavid Hildenbrand			 itself makes memory hotplug less likely to fail and
426*ac3332c4SDavid Hildenbrand			 keeps the memmap on the same NUMA node in any case, it
427*ac3332c4SDavid Hildenbrand			 can fragment physical memory in a way that huge pages
428*ac3332c4SDavid Hildenbrand			 in bigger granularity cannot be formed on hotplugged
429*ac3332c4SDavid Hildenbrand			 memory.
430*ac3332c4SDavid Hildenbrand======================== =======================================================
4316bf53999SMike Rapoport
432*ac3332c4SDavid HildenbrandZONE_MOVABLE
433*ac3332c4SDavid Hildenbrand============
4346bf53999SMike Rapoport
435*ac3332c4SDavid HildenbrandZONE_MOVABLE is an important mechanism for more reliable memory offlining.
436*ac3332c4SDavid HildenbrandFurther, having system RAM managed by ZONE_MOVABLE instead of one of the
437*ac3332c4SDavid Hildenbrandkernel zones can increase the number of possible transparent huge pages and
438*ac3332c4SDavid Hildenbranddynamically allocated huge pages.
4396bf53999SMike Rapoport
440*ac3332c4SDavid HildenbrandMost kernel allocations are unmovable. Important examples include the memory
441*ac3332c4SDavid Hildenbrandmap (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
442*ac3332c4SDavid Hildenbrandcan only be served from the kernel zones.
4436bf53999SMike Rapoport
444*ac3332c4SDavid HildenbrandMost user space pages, such as anonymous memory, and page cache pages are
445*ac3332c4SDavid Hildenbrandmovable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
4466bf53999SMike Rapoport
447*ac3332c4SDavid HildenbrandOnly movable allocations are served from ZONE_MOVABLE, resulting in unmovable
448*ac3332c4SDavid Hildenbrandallocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
449*ac3332c4SDavid Hildenbrandabsolutely no guarantee whether a memory block can be offlined successfully.
450*ac3332c4SDavid Hildenbrand
451*ac3332c4SDavid HildenbrandZone Imbalances
4526bf53999SMike Rapoport---------------
4536bf53999SMike Rapoport
454*ac3332c4SDavid HildenbrandHaving too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
455*ac3332c4SDavid Hildenbrandwhich can harm the system or degrade performance. As one example, the kernel
456*ac3332c4SDavid Hildenbrandmight crash because it runs out of free memory for unmovable allocations,
457*ac3332c4SDavid Hildenbrandalthough there is still plenty of free memory left in ZONE_MOVABLE.
4586bf53999SMike Rapoport
459*ac3332c4SDavid HildenbrandUsually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
460*ac3332c4SDavid Hildenbrandare definitely impossible due to the overhead for the memory map.
4616bf53999SMike Rapoport
462*ac3332c4SDavid HildenbrandActual safe zone ratios depend on the workload. Extreme cases, like excessive
463*ac3332c4SDavid Hildenbrandlong-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
4646bf53999SMike Rapoport
4656bf53999SMike Rapoport.. note::
4666bf53999SMike Rapoport
467*ac3332c4SDavid Hildenbrand  CMA memory part of a kernel zone essentially behaves like memory in
468*ac3332c4SDavid Hildenbrand  ZONE_MOVABLE and similar considerations apply, especially when combining
469*ac3332c4SDavid Hildenbrand  CMA with ZONE_MOVABLE.
4706bf53999SMike Rapoport
471*ac3332c4SDavid HildenbrandZONE_MOVABLE Sizing Considerations
472*ac3332c4SDavid Hildenbrand----------------------------------
473ad2fa371SMuchun Song
474*ac3332c4SDavid HildenbrandWe usually expect that a large portion of available system RAM will actually
475*ac3332c4SDavid Hildenbrandbe consumed by user space, either directly or indirectly via the page cache. In
476*ac3332c4SDavid Hildenbrandthe normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
477ad2fa371SMuchun Song
478*ac3332c4SDavid HildenbrandWith that in mind, it makes sense that we can have a big portion of system RAM
479*ac3332c4SDavid Hildenbrandmanaged by ZONE_MOVABLE. However, there are some things to consider when using
480*ac3332c4SDavid HildenbrandZONE_MOVABLE, especially when fine-tuning zone ratios:
481fa965fd5SPavel Tatashin
482*ac3332c4SDavid Hildenbrand- Having a lot of offline memory blocks. Even offline memory blocks consume
483*ac3332c4SDavid Hildenbrand  memory for metadata and page tables in the direct map; having a lot of offline
484*ac3332c4SDavid Hildenbrand  memory blocks is not a typical case, though.
4856bf53999SMike Rapoport
486*ac3332c4SDavid Hildenbrand- Memory ballooning without balloon compaction is incompatible with
487*ac3332c4SDavid Hildenbrand  ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
488*ac3332c4SDavid Hildenbrand  pseries CMM, fully support balloon compaction.
4896bf53999SMike Rapoport
490*ac3332c4SDavid Hildenbrand  Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
491*ac3332c4SDavid Hildenbrand  disabled. In that case, balloon inflation will only perform unmovable
492*ac3332c4SDavid Hildenbrand  allocations and silently create a zone imbalance, usually triggered by
493*ac3332c4SDavid Hildenbrand  inflation requests from the hypervisor.
4946bf53999SMike Rapoport
495*ac3332c4SDavid Hildenbrand- Gigantic pages are unmovable, resulting in user space consuming a
496*ac3332c4SDavid Hildenbrand  lot of unmovable memory.
4976bf53999SMike Rapoport
498*ac3332c4SDavid Hildenbrand- Huge pages are unmovable when an architectures does not support huge
499*ac3332c4SDavid Hildenbrand  page migration, resulting in a similar issue as with gigantic pages.
5006bf53999SMike Rapoport
501*ac3332c4SDavid Hildenbrand- Page tables are unmovable. Excessive swapping, mapping extremely large
502*ac3332c4SDavid Hildenbrand  files or ZONE_DEVICE memory can be problematic, although only really relevant
503*ac3332c4SDavid Hildenbrand  in corner cases. When we manage a lot of user space memory that has been
504*ac3332c4SDavid Hildenbrand  swapped out or is served from a file/persistent memory/... we still need a lot
505*ac3332c4SDavid Hildenbrand  of page tables to manage that memory once user space accessed that memory.
5066bf53999SMike Rapoport
507*ac3332c4SDavid Hildenbrand- In certain DAX configurations the memory map for the device memory will be
508*ac3332c4SDavid Hildenbrand  allocated from the kernel zones.
5096bf53999SMike Rapoport
510*ac3332c4SDavid Hildenbrand- KASAN can have a significant memory overhead, for example, consuming 1/8th of
511*ac3332c4SDavid Hildenbrand  the total system memory size as (unmovable) tracking metadata.
5126bf53999SMike Rapoport
513*ac3332c4SDavid Hildenbrand- Long-term pinning of pages. Techniques that rely on long-term pinnings
514*ac3332c4SDavid Hildenbrand  (especially, RDMA and vfio/mdev) are fundamentally problematic with
515*ac3332c4SDavid Hildenbrand  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
516*ac3332c4SDavid Hildenbrand  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
517*ac3332c4SDavid Hildenbrand  have to be migrated off that zone while pinning. Pinning a page can fail
518*ac3332c4SDavid Hildenbrand  even if there is plenty of free memory in ZONE_MOVABLE.
5196bf53999SMike Rapoport
520*ac3332c4SDavid Hildenbrand  In addition, using ZONE_MOVABLE might make page pinning more expensive,
521*ac3332c4SDavid Hildenbrand  because of the page migration overhead.
522dee6da22SDavid Hildenbrand
523*ac3332c4SDavid HildenbrandBy default, all the memory configured at boot time is managed by the kernel
524*ac3332c4SDavid Hildenbrandzones and ZONE_MOVABLE is not used.
5256bf53999SMike Rapoport
526*ac3332c4SDavid HildenbrandTo enable ZONE_MOVABLE to include the memory present at boot and to control the
527*ac3332c4SDavid Hildenbrandratio between movable and kernel zones there are two command line options:
528*ac3332c4SDavid Hildenbrand``kernelcore=`` and ``movablecore=``. See
529*ac3332c4SDavid HildenbrandDocumentation/admin-guide/kernel-parameters.rst for their description.
530*ac3332c4SDavid Hildenbrand
531*ac3332c4SDavid HildenbrandMemory Offlining and ZONE_MOVABLE
532*ac3332c4SDavid Hildenbrand---------------------------------
533*ac3332c4SDavid Hildenbrand
534*ac3332c4SDavid HildenbrandEven with ZONE_MOVABLE, there are some corner cases where offlining a memory
535*ac3332c4SDavid Hildenbrandblock might fail:
536*ac3332c4SDavid Hildenbrand
537*ac3332c4SDavid Hildenbrand- Memory blocks with memory holes; this applies to memory blocks present during
538*ac3332c4SDavid Hildenbrand  boot and can apply to memory blocks hotplugged via the XEN balloon and the
539*ac3332c4SDavid Hildenbrand  Hyper-V balloon.
540*ac3332c4SDavid Hildenbrand
541*ac3332c4SDavid Hildenbrand- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
542*ac3332c4SDavid Hildenbrand  offlining; this applies to memory blocks present during boot only.
543*ac3332c4SDavid Hildenbrand
544*ac3332c4SDavid Hildenbrand- Special memory blocks prevented by the system from getting offlined. Examples
545*ac3332c4SDavid Hildenbrand  include any memory available during boot on arm64 or memory blocks spanning
546*ac3332c4SDavid Hildenbrand  the crashkernel area on s390x; this usually applies to memory blocks present
547*ac3332c4SDavid Hildenbrand  during boot only.
548*ac3332c4SDavid Hildenbrand
549*ac3332c4SDavid Hildenbrand- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
550*ac3332c4SDavid Hildenbrand  memory blocks present during boot only.
551*ac3332c4SDavid Hildenbrand
552*ac3332c4SDavid Hildenbrand- Concurrent activity that operates on the same physical memory area, such as
553*ac3332c4SDavid Hildenbrand  allocating gigantic pages, can result in temporary offlining failures.
554*ac3332c4SDavid Hildenbrand
555*ac3332c4SDavid Hildenbrand- Out of memory when dissolving huge pages, especially when freeing unused
556*ac3332c4SDavid Hildenbrand  vmemmap pages associated with each hugetlb page is enabled.
557*ac3332c4SDavid Hildenbrand
558*ac3332c4SDavid Hildenbrand  Offlining code may be able to migrate huge page contents, but may not be able
559*ac3332c4SDavid Hildenbrand  to dissolve the source huge page because it fails allocating (unmovable) pages
560*ac3332c4SDavid Hildenbrand  for the vmemmap, because the system might not have free memory in the kernel
561*ac3332c4SDavid Hildenbrand  zones left.
562*ac3332c4SDavid Hildenbrand
563*ac3332c4SDavid Hildenbrand  Users that depend on memory offlining to succeed for movable zones should
564*ac3332c4SDavid Hildenbrand  carefully consider whether the memory savings gained from this feature are
565*ac3332c4SDavid Hildenbrand  worth the risk of possibly not being able to offline memory in certain
566*ac3332c4SDavid Hildenbrand  situations.
567*ac3332c4SDavid Hildenbrand
568*ac3332c4SDavid HildenbrandFurther, when running into out of memory situations while migrating pages, or
569*ac3332c4SDavid Hildenbrandwhen still encountering permanently unmovable pages within ZONE_MOVABLE
570*ac3332c4SDavid Hildenbrand(-> BUG), memory offlining will keep retrying until it eventually succeeds.
571*ac3332c4SDavid Hildenbrand
572*ac3332c4SDavid HildenbrandWhen offlining is triggered from user space, the offlining context can be
573*ac3332c4SDavid Hildenbrandterminated by sending a fatal signal. A timeout based offlining can easily be
574*ac3332c4SDavid Hildenbrandimplemented via::
575*ac3332c4SDavid Hildenbrand
576*ac3332c4SDavid Hildenbrand	% timeout $TIMEOUT offline_block | failure_handling
577