1=========================
2CPU hotplug in the Kernel
3=========================
4
5:Date: December, 2016
6:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
7          Rusty Russell <rusty@rustcorp.com.au>,
8          Srivatsa Vaddagiri <vatsa@in.ibm.com>,
9          Ashok Raj <ashok.raj@intel.com>,
10          Joel Schopp <jschopp@austin.ibm.com>
11
12Introduction
13============
14
15Modern advances in system architectures have introduced advanced error
16reporting and correction capabilities in processors. There are couple OEMS that
17support NUMA hardware which are hot pluggable as well, where physical node
18insertion and removal require support for CPU hotplug.
19
20Such advances require CPUs available to a kernel to be removed either for
21provisioning reasons, or for RAS purposes to keep an offending CPU off
22system execution path. Hence the need for CPU hotplug support in the
23Linux kernel.
24
25A more novel use of CPU-hotplug support is its use today in suspend resume
26support for SMP. Dual-core and HT support makes even a laptop run SMP kernels
27which didn't support these methods.
28
29
30Command Line Switches
31=====================
32``maxcpus=n``
33  Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using
34  ``maxcpus=2`` will only boot two. You can choose to bring the
35  other CPUs later online.
36
37``nr_cpus=n``
38  Restrict the total amount CPUs the kernel will support. If the number
39  supplied here is lower than the number of physically available CPUs than
40  those CPUs can not be brought online later.
41
42``additional_cpus=n``
43  Use this to limit hotpluggable CPUs. This option sets
44  ``cpu_possible_mask = cpu_present_mask + additional_cpus``
45
46  This option is limited to the IA64 architecture.
47
48``possible_cpus=n``
49  This option sets ``possible_cpus`` bits in ``cpu_possible_mask``.
50
51  This option is limited to the X86 and S390 architecture.
52
53``cede_offline={"off","on"}``
54  Use this option to disable/enable putting offlined processors to an extended
55  ``H_CEDE`` state on supported pseries platforms. If nothing is specified,
56  ``cede_offline`` is set to "on".
57
58  This option is limited to the PowerPC architecture.
59
60``cpu0_hotplug``
61  Allow to shutdown CPU0.
62
63  This option is limited to the X86 architecture.
64
65CPU maps
66========
67
68``cpu_possible_mask``
69  Bitmap of possible CPUs that can ever be available in the
70  system. This is used to allocate some boot time memory for per_cpu variables
71  that aren't designed to grow/shrink as CPUs are made available or removed.
72  Once set during boot time discovery phase, the map is static, i.e no bits
73  are added or removed anytime. Trimming it accurately for your system needs
74  upfront can save some boot time memory.
75
76``cpu_online_mask``
77  Bitmap of all CPUs currently online. Its set in ``__cpu_up()``
78  after a CPU is available for kernel scheduling and ready to receive
79  interrupts from devices. Its cleared when a CPU is brought down using
80  ``__cpu_disable()``, before which all OS services including interrupts are
81  migrated to another target CPU.
82
83``cpu_present_mask``
84  Bitmap of CPUs currently present in the system. Not all
85  of them may be online. When physical hotplug is processed by the relevant
86  subsystem (e.g ACPI) can change and new bit either be added or removed
87  from the map depending on the event is hot-add/hot-remove. There are currently
88  no locking rules as of now. Typical usage is to init topology during boot,
89  at which time hotplug is disabled.
90
91You really don't need to manipulate any of the system CPU maps. They should
92be read-only for most use. When setting up per-cpu resources almost always use
93``cpu_possible_mask`` or ``for_each_possible_cpu()`` to iterate. To macro
94``for_each_cpu()`` can be used to iterate over a custom CPU mask.
95
96Never use anything other than ``cpumask_t`` to represent bitmap of CPUs.
97
98
99Using CPU hotplug
100=================
101The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently
102available on multiple architectures including ARM, MIPS, PowerPC and X86. The
103configuration is done via the sysfs interface: ::
104
105 $ ls -lh /sys/devices/system/cpu
106 total 0
107 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu0
108 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu1
109 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu2
110 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu3
111 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu4
112 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu5
113 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu6
114 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu7
115 drwxr-xr-x  2 root root    0 Dec 21 16:33 hotplug
116 -r--r--r--  1 root root 4.0K Dec 21 16:33 offline
117 -r--r--r--  1 root root 4.0K Dec 21 16:33 online
118 -r--r--r--  1 root root 4.0K Dec 21 16:33 possible
119 -r--r--r--  1 root root 4.0K Dec 21 16:33 present
120
121The files *offline*, *online*, *possible*, *present* represent the CPU masks.
122Each CPU folder contains an *online* file which controls the logical on (1) and
123off (0) state. To logically shutdown CPU4: ::
124
125 $ echo 0 > /sys/devices/system/cpu/cpu4/online
126  smpboot: CPU 4 is now offline
127
128Once the CPU is shutdown, it will be removed from */proc/interrupts*,
129*/proc/cpuinfo* and should also not be shown visible by the *top* command. To
130bring CPU4 back online: ::
131
132 $ echo 1 > /sys/devices/system/cpu/cpu4/online
133 smpboot: Booting Node 0 Processor 4 APIC 0x1
134
135The CPU is usable again. This should work on all CPUs. CPU0 is often special
136and excluded from CPU hotplug. On X86 the kernel option
137*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to
138shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be
139used. Some known dependencies of CPU0:
140
141* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline.
142* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected.
143
144Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies
145on CPU0.
146
147The CPU hotplug coordination
148============================
149
150The offline case
151----------------
152Once a CPU has been logically shutdown the teardown callbacks of registered
153hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating
154at state ``CPUHP_OFFLINE``. This includes:
155
156* If tasks are frozen due to a suspend operation then *cpuhp_tasks_frozen*
157  will be set to true.
158* All processes are migrated away from this outgoing CPU to new CPUs.
159  The new CPU is chosen from each process' current cpuset, which may be
160  a subset of all online CPUs.
161* All interrupts targeted to this CPU are migrated to a new CPU
162* timers are also migrated to a new CPU
163* Once all services are migrated, kernel calls an arch specific routine
164  ``__cpu_disable()`` to perform arch specific cleanup.
165
166Using the hotplug API
167---------------------
168It is possible to receive notifications once a CPU is offline or onlined. This
169might be important to certain drivers which need to perform some kind of setup
170or clean up functions based on the number of available CPUs: ::
171
172  #include <linux/cpuhotplug.h>
173
174  ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online",
175                          Y_online, Y_prepare_down);
176
177*X* is the subsystem and *Y* the particular driver. The *Y_online* callback
178will be invoked during registration on all online CPUs. If an error
179occurs during the online callback the *Y_prepare_down* callback will be
180invoked on all CPUs on which the online callback was previously invoked.
181After registration completed, the *Y_online* callback will be invoked
182once a CPU is brought online and *Y_prepare_down* will be invoked when a
183CPU is shutdown. All resources which were previously allocated in
184*Y_online* should be released in *Y_prepare_down*.
185The return value *ret* is negative if an error occurred during the
186registration process. Otherwise a positive value is returned which
187contains the allocated hotplug for dynamically allocated states
188(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states.
189
190The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a
191dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state.
192During the removal of a hotplug state the teardown callback will be invoked.
193
194Multiple instances
195~~~~~~~~~~~~~~~~~~
196If a driver has multiple instances and each instance needs to perform the
197callback independently then it is likely that a ''multi-state'' should be used.
198First a multi-state state needs to be registered: ::
199
200  ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online,
201                                Y_online, Y_prepare_down);
202  Y_hp_online = ret;
203
204The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()``
205except it prepares the callbacks for a multi state and does not invoke
206the callbacks. This is a one time setup.
207Once a new instance is allocated, you need to register this new instance: ::
208
209  ret = cpuhp_state_add_instance(Y_hp_online, &d->node);
210
211This function will add this instance to your previously allocated
212*Y_hp_online* state and invoke the previously registered callback
213(*Y_online*) on all online CPUs. The *node* element is a ``struct
214hlist_node`` member of your per-instance data structure.
215
216On removal of the instance: ::
217  cpuhp_state_remove_instance(Y_hp_online, &d->node)
218
219should be invoked which will invoke the teardown callback on all online
220CPUs.
221
222Manual setup
223~~~~~~~~~~~~
224Usually it is handy to invoke setup and teardown callbacks on registration or
225removal of a state because usually the operation needs to performed once a CPU
226goes online (offline) and during initial setup (shutdown) of the driver. However
227each registration and removal function is also available with a ``_nocalls``
228suffix which does not invoke the provided callbacks if the invocation of the
229callbacks is not desired. During the manual setup (or teardown) the functions
230``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU
231hotplug operations.
232
233
234The ordering of the events
235--------------------------
236The hotplug states are defined in ``include/linux/cpuhotplug.h``:
237
238* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the
239  CPU is up.
240* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked
241  just the after the CPU has been brought up. The interrupts are off and
242  the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE*
243  the callbacks are invoked on the target CPU.
244* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are
245  reserved for the dynamic allocation.
246* The states are invoked in the reverse order on CPU shutdown starting with
247  *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are
248  invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*.
249
250A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough.
251However if an earlier invocation during the bring up or shutdown is required
252then an explicit state should be acquired. An explicit state might also be
253required if the hotplug event requires specific ordering in respect to
254another hotplug event.
255
256Testing of hotplug states
257=========================
258One way to verify whether a custom state is working as expected or not is to
259shutdown a CPU and then put it online again. It is also possible to put the CPU
260to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to
261*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE*
262which would lead to rollback to the online state.
263
264All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: ::
265
266 $ tail /sys/devices/system/cpu/hotplug/states
267 138: mm/vmscan:online
268 139: mm/vmstat:online
269 140: lib/percpu_cnt:online
270 141: acpi/cpu-drv:online
271 142: base/cacheinfo:online
272 143: virtio/net:online
273 144: x86/mce:online
274 145: printk:online
275 168: sched:active
276 169: online
277
278To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
279
280  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
281  169
282  $ echo 140 > /sys/devices/system/cpu/cpu4/hotplug/target
283  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
284  140
285
286It is important to note that the teardown callbac of state 140 have been
287invoked. And now get back online: ::
288
289  $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target
290  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
291  169
292
293With trace events enabled, the individual steps are visible, too: ::
294
295  #  TASK-PID   CPU#    TIMESTAMP  FUNCTION
296  #     | |       |        |         |
297      bash-394  [001]  22.976: cpuhp_enter: cpu: 0004 target: 140 step: 169 (cpuhp_kick_ap_work)
298   cpuhp/4-31   [004]  22.977: cpuhp_enter: cpu: 0004 target: 140 step: 168 (sched_cpu_deactivate)
299   cpuhp/4-31   [004]  22.990: cpuhp_exit:  cpu: 0004  state: 168 step: 168 ret: 0
300   cpuhp/4-31   [004]  22.991: cpuhp_enter: cpu: 0004 target: 140 step: 144 (mce_cpu_pre_down)
301   cpuhp/4-31   [004]  22.992: cpuhp_exit:  cpu: 0004  state: 144 step: 144 ret: 0
302   cpuhp/4-31   [004]  22.993: cpuhp_multi_enter: cpu: 0004 target: 140 step: 143 (virtnet_cpu_down_prep)
303   cpuhp/4-31   [004]  22.994: cpuhp_exit:  cpu: 0004  state: 143 step: 143 ret: 0
304   cpuhp/4-31   [004]  22.995: cpuhp_enter: cpu: 0004 target: 140 step: 142 (cacheinfo_cpu_pre_down)
305   cpuhp/4-31   [004]  22.996: cpuhp_exit:  cpu: 0004  state: 142 step: 142 ret: 0
306      bash-394  [001]  22.997: cpuhp_exit:  cpu: 0004  state: 140 step: 169 ret: 0
307      bash-394  [005]  95.540: cpuhp_enter: cpu: 0004 target: 169 step: 140 (cpuhp_kick_ap_work)
308   cpuhp/4-31   [004]  95.541: cpuhp_enter: cpu: 0004 target: 169 step: 141 (acpi_soft_cpu_online)
309   cpuhp/4-31   [004]  95.542: cpuhp_exit:  cpu: 0004  state: 141 step: 141 ret: 0
310   cpuhp/4-31   [004]  95.543: cpuhp_enter: cpu: 0004 target: 169 step: 142 (cacheinfo_cpu_online)
311   cpuhp/4-31   [004]  95.544: cpuhp_exit:  cpu: 0004  state: 142 step: 142 ret: 0
312   cpuhp/4-31   [004]  95.545: cpuhp_multi_enter: cpu: 0004 target: 169 step: 143 (virtnet_cpu_online)
313   cpuhp/4-31   [004]  95.546: cpuhp_exit:  cpu: 0004  state: 143 step: 143 ret: 0
314   cpuhp/4-31   [004]  95.547: cpuhp_enter: cpu: 0004 target: 169 step: 144 (mce_cpu_online)
315   cpuhp/4-31   [004]  95.548: cpuhp_exit:  cpu: 0004  state: 144 step: 144 ret: 0
316   cpuhp/4-31   [004]  95.549: cpuhp_enter: cpu: 0004 target: 169 step: 145 (console_cpu_notify)
317   cpuhp/4-31   [004]  95.550: cpuhp_exit:  cpu: 0004  state: 145 step: 145 ret: 0
318   cpuhp/4-31   [004]  95.551: cpuhp_enter: cpu: 0004 target: 169 step: 168 (sched_cpu_activate)
319   cpuhp/4-31   [004]  95.552: cpuhp_exit:  cpu: 0004  state: 168 step: 168 ret: 0
320      bash-394  [005]  95.553: cpuhp_exit:  cpu: 0004  state: 169 step: 140 ret: 0
321
322As it an be seen, CPU4 went down until timestamp 22.996 and then back up until
32395.552. All invoked callbacks including their return codes are visible in the
324trace.
325
326Architecture's requirements
327===========================
328The following functions and configurations are required:
329
330``CONFIG_HOTPLUG_CPU``
331  This entry needs to be enabled in Kconfig
332
333``__cpu_up()``
334  Arch interface to bring up a CPU
335
336``__cpu_disable()``
337  Arch interface to shutdown a CPU, no more interrupts can be handled by the
338  kernel after the routine returns. This includes the shutdown of the timer.
339
340``__cpu_die()``
341  This actually supposed to ensure death of the CPU. Actually look at some
342  example code in other arch that implement CPU hotplug. The processor is taken
343  down from the ``idle()`` loop for that specific architecture. ``__cpu_die()``
344  typically waits for some per_cpu state to be set, to ensure the processor dead
345  routine is called to be sure positively.
346
347User Space Notification
348=======================
349After CPU successfully onlined or offline udev events are sent. A udev rule like: ::
350
351  SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh"
352
353will receive all events. A script like: ::
354
355  #!/bin/sh
356
357  if [ "${ACTION}" = "offline" ]
358  then
359      echo "CPU ${DEVPATH##*/} offline"
360
361  elif [ "${ACTION}" = "online" ]
362  then
363      echo "CPU ${DEVPATH##*/} online"
364
365  fi
366
367can process the event further.
368
369Kernel Inline Documentations Reference
370======================================
371
372.. kernel-doc:: include/linux/cpuhotplug.h
373