1=========================
2CPU hotplug in the Kernel
3=========================
4
5:Date: December, 2016
6:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
7          Rusty Russell <rusty@rustcorp.com.au>,
8          Srivatsa Vaddagiri <vatsa@in.ibm.com>,
9          Ashok Raj <ashok.raj@intel.com>,
10          Joel Schopp <jschopp@austin.ibm.com>
11
12Introduction
13============
14
15Modern advances in system architectures have introduced advanced error
16reporting and correction capabilities in processors. There are couple OEMS that
17support NUMA hardware which are hot pluggable as well, where physical node
18insertion and removal require support for CPU hotplug.
19
20Such advances require CPUs available to a kernel to be removed either for
21provisioning reasons, or for RAS purposes to keep an offending CPU off
22system execution path. Hence the need for CPU hotplug support in the
23Linux kernel.
24
25A more novel use of CPU-hotplug support is its use today in suspend resume
26support for SMP. Dual-core and HT support makes even a laptop run SMP kernels
27which didn't support these methods.
28
29
30Command Line Switches
31=====================
32``maxcpus=n``
33  Restrict boot time CPUs to *n*. Say if you have four CPUs, using
34  ``maxcpus=2`` will only boot two. You can choose to bring the
35  other CPUs later online.
36
37``nr_cpus=n``
38  Restrict the total amount of CPUs the kernel will support. If the number
39  supplied here is lower than the number of physically available CPUs, then
40  those CPUs can not be brought online later.
41
42``additional_cpus=n``
43  Use this to limit hotpluggable CPUs. This option sets
44  ``cpu_possible_mask = cpu_present_mask + additional_cpus``
45
46  This option is limited to the IA64 architecture.
47
48``possible_cpus=n``
49  This option sets ``possible_cpus`` bits in ``cpu_possible_mask``.
50
51  This option is limited to the X86 and S390 architecture.
52
53``cpu0_hotplug``
54  Allow to shutdown CPU0.
55
56  This option is limited to the X86 architecture.
57
58CPU maps
59========
60
61``cpu_possible_mask``
62  Bitmap of possible CPUs that can ever be available in the
63  system. This is used to allocate some boot time memory for per_cpu variables
64  that aren't designed to grow/shrink as CPUs are made available or removed.
65  Once set during boot time discovery phase, the map is static, i.e no bits
66  are added or removed anytime. Trimming it accurately for your system needs
67  upfront can save some boot time memory.
68
69``cpu_online_mask``
70  Bitmap of all CPUs currently online. Its set in ``__cpu_up()``
71  after a CPU is available for kernel scheduling and ready to receive
72  interrupts from devices. Its cleared when a CPU is brought down using
73  ``__cpu_disable()``, before which all OS services including interrupts are
74  migrated to another target CPU.
75
76``cpu_present_mask``
77  Bitmap of CPUs currently present in the system. Not all
78  of them may be online. When physical hotplug is processed by the relevant
79  subsystem (e.g ACPI) can change and new bit either be added or removed
80  from the map depending on the event is hot-add/hot-remove. There are currently
81  no locking rules as of now. Typical usage is to init topology during boot,
82  at which time hotplug is disabled.
83
84You really don't need to manipulate any of the system CPU maps. They should
85be read-only for most use. When setting up per-cpu resources almost always use
86``cpu_possible_mask`` or ``for_each_possible_cpu()`` to iterate. To macro
87``for_each_cpu()`` can be used to iterate over a custom CPU mask.
88
89Never use anything other than ``cpumask_t`` to represent bitmap of CPUs.
90
91
92Using CPU hotplug
93=================
94The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently
95available on multiple architectures including ARM, MIPS, PowerPC and X86. The
96configuration is done via the sysfs interface: ::
97
98 $ ls -lh /sys/devices/system/cpu
99 total 0
100 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu0
101 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu1
102 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu2
103 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu3
104 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu4
105 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu5
106 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu6
107 drwxr-xr-x  9 root root    0 Dec 21 16:33 cpu7
108 drwxr-xr-x  2 root root    0 Dec 21 16:33 hotplug
109 -r--r--r--  1 root root 4.0K Dec 21 16:33 offline
110 -r--r--r--  1 root root 4.0K Dec 21 16:33 online
111 -r--r--r--  1 root root 4.0K Dec 21 16:33 possible
112 -r--r--r--  1 root root 4.0K Dec 21 16:33 present
113
114The files *offline*, *online*, *possible*, *present* represent the CPU masks.
115Each CPU folder contains an *online* file which controls the logical on (1) and
116off (0) state. To logically shutdown CPU4: ::
117
118 $ echo 0 > /sys/devices/system/cpu/cpu4/online
119  smpboot: CPU 4 is now offline
120
121Once the CPU is shutdown, it will be removed from */proc/interrupts*,
122*/proc/cpuinfo* and should also not be shown visible by the *top* command. To
123bring CPU4 back online: ::
124
125 $ echo 1 > /sys/devices/system/cpu/cpu4/online
126 smpboot: Booting Node 0 Processor 4 APIC 0x1
127
128The CPU is usable again. This should work on all CPUs. CPU0 is often special
129and excluded from CPU hotplug. On X86 the kernel option
130*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to
131shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be
132used. Some known dependencies of CPU0:
133
134* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline.
135* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected.
136
137Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies
138on CPU0.
139
140The CPU hotplug coordination
141============================
142
143The offline case
144----------------
145Once a CPU has been logically shutdown the teardown callbacks of registered
146hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating
147at state ``CPUHP_OFFLINE``. This includes:
148
149* If tasks are frozen due to a suspend operation then *cpuhp_tasks_frozen*
150  will be set to true.
151* All processes are migrated away from this outgoing CPU to new CPUs.
152  The new CPU is chosen from each process' current cpuset, which may be
153  a subset of all online CPUs.
154* All interrupts targeted to this CPU are migrated to a new CPU
155* timers are also migrated to a new CPU
156* Once all services are migrated, kernel calls an arch specific routine
157  ``__cpu_disable()`` to perform arch specific cleanup.
158
159Using the hotplug API
160---------------------
161It is possible to receive notifications once a CPU is offline or onlined. This
162might be important to certain drivers which need to perform some kind of setup
163or clean up functions based on the number of available CPUs: ::
164
165  #include <linux/cpuhotplug.h>
166
167  ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online",
168                          Y_online, Y_prepare_down);
169
170*X* is the subsystem and *Y* the particular driver. The *Y_online* callback
171will be invoked during registration on all online CPUs. If an error
172occurs during the online callback the *Y_prepare_down* callback will be
173invoked on all CPUs on which the online callback was previously invoked.
174After registration completed, the *Y_online* callback will be invoked
175once a CPU is brought online and *Y_prepare_down* will be invoked when a
176CPU is shutdown. All resources which were previously allocated in
177*Y_online* should be released in *Y_prepare_down*.
178The return value *ret* is negative if an error occurred during the
179registration process. Otherwise a positive value is returned which
180contains the allocated hotplug for dynamically allocated states
181(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states.
182
183The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a
184dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state.
185During the removal of a hotplug state the teardown callback will be invoked.
186
187Multiple instances
188~~~~~~~~~~~~~~~~~~
189If a driver has multiple instances and each instance needs to perform the
190callback independently then it is likely that a ''multi-state'' should be used.
191First a multi-state state needs to be registered: ::
192
193  ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online,
194                                Y_online, Y_prepare_down);
195  Y_hp_online = ret;
196
197The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()``
198except it prepares the callbacks for a multi state and does not invoke
199the callbacks. This is a one time setup.
200Once a new instance is allocated, you need to register this new instance: ::
201
202  ret = cpuhp_state_add_instance(Y_hp_online, &d->node);
203
204This function will add this instance to your previously allocated
205*Y_hp_online* state and invoke the previously registered callback
206(*Y_online*) on all online CPUs. The *node* element is a ``struct
207hlist_node`` member of your per-instance data structure.
208
209On removal of the instance: ::
210  cpuhp_state_remove_instance(Y_hp_online, &d->node)
211
212should be invoked which will invoke the teardown callback on all online
213CPUs.
214
215Manual setup
216~~~~~~~~~~~~
217Usually it is handy to invoke setup and teardown callbacks on registration or
218removal of a state because usually the operation needs to performed once a CPU
219goes online (offline) and during initial setup (shutdown) of the driver. However
220each registration and removal function is also available with a ``_nocalls``
221suffix which does not invoke the provided callbacks if the invocation of the
222callbacks is not desired. During the manual setup (or teardown) the functions
223``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU
224hotplug operations.
225
226
227The ordering of the events
228--------------------------
229The hotplug states are defined in ``include/linux/cpuhotplug.h``:
230
231* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the
232  CPU is up.
233* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked
234  just the after the CPU has been brought up. The interrupts are off and
235  the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE*
236  the callbacks are invoked on the target CPU.
237* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are
238  reserved for the dynamic allocation.
239* The states are invoked in the reverse order on CPU shutdown starting with
240  *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are
241  invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*.
242
243A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough.
244However if an earlier invocation during the bring up or shutdown is required
245then an explicit state should be acquired. An explicit state might also be
246required if the hotplug event requires specific ordering in respect to
247another hotplug event.
248
249Testing of hotplug states
250=========================
251One way to verify whether a custom state is working as expected or not is to
252shutdown a CPU and then put it online again. It is also possible to put the CPU
253to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to
254*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE*
255which would lead to rollback to the online state.
256
257All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: ::
258
259 $ tail /sys/devices/system/cpu/hotplug/states
260 138: mm/vmscan:online
261 139: mm/vmstat:online
262 140: lib/percpu_cnt:online
263 141: acpi/cpu-drv:online
264 142: base/cacheinfo:online
265 143: virtio/net:online
266 144: x86/mce:online
267 145: printk:online
268 168: sched:active
269 169: online
270
271To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: ::
272
273  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
274  169
275  $ echo 140 > /sys/devices/system/cpu/cpu4/hotplug/target
276  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
277  140
278
279It is important to note that the teardown callbac of state 140 have been
280invoked. And now get back online: ::
281
282  $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target
283  $ cat /sys/devices/system/cpu/cpu4/hotplug/state
284  169
285
286With trace events enabled, the individual steps are visible, too: ::
287
288  #  TASK-PID   CPU#    TIMESTAMP  FUNCTION
289  #     | |       |        |         |
290      bash-394  [001]  22.976: cpuhp_enter: cpu: 0004 target: 140 step: 169 (cpuhp_kick_ap_work)
291   cpuhp/4-31   [004]  22.977: cpuhp_enter: cpu: 0004 target: 140 step: 168 (sched_cpu_deactivate)
292   cpuhp/4-31   [004]  22.990: cpuhp_exit:  cpu: 0004  state: 168 step: 168 ret: 0
293   cpuhp/4-31   [004]  22.991: cpuhp_enter: cpu: 0004 target: 140 step: 144 (mce_cpu_pre_down)
294   cpuhp/4-31   [004]  22.992: cpuhp_exit:  cpu: 0004  state: 144 step: 144 ret: 0
295   cpuhp/4-31   [004]  22.993: cpuhp_multi_enter: cpu: 0004 target: 140 step: 143 (virtnet_cpu_down_prep)
296   cpuhp/4-31   [004]  22.994: cpuhp_exit:  cpu: 0004  state: 143 step: 143 ret: 0
297   cpuhp/4-31   [004]  22.995: cpuhp_enter: cpu: 0004 target: 140 step: 142 (cacheinfo_cpu_pre_down)
298   cpuhp/4-31   [004]  22.996: cpuhp_exit:  cpu: 0004  state: 142 step: 142 ret: 0
299      bash-394  [001]  22.997: cpuhp_exit:  cpu: 0004  state: 140 step: 169 ret: 0
300      bash-394  [005]  95.540: cpuhp_enter: cpu: 0004 target: 169 step: 140 (cpuhp_kick_ap_work)
301   cpuhp/4-31   [004]  95.541: cpuhp_enter: cpu: 0004 target: 169 step: 141 (acpi_soft_cpu_online)
302   cpuhp/4-31   [004]  95.542: cpuhp_exit:  cpu: 0004  state: 141 step: 141 ret: 0
303   cpuhp/4-31   [004]  95.543: cpuhp_enter: cpu: 0004 target: 169 step: 142 (cacheinfo_cpu_online)
304   cpuhp/4-31   [004]  95.544: cpuhp_exit:  cpu: 0004  state: 142 step: 142 ret: 0
305   cpuhp/4-31   [004]  95.545: cpuhp_multi_enter: cpu: 0004 target: 169 step: 143 (virtnet_cpu_online)
306   cpuhp/4-31   [004]  95.546: cpuhp_exit:  cpu: 0004  state: 143 step: 143 ret: 0
307   cpuhp/4-31   [004]  95.547: cpuhp_enter: cpu: 0004 target: 169 step: 144 (mce_cpu_online)
308   cpuhp/4-31   [004]  95.548: cpuhp_exit:  cpu: 0004  state: 144 step: 144 ret: 0
309   cpuhp/4-31   [004]  95.549: cpuhp_enter: cpu: 0004 target: 169 step: 145 (console_cpu_notify)
310   cpuhp/4-31   [004]  95.550: cpuhp_exit:  cpu: 0004  state: 145 step: 145 ret: 0
311   cpuhp/4-31   [004]  95.551: cpuhp_enter: cpu: 0004 target: 169 step: 168 (sched_cpu_activate)
312   cpuhp/4-31   [004]  95.552: cpuhp_exit:  cpu: 0004  state: 168 step: 168 ret: 0
313      bash-394  [005]  95.553: cpuhp_exit:  cpu: 0004  state: 169 step: 140 ret: 0
314
315As it an be seen, CPU4 went down until timestamp 22.996 and then back up until
31695.552. All invoked callbacks including their return codes are visible in the
317trace.
318
319Architecture's requirements
320===========================
321The following functions and configurations are required:
322
323``CONFIG_HOTPLUG_CPU``
324  This entry needs to be enabled in Kconfig
325
326``__cpu_up()``
327  Arch interface to bring up a CPU
328
329``__cpu_disable()``
330  Arch interface to shutdown a CPU, no more interrupts can be handled by the
331  kernel after the routine returns. This includes the shutdown of the timer.
332
333``__cpu_die()``
334  This actually supposed to ensure death of the CPU. Actually look at some
335  example code in other arch that implement CPU hotplug. The processor is taken
336  down from the ``idle()`` loop for that specific architecture. ``__cpu_die()``
337  typically waits for some per_cpu state to be set, to ensure the processor dead
338  routine is called to be sure positively.
339
340User Space Notification
341=======================
342After CPU successfully onlined or offline udev events are sent. A udev rule like: ::
343
344  SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh"
345
346will receive all events. A script like: ::
347
348  #!/bin/sh
349
350  if [ "${ACTION}" = "offline" ]
351  then
352      echo "CPU ${DEVPATH##*/} offline"
353
354  elif [ "${ACTION}" = "online" ]
355  then
356      echo "CPU ${DEVPATH##*/} online"
357
358  fi
359
360can process the event further.
361
362Kernel Inline Documentations Reference
363======================================
364
365.. kernel-doc:: include/linux/cpuhotplug.h
366