1========================= 2CPU hotplug in the Kernel 3========================= 4 5:Date: December, 2016 6:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>, 7 Rusty Russell <rusty@rustcorp.com.au>, 8 Srivatsa Vaddagiri <vatsa@in.ibm.com>, 9 Ashok Raj <ashok.raj@intel.com>, 10 Joel Schopp <jschopp@austin.ibm.com> 11 12Introduction 13============ 14 15Modern advances in system architectures have introduced advanced error 16reporting and correction capabilities in processors. There are couple OEMS that 17support NUMA hardware which are hot pluggable as well, where physical node 18insertion and removal require support for CPU hotplug. 19 20Such advances require CPUs available to a kernel to be removed either for 21provisioning reasons, or for RAS purposes to keep an offending CPU off 22system execution path. Hence the need for CPU hotplug support in the 23Linux kernel. 24 25A more novel use of CPU-hotplug support is its use today in suspend resume 26support for SMP. Dual-core and HT support makes even a laptop run SMP kernels 27which didn't support these methods. 28 29 30Command Line Switches 31===================== 32``maxcpus=n`` 33 Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using 34 ``maxcpus=2`` will only boot two. You can choose to bring the 35 other CPUs later online. 36 37``nr_cpus=n`` 38 Restrict the total amount CPUs the kernel will support. If the number 39 supplied here is lower than the number of physically available CPUs than 40 those CPUs can not be brought online later. 41 42``additional_cpus=n`` 43 Use this to limit hotpluggable CPUs. This option sets 44 ``cpu_possible_mask = cpu_present_mask + additional_cpus`` 45 46 This option is limited to the IA64 architecture. 47 48``possible_cpus=n`` 49 This option sets ``possible_cpus`` bits in ``cpu_possible_mask``. 50 51 This option is limited to the X86 and S390 architecture. 52 53``cede_offline={"off","on"}`` 54 Use this option to disable/enable putting offlined processors to an extended 55 ``H_CEDE`` state on supported pseries platforms. If nothing is specified, 56 ``cede_offline`` is set to "on". 57 58 This option is limited to the PowerPC architecture. 59 60``cpu0_hotplug`` 61 Allow to shutdown CPU0. 62 63 This option is limited to the X86 architecture. 64 65CPU maps 66======== 67 68``cpu_possible_mask`` 69 Bitmap of possible CPUs that can ever be available in the 70 system. This is used to allocate some boot time memory for per_cpu variables 71 that aren't designed to grow/shrink as CPUs are made available or removed. 72 Once set during boot time discovery phase, the map is static, i.e no bits 73 are added or removed anytime. Trimming it accurately for your system needs 74 upfront can save some boot time memory. 75 76``cpu_online_mask`` 77 Bitmap of all CPUs currently online. Its set in ``__cpu_up()`` 78 after a CPU is available for kernel scheduling and ready to receive 79 interrupts from devices. Its cleared when a CPU is brought down using 80 ``__cpu_disable()``, before which all OS services including interrupts are 81 migrated to another target CPU. 82 83``cpu_present_mask`` 84 Bitmap of CPUs currently present in the system. Not all 85 of them may be online. When physical hotplug is processed by the relevant 86 subsystem (e.g ACPI) can change and new bit either be added or removed 87 from the map depending on the event is hot-add/hot-remove. There are currently 88 no locking rules as of now. Typical usage is to init topology during boot, 89 at which time hotplug is disabled. 90 91You really don't need to manipulate any of the system CPU maps. They should 92be read-only for most use. When setting up per-cpu resources almost always use 93``cpu_possible_mask`` or ``for_each_possible_cpu()`` to iterate. To macro 94``for_each_cpu()`` can be used to iterate over a custom CPU mask. 95 96Never use anything other than ``cpumask_t`` to represent bitmap of CPUs. 97 98 99Using CPU hotplug 100================= 101The kernel option *CONFIG_HOTPLUG_CPU* needs to be enabled. It is currently 102available on multiple architectures including ARM, MIPS, PowerPC and X86. The 103configuration is done via the sysfs interface: :: 104 105 $ ls -lh /sys/devices/system/cpu 106 total 0 107 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu0 108 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu1 109 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu2 110 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu3 111 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu4 112 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu5 113 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu6 114 drwxr-xr-x 9 root root 0 Dec 21 16:33 cpu7 115 drwxr-xr-x 2 root root 0 Dec 21 16:33 hotplug 116 -r--r--r-- 1 root root 4.0K Dec 21 16:33 offline 117 -r--r--r-- 1 root root 4.0K Dec 21 16:33 online 118 -r--r--r-- 1 root root 4.0K Dec 21 16:33 possible 119 -r--r--r-- 1 root root 4.0K Dec 21 16:33 present 120 121The files *offline*, *online*, *possible*, *present* represent the CPU masks. 122Each CPU folder contains an *online* file which controls the logical on (1) and 123off (0) state. To logically shutdown CPU4: :: 124 125 $ echo 0 > /sys/devices/system/cpu/cpu4/online 126 smpboot: CPU 4 is now offline 127 128Once the CPU is shutdown, it will be removed from */proc/interrupts*, 129*/proc/cpuinfo* and should also not be shown visible by the *top* command. To 130bring CPU4 back online: :: 131 132 $ echo 1 > /sys/devices/system/cpu/cpu4/online 133 smpboot: Booting Node 0 Processor 4 APIC 0x1 134 135The CPU is usable again. This should work on all CPUs. CPU0 is often special 136and excluded from CPU hotplug. On X86 the kernel option 137*CONFIG_BOOTPARAM_HOTPLUG_CPU0* has to be enabled in order to be able to 138shutdown CPU0. Alternatively the kernel command option *cpu0_hotplug* can be 139used. Some known dependencies of CPU0: 140 141* Resume from hibernate/suspend. Hibernate/suspend will fail if CPU0 is offline. 142* PIC interrupts. CPU0 can't be removed if a PIC interrupt is detected. 143 144Please let Fenghua Yu <fenghua.yu@intel.com> know if you find any dependencies 145on CPU0. 146 147The CPU hotplug coordination 148============================ 149 150The offline case 151---------------- 152Once a CPU has been logically shutdown the teardown callbacks of registered 153hotplug states will be invoked, starting with ``CPUHP_ONLINE`` and terminating 154at state ``CPUHP_OFFLINE``. This includes: 155 156* If tasks are frozen due to a suspend operation then *cpuhp_tasks_frozen* 157 will be set to true. 158* All processes are migrated away from this outgoing CPU to new CPUs. 159 The new CPU is chosen from each process' current cpuset, which may be 160 a subset of all online CPUs. 161* All interrupts targeted to this CPU are migrated to a new CPU 162* timers are also migrated to a new CPU 163* Once all services are migrated, kernel calls an arch specific routine 164 ``__cpu_disable()`` to perform arch specific cleanup. 165 166Using the hotplug API 167--------------------- 168It is possible to receive notifications once a CPU is offline or onlined. This 169might be important to certain drivers which need to perform some kind of setup 170or clean up functions based on the number of available CPUs: :: 171 172 #include <linux/cpuhotplug.h> 173 174 ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "X/Y:online", 175 Y_online, Y_prepare_down); 176 177*X* is the subsystem and *Y* the particular driver. The *Y_online* callback 178will be invoked during registration on all online CPUs. If an error 179occurs during the online callback the *Y_prepare_down* callback will be 180invoked on all CPUs on which the online callback was previously invoked. 181After registration completed, the *Y_online* callback will be invoked 182once a CPU is brought online and *Y_prepare_down* will be invoked when a 183CPU is shutdown. All resources which were previously allocated in 184*Y_online* should be released in *Y_prepare_down*. 185The return value *ret* is negative if an error occurred during the 186registration process. Otherwise a positive value is returned which 187contains the allocated hotplug for dynamically allocated states 188(*CPUHP_AP_ONLINE_DYN*). It will return zero for predefined states. 189 190The callback can be remove by invoking ``cpuhp_remove_state()``. In case of a 191dynamically allocated state (*CPUHP_AP_ONLINE_DYN*) use the returned state. 192During the removal of a hotplug state the teardown callback will be invoked. 193 194Multiple instances 195~~~~~~~~~~~~~~~~~~ 196If a driver has multiple instances and each instance needs to perform the 197callback independently then it is likely that a ''multi-state'' should be used. 198First a multi-state state needs to be registered: :: 199 200 ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "X/Y:online, 201 Y_online, Y_prepare_down); 202 Y_hp_online = ret; 203 204The ``cpuhp_setup_state_multi()`` behaves similar to ``cpuhp_setup_state()`` 205except it prepares the callbacks for a multi state and does not invoke 206the callbacks. This is a one time setup. 207Once a new instance is allocated, you need to register this new instance: :: 208 209 ret = cpuhp_state_add_instance(Y_hp_online, &d->node); 210 211This function will add this instance to your previously allocated 212*Y_hp_online* state and invoke the previously registered callback 213(*Y_online*) on all online CPUs. The *node* element is a ``struct 214hlist_node`` member of your per-instance data structure. 215 216On removal of the instance: :: 217 cpuhp_state_remove_instance(Y_hp_online, &d->node) 218 219should be invoked which will invoke the teardown callback on all online 220CPUs. 221 222Manual setup 223~~~~~~~~~~~~ 224Usually it is handy to invoke setup and teardown callbacks on registration or 225removal of a state because usually the operation needs to performed once a CPU 226goes online (offline) and during initial setup (shutdown) of the driver. However 227each registration and removal function is also available with a ``_nocalls`` 228suffix which does not invoke the provided callbacks if the invocation of the 229callbacks is not desired. During the manual setup (or teardown) the functions 230``get_online_cpus()`` and ``put_online_cpus()`` should be used to inhibit CPU 231hotplug operations. 232 233 234The ordering of the events 235-------------------------- 236The hotplug states are defined in ``include/linux/cpuhotplug.h``: 237 238* The states *CPUHP_OFFLINE* … *CPUHP_AP_OFFLINE* are invoked before the 239 CPU is up. 240* The states *CPUHP_AP_OFFLINE* … *CPUHP_AP_ONLINE* are invoked 241 just the after the CPU has been brought up. The interrupts are off and 242 the scheduler is not yet active on this CPU. Starting with *CPUHP_AP_OFFLINE* 243 the callbacks are invoked on the target CPU. 244* The states between *CPUHP_AP_ONLINE_DYN* and *CPUHP_AP_ONLINE_DYN_END* are 245 reserved for the dynamic allocation. 246* The states are invoked in the reverse order on CPU shutdown starting with 247 *CPUHP_ONLINE* and stopping at *CPUHP_OFFLINE*. Here the callbacks are 248 invoked on the CPU that will be shutdown until *CPUHP_AP_OFFLINE*. 249 250A dynamically allocated state via *CPUHP_AP_ONLINE_DYN* is often enough. 251However if an earlier invocation during the bring up or shutdown is required 252then an explicit state should be acquired. An explicit state might also be 253required if the hotplug event requires specific ordering in respect to 254another hotplug event. 255 256Testing of hotplug states 257========================= 258One way to verify whether a custom state is working as expected or not is to 259shutdown a CPU and then put it online again. It is also possible to put the CPU 260to certain state (for instance *CPUHP_AP_ONLINE*) and then go back to 261*CPUHP_ONLINE*. This would simulate an error one state after *CPUHP_AP_ONLINE* 262which would lead to rollback to the online state. 263 264All registered states are enumerated in ``/sys/devices/system/cpu/hotplug/states``: :: 265 266 $ tail /sys/devices/system/cpu/hotplug/states 267 138: mm/vmscan:online 268 139: mm/vmstat:online 269 140: lib/percpu_cnt:online 270 141: acpi/cpu-drv:online 271 142: base/cacheinfo:online 272 143: virtio/net:online 273 144: x86/mce:online 274 145: printk:online 275 168: sched:active 276 169: online 277 278To rollback CPU4 to ``lib/percpu_cnt:online`` and back online just issue: :: 279 280 $ cat /sys/devices/system/cpu/cpu4/hotplug/state 281 169 282 $ echo 140 > /sys/devices/system/cpu/cpu4/hotplug/target 283 $ cat /sys/devices/system/cpu/cpu4/hotplug/state 284 140 285 286It is important to note that the teardown callbac of state 140 have been 287invoked. And now get back online: :: 288 289 $ echo 169 > /sys/devices/system/cpu/cpu4/hotplug/target 290 $ cat /sys/devices/system/cpu/cpu4/hotplug/state 291 169 292 293With trace events enabled, the individual steps are visible, too: :: 294 295 # TASK-PID CPU# TIMESTAMP FUNCTION 296 # | | | | | 297 bash-394 [001] 22.976: cpuhp_enter: cpu: 0004 target: 140 step: 169 (cpuhp_kick_ap_work) 298 cpuhp/4-31 [004] 22.977: cpuhp_enter: cpu: 0004 target: 140 step: 168 (sched_cpu_deactivate) 299 cpuhp/4-31 [004] 22.990: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 300 cpuhp/4-31 [004] 22.991: cpuhp_enter: cpu: 0004 target: 140 step: 144 (mce_cpu_pre_down) 301 cpuhp/4-31 [004] 22.992: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 302 cpuhp/4-31 [004] 22.993: cpuhp_multi_enter: cpu: 0004 target: 140 step: 143 (virtnet_cpu_down_prep) 303 cpuhp/4-31 [004] 22.994: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 304 cpuhp/4-31 [004] 22.995: cpuhp_enter: cpu: 0004 target: 140 step: 142 (cacheinfo_cpu_pre_down) 305 cpuhp/4-31 [004] 22.996: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 306 bash-394 [001] 22.997: cpuhp_exit: cpu: 0004 state: 140 step: 169 ret: 0 307 bash-394 [005] 95.540: cpuhp_enter: cpu: 0004 target: 169 step: 140 (cpuhp_kick_ap_work) 308 cpuhp/4-31 [004] 95.541: cpuhp_enter: cpu: 0004 target: 169 step: 141 (acpi_soft_cpu_online) 309 cpuhp/4-31 [004] 95.542: cpuhp_exit: cpu: 0004 state: 141 step: 141 ret: 0 310 cpuhp/4-31 [004] 95.543: cpuhp_enter: cpu: 0004 target: 169 step: 142 (cacheinfo_cpu_online) 311 cpuhp/4-31 [004] 95.544: cpuhp_exit: cpu: 0004 state: 142 step: 142 ret: 0 312 cpuhp/4-31 [004] 95.545: cpuhp_multi_enter: cpu: 0004 target: 169 step: 143 (virtnet_cpu_online) 313 cpuhp/4-31 [004] 95.546: cpuhp_exit: cpu: 0004 state: 143 step: 143 ret: 0 314 cpuhp/4-31 [004] 95.547: cpuhp_enter: cpu: 0004 target: 169 step: 144 (mce_cpu_online) 315 cpuhp/4-31 [004] 95.548: cpuhp_exit: cpu: 0004 state: 144 step: 144 ret: 0 316 cpuhp/4-31 [004] 95.549: cpuhp_enter: cpu: 0004 target: 169 step: 145 (console_cpu_notify) 317 cpuhp/4-31 [004] 95.550: cpuhp_exit: cpu: 0004 state: 145 step: 145 ret: 0 318 cpuhp/4-31 [004] 95.551: cpuhp_enter: cpu: 0004 target: 169 step: 168 (sched_cpu_activate) 319 cpuhp/4-31 [004] 95.552: cpuhp_exit: cpu: 0004 state: 168 step: 168 ret: 0 320 bash-394 [005] 95.553: cpuhp_exit: cpu: 0004 state: 169 step: 140 ret: 0 321 322As it an be seen, CPU4 went down until timestamp 22.996 and then back up until 32395.552. All invoked callbacks including their return codes are visible in the 324trace. 325 326Architecture's requirements 327=========================== 328The following functions and configurations are required: 329 330``CONFIG_HOTPLUG_CPU`` 331 This entry needs to be enabled in Kconfig 332 333``__cpu_up()`` 334 Arch interface to bring up a CPU 335 336``__cpu_disable()`` 337 Arch interface to shutdown a CPU, no more interrupts can be handled by the 338 kernel after the routine returns. This includes the shutdown of the timer. 339 340``__cpu_die()`` 341 This actually supposed to ensure death of the CPU. Actually look at some 342 example code in other arch that implement CPU hotplug. The processor is taken 343 down from the ``idle()`` loop for that specific architecture. ``__cpu_die()`` 344 typically waits for some per_cpu state to be set, to ensure the processor dead 345 routine is called to be sure positively. 346 347User Space Notification 348======================= 349After CPU successfully onlined or offline udev events are sent. A udev rule like: :: 350 351 SUBSYSTEM=="cpu", DRIVERS=="processor", DEVPATH=="/devices/system/cpu/*", RUN+="the_hotplug_receiver.sh" 352 353will receive all events. A script like: :: 354 355 #!/bin/sh 356 357 if [ "${ACTION}" = "offline" ] 358 then 359 echo "CPU ${DEVPATH##*/} offline" 360 361 elif [ "${ACTION}" = "online" ] 362 then 363 echo "CPU ${DEVPATH##*/} online" 364 365 fi 366 367can process the event further. 368 369Kernel Inline Documentations Reference 370====================================== 371 372.. kernel-doc:: include/linux/cpuhotplug.h 373