1*151f4e2bSMauro Carvalho Chehab==================================================================== 2*151f4e2bSMauro Carvalho ChehabInteraction of Suspend code (S3) with the CPU hotplug infrastructure 3*151f4e2bSMauro Carvalho Chehab==================================================================== 4*151f4e2bSMauro Carvalho Chehab 5*151f4e2bSMauro Carvalho Chehab(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> 6*151f4e2bSMauro Carvalho Chehab 7*151f4e2bSMauro Carvalho Chehab 8*151f4e2bSMauro Carvalho ChehabI. Differences between CPU hotplug and Suspend-to-RAM 9*151f4e2bSMauro Carvalho Chehab====================================================== 10*151f4e2bSMauro Carvalho Chehab 11*151f4e2bSMauro Carvalho ChehabHow does the regular CPU hotplug code differ from how the Suspend-to-RAM 12*151f4e2bSMauro Carvalho Chehabinfrastructure uses it internally? And where do they share common code? 13*151f4e2bSMauro Carvalho Chehab 14*151f4e2bSMauro Carvalho ChehabWell, a picture is worth a thousand words... So ASCII art follows :-) 15*151f4e2bSMauro Carvalho Chehab 16*151f4e2bSMauro Carvalho Chehab[This depicts the current design in the kernel, and focusses only on the 17*151f4e2bSMauro Carvalho Chehabinteractions involving the freezer and CPU hotplug and also tries to explain 18*151f4e2bSMauro Carvalho Chehabthe locking involved. It outlines the notifications involved as well. 19*151f4e2bSMauro Carvalho ChehabBut please note that here, only the call paths are illustrated, with the aim 20*151f4e2bSMauro Carvalho Chehabof describing where they take different paths and where they share code. 21*151f4e2bSMauro Carvalho ChehabWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other 22*151f4e2bSMauro Carvalho Chehabis not depicted here.] 23*151f4e2bSMauro Carvalho Chehab 24*151f4e2bSMauro Carvalho ChehabOn a high level, the suspend-resume cycle goes like this:: 25*151f4e2bSMauro Carvalho Chehab 26*151f4e2bSMauro Carvalho Chehab |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | 27*151f4e2bSMauro Carvalho Chehab |tasks | | cpus | | | | cpus | |tasks| 28*151f4e2bSMauro Carvalho Chehab 29*151f4e2bSMauro Carvalho Chehab 30*151f4e2bSMauro Carvalho ChehabMore details follow:: 31*151f4e2bSMauro Carvalho Chehab 32*151f4e2bSMauro Carvalho Chehab Suspend call path 33*151f4e2bSMauro Carvalho Chehab ----------------- 34*151f4e2bSMauro Carvalho Chehab 35*151f4e2bSMauro Carvalho Chehab Write 'mem' to 36*151f4e2bSMauro Carvalho Chehab /sys/power/state 37*151f4e2bSMauro Carvalho Chehab sysfs file 38*151f4e2bSMauro Carvalho Chehab | 39*151f4e2bSMauro Carvalho Chehab v 40*151f4e2bSMauro Carvalho Chehab Acquire system_transition_mutex lock 41*151f4e2bSMauro Carvalho Chehab | 42*151f4e2bSMauro Carvalho Chehab v 43*151f4e2bSMauro Carvalho Chehab Send PM_SUSPEND_PREPARE 44*151f4e2bSMauro Carvalho Chehab notifications 45*151f4e2bSMauro Carvalho Chehab | 46*151f4e2bSMauro Carvalho Chehab v 47*151f4e2bSMauro Carvalho Chehab Freeze tasks 48*151f4e2bSMauro Carvalho Chehab | 49*151f4e2bSMauro Carvalho Chehab | 50*151f4e2bSMauro Carvalho Chehab v 51*151f4e2bSMauro Carvalho Chehab disable_nonboot_cpus() 52*151f4e2bSMauro Carvalho Chehab /* start */ 53*151f4e2bSMauro Carvalho Chehab | 54*151f4e2bSMauro Carvalho Chehab v 55*151f4e2bSMauro Carvalho Chehab Acquire cpu_add_remove_lock 56*151f4e2bSMauro Carvalho Chehab | 57*151f4e2bSMauro Carvalho Chehab v 58*151f4e2bSMauro Carvalho Chehab Iterate over CURRENTLY 59*151f4e2bSMauro Carvalho Chehab online CPUs 60*151f4e2bSMauro Carvalho Chehab | 61*151f4e2bSMauro Carvalho Chehab | 62*151f4e2bSMauro Carvalho Chehab | ---------- 63*151f4e2bSMauro Carvalho Chehab v | L 64*151f4e2bSMauro Carvalho Chehab ======> _cpu_down() | 65*151f4e2bSMauro Carvalho Chehab | [This takes cpuhotplug.lock | 66*151f4e2bSMauro Carvalho Chehab Common | before taking down the CPU | 67*151f4e2bSMauro Carvalho Chehab code | and releases it when done] | O 68*151f4e2bSMauro Carvalho Chehab | While it is at it, notifications | 69*151f4e2bSMauro Carvalho Chehab | are sent when notable events occur, | 70*151f4e2bSMauro Carvalho Chehab ======> by running all registered callbacks. | 71*151f4e2bSMauro Carvalho Chehab | | O 72*151f4e2bSMauro Carvalho Chehab | | 73*151f4e2bSMauro Carvalho Chehab | | 74*151f4e2bSMauro Carvalho Chehab v | 75*151f4e2bSMauro Carvalho Chehab Note down these cpus in | P 76*151f4e2bSMauro Carvalho Chehab frozen_cpus mask ---------- 77*151f4e2bSMauro Carvalho Chehab | 78*151f4e2bSMauro Carvalho Chehab v 79*151f4e2bSMauro Carvalho Chehab Disable regular cpu hotplug 80*151f4e2bSMauro Carvalho Chehab by increasing cpu_hotplug_disabled 81*151f4e2bSMauro Carvalho Chehab | 82*151f4e2bSMauro Carvalho Chehab v 83*151f4e2bSMauro Carvalho Chehab Release cpu_add_remove_lock 84*151f4e2bSMauro Carvalho Chehab | 85*151f4e2bSMauro Carvalho Chehab v 86*151f4e2bSMauro Carvalho Chehab /* disable_nonboot_cpus() complete */ 87*151f4e2bSMauro Carvalho Chehab | 88*151f4e2bSMauro Carvalho Chehab v 89*151f4e2bSMauro Carvalho Chehab Do suspend 90*151f4e2bSMauro Carvalho Chehab 91*151f4e2bSMauro Carvalho Chehab 92*151f4e2bSMauro Carvalho Chehab 93*151f4e2bSMauro Carvalho ChehabResuming back is likewise, with the counterparts being (in the order of 94*151f4e2bSMauro Carvalho Chehabexecution during resume): 95*151f4e2bSMauro Carvalho Chehab 96*151f4e2bSMauro Carvalho Chehab* enable_nonboot_cpus() which involves:: 97*151f4e2bSMauro Carvalho Chehab 98*151f4e2bSMauro Carvalho Chehab | Acquire cpu_add_remove_lock 99*151f4e2bSMauro Carvalho Chehab | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug 100*151f4e2bSMauro Carvalho Chehab | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] 101*151f4e2bSMauro Carvalho Chehab | Release cpu_add_remove_lock 102*151f4e2bSMauro Carvalho Chehab v 103*151f4e2bSMauro Carvalho Chehab 104*151f4e2bSMauro Carvalho Chehab* thaw tasks 105*151f4e2bSMauro Carvalho Chehab* send PM_POST_SUSPEND notifications 106*151f4e2bSMauro Carvalho Chehab* Release system_transition_mutex lock. 107*151f4e2bSMauro Carvalho Chehab 108*151f4e2bSMauro Carvalho Chehab 109*151f4e2bSMauro Carvalho ChehabIt is to be noted here that the system_transition_mutex lock is acquired at the very 110*151f4e2bSMauro Carvalho Chehabbeginning, when we are just starting out to suspend, and then released only 111*151f4e2bSMauro Carvalho Chehabafter the entire cycle is complete (i.e., suspend + resume). 112*151f4e2bSMauro Carvalho Chehab 113*151f4e2bSMauro Carvalho Chehab:: 114*151f4e2bSMauro Carvalho Chehab 115*151f4e2bSMauro Carvalho Chehab 116*151f4e2bSMauro Carvalho Chehab 117*151f4e2bSMauro Carvalho Chehab Regular CPU hotplug call path 118*151f4e2bSMauro Carvalho Chehab ----------------------------- 119*151f4e2bSMauro Carvalho Chehab 120*151f4e2bSMauro Carvalho Chehab Write 0 (or 1) to 121*151f4e2bSMauro Carvalho Chehab /sys/devices/system/cpu/cpu*/online 122*151f4e2bSMauro Carvalho Chehab sysfs file 123*151f4e2bSMauro Carvalho Chehab | 124*151f4e2bSMauro Carvalho Chehab | 125*151f4e2bSMauro Carvalho Chehab v 126*151f4e2bSMauro Carvalho Chehab cpu_down() 127*151f4e2bSMauro Carvalho Chehab | 128*151f4e2bSMauro Carvalho Chehab v 129*151f4e2bSMauro Carvalho Chehab Acquire cpu_add_remove_lock 130*151f4e2bSMauro Carvalho Chehab | 131*151f4e2bSMauro Carvalho Chehab v 132*151f4e2bSMauro Carvalho Chehab If cpu_hotplug_disabled > 0 133*151f4e2bSMauro Carvalho Chehab return gracefully 134*151f4e2bSMauro Carvalho Chehab | 135*151f4e2bSMauro Carvalho Chehab | 136*151f4e2bSMauro Carvalho Chehab v 137*151f4e2bSMauro Carvalho Chehab ======> _cpu_down() 138*151f4e2bSMauro Carvalho Chehab | [This takes cpuhotplug.lock 139*151f4e2bSMauro Carvalho Chehab Common | before taking down the CPU 140*151f4e2bSMauro Carvalho Chehab code | and releases it when done] 141*151f4e2bSMauro Carvalho Chehab | While it is at it, notifications 142*151f4e2bSMauro Carvalho Chehab | are sent when notable events occur, 143*151f4e2bSMauro Carvalho Chehab ======> by running all registered callbacks. 144*151f4e2bSMauro Carvalho Chehab | 145*151f4e2bSMauro Carvalho Chehab | 146*151f4e2bSMauro Carvalho Chehab v 147*151f4e2bSMauro Carvalho Chehab Release cpu_add_remove_lock 148*151f4e2bSMauro Carvalho Chehab [That's it!, for 149*151f4e2bSMauro Carvalho Chehab regular CPU hotplug] 150*151f4e2bSMauro Carvalho Chehab 151*151f4e2bSMauro Carvalho Chehab 152*151f4e2bSMauro Carvalho Chehab 153*151f4e2bSMauro Carvalho ChehabSo, as can be seen from the two diagrams (the parts marked as "Common code"), 154*151f4e2bSMauro Carvalho Chehabregular CPU hotplug and the suspend code path converge at the _cpu_down() and 155*151f4e2bSMauro Carvalho Chehab_cpu_up() functions. They differ in the arguments passed to these functions, 156*151f4e2bSMauro Carvalho Chehabin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' 157*151f4e2bSMauro Carvalho Chehabargument. But during suspend, since the tasks are already frozen by the time 158*151f4e2bSMauro Carvalho Chehabthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called 159*151f4e2bSMauro Carvalho Chehabwith the 'tasks_frozen' argument set to 1. 160*151f4e2bSMauro Carvalho Chehab[See below for some known issues regarding this.] 161*151f4e2bSMauro Carvalho Chehab 162*151f4e2bSMauro Carvalho Chehab 163*151f4e2bSMauro Carvalho ChehabImportant files and functions/entry points: 164*151f4e2bSMauro Carvalho Chehab------------------------------------------- 165*151f4e2bSMauro Carvalho Chehab 166*151f4e2bSMauro Carvalho Chehab- kernel/power/process.c : freeze_processes(), thaw_processes() 167*151f4e2bSMauro Carvalho Chehab- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() 168*151f4e2bSMauro Carvalho Chehab- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() 169*151f4e2bSMauro Carvalho Chehab 170*151f4e2bSMauro Carvalho Chehab 171*151f4e2bSMauro Carvalho Chehab 172*151f4e2bSMauro Carvalho ChehabII. What are the issues involved in CPU hotplug? 173*151f4e2bSMauro Carvalho Chehab------------------------------------------------ 174*151f4e2bSMauro Carvalho Chehab 175*151f4e2bSMauro Carvalho ChehabThere are some interesting situations involving CPU hotplug and microcode 176*151f4e2bSMauro Carvalho Chehabupdate on the CPUs, as discussed below: 177*151f4e2bSMauro Carvalho Chehab 178*151f4e2bSMauro Carvalho Chehab[Please bear in mind that the kernel requests the microcode images from 179*151f4e2bSMauro Carvalho Chehabuserspace, using the request_firmware() function defined in 180*151f4e2bSMauro Carvalho Chehabdrivers/base/firmware_loader/main.c] 181*151f4e2bSMauro Carvalho Chehab 182*151f4e2bSMauro Carvalho Chehab 183*151f4e2bSMauro Carvalho Chehaba. When all the CPUs are identical: 184*151f4e2bSMauro Carvalho Chehab 185*151f4e2bSMauro Carvalho Chehab This is the most common situation and it is quite straightforward: we want 186*151f4e2bSMauro Carvalho Chehab to apply the same microcode revision to each of the CPUs. 187*151f4e2bSMauro Carvalho Chehab To give an example of x86, the collect_cpu_info() function defined in 188*151f4e2bSMauro Carvalho Chehab arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU 189*151f4e2bSMauro Carvalho Chehab and thereby in applying the correct microcode revision to it. 190*151f4e2bSMauro Carvalho Chehab But note that the kernel does not maintain a common microcode image for the 191*151f4e2bSMauro Carvalho Chehab all CPUs, in order to handle case 'b' described below. 192*151f4e2bSMauro Carvalho Chehab 193*151f4e2bSMauro Carvalho Chehab 194*151f4e2bSMauro Carvalho Chehabb. When some of the CPUs are different than the rest: 195*151f4e2bSMauro Carvalho Chehab 196*151f4e2bSMauro Carvalho Chehab In this case since we probably need to apply different microcode revisions 197*151f4e2bSMauro Carvalho Chehab to different CPUs, the kernel maintains a copy of the correct microcode 198*151f4e2bSMauro Carvalho Chehab image for each CPU (after appropriate CPU type/model discovery using 199*151f4e2bSMauro Carvalho Chehab functions such as collect_cpu_info()). 200*151f4e2bSMauro Carvalho Chehab 201*151f4e2bSMauro Carvalho Chehab 202*151f4e2bSMauro Carvalho Chehabc. When a CPU is physically hot-unplugged and a new (and possibly different 203*151f4e2bSMauro Carvalho Chehab type of) CPU is hot-plugged into the system: 204*151f4e2bSMauro Carvalho Chehab 205*151f4e2bSMauro Carvalho Chehab In the current design of the kernel, whenever a CPU is taken offline during 206*151f4e2bSMauro Carvalho Chehab a regular CPU hotplug operation, upon receiving the CPU_DEAD notification 207*151f4e2bSMauro Carvalho Chehab (which is sent by the CPU hotplug code), the microcode update driver's 208*151f4e2bSMauro Carvalho Chehab callback for that event reacts by freeing the kernel's copy of the 209*151f4e2bSMauro Carvalho Chehab microcode image for that CPU. 210*151f4e2bSMauro Carvalho Chehab 211*151f4e2bSMauro Carvalho Chehab Hence, when a new CPU is brought online, since the kernel finds that it 212*151f4e2bSMauro Carvalho Chehab doesn't have the microcode image, it does the CPU type/model discovery 213*151f4e2bSMauro Carvalho Chehab afresh and then requests the userspace for the appropriate microcode image 214*151f4e2bSMauro Carvalho Chehab for that CPU, which is subsequently applied. 215*151f4e2bSMauro Carvalho Chehab 216*151f4e2bSMauro Carvalho Chehab For example, in x86, the mc_cpu_callback() function (which is the microcode 217*151f4e2bSMauro Carvalho Chehab update driver's callback registered for CPU hotplug events) calls 218*151f4e2bSMauro Carvalho Chehab microcode_update_cpu() which would call microcode_init_cpu() in this case, 219*151f4e2bSMauro Carvalho Chehab instead of microcode_resume_cpu() when it finds that the kernel doesn't 220*151f4e2bSMauro Carvalho Chehab have a valid microcode image. This ensures that the CPU type/model 221*151f4e2bSMauro Carvalho Chehab discovery is performed and the right microcode is applied to the CPU after 222*151f4e2bSMauro Carvalho Chehab getting it from userspace. 223*151f4e2bSMauro Carvalho Chehab 224*151f4e2bSMauro Carvalho Chehab 225*151f4e2bSMauro Carvalho Chehabd. Handling microcode update during suspend/hibernate: 226*151f4e2bSMauro Carvalho Chehab 227*151f4e2bSMauro Carvalho Chehab Strictly speaking, during a CPU hotplug operation which does not involve 228*151f4e2bSMauro Carvalho Chehab physically removing or inserting CPUs, the CPUs are not actually powered 229*151f4e2bSMauro Carvalho Chehab off during a CPU offline. They are just put to the lowest C-states possible. 230*151f4e2bSMauro Carvalho Chehab Hence, in such a case, it is not really necessary to re-apply microcode 231*151f4e2bSMauro Carvalho Chehab when the CPUs are brought back online, since they wouldn't have lost the 232*151f4e2bSMauro Carvalho Chehab image during the CPU offline operation. 233*151f4e2bSMauro Carvalho Chehab 234*151f4e2bSMauro Carvalho Chehab This is the usual scenario encountered during a resume after a suspend. 235*151f4e2bSMauro Carvalho Chehab However, in the case of hibernation, since all the CPUs are completely 236*151f4e2bSMauro Carvalho Chehab powered off, during restore it becomes necessary to apply the microcode 237*151f4e2bSMauro Carvalho Chehab images to all the CPUs. 238*151f4e2bSMauro Carvalho Chehab 239*151f4e2bSMauro Carvalho Chehab [Note that we don't expect someone to physically pull out nodes and insert 240*151f4e2bSMauro Carvalho Chehab nodes with a different type of CPUs in-between a suspend-resume or a 241*151f4e2bSMauro Carvalho Chehab hibernate/restore cycle.] 242*151f4e2bSMauro Carvalho Chehab 243*151f4e2bSMauro Carvalho Chehab In the current design of the kernel however, during a CPU offline operation 244*151f4e2bSMauro Carvalho Chehab as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), 245*151f4e2bSMauro Carvalho Chehab the existing copy of microcode image in the kernel is not freed up. 246*151f4e2bSMauro Carvalho Chehab And during the CPU online operations (during resume/restore), since the 247*151f4e2bSMauro Carvalho Chehab kernel finds that it already has copies of the microcode images for all the 248*151f4e2bSMauro Carvalho Chehab CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU 249*151f4e2bSMauro Carvalho Chehab type/model and the need for validating whether the microcode revisions are 250*151f4e2bSMauro Carvalho Chehab right for the CPUs or not (due to the above assumption that physical CPU 251*151f4e2bSMauro Carvalho Chehab hotplug will not be done in-between suspend/resume or hibernate/restore 252*151f4e2bSMauro Carvalho Chehab cycles). 253*151f4e2bSMauro Carvalho Chehab 254*151f4e2bSMauro Carvalho Chehab 255*151f4e2bSMauro Carvalho ChehabIII. Known problems 256*151f4e2bSMauro Carvalho Chehab=================== 257*151f4e2bSMauro Carvalho Chehab 258*151f4e2bSMauro Carvalho ChehabAre there any known problems when regular CPU hotplug and suspend race 259*151f4e2bSMauro Carvalho Chehabwith each other? 260*151f4e2bSMauro Carvalho Chehab 261*151f4e2bSMauro Carvalho ChehabYes, they are listed below: 262*151f4e2bSMauro Carvalho Chehab 263*151f4e2bSMauro Carvalho Chehab1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to 264*151f4e2bSMauro Carvalho Chehab the _cpu_down() and _cpu_up() functions is *always* 0. 265*151f4e2bSMauro Carvalho Chehab This might not reflect the true current state of the system, since the 266*151f4e2bSMauro Carvalho Chehab tasks could have been frozen by an out-of-band event such as a suspend 267*151f4e2bSMauro Carvalho Chehab operation in progress. Hence, the cpuhp_tasks_frozen variable will not 268*151f4e2bSMauro Carvalho Chehab reflect the frozen state and the CPU hotplug callbacks which evaluate 269*151f4e2bSMauro Carvalho Chehab that variable might execute the wrong code path. 270*151f4e2bSMauro Carvalho Chehab 271*151f4e2bSMauro Carvalho Chehab2. If a regular CPU hotplug stress test happens to race with the freezer due 272*151f4e2bSMauro Carvalho Chehab to a suspend operation in progress at the same time, then we could hit the 273*151f4e2bSMauro Carvalho Chehab situation described below: 274*151f4e2bSMauro Carvalho Chehab 275*151f4e2bSMauro Carvalho Chehab * A regular cpu online operation continues its journey from userspace 276*151f4e2bSMauro Carvalho Chehab into the kernel, since the freezing has not yet begun. 277*151f4e2bSMauro Carvalho Chehab * Then freezer gets to work and freezes userspace. 278*151f4e2bSMauro Carvalho Chehab * If cpu online has not yet completed the microcode update stuff by now, 279*151f4e2bSMauro Carvalho Chehab it will now start waiting on the frozen userspace in the 280*151f4e2bSMauro Carvalho Chehab TASK_UNINTERRUPTIBLE state, in order to get the microcode image. 281*151f4e2bSMauro Carvalho Chehab * Now the freezer continues and tries to freeze the remaining tasks. But 282*151f4e2bSMauro Carvalho Chehab due to this wait mentioned above, the freezer won't be able to freeze 283*151f4e2bSMauro Carvalho Chehab the cpu online hotplug task and hence freezing of tasks fails. 284*151f4e2bSMauro Carvalho Chehab 285*151f4e2bSMauro Carvalho Chehab As a result of this task freezing failure, the suspend operation gets 286*151f4e2bSMauro Carvalho Chehab aborted. 287