1151f4e2bSMauro Carvalho Chehab==================================================================== 2151f4e2bSMauro Carvalho ChehabInteraction of Suspend code (S3) with the CPU hotplug infrastructure 3151f4e2bSMauro Carvalho Chehab==================================================================== 4151f4e2bSMauro Carvalho Chehab 5151f4e2bSMauro Carvalho Chehab(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> 6151f4e2bSMauro Carvalho Chehab 7151f4e2bSMauro Carvalho Chehab 8151f4e2bSMauro Carvalho ChehabI. Differences between CPU hotplug and Suspend-to-RAM 9151f4e2bSMauro Carvalho Chehab====================================================== 10151f4e2bSMauro Carvalho Chehab 11151f4e2bSMauro Carvalho ChehabHow does the regular CPU hotplug code differ from how the Suspend-to-RAM 12151f4e2bSMauro Carvalho Chehabinfrastructure uses it internally? And where do they share common code? 13151f4e2bSMauro Carvalho Chehab 14151f4e2bSMauro Carvalho ChehabWell, a picture is worth a thousand words... So ASCII art follows :-) 15151f4e2bSMauro Carvalho Chehab 16151f4e2bSMauro Carvalho Chehab[This depicts the current design in the kernel, and focusses only on the 17151f4e2bSMauro Carvalho Chehabinteractions involving the freezer and CPU hotplug and also tries to explain 18151f4e2bSMauro Carvalho Chehabthe locking involved. It outlines the notifications involved as well. 19151f4e2bSMauro Carvalho ChehabBut please note that here, only the call paths are illustrated, with the aim 20151f4e2bSMauro Carvalho Chehabof describing where they take different paths and where they share code. 21151f4e2bSMauro Carvalho ChehabWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other 22151f4e2bSMauro Carvalho Chehabis not depicted here.] 23151f4e2bSMauro Carvalho Chehab 24151f4e2bSMauro Carvalho ChehabOn a high level, the suspend-resume cycle goes like this:: 25151f4e2bSMauro Carvalho Chehab 26151f4e2bSMauro Carvalho Chehab |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | 27151f4e2bSMauro Carvalho Chehab |tasks | | cpus | | | | cpus | |tasks| 28151f4e2bSMauro Carvalho Chehab 29151f4e2bSMauro Carvalho Chehab 30151f4e2bSMauro Carvalho ChehabMore details follow:: 31151f4e2bSMauro Carvalho Chehab 32151f4e2bSMauro Carvalho Chehab Suspend call path 33151f4e2bSMauro Carvalho Chehab ----------------- 34151f4e2bSMauro Carvalho Chehab 35151f4e2bSMauro Carvalho Chehab Write 'mem' to 36151f4e2bSMauro Carvalho Chehab /sys/power/state 37151f4e2bSMauro Carvalho Chehab sysfs file 38151f4e2bSMauro Carvalho Chehab | 39151f4e2bSMauro Carvalho Chehab v 40151f4e2bSMauro Carvalho Chehab Acquire system_transition_mutex lock 41151f4e2bSMauro Carvalho Chehab | 42151f4e2bSMauro Carvalho Chehab v 43151f4e2bSMauro Carvalho Chehab Send PM_SUSPEND_PREPARE 44151f4e2bSMauro Carvalho Chehab notifications 45151f4e2bSMauro Carvalho Chehab | 46151f4e2bSMauro Carvalho Chehab v 47151f4e2bSMauro Carvalho Chehab Freeze tasks 48151f4e2bSMauro Carvalho Chehab | 49151f4e2bSMauro Carvalho Chehab | 50151f4e2bSMauro Carvalho Chehab v 51*56555855SQais Yousef freeze_secondary_cpus() 52151f4e2bSMauro Carvalho Chehab /* start */ 53151f4e2bSMauro Carvalho Chehab | 54151f4e2bSMauro Carvalho Chehab v 55151f4e2bSMauro Carvalho Chehab Acquire cpu_add_remove_lock 56151f4e2bSMauro Carvalho Chehab | 57151f4e2bSMauro Carvalho Chehab v 58151f4e2bSMauro Carvalho Chehab Iterate over CURRENTLY 59151f4e2bSMauro Carvalho Chehab online CPUs 60151f4e2bSMauro Carvalho Chehab | 61151f4e2bSMauro Carvalho Chehab | 62151f4e2bSMauro Carvalho Chehab | ---------- 63151f4e2bSMauro Carvalho Chehab v | L 64151f4e2bSMauro Carvalho Chehab ======> _cpu_down() | 65151f4e2bSMauro Carvalho Chehab | [This takes cpuhotplug.lock | 66151f4e2bSMauro Carvalho Chehab Common | before taking down the CPU | 67151f4e2bSMauro Carvalho Chehab code | and releases it when done] | O 68151f4e2bSMauro Carvalho Chehab | While it is at it, notifications | 69151f4e2bSMauro Carvalho Chehab | are sent when notable events occur, | 70151f4e2bSMauro Carvalho Chehab ======> by running all registered callbacks. | 71151f4e2bSMauro Carvalho Chehab | | O 72151f4e2bSMauro Carvalho Chehab | | 73151f4e2bSMauro Carvalho Chehab | | 74151f4e2bSMauro Carvalho Chehab v | 75151f4e2bSMauro Carvalho Chehab Note down these cpus in | P 76151f4e2bSMauro Carvalho Chehab frozen_cpus mask ---------- 77151f4e2bSMauro Carvalho Chehab | 78151f4e2bSMauro Carvalho Chehab v 79151f4e2bSMauro Carvalho Chehab Disable regular cpu hotplug 80151f4e2bSMauro Carvalho Chehab by increasing cpu_hotplug_disabled 81151f4e2bSMauro Carvalho Chehab | 82151f4e2bSMauro Carvalho Chehab v 83151f4e2bSMauro Carvalho Chehab Release cpu_add_remove_lock 84151f4e2bSMauro Carvalho Chehab | 85151f4e2bSMauro Carvalho Chehab v 86*56555855SQais Yousef /* freeze_secondary_cpus() complete */ 87151f4e2bSMauro Carvalho Chehab | 88151f4e2bSMauro Carvalho Chehab v 89151f4e2bSMauro Carvalho Chehab Do suspend 90151f4e2bSMauro Carvalho Chehab 91151f4e2bSMauro Carvalho Chehab 92151f4e2bSMauro Carvalho Chehab 93151f4e2bSMauro Carvalho ChehabResuming back is likewise, with the counterparts being (in the order of 94151f4e2bSMauro Carvalho Chehabexecution during resume): 95151f4e2bSMauro Carvalho Chehab 96*56555855SQais Yousef* thaw_secondary_cpus() which involves:: 97151f4e2bSMauro Carvalho Chehab 98151f4e2bSMauro Carvalho Chehab | Acquire cpu_add_remove_lock 99151f4e2bSMauro Carvalho Chehab | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug 100151f4e2bSMauro Carvalho Chehab | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] 101151f4e2bSMauro Carvalho Chehab | Release cpu_add_remove_lock 102151f4e2bSMauro Carvalho Chehab v 103151f4e2bSMauro Carvalho Chehab 104151f4e2bSMauro Carvalho Chehab* thaw tasks 105151f4e2bSMauro Carvalho Chehab* send PM_POST_SUSPEND notifications 106151f4e2bSMauro Carvalho Chehab* Release system_transition_mutex lock. 107151f4e2bSMauro Carvalho Chehab 108151f4e2bSMauro Carvalho Chehab 1091992b66dSBjorn HelgaasIt is to be noted here that the system_transition_mutex lock is acquired at the 1101992b66dSBjorn Helgaasvery beginning, when we are just starting out to suspend, and then released only 111151f4e2bSMauro Carvalho Chehabafter the entire cycle is complete (i.e., suspend + resume). 112151f4e2bSMauro Carvalho Chehab 113151f4e2bSMauro Carvalho Chehab:: 114151f4e2bSMauro Carvalho Chehab 115151f4e2bSMauro Carvalho Chehab 116151f4e2bSMauro Carvalho Chehab 117151f4e2bSMauro Carvalho Chehab Regular CPU hotplug call path 118151f4e2bSMauro Carvalho Chehab ----------------------------- 119151f4e2bSMauro Carvalho Chehab 120151f4e2bSMauro Carvalho Chehab Write 0 (or 1) to 121151f4e2bSMauro Carvalho Chehab /sys/devices/system/cpu/cpu*/online 122151f4e2bSMauro Carvalho Chehab sysfs file 123151f4e2bSMauro Carvalho Chehab | 124151f4e2bSMauro Carvalho Chehab | 125151f4e2bSMauro Carvalho Chehab v 126151f4e2bSMauro Carvalho Chehab cpu_down() 127151f4e2bSMauro Carvalho Chehab | 128151f4e2bSMauro Carvalho Chehab v 129151f4e2bSMauro Carvalho Chehab Acquire cpu_add_remove_lock 130151f4e2bSMauro Carvalho Chehab | 131151f4e2bSMauro Carvalho Chehab v 132151f4e2bSMauro Carvalho Chehab If cpu_hotplug_disabled > 0 133151f4e2bSMauro Carvalho Chehab return gracefully 134151f4e2bSMauro Carvalho Chehab | 135151f4e2bSMauro Carvalho Chehab | 136151f4e2bSMauro Carvalho Chehab v 137151f4e2bSMauro Carvalho Chehab ======> _cpu_down() 138151f4e2bSMauro Carvalho Chehab | [This takes cpuhotplug.lock 139151f4e2bSMauro Carvalho Chehab Common | before taking down the CPU 140151f4e2bSMauro Carvalho Chehab code | and releases it when done] 141151f4e2bSMauro Carvalho Chehab | While it is at it, notifications 142151f4e2bSMauro Carvalho Chehab | are sent when notable events occur, 143151f4e2bSMauro Carvalho Chehab ======> by running all registered callbacks. 144151f4e2bSMauro Carvalho Chehab | 145151f4e2bSMauro Carvalho Chehab | 146151f4e2bSMauro Carvalho Chehab v 147151f4e2bSMauro Carvalho Chehab Release cpu_add_remove_lock 148151f4e2bSMauro Carvalho Chehab [That's it!, for 149151f4e2bSMauro Carvalho Chehab regular CPU hotplug] 150151f4e2bSMauro Carvalho Chehab 151151f4e2bSMauro Carvalho Chehab 152151f4e2bSMauro Carvalho Chehab 153151f4e2bSMauro Carvalho ChehabSo, as can be seen from the two diagrams (the parts marked as "Common code"), 154151f4e2bSMauro Carvalho Chehabregular CPU hotplug and the suspend code path converge at the _cpu_down() and 155151f4e2bSMauro Carvalho Chehab_cpu_up() functions. They differ in the arguments passed to these functions, 156151f4e2bSMauro Carvalho Chehabin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' 157151f4e2bSMauro Carvalho Chehabargument. But during suspend, since the tasks are already frozen by the time 158151f4e2bSMauro Carvalho Chehabthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called 159151f4e2bSMauro Carvalho Chehabwith the 'tasks_frozen' argument set to 1. 160151f4e2bSMauro Carvalho Chehab[See below for some known issues regarding this.] 161151f4e2bSMauro Carvalho Chehab 162151f4e2bSMauro Carvalho Chehab 163151f4e2bSMauro Carvalho ChehabImportant files and functions/entry points: 164151f4e2bSMauro Carvalho Chehab------------------------------------------- 165151f4e2bSMauro Carvalho Chehab 166151f4e2bSMauro Carvalho Chehab- kernel/power/process.c : freeze_processes(), thaw_processes() 167151f4e2bSMauro Carvalho Chehab- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() 1681992b66dSBjorn Helgaas- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), 1691992b66dSBjorn Helgaas [disable|enable]_nonboot_cpus() 170151f4e2bSMauro Carvalho Chehab 171151f4e2bSMauro Carvalho Chehab 172151f4e2bSMauro Carvalho Chehab 173151f4e2bSMauro Carvalho ChehabII. What are the issues involved in CPU hotplug? 174151f4e2bSMauro Carvalho Chehab------------------------------------------------ 175151f4e2bSMauro Carvalho Chehab 176151f4e2bSMauro Carvalho ChehabThere are some interesting situations involving CPU hotplug and microcode 177151f4e2bSMauro Carvalho Chehabupdate on the CPUs, as discussed below: 178151f4e2bSMauro Carvalho Chehab 179151f4e2bSMauro Carvalho Chehab[Please bear in mind that the kernel requests the microcode images from 180151f4e2bSMauro Carvalho Chehabuserspace, using the request_firmware() function defined in 181151f4e2bSMauro Carvalho Chehabdrivers/base/firmware_loader/main.c] 182151f4e2bSMauro Carvalho Chehab 183151f4e2bSMauro Carvalho Chehab 184151f4e2bSMauro Carvalho Chehaba. When all the CPUs are identical: 185151f4e2bSMauro Carvalho Chehab 186151f4e2bSMauro Carvalho Chehab This is the most common situation and it is quite straightforward: we want 187151f4e2bSMauro Carvalho Chehab to apply the same microcode revision to each of the CPUs. 188151f4e2bSMauro Carvalho Chehab To give an example of x86, the collect_cpu_info() function defined in 189151f4e2bSMauro Carvalho Chehab arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU 190151f4e2bSMauro Carvalho Chehab and thereby in applying the correct microcode revision to it. 191151f4e2bSMauro Carvalho Chehab But note that the kernel does not maintain a common microcode image for the 192151f4e2bSMauro Carvalho Chehab all CPUs, in order to handle case 'b' described below. 193151f4e2bSMauro Carvalho Chehab 194151f4e2bSMauro Carvalho Chehab 195151f4e2bSMauro Carvalho Chehabb. When some of the CPUs are different than the rest: 196151f4e2bSMauro Carvalho Chehab 197151f4e2bSMauro Carvalho Chehab In this case since we probably need to apply different microcode revisions 198151f4e2bSMauro Carvalho Chehab to different CPUs, the kernel maintains a copy of the correct microcode 199151f4e2bSMauro Carvalho Chehab image for each CPU (after appropriate CPU type/model discovery using 200151f4e2bSMauro Carvalho Chehab functions such as collect_cpu_info()). 201151f4e2bSMauro Carvalho Chehab 202151f4e2bSMauro Carvalho Chehab 203151f4e2bSMauro Carvalho Chehabc. When a CPU is physically hot-unplugged and a new (and possibly different 204151f4e2bSMauro Carvalho Chehab type of) CPU is hot-plugged into the system: 205151f4e2bSMauro Carvalho Chehab 206151f4e2bSMauro Carvalho Chehab In the current design of the kernel, whenever a CPU is taken offline during 207151f4e2bSMauro Carvalho Chehab a regular CPU hotplug operation, upon receiving the CPU_DEAD notification 208151f4e2bSMauro Carvalho Chehab (which is sent by the CPU hotplug code), the microcode update driver's 209151f4e2bSMauro Carvalho Chehab callback for that event reacts by freeing the kernel's copy of the 210151f4e2bSMauro Carvalho Chehab microcode image for that CPU. 211151f4e2bSMauro Carvalho Chehab 212151f4e2bSMauro Carvalho Chehab Hence, when a new CPU is brought online, since the kernel finds that it 213151f4e2bSMauro Carvalho Chehab doesn't have the microcode image, it does the CPU type/model discovery 214151f4e2bSMauro Carvalho Chehab afresh and then requests the userspace for the appropriate microcode image 215151f4e2bSMauro Carvalho Chehab for that CPU, which is subsequently applied. 216151f4e2bSMauro Carvalho Chehab 217151f4e2bSMauro Carvalho Chehab For example, in x86, the mc_cpu_callback() function (which is the microcode 218151f4e2bSMauro Carvalho Chehab update driver's callback registered for CPU hotplug events) calls 219151f4e2bSMauro Carvalho Chehab microcode_update_cpu() which would call microcode_init_cpu() in this case, 220151f4e2bSMauro Carvalho Chehab instead of microcode_resume_cpu() when it finds that the kernel doesn't 221151f4e2bSMauro Carvalho Chehab have a valid microcode image. This ensures that the CPU type/model 222151f4e2bSMauro Carvalho Chehab discovery is performed and the right microcode is applied to the CPU after 223151f4e2bSMauro Carvalho Chehab getting it from userspace. 224151f4e2bSMauro Carvalho Chehab 225151f4e2bSMauro Carvalho Chehab 226151f4e2bSMauro Carvalho Chehabd. Handling microcode update during suspend/hibernate: 227151f4e2bSMauro Carvalho Chehab 228151f4e2bSMauro Carvalho Chehab Strictly speaking, during a CPU hotplug operation which does not involve 229151f4e2bSMauro Carvalho Chehab physically removing or inserting CPUs, the CPUs are not actually powered 230151f4e2bSMauro Carvalho Chehab off during a CPU offline. They are just put to the lowest C-states possible. 231151f4e2bSMauro Carvalho Chehab Hence, in such a case, it is not really necessary to re-apply microcode 232151f4e2bSMauro Carvalho Chehab when the CPUs are brought back online, since they wouldn't have lost the 233151f4e2bSMauro Carvalho Chehab image during the CPU offline operation. 234151f4e2bSMauro Carvalho Chehab 235151f4e2bSMauro Carvalho Chehab This is the usual scenario encountered during a resume after a suspend. 236151f4e2bSMauro Carvalho Chehab However, in the case of hibernation, since all the CPUs are completely 237151f4e2bSMauro Carvalho Chehab powered off, during restore it becomes necessary to apply the microcode 238151f4e2bSMauro Carvalho Chehab images to all the CPUs. 239151f4e2bSMauro Carvalho Chehab 240151f4e2bSMauro Carvalho Chehab [Note that we don't expect someone to physically pull out nodes and insert 241151f4e2bSMauro Carvalho Chehab nodes with a different type of CPUs in-between a suspend-resume or a 242151f4e2bSMauro Carvalho Chehab hibernate/restore cycle.] 243151f4e2bSMauro Carvalho Chehab 244151f4e2bSMauro Carvalho Chehab In the current design of the kernel however, during a CPU offline operation 245151f4e2bSMauro Carvalho Chehab as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), 246151f4e2bSMauro Carvalho Chehab the existing copy of microcode image in the kernel is not freed up. 247151f4e2bSMauro Carvalho Chehab And during the CPU online operations (during resume/restore), since the 248151f4e2bSMauro Carvalho Chehab kernel finds that it already has copies of the microcode images for all the 249151f4e2bSMauro Carvalho Chehab CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU 250151f4e2bSMauro Carvalho Chehab type/model and the need for validating whether the microcode revisions are 251151f4e2bSMauro Carvalho Chehab right for the CPUs or not (due to the above assumption that physical CPU 252151f4e2bSMauro Carvalho Chehab hotplug will not be done in-between suspend/resume or hibernate/restore 253151f4e2bSMauro Carvalho Chehab cycles). 254151f4e2bSMauro Carvalho Chehab 255151f4e2bSMauro Carvalho Chehab 256151f4e2bSMauro Carvalho ChehabIII. Known problems 257151f4e2bSMauro Carvalho Chehab=================== 258151f4e2bSMauro Carvalho Chehab 259151f4e2bSMauro Carvalho ChehabAre there any known problems when regular CPU hotplug and suspend race 260151f4e2bSMauro Carvalho Chehabwith each other? 261151f4e2bSMauro Carvalho Chehab 262151f4e2bSMauro Carvalho ChehabYes, they are listed below: 263151f4e2bSMauro Carvalho Chehab 264151f4e2bSMauro Carvalho Chehab1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to 265151f4e2bSMauro Carvalho Chehab the _cpu_down() and _cpu_up() functions is *always* 0. 266151f4e2bSMauro Carvalho Chehab This might not reflect the true current state of the system, since the 267151f4e2bSMauro Carvalho Chehab tasks could have been frozen by an out-of-band event such as a suspend 268151f4e2bSMauro Carvalho Chehab operation in progress. Hence, the cpuhp_tasks_frozen variable will not 269151f4e2bSMauro Carvalho Chehab reflect the frozen state and the CPU hotplug callbacks which evaluate 270151f4e2bSMauro Carvalho Chehab that variable might execute the wrong code path. 271151f4e2bSMauro Carvalho Chehab 272151f4e2bSMauro Carvalho Chehab2. If a regular CPU hotplug stress test happens to race with the freezer due 273151f4e2bSMauro Carvalho Chehab to a suspend operation in progress at the same time, then we could hit the 274151f4e2bSMauro Carvalho Chehab situation described below: 275151f4e2bSMauro Carvalho Chehab 276151f4e2bSMauro Carvalho Chehab * A regular cpu online operation continues its journey from userspace 277151f4e2bSMauro Carvalho Chehab into the kernel, since the freezing has not yet begun. 278151f4e2bSMauro Carvalho Chehab * Then freezer gets to work and freezes userspace. 279151f4e2bSMauro Carvalho Chehab * If cpu online has not yet completed the microcode update stuff by now, 280151f4e2bSMauro Carvalho Chehab it will now start waiting on the frozen userspace in the 281151f4e2bSMauro Carvalho Chehab TASK_UNINTERRUPTIBLE state, in order to get the microcode image. 282151f4e2bSMauro Carvalho Chehab * Now the freezer continues and tries to freeze the remaining tasks. But 283151f4e2bSMauro Carvalho Chehab due to this wait mentioned above, the freezer won't be able to freeze 284151f4e2bSMauro Carvalho Chehab the cpu online hotplug task and hence freezing of tasks fails. 285151f4e2bSMauro Carvalho Chehab 286151f4e2bSMauro Carvalho Chehab As a result of this task freezing failure, the suspend operation gets 287151f4e2bSMauro Carvalho Chehab aborted. 288