xref: /openbmc/linux/Documentation/power/suspend-and-cpuhotplug.rst (revision 151f4e2bdc7a04020ae5c533896fb91a16e1f501)
1*151f4e2bSMauro Carvalho Chehab====================================================================
2*151f4e2bSMauro Carvalho ChehabInteraction of Suspend code (S3) with the CPU hotplug infrastructure
3*151f4e2bSMauro Carvalho Chehab====================================================================
4*151f4e2bSMauro Carvalho Chehab
5*151f4e2bSMauro Carvalho Chehab(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
6*151f4e2bSMauro Carvalho Chehab
7*151f4e2bSMauro Carvalho Chehab
8*151f4e2bSMauro Carvalho ChehabI. Differences between CPU hotplug and Suspend-to-RAM
9*151f4e2bSMauro Carvalho Chehab======================================================
10*151f4e2bSMauro Carvalho Chehab
11*151f4e2bSMauro Carvalho ChehabHow does the regular CPU hotplug code differ from how the Suspend-to-RAM
12*151f4e2bSMauro Carvalho Chehabinfrastructure uses it internally? And where do they share common code?
13*151f4e2bSMauro Carvalho Chehab
14*151f4e2bSMauro Carvalho ChehabWell, a picture is worth a thousand words... So ASCII art follows :-)
15*151f4e2bSMauro Carvalho Chehab
16*151f4e2bSMauro Carvalho Chehab[This depicts the current design in the kernel, and focusses only on the
17*151f4e2bSMauro Carvalho Chehabinteractions involving the freezer and CPU hotplug and also tries to explain
18*151f4e2bSMauro Carvalho Chehabthe locking involved. It outlines the notifications involved as well.
19*151f4e2bSMauro Carvalho ChehabBut please note that here, only the call paths are illustrated, with the aim
20*151f4e2bSMauro Carvalho Chehabof describing where they take different paths and where they share code.
21*151f4e2bSMauro Carvalho ChehabWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other
22*151f4e2bSMauro Carvalho Chehabis not depicted here.]
23*151f4e2bSMauro Carvalho Chehab
24*151f4e2bSMauro Carvalho ChehabOn a high level, the suspend-resume cycle goes like this::
25*151f4e2bSMauro Carvalho Chehab
26*151f4e2bSMauro Carvalho Chehab  |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
27*151f4e2bSMauro Carvalho Chehab  |tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
28*151f4e2bSMauro Carvalho Chehab
29*151f4e2bSMauro Carvalho Chehab
30*151f4e2bSMauro Carvalho ChehabMore details follow::
31*151f4e2bSMauro Carvalho Chehab
32*151f4e2bSMauro Carvalho Chehab                                Suspend call path
33*151f4e2bSMauro Carvalho Chehab                                -----------------
34*151f4e2bSMauro Carvalho Chehab
35*151f4e2bSMauro Carvalho Chehab                                  Write 'mem' to
36*151f4e2bSMauro Carvalho Chehab                                /sys/power/state
37*151f4e2bSMauro Carvalho Chehab                                    sysfs file
38*151f4e2bSMauro Carvalho Chehab                                        |
39*151f4e2bSMauro Carvalho Chehab                                        v
40*151f4e2bSMauro Carvalho Chehab                               Acquire system_transition_mutex lock
41*151f4e2bSMauro Carvalho Chehab                                        |
42*151f4e2bSMauro Carvalho Chehab                                        v
43*151f4e2bSMauro Carvalho Chehab                             Send PM_SUSPEND_PREPARE
44*151f4e2bSMauro Carvalho Chehab                                   notifications
45*151f4e2bSMauro Carvalho Chehab                                        |
46*151f4e2bSMauro Carvalho Chehab                                        v
47*151f4e2bSMauro Carvalho Chehab                                   Freeze tasks
48*151f4e2bSMauro Carvalho Chehab                                        |
49*151f4e2bSMauro Carvalho Chehab                                        |
50*151f4e2bSMauro Carvalho Chehab                                        v
51*151f4e2bSMauro Carvalho Chehab                              disable_nonboot_cpus()
52*151f4e2bSMauro Carvalho Chehab                                   /* start */
53*151f4e2bSMauro Carvalho Chehab                                        |
54*151f4e2bSMauro Carvalho Chehab                                        v
55*151f4e2bSMauro Carvalho Chehab                            Acquire cpu_add_remove_lock
56*151f4e2bSMauro Carvalho Chehab                                        |
57*151f4e2bSMauro Carvalho Chehab                                        v
58*151f4e2bSMauro Carvalho Chehab                             Iterate over CURRENTLY
59*151f4e2bSMauro Carvalho Chehab                                   online CPUs
60*151f4e2bSMauro Carvalho Chehab                                        |
61*151f4e2bSMauro Carvalho Chehab                                        |
62*151f4e2bSMauro Carvalho Chehab                                        |                ----------
63*151f4e2bSMauro Carvalho Chehab                                        v                          | L
64*151f4e2bSMauro Carvalho Chehab             ======>               _cpu_down()                     |
65*151f4e2bSMauro Carvalho Chehab            |              [This takes cpuhotplug.lock             |
66*151f4e2bSMauro Carvalho Chehab  Common    |               before taking down the CPU             |
67*151f4e2bSMauro Carvalho Chehab   code     |               and releases it when done]             | O
68*151f4e2bSMauro Carvalho Chehab            |            While it is at it, notifications          |
69*151f4e2bSMauro Carvalho Chehab            |            are sent when notable events occur,       |
70*151f4e2bSMauro Carvalho Chehab             ======>     by running all registered callbacks.      |
71*151f4e2bSMauro Carvalho Chehab                                        |                          | O
72*151f4e2bSMauro Carvalho Chehab                                        |                          |
73*151f4e2bSMauro Carvalho Chehab                                        |                          |
74*151f4e2bSMauro Carvalho Chehab                                        v                          |
75*151f4e2bSMauro Carvalho Chehab                            Note down these cpus in                | P
76*151f4e2bSMauro Carvalho Chehab                                frozen_cpus mask         ----------
77*151f4e2bSMauro Carvalho Chehab                                        |
78*151f4e2bSMauro Carvalho Chehab                                        v
79*151f4e2bSMauro Carvalho Chehab                           Disable regular cpu hotplug
80*151f4e2bSMauro Carvalho Chehab                        by increasing cpu_hotplug_disabled
81*151f4e2bSMauro Carvalho Chehab                                        |
82*151f4e2bSMauro Carvalho Chehab                                        v
83*151f4e2bSMauro Carvalho Chehab                            Release cpu_add_remove_lock
84*151f4e2bSMauro Carvalho Chehab                                        |
85*151f4e2bSMauro Carvalho Chehab                                        v
86*151f4e2bSMauro Carvalho Chehab                       /* disable_nonboot_cpus() complete */
87*151f4e2bSMauro Carvalho Chehab                                        |
88*151f4e2bSMauro Carvalho Chehab                                        v
89*151f4e2bSMauro Carvalho Chehab                                   Do suspend
90*151f4e2bSMauro Carvalho Chehab
91*151f4e2bSMauro Carvalho Chehab
92*151f4e2bSMauro Carvalho Chehab
93*151f4e2bSMauro Carvalho ChehabResuming back is likewise, with the counterparts being (in the order of
94*151f4e2bSMauro Carvalho Chehabexecution during resume):
95*151f4e2bSMauro Carvalho Chehab
96*151f4e2bSMauro Carvalho Chehab* enable_nonboot_cpus() which involves::
97*151f4e2bSMauro Carvalho Chehab
98*151f4e2bSMauro Carvalho Chehab   |  Acquire cpu_add_remove_lock
99*151f4e2bSMauro Carvalho Chehab   |  Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug
100*151f4e2bSMauro Carvalho Chehab   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
101*151f4e2bSMauro Carvalho Chehab   |  Release cpu_add_remove_lock
102*151f4e2bSMauro Carvalho Chehab   v
103*151f4e2bSMauro Carvalho Chehab
104*151f4e2bSMauro Carvalho Chehab* thaw tasks
105*151f4e2bSMauro Carvalho Chehab* send PM_POST_SUSPEND notifications
106*151f4e2bSMauro Carvalho Chehab* Release system_transition_mutex lock.
107*151f4e2bSMauro Carvalho Chehab
108*151f4e2bSMauro Carvalho Chehab
109*151f4e2bSMauro Carvalho ChehabIt is to be noted here that the system_transition_mutex lock is acquired at the very
110*151f4e2bSMauro Carvalho Chehabbeginning, when we are just starting out to suspend, and then released only
111*151f4e2bSMauro Carvalho Chehabafter the entire cycle is complete (i.e., suspend + resume).
112*151f4e2bSMauro Carvalho Chehab
113*151f4e2bSMauro Carvalho Chehab::
114*151f4e2bSMauro Carvalho Chehab
115*151f4e2bSMauro Carvalho Chehab
116*151f4e2bSMauro Carvalho Chehab
117*151f4e2bSMauro Carvalho Chehab                          Regular CPU hotplug call path
118*151f4e2bSMauro Carvalho Chehab                          -----------------------------
119*151f4e2bSMauro Carvalho Chehab
120*151f4e2bSMauro Carvalho Chehab                                Write 0 (or 1) to
121*151f4e2bSMauro Carvalho Chehab                       /sys/devices/system/cpu/cpu*/online
122*151f4e2bSMauro Carvalho Chehab                                    sysfs file
123*151f4e2bSMauro Carvalho Chehab                                        |
124*151f4e2bSMauro Carvalho Chehab                                        |
125*151f4e2bSMauro Carvalho Chehab                                        v
126*151f4e2bSMauro Carvalho Chehab                                    cpu_down()
127*151f4e2bSMauro Carvalho Chehab                                        |
128*151f4e2bSMauro Carvalho Chehab                                        v
129*151f4e2bSMauro Carvalho Chehab                           Acquire cpu_add_remove_lock
130*151f4e2bSMauro Carvalho Chehab                                        |
131*151f4e2bSMauro Carvalho Chehab                                        v
132*151f4e2bSMauro Carvalho Chehab                          If cpu_hotplug_disabled > 0
133*151f4e2bSMauro Carvalho Chehab                                return gracefully
134*151f4e2bSMauro Carvalho Chehab                                        |
135*151f4e2bSMauro Carvalho Chehab                                        |
136*151f4e2bSMauro Carvalho Chehab                                        v
137*151f4e2bSMauro Carvalho Chehab             ======>                _cpu_down()
138*151f4e2bSMauro Carvalho Chehab            |              [This takes cpuhotplug.lock
139*151f4e2bSMauro Carvalho Chehab  Common    |               before taking down the CPU
140*151f4e2bSMauro Carvalho Chehab   code     |               and releases it when done]
141*151f4e2bSMauro Carvalho Chehab            |            While it is at it, notifications
142*151f4e2bSMauro Carvalho Chehab            |           are sent when notable events occur,
143*151f4e2bSMauro Carvalho Chehab             ======>    by running all registered callbacks.
144*151f4e2bSMauro Carvalho Chehab                                        |
145*151f4e2bSMauro Carvalho Chehab                                        |
146*151f4e2bSMauro Carvalho Chehab                                        v
147*151f4e2bSMauro Carvalho Chehab                          Release cpu_add_remove_lock
148*151f4e2bSMauro Carvalho Chehab                               [That's it!, for
149*151f4e2bSMauro Carvalho Chehab                              regular CPU hotplug]
150*151f4e2bSMauro Carvalho Chehab
151*151f4e2bSMauro Carvalho Chehab
152*151f4e2bSMauro Carvalho Chehab
153*151f4e2bSMauro Carvalho ChehabSo, as can be seen from the two diagrams (the parts marked as "Common code"),
154*151f4e2bSMauro Carvalho Chehabregular CPU hotplug and the suspend code path converge at the _cpu_down() and
155*151f4e2bSMauro Carvalho Chehab_cpu_up() functions. They differ in the arguments passed to these functions,
156*151f4e2bSMauro Carvalho Chehabin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
157*151f4e2bSMauro Carvalho Chehabargument. But during suspend, since the tasks are already frozen by the time
158*151f4e2bSMauro Carvalho Chehabthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
159*151f4e2bSMauro Carvalho Chehabwith the 'tasks_frozen' argument set to 1.
160*151f4e2bSMauro Carvalho Chehab[See below for some known issues regarding this.]
161*151f4e2bSMauro Carvalho Chehab
162*151f4e2bSMauro Carvalho Chehab
163*151f4e2bSMauro Carvalho ChehabImportant files and functions/entry points:
164*151f4e2bSMauro Carvalho Chehab-------------------------------------------
165*151f4e2bSMauro Carvalho Chehab
166*151f4e2bSMauro Carvalho Chehab- kernel/power/process.c : freeze_processes(), thaw_processes()
167*151f4e2bSMauro Carvalho Chehab- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
168*151f4e2bSMauro Carvalho Chehab- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus()
169*151f4e2bSMauro Carvalho Chehab
170*151f4e2bSMauro Carvalho Chehab
171*151f4e2bSMauro Carvalho Chehab
172*151f4e2bSMauro Carvalho ChehabII. What are the issues involved in CPU hotplug?
173*151f4e2bSMauro Carvalho Chehab------------------------------------------------
174*151f4e2bSMauro Carvalho Chehab
175*151f4e2bSMauro Carvalho ChehabThere are some interesting situations involving CPU hotplug and microcode
176*151f4e2bSMauro Carvalho Chehabupdate on the CPUs, as discussed below:
177*151f4e2bSMauro Carvalho Chehab
178*151f4e2bSMauro Carvalho Chehab[Please bear in mind that the kernel requests the microcode images from
179*151f4e2bSMauro Carvalho Chehabuserspace, using the request_firmware() function defined in
180*151f4e2bSMauro Carvalho Chehabdrivers/base/firmware_loader/main.c]
181*151f4e2bSMauro Carvalho Chehab
182*151f4e2bSMauro Carvalho Chehab
183*151f4e2bSMauro Carvalho Chehaba. When all the CPUs are identical:
184*151f4e2bSMauro Carvalho Chehab
185*151f4e2bSMauro Carvalho Chehab   This is the most common situation and it is quite straightforward: we want
186*151f4e2bSMauro Carvalho Chehab   to apply the same microcode revision to each of the CPUs.
187*151f4e2bSMauro Carvalho Chehab   To give an example of x86, the collect_cpu_info() function defined in
188*151f4e2bSMauro Carvalho Chehab   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
189*151f4e2bSMauro Carvalho Chehab   and thereby in applying the correct microcode revision to it.
190*151f4e2bSMauro Carvalho Chehab   But note that the kernel does not maintain a common microcode image for the
191*151f4e2bSMauro Carvalho Chehab   all CPUs, in order to handle case 'b' described below.
192*151f4e2bSMauro Carvalho Chehab
193*151f4e2bSMauro Carvalho Chehab
194*151f4e2bSMauro Carvalho Chehabb. When some of the CPUs are different than the rest:
195*151f4e2bSMauro Carvalho Chehab
196*151f4e2bSMauro Carvalho Chehab   In this case since we probably need to apply different microcode revisions
197*151f4e2bSMauro Carvalho Chehab   to different CPUs, the kernel maintains a copy of the correct microcode
198*151f4e2bSMauro Carvalho Chehab   image for each CPU (after appropriate CPU type/model discovery using
199*151f4e2bSMauro Carvalho Chehab   functions such as collect_cpu_info()).
200*151f4e2bSMauro Carvalho Chehab
201*151f4e2bSMauro Carvalho Chehab
202*151f4e2bSMauro Carvalho Chehabc. When a CPU is physically hot-unplugged and a new (and possibly different
203*151f4e2bSMauro Carvalho Chehab   type of) CPU is hot-plugged into the system:
204*151f4e2bSMauro Carvalho Chehab
205*151f4e2bSMauro Carvalho Chehab   In the current design of the kernel, whenever a CPU is taken offline during
206*151f4e2bSMauro Carvalho Chehab   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
207*151f4e2bSMauro Carvalho Chehab   (which is sent by the CPU hotplug code), the microcode update driver's
208*151f4e2bSMauro Carvalho Chehab   callback for that event reacts by freeing the kernel's copy of the
209*151f4e2bSMauro Carvalho Chehab   microcode image for that CPU.
210*151f4e2bSMauro Carvalho Chehab
211*151f4e2bSMauro Carvalho Chehab   Hence, when a new CPU is brought online, since the kernel finds that it
212*151f4e2bSMauro Carvalho Chehab   doesn't have the microcode image, it does the CPU type/model discovery
213*151f4e2bSMauro Carvalho Chehab   afresh and then requests the userspace for the appropriate microcode image
214*151f4e2bSMauro Carvalho Chehab   for that CPU, which is subsequently applied.
215*151f4e2bSMauro Carvalho Chehab
216*151f4e2bSMauro Carvalho Chehab   For example, in x86, the mc_cpu_callback() function (which is the microcode
217*151f4e2bSMauro Carvalho Chehab   update driver's callback registered for CPU hotplug events) calls
218*151f4e2bSMauro Carvalho Chehab   microcode_update_cpu() which would call microcode_init_cpu() in this case,
219*151f4e2bSMauro Carvalho Chehab   instead of microcode_resume_cpu() when it finds that the kernel doesn't
220*151f4e2bSMauro Carvalho Chehab   have a valid microcode image. This ensures that the CPU type/model
221*151f4e2bSMauro Carvalho Chehab   discovery is performed and the right microcode is applied to the CPU after
222*151f4e2bSMauro Carvalho Chehab   getting it from userspace.
223*151f4e2bSMauro Carvalho Chehab
224*151f4e2bSMauro Carvalho Chehab
225*151f4e2bSMauro Carvalho Chehabd. Handling microcode update during suspend/hibernate:
226*151f4e2bSMauro Carvalho Chehab
227*151f4e2bSMauro Carvalho Chehab   Strictly speaking, during a CPU hotplug operation which does not involve
228*151f4e2bSMauro Carvalho Chehab   physically removing or inserting CPUs, the CPUs are not actually powered
229*151f4e2bSMauro Carvalho Chehab   off during a CPU offline. They are just put to the lowest C-states possible.
230*151f4e2bSMauro Carvalho Chehab   Hence, in such a case, it is not really necessary to re-apply microcode
231*151f4e2bSMauro Carvalho Chehab   when the CPUs are brought back online, since they wouldn't have lost the
232*151f4e2bSMauro Carvalho Chehab   image during the CPU offline operation.
233*151f4e2bSMauro Carvalho Chehab
234*151f4e2bSMauro Carvalho Chehab   This is the usual scenario encountered during a resume after a suspend.
235*151f4e2bSMauro Carvalho Chehab   However, in the case of hibernation, since all the CPUs are completely
236*151f4e2bSMauro Carvalho Chehab   powered off, during restore it becomes necessary to apply the microcode
237*151f4e2bSMauro Carvalho Chehab   images to all the CPUs.
238*151f4e2bSMauro Carvalho Chehab
239*151f4e2bSMauro Carvalho Chehab   [Note that we don't expect someone to physically pull out nodes and insert
240*151f4e2bSMauro Carvalho Chehab   nodes with a different type of CPUs in-between a suspend-resume or a
241*151f4e2bSMauro Carvalho Chehab   hibernate/restore cycle.]
242*151f4e2bSMauro Carvalho Chehab
243*151f4e2bSMauro Carvalho Chehab   In the current design of the kernel however, during a CPU offline operation
244*151f4e2bSMauro Carvalho Chehab   as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set),
245*151f4e2bSMauro Carvalho Chehab   the existing copy of microcode image in the kernel is not freed up.
246*151f4e2bSMauro Carvalho Chehab   And during the CPU online operations (during resume/restore), since the
247*151f4e2bSMauro Carvalho Chehab   kernel finds that it already has copies of the microcode images for all the
248*151f4e2bSMauro Carvalho Chehab   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
249*151f4e2bSMauro Carvalho Chehab   type/model and the need for validating whether the microcode revisions are
250*151f4e2bSMauro Carvalho Chehab   right for the CPUs or not (due to the above assumption that physical CPU
251*151f4e2bSMauro Carvalho Chehab   hotplug will not be done in-between suspend/resume or hibernate/restore
252*151f4e2bSMauro Carvalho Chehab   cycles).
253*151f4e2bSMauro Carvalho Chehab
254*151f4e2bSMauro Carvalho Chehab
255*151f4e2bSMauro Carvalho ChehabIII. Known problems
256*151f4e2bSMauro Carvalho Chehab===================
257*151f4e2bSMauro Carvalho Chehab
258*151f4e2bSMauro Carvalho ChehabAre there any known problems when regular CPU hotplug and suspend race
259*151f4e2bSMauro Carvalho Chehabwith each other?
260*151f4e2bSMauro Carvalho Chehab
261*151f4e2bSMauro Carvalho ChehabYes, they are listed below:
262*151f4e2bSMauro Carvalho Chehab
263*151f4e2bSMauro Carvalho Chehab1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
264*151f4e2bSMauro Carvalho Chehab   the _cpu_down() and _cpu_up() functions is *always* 0.
265*151f4e2bSMauro Carvalho Chehab   This might not reflect the true current state of the system, since the
266*151f4e2bSMauro Carvalho Chehab   tasks could have been frozen by an out-of-band event such as a suspend
267*151f4e2bSMauro Carvalho Chehab   operation in progress. Hence, the cpuhp_tasks_frozen variable will not
268*151f4e2bSMauro Carvalho Chehab   reflect the frozen state and the CPU hotplug callbacks which evaluate
269*151f4e2bSMauro Carvalho Chehab   that variable might execute the wrong code path.
270*151f4e2bSMauro Carvalho Chehab
271*151f4e2bSMauro Carvalho Chehab2. If a regular CPU hotplug stress test happens to race with the freezer due
272*151f4e2bSMauro Carvalho Chehab   to a suspend operation in progress at the same time, then we could hit the
273*151f4e2bSMauro Carvalho Chehab   situation described below:
274*151f4e2bSMauro Carvalho Chehab
275*151f4e2bSMauro Carvalho Chehab    * A regular cpu online operation continues its journey from userspace
276*151f4e2bSMauro Carvalho Chehab      into the kernel, since the freezing has not yet begun.
277*151f4e2bSMauro Carvalho Chehab    * Then freezer gets to work and freezes userspace.
278*151f4e2bSMauro Carvalho Chehab    * If cpu online has not yet completed the microcode update stuff by now,
279*151f4e2bSMauro Carvalho Chehab      it will now start waiting on the frozen userspace in the
280*151f4e2bSMauro Carvalho Chehab      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
281*151f4e2bSMauro Carvalho Chehab    * Now the freezer continues and tries to freeze the remaining tasks. But
282*151f4e2bSMauro Carvalho Chehab      due to this wait mentioned above, the freezer won't be able to freeze
283*151f4e2bSMauro Carvalho Chehab      the cpu online hotplug task and hence freezing of tasks fails.
284*151f4e2bSMauro Carvalho Chehab
285*151f4e2bSMauro Carvalho Chehab   As a result of this task freezing failure, the suspend operation gets
286*151f4e2bSMauro Carvalho Chehab   aborted.
287