xref: /openbmc/linux/Documentation/power/suspend-and-cpuhotplug.rst (revision 4b4193256c8d3bc3a5397b5cd9494c2ad386317d)
1151f4e2bSMauro Carvalho Chehab====================================================================
2151f4e2bSMauro Carvalho ChehabInteraction of Suspend code (S3) with the CPU hotplug infrastructure
3151f4e2bSMauro Carvalho Chehab====================================================================
4151f4e2bSMauro Carvalho Chehab
5151f4e2bSMauro Carvalho Chehab(C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
6151f4e2bSMauro Carvalho Chehab
7151f4e2bSMauro Carvalho Chehab
8151f4e2bSMauro Carvalho ChehabI. Differences between CPU hotplug and Suspend-to-RAM
9151f4e2bSMauro Carvalho Chehab======================================================
10151f4e2bSMauro Carvalho Chehab
11151f4e2bSMauro Carvalho ChehabHow does the regular CPU hotplug code differ from how the Suspend-to-RAM
12151f4e2bSMauro Carvalho Chehabinfrastructure uses it internally? And where do they share common code?
13151f4e2bSMauro Carvalho Chehab
14151f4e2bSMauro Carvalho ChehabWell, a picture is worth a thousand words... So ASCII art follows :-)
15151f4e2bSMauro Carvalho Chehab
16151f4e2bSMauro Carvalho Chehab[This depicts the current design in the kernel, and focusses only on the
17151f4e2bSMauro Carvalho Chehabinteractions involving the freezer and CPU hotplug and also tries to explain
18151f4e2bSMauro Carvalho Chehabthe locking involved. It outlines the notifications involved as well.
19151f4e2bSMauro Carvalho ChehabBut please note that here, only the call paths are illustrated, with the aim
20151f4e2bSMauro Carvalho Chehabof describing where they take different paths and where they share code.
21151f4e2bSMauro Carvalho ChehabWhat happens when regular CPU hotplug and Suspend-to-RAM race with each other
22151f4e2bSMauro Carvalho Chehabis not depicted here.]
23151f4e2bSMauro Carvalho Chehab
24151f4e2bSMauro Carvalho ChehabOn a high level, the suspend-resume cycle goes like this::
25151f4e2bSMauro Carvalho Chehab
26151f4e2bSMauro Carvalho Chehab  |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw |
27151f4e2bSMauro Carvalho Chehab  |tasks |    |     cpus      |    |          |    |     cpus     |    |tasks|
28151f4e2bSMauro Carvalho Chehab
29151f4e2bSMauro Carvalho Chehab
30151f4e2bSMauro Carvalho ChehabMore details follow::
31151f4e2bSMauro Carvalho Chehab
32151f4e2bSMauro Carvalho Chehab                                Suspend call path
33151f4e2bSMauro Carvalho Chehab                                -----------------
34151f4e2bSMauro Carvalho Chehab
35151f4e2bSMauro Carvalho Chehab                                  Write 'mem' to
36151f4e2bSMauro Carvalho Chehab                                /sys/power/state
37151f4e2bSMauro Carvalho Chehab                                    sysfs file
38151f4e2bSMauro Carvalho Chehab                                        |
39151f4e2bSMauro Carvalho Chehab                                        v
40151f4e2bSMauro Carvalho Chehab                               Acquire system_transition_mutex lock
41151f4e2bSMauro Carvalho Chehab                                        |
42151f4e2bSMauro Carvalho Chehab                                        v
43151f4e2bSMauro Carvalho Chehab                             Send PM_SUSPEND_PREPARE
44151f4e2bSMauro Carvalho Chehab                                   notifications
45151f4e2bSMauro Carvalho Chehab                                        |
46151f4e2bSMauro Carvalho Chehab                                        v
47151f4e2bSMauro Carvalho Chehab                                   Freeze tasks
48151f4e2bSMauro Carvalho Chehab                                        |
49151f4e2bSMauro Carvalho Chehab                                        |
50151f4e2bSMauro Carvalho Chehab                                        v
51*56555855SQais Yousef                              freeze_secondary_cpus()
52151f4e2bSMauro Carvalho Chehab                                   /* start */
53151f4e2bSMauro Carvalho Chehab                                        |
54151f4e2bSMauro Carvalho Chehab                                        v
55151f4e2bSMauro Carvalho Chehab                            Acquire cpu_add_remove_lock
56151f4e2bSMauro Carvalho Chehab                                        |
57151f4e2bSMauro Carvalho Chehab                                        v
58151f4e2bSMauro Carvalho Chehab                             Iterate over CURRENTLY
59151f4e2bSMauro Carvalho Chehab                                   online CPUs
60151f4e2bSMauro Carvalho Chehab                                        |
61151f4e2bSMauro Carvalho Chehab                                        |
62151f4e2bSMauro Carvalho Chehab                                        |                ----------
63151f4e2bSMauro Carvalho Chehab                                        v                          | L
64151f4e2bSMauro Carvalho Chehab             ======>               _cpu_down()                     |
65151f4e2bSMauro Carvalho Chehab            |              [This takes cpuhotplug.lock             |
66151f4e2bSMauro Carvalho Chehab  Common    |               before taking down the CPU             |
67151f4e2bSMauro Carvalho Chehab   code     |               and releases it when done]             | O
68151f4e2bSMauro Carvalho Chehab            |            While it is at it, notifications          |
69151f4e2bSMauro Carvalho Chehab            |            are sent when notable events occur,       |
70151f4e2bSMauro Carvalho Chehab             ======>     by running all registered callbacks.      |
71151f4e2bSMauro Carvalho Chehab                                        |                          | O
72151f4e2bSMauro Carvalho Chehab                                        |                          |
73151f4e2bSMauro Carvalho Chehab                                        |                          |
74151f4e2bSMauro Carvalho Chehab                                        v                          |
75151f4e2bSMauro Carvalho Chehab                            Note down these cpus in                | P
76151f4e2bSMauro Carvalho Chehab                                frozen_cpus mask         ----------
77151f4e2bSMauro Carvalho Chehab                                        |
78151f4e2bSMauro Carvalho Chehab                                        v
79151f4e2bSMauro Carvalho Chehab                           Disable regular cpu hotplug
80151f4e2bSMauro Carvalho Chehab                        by increasing cpu_hotplug_disabled
81151f4e2bSMauro Carvalho Chehab                                        |
82151f4e2bSMauro Carvalho Chehab                                        v
83151f4e2bSMauro Carvalho Chehab                            Release cpu_add_remove_lock
84151f4e2bSMauro Carvalho Chehab                                        |
85151f4e2bSMauro Carvalho Chehab                                        v
86*56555855SQais Yousef                       /* freeze_secondary_cpus() complete */
87151f4e2bSMauro Carvalho Chehab                                        |
88151f4e2bSMauro Carvalho Chehab                                        v
89151f4e2bSMauro Carvalho Chehab                                   Do suspend
90151f4e2bSMauro Carvalho Chehab
91151f4e2bSMauro Carvalho Chehab
92151f4e2bSMauro Carvalho Chehab
93151f4e2bSMauro Carvalho ChehabResuming back is likewise, with the counterparts being (in the order of
94151f4e2bSMauro Carvalho Chehabexecution during resume):
95151f4e2bSMauro Carvalho Chehab
96*56555855SQais Yousef* thaw_secondary_cpus() which involves::
97151f4e2bSMauro Carvalho Chehab
98151f4e2bSMauro Carvalho Chehab   |  Acquire cpu_add_remove_lock
99151f4e2bSMauro Carvalho Chehab   |  Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug
100151f4e2bSMauro Carvalho Chehab   |  Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop]
101151f4e2bSMauro Carvalho Chehab   |  Release cpu_add_remove_lock
102151f4e2bSMauro Carvalho Chehab   v
103151f4e2bSMauro Carvalho Chehab
104151f4e2bSMauro Carvalho Chehab* thaw tasks
105151f4e2bSMauro Carvalho Chehab* send PM_POST_SUSPEND notifications
106151f4e2bSMauro Carvalho Chehab* Release system_transition_mutex lock.
107151f4e2bSMauro Carvalho Chehab
108151f4e2bSMauro Carvalho Chehab
1091992b66dSBjorn HelgaasIt is to be noted here that the system_transition_mutex lock is acquired at the
1101992b66dSBjorn Helgaasvery beginning, when we are just starting out to suspend, and then released only
111151f4e2bSMauro Carvalho Chehabafter the entire cycle is complete (i.e., suspend + resume).
112151f4e2bSMauro Carvalho Chehab
113151f4e2bSMauro Carvalho Chehab::
114151f4e2bSMauro Carvalho Chehab
115151f4e2bSMauro Carvalho Chehab
116151f4e2bSMauro Carvalho Chehab
117151f4e2bSMauro Carvalho Chehab                          Regular CPU hotplug call path
118151f4e2bSMauro Carvalho Chehab                          -----------------------------
119151f4e2bSMauro Carvalho Chehab
120151f4e2bSMauro Carvalho Chehab                                Write 0 (or 1) to
121151f4e2bSMauro Carvalho Chehab                       /sys/devices/system/cpu/cpu*/online
122151f4e2bSMauro Carvalho Chehab                                    sysfs file
123151f4e2bSMauro Carvalho Chehab                                        |
124151f4e2bSMauro Carvalho Chehab                                        |
125151f4e2bSMauro Carvalho Chehab                                        v
126151f4e2bSMauro Carvalho Chehab                                    cpu_down()
127151f4e2bSMauro Carvalho Chehab                                        |
128151f4e2bSMauro Carvalho Chehab                                        v
129151f4e2bSMauro Carvalho Chehab                           Acquire cpu_add_remove_lock
130151f4e2bSMauro Carvalho Chehab                                        |
131151f4e2bSMauro Carvalho Chehab                                        v
132151f4e2bSMauro Carvalho Chehab                          If cpu_hotplug_disabled > 0
133151f4e2bSMauro Carvalho Chehab                                return gracefully
134151f4e2bSMauro Carvalho Chehab                                        |
135151f4e2bSMauro Carvalho Chehab                                        |
136151f4e2bSMauro Carvalho Chehab                                        v
137151f4e2bSMauro Carvalho Chehab             ======>                _cpu_down()
138151f4e2bSMauro Carvalho Chehab            |              [This takes cpuhotplug.lock
139151f4e2bSMauro Carvalho Chehab  Common    |               before taking down the CPU
140151f4e2bSMauro Carvalho Chehab   code     |               and releases it when done]
141151f4e2bSMauro Carvalho Chehab            |            While it is at it, notifications
142151f4e2bSMauro Carvalho Chehab            |           are sent when notable events occur,
143151f4e2bSMauro Carvalho Chehab             ======>    by running all registered callbacks.
144151f4e2bSMauro Carvalho Chehab                                        |
145151f4e2bSMauro Carvalho Chehab                                        |
146151f4e2bSMauro Carvalho Chehab                                        v
147151f4e2bSMauro Carvalho Chehab                          Release cpu_add_remove_lock
148151f4e2bSMauro Carvalho Chehab                               [That's it!, for
149151f4e2bSMauro Carvalho Chehab                              regular CPU hotplug]
150151f4e2bSMauro Carvalho Chehab
151151f4e2bSMauro Carvalho Chehab
152151f4e2bSMauro Carvalho Chehab
153151f4e2bSMauro Carvalho ChehabSo, as can be seen from the two diagrams (the parts marked as "Common code"),
154151f4e2bSMauro Carvalho Chehabregular CPU hotplug and the suspend code path converge at the _cpu_down() and
155151f4e2bSMauro Carvalho Chehab_cpu_up() functions. They differ in the arguments passed to these functions,
156151f4e2bSMauro Carvalho Chehabin that during regular CPU hotplug, 0 is passed for the 'tasks_frozen'
157151f4e2bSMauro Carvalho Chehabargument. But during suspend, since the tasks are already frozen by the time
158151f4e2bSMauro Carvalho Chehabthe non-boot CPUs are offlined or onlined, the _cpu_*() functions are called
159151f4e2bSMauro Carvalho Chehabwith the 'tasks_frozen' argument set to 1.
160151f4e2bSMauro Carvalho Chehab[See below for some known issues regarding this.]
161151f4e2bSMauro Carvalho Chehab
162151f4e2bSMauro Carvalho Chehab
163151f4e2bSMauro Carvalho ChehabImportant files and functions/entry points:
164151f4e2bSMauro Carvalho Chehab-------------------------------------------
165151f4e2bSMauro Carvalho Chehab
166151f4e2bSMauro Carvalho Chehab- kernel/power/process.c : freeze_processes(), thaw_processes()
167151f4e2bSMauro Carvalho Chehab- kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish()
1681992b66dSBjorn Helgaas- kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](),
1691992b66dSBjorn Helgaas  [disable|enable]_nonboot_cpus()
170151f4e2bSMauro Carvalho Chehab
171151f4e2bSMauro Carvalho Chehab
172151f4e2bSMauro Carvalho Chehab
173151f4e2bSMauro Carvalho ChehabII. What are the issues involved in CPU hotplug?
174151f4e2bSMauro Carvalho Chehab------------------------------------------------
175151f4e2bSMauro Carvalho Chehab
176151f4e2bSMauro Carvalho ChehabThere are some interesting situations involving CPU hotplug and microcode
177151f4e2bSMauro Carvalho Chehabupdate on the CPUs, as discussed below:
178151f4e2bSMauro Carvalho Chehab
179151f4e2bSMauro Carvalho Chehab[Please bear in mind that the kernel requests the microcode images from
180151f4e2bSMauro Carvalho Chehabuserspace, using the request_firmware() function defined in
181151f4e2bSMauro Carvalho Chehabdrivers/base/firmware_loader/main.c]
182151f4e2bSMauro Carvalho Chehab
183151f4e2bSMauro Carvalho Chehab
184151f4e2bSMauro Carvalho Chehaba. When all the CPUs are identical:
185151f4e2bSMauro Carvalho Chehab
186151f4e2bSMauro Carvalho Chehab   This is the most common situation and it is quite straightforward: we want
187151f4e2bSMauro Carvalho Chehab   to apply the same microcode revision to each of the CPUs.
188151f4e2bSMauro Carvalho Chehab   To give an example of x86, the collect_cpu_info() function defined in
189151f4e2bSMauro Carvalho Chehab   arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU
190151f4e2bSMauro Carvalho Chehab   and thereby in applying the correct microcode revision to it.
191151f4e2bSMauro Carvalho Chehab   But note that the kernel does not maintain a common microcode image for the
192151f4e2bSMauro Carvalho Chehab   all CPUs, in order to handle case 'b' described below.
193151f4e2bSMauro Carvalho Chehab
194151f4e2bSMauro Carvalho Chehab
195151f4e2bSMauro Carvalho Chehabb. When some of the CPUs are different than the rest:
196151f4e2bSMauro Carvalho Chehab
197151f4e2bSMauro Carvalho Chehab   In this case since we probably need to apply different microcode revisions
198151f4e2bSMauro Carvalho Chehab   to different CPUs, the kernel maintains a copy of the correct microcode
199151f4e2bSMauro Carvalho Chehab   image for each CPU (after appropriate CPU type/model discovery using
200151f4e2bSMauro Carvalho Chehab   functions such as collect_cpu_info()).
201151f4e2bSMauro Carvalho Chehab
202151f4e2bSMauro Carvalho Chehab
203151f4e2bSMauro Carvalho Chehabc. When a CPU is physically hot-unplugged and a new (and possibly different
204151f4e2bSMauro Carvalho Chehab   type of) CPU is hot-plugged into the system:
205151f4e2bSMauro Carvalho Chehab
206151f4e2bSMauro Carvalho Chehab   In the current design of the kernel, whenever a CPU is taken offline during
207151f4e2bSMauro Carvalho Chehab   a regular CPU hotplug operation, upon receiving the CPU_DEAD notification
208151f4e2bSMauro Carvalho Chehab   (which is sent by the CPU hotplug code), the microcode update driver's
209151f4e2bSMauro Carvalho Chehab   callback for that event reacts by freeing the kernel's copy of the
210151f4e2bSMauro Carvalho Chehab   microcode image for that CPU.
211151f4e2bSMauro Carvalho Chehab
212151f4e2bSMauro Carvalho Chehab   Hence, when a new CPU is brought online, since the kernel finds that it
213151f4e2bSMauro Carvalho Chehab   doesn't have the microcode image, it does the CPU type/model discovery
214151f4e2bSMauro Carvalho Chehab   afresh and then requests the userspace for the appropriate microcode image
215151f4e2bSMauro Carvalho Chehab   for that CPU, which is subsequently applied.
216151f4e2bSMauro Carvalho Chehab
217151f4e2bSMauro Carvalho Chehab   For example, in x86, the mc_cpu_callback() function (which is the microcode
218151f4e2bSMauro Carvalho Chehab   update driver's callback registered for CPU hotplug events) calls
219151f4e2bSMauro Carvalho Chehab   microcode_update_cpu() which would call microcode_init_cpu() in this case,
220151f4e2bSMauro Carvalho Chehab   instead of microcode_resume_cpu() when it finds that the kernel doesn't
221151f4e2bSMauro Carvalho Chehab   have a valid microcode image. This ensures that the CPU type/model
222151f4e2bSMauro Carvalho Chehab   discovery is performed and the right microcode is applied to the CPU after
223151f4e2bSMauro Carvalho Chehab   getting it from userspace.
224151f4e2bSMauro Carvalho Chehab
225151f4e2bSMauro Carvalho Chehab
226151f4e2bSMauro Carvalho Chehabd. Handling microcode update during suspend/hibernate:
227151f4e2bSMauro Carvalho Chehab
228151f4e2bSMauro Carvalho Chehab   Strictly speaking, during a CPU hotplug operation which does not involve
229151f4e2bSMauro Carvalho Chehab   physically removing or inserting CPUs, the CPUs are not actually powered
230151f4e2bSMauro Carvalho Chehab   off during a CPU offline. They are just put to the lowest C-states possible.
231151f4e2bSMauro Carvalho Chehab   Hence, in such a case, it is not really necessary to re-apply microcode
232151f4e2bSMauro Carvalho Chehab   when the CPUs are brought back online, since they wouldn't have lost the
233151f4e2bSMauro Carvalho Chehab   image during the CPU offline operation.
234151f4e2bSMauro Carvalho Chehab
235151f4e2bSMauro Carvalho Chehab   This is the usual scenario encountered during a resume after a suspend.
236151f4e2bSMauro Carvalho Chehab   However, in the case of hibernation, since all the CPUs are completely
237151f4e2bSMauro Carvalho Chehab   powered off, during restore it becomes necessary to apply the microcode
238151f4e2bSMauro Carvalho Chehab   images to all the CPUs.
239151f4e2bSMauro Carvalho Chehab
240151f4e2bSMauro Carvalho Chehab   [Note that we don't expect someone to physically pull out nodes and insert
241151f4e2bSMauro Carvalho Chehab   nodes with a different type of CPUs in-between a suspend-resume or a
242151f4e2bSMauro Carvalho Chehab   hibernate/restore cycle.]
243151f4e2bSMauro Carvalho Chehab
244151f4e2bSMauro Carvalho Chehab   In the current design of the kernel however, during a CPU offline operation
245151f4e2bSMauro Carvalho Chehab   as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set),
246151f4e2bSMauro Carvalho Chehab   the existing copy of microcode image in the kernel is not freed up.
247151f4e2bSMauro Carvalho Chehab   And during the CPU online operations (during resume/restore), since the
248151f4e2bSMauro Carvalho Chehab   kernel finds that it already has copies of the microcode images for all the
249151f4e2bSMauro Carvalho Chehab   CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU
250151f4e2bSMauro Carvalho Chehab   type/model and the need for validating whether the microcode revisions are
251151f4e2bSMauro Carvalho Chehab   right for the CPUs or not (due to the above assumption that physical CPU
252151f4e2bSMauro Carvalho Chehab   hotplug will not be done in-between suspend/resume or hibernate/restore
253151f4e2bSMauro Carvalho Chehab   cycles).
254151f4e2bSMauro Carvalho Chehab
255151f4e2bSMauro Carvalho Chehab
256151f4e2bSMauro Carvalho ChehabIII. Known problems
257151f4e2bSMauro Carvalho Chehab===================
258151f4e2bSMauro Carvalho Chehab
259151f4e2bSMauro Carvalho ChehabAre there any known problems when regular CPU hotplug and suspend race
260151f4e2bSMauro Carvalho Chehabwith each other?
261151f4e2bSMauro Carvalho Chehab
262151f4e2bSMauro Carvalho ChehabYes, they are listed below:
263151f4e2bSMauro Carvalho Chehab
264151f4e2bSMauro Carvalho Chehab1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to
265151f4e2bSMauro Carvalho Chehab   the _cpu_down() and _cpu_up() functions is *always* 0.
266151f4e2bSMauro Carvalho Chehab   This might not reflect the true current state of the system, since the
267151f4e2bSMauro Carvalho Chehab   tasks could have been frozen by an out-of-band event such as a suspend
268151f4e2bSMauro Carvalho Chehab   operation in progress. Hence, the cpuhp_tasks_frozen variable will not
269151f4e2bSMauro Carvalho Chehab   reflect the frozen state and the CPU hotplug callbacks which evaluate
270151f4e2bSMauro Carvalho Chehab   that variable might execute the wrong code path.
271151f4e2bSMauro Carvalho Chehab
272151f4e2bSMauro Carvalho Chehab2. If a regular CPU hotplug stress test happens to race with the freezer due
273151f4e2bSMauro Carvalho Chehab   to a suspend operation in progress at the same time, then we could hit the
274151f4e2bSMauro Carvalho Chehab   situation described below:
275151f4e2bSMauro Carvalho Chehab
276151f4e2bSMauro Carvalho Chehab    * A regular cpu online operation continues its journey from userspace
277151f4e2bSMauro Carvalho Chehab      into the kernel, since the freezing has not yet begun.
278151f4e2bSMauro Carvalho Chehab    * Then freezer gets to work and freezes userspace.
279151f4e2bSMauro Carvalho Chehab    * If cpu online has not yet completed the microcode update stuff by now,
280151f4e2bSMauro Carvalho Chehab      it will now start waiting on the frozen userspace in the
281151f4e2bSMauro Carvalho Chehab      TASK_UNINTERRUPTIBLE state, in order to get the microcode image.
282151f4e2bSMauro Carvalho Chehab    * Now the freezer continues and tries to freeze the remaining tasks. But
283151f4e2bSMauro Carvalho Chehab      due to this wait mentioned above, the freezer won't be able to freeze
284151f4e2bSMauro Carvalho Chehab      the cpu online hotplug task and hence freezing of tasks fails.
285151f4e2bSMauro Carvalho Chehab
286151f4e2bSMauro Carvalho Chehab   As a result of this task freezing failure, the suspend operation gets
287151f4e2bSMauro Carvalho Chehab   aborted.
288