1=========================
2Capacity Aware Scheduling
3=========================
4
51. CPU Capacity
6===============
7
81.1 Introduction
9----------------
10
11Conventional, homogeneous SMP platforms are composed of purely identical
12CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
13different performance characteristics - on such platforms, not all CPUs can be
14considered equal.
15
16CPU capacity is a measure of the performance a CPU can reach, normalized against
17the most performant CPU in the system. Heterogeneous systems are also called
18asymmetric CPU capacity systems, as they contain CPUs of different capacities.
19
20Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
21from two factors:
22
23- not all CPUs may have the same microarchitecture (µarch).
24- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
25  physically able to attain the higher Operating Performance Points (OPP).
26
27Arm big.LITTLE systems are an example of both. The big CPUs are more
28performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
29smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
30can.
31
32CPU performance is usually expressed in Millions of Instructions Per Second
33(MIPS), which can also be expressed as a given amount of instructions attainable
34per Hz, leading to::
35
36  capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
37
381.2 Scheduler terms
39-------------------
40
41Two different capacity values are used within the scheduler. A CPU's
42``capacity_orig`` is its maximum attainable capacity, i.e. its maximum
43attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to
44which some loss of available performance (e.g. time spent handling IRQs) is
45subtracted.
46
47Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
48while ``capacity_orig`` is class-agnostic. The rest of this document will use
49the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of
50brevity.
51
521.3 Platform examples
53---------------------
54
551.3.1 Identical OPPs
56~~~~~~~~~~~~~~~~~~~~
57
58Consider an hypothetical dual-core asymmetric CPU capacity system where
59
60- work_per_hz(CPU0) = W
61- work_per_hz(CPU1) = W/2
62- all CPUs are running at the same fixed frequency
63
64By the above definition of capacity:
65
66- capacity(CPU0) = C
67- capacity(CPU1) = C/2
68
69To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
70be a LITTLE.
71
72With a workload that periodically does a fixed amount of work, you will get an
73execution trace like so::
74
75 CPU0 work ^
76           |     ____                ____                ____
77           |    |    |              |    |              |    |
78           +----+----+----+----+----+----+----+----+----+----+-> time
79
80 CPU1 work ^
81           |     _________           _________           ____
82           |    |         |         |         |         |
83           +----+----+----+----+----+----+----+----+----+----+-> time
84
85CPU0 has the highest capacity in the system (C), and completes a fixed amount of
86work W in T units of time. On the other hand, CPU1 has half the capacity of
87CPU0, and thus only completes W/2 in T.
88
891.3.2 Different max OPPs
90~~~~~~~~~~~~~~~~~~~~~~~~
91
92Usually, CPUs of different capacity values also have different maximum
93OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
94
95- max_freq(CPU0) = F
96- max_freq(CPU1) = 2/3 * F
97
98This yields:
99
100- capacity(CPU0) = C
101- capacity(CPU1) = C/3
102
103Executing the same workload as described in 1.3.1, which each CPU running at its
104maximum frequency results in::
105
106 CPU0 work ^
107           |     ____                ____                ____
108           |    |    |              |    |              |    |
109           +----+----+----+----+----+----+----+----+----+----+-> time
110
111                            workload on CPU1
112 CPU1 work ^
113           |     ______________      ______________      ____
114           |    |              |    |              |    |
115           +----+----+----+----+----+----+----+----+----+----+-> time
116
1171.4 Representation caveat
118-------------------------
119
120It should be noted that having a *single* value to represent differences in CPU
121performance is somewhat of a contentious point. The relative performance
122difference between two different µarchs could be X% on integer operations, Y% on
123floating point operations, Z% on branches, and so on. Still, results using this
124simple approach have been satisfactory for now.
125
1262. Task utilization
127===================
128
1292.1 Introduction
130----------------
131
132Capacity aware scheduling requires an expression of a task's requirements with
133regards to CPU capacity. Each scheduler class can express this differently, and
134while task utilization is specific to CFS, it is convenient to describe it here
135in order to introduce more generic concepts.
136
137Task utilization is a percentage meant to represent the throughput requirements
138of a task. A simple approximation of it is the task's duty cycle, i.e.::
139
140  task_util(p) = duty_cycle(p)
141
142On an SMP system with fixed frequencies, 100% utilization suggests the task is a
143busy loop. Conversely, 10% utilization hints it is a small periodic task that
144spends more time sleeping than executing. Variable CPU frequencies and
145asymmetric CPU capacities complexify this somewhat; the following sections will
146expand on these.
147
1482.2 Frequency invariance
149------------------------
150
151One issue that needs to be taken into account is that a workload's duty cycle is
152directly impacted by the current OPP the CPU is running at. Consider running a
153periodic workload at a given frequency F::
154
155  CPU work ^
156           |     ____                ____                ____
157           |    |    |              |    |              |    |
158           +----+----+----+----+----+----+----+----+----+----+-> time
159
160This yields duty_cycle(p) == 25%.
161
162Now, consider running the *same* workload at frequency F/2::
163
164  CPU work ^
165           |     _________           _________           ____
166           |    |         |         |         |         |
167           +----+----+----+----+----+----+----+----+----+----+-> time
168
169This yields duty_cycle(p) == 50%, despite the task having the exact same
170behaviour (i.e. executing the same amount of work) in both executions.
171
172The task utilization signal can be made frequency invariant using the following
173formula::
174
175  task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
176
177Applying this formula to the two examples above yields a frequency invariant
178task utilization of 25%.
179
1802.3 CPU invariance
181------------------
182
183CPU capacity has a similar effect on task utilization in that running an
184identical workload on CPUs of different capacity values will yield different
185duty cycles.
186
187Consider the system described in 1.3.2., i.e.::
188
189- capacity(CPU0) = C
190- capacity(CPU1) = C/3
191
192Executing a given periodic workload on each CPU at their maximum frequency would
193result in::
194
195 CPU0 work ^
196           |     ____                ____                ____
197           |    |    |              |    |              |    |
198           +----+----+----+----+----+----+----+----+----+----+-> time
199
200 CPU1 work ^
201           |     ______________      ______________      ____
202           |    |              |    |              |    |
203           +----+----+----+----+----+----+----+----+----+----+-> time
204
205IOW,
206
207- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
208- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
209
210The task utilization signal can be made CPU invariant using the following
211formula::
212
213  task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
214
215with ``max_capacity`` being the highest CPU capacity value in the
216system. Applying this formula to the above example above yields a CPU
217invariant task utilization of 25%.
218
2192.4 Invariant task utilization
220------------------------------
221
222Both frequency and CPU invariance need to be applied to task utilization in
223order to obtain a truly invariant signal. The pseudo-formula for a task
224utilization that is both CPU and frequency invariant is thus, for a given
225task p::
226
227                                     curr_frequency(cpu)   capacity(cpu)
228  task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
229                                     max_frequency(cpu)    max_capacity
230
231In other words, invariant task utilization describes the behaviour of a task as
232if it were running on the highest-capacity CPU in the system, running at its
233maximum frequency.
234
235Any mention of task utilization in the following sections will imply its
236invariant form.
237
2382.5 Utilization estimation
239--------------------------
240
241Without a crystal ball, task behaviour (and thus task utilization) cannot
242accurately be predicted the moment a task first becomes runnable. The CFS class
243maintains a handful of CPU and task signals based on the Per-Entity Load
244Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
245opposed to instantaneous).
246
247This means that while the capacity aware scheduling criteria will be written
248considering a "true" task utilization (using a crystal ball), the implementation
249will only ever be able to use an estimator thereof.
250
2513. Capacity aware scheduling requirements
252=========================================
253
2543.1 CPU capacity
255----------------
256
257Linux cannot currently figure out CPU capacity on its own, this information thus
258needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
259for that purpose.
260
261The arm and arm64 architectures directly map this to the arch_topology driver
262CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
263Documentation/devicetree/bindings/arm/cpu-capacity.txt.
264
2653.2 Frequency invariance
266------------------------
267
268As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
269utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
270purpose.
271
272Implementing this function requires figuring out at which frequency each CPU
273have been running at. One way to implement this is to leverage hardware counters
274whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
275AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
276when the kernel is aware of the switched-to frequency (also employed by
277arm/arm64).
278
2794. Scheduler topology
280=====================
281
282During the construction of the sched domains, the scheduler will figure out
283whether the system exhibits asymmetric CPU capacities. Should that be the
284case:
285
286- The sched_asym_cpucapacity static key will be enabled.
287- The SD_ASYM_CPUCAPACITY flag will be set at the lowest sched_domain level that
288  spans all unique CPU capacity values.
289
290The sched_asym_cpucapacity static key is intended to guard sections of code that
291cater to asymmetric CPU capacity systems. Do note however that said key is
292*system-wide*. Imagine the following setup using cpusets::
293
294  capacity    C/2          C
295            ________    ________
296           /        \  /        \
297  CPUs     0  1  2  3  4  5  6  7
298           \__/  \______________/
299  cpusets   cs0         cs1
300
301Which could be created via:
302
303.. code-block:: sh
304
305  mkdir /sys/fs/cgroup/cpuset/cs0
306  echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
307  echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
308
309  mkdir /sys/fs/cgroup/cpuset/cs1
310  echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
311  echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
312
313  echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
314
315Since there *is* CPU capacity asymmetry in the system, the
316sched_asym_cpucapacity static key will be enabled. However, the sched_domain
317hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
318set in that hierarchy, it describes an SMP island and should be treated as such.
319
320Therefore, the 'canonical' pattern for protecting codepaths that cater to
321asymmetric CPU capacities is to:
322
323- Check the sched_asym_cpucapacity static key
324- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
325  the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
326  CPU or group thereof)
327
3285. Capacity aware scheduling implementation
329===========================================
330
3315.1 CFS
332-------
333
3345.1.1 Capacity fitness
335~~~~~~~~~~~~~~~~~~~~~~
336
337The main capacity scheduling criterion of CFS is::
338
339  task_util(p) < capacity(task_cpu(p))
340
341This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
342task "fits" on its CPU. If it is violated, the task will need to achieve more
343work than what its CPU can provide: it will be CPU-bound.
344
345Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
346value for a task, either via sched_setattr() or via the cgroup interface (see
347Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
348clamp task_util() in the previous criterion.
349
3505.1.2 Wakeup CPU selection
351~~~~~~~~~~~~~~~~~~~~~~~~~~
352
353CFS task wakeup CPU selection follows the capacity fitness criterion described
354above. On top of that, uclamp is used to clamp the task utilization values,
355which lets userspace have more leverage over the CPU selection of CFS
356tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
357
358  clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
359
360By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
361on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
362periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
363giving it a high uclamp.min value.
364
365.. note::
366
367  Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
368  (EAS), which is described in Documentation/scheduling/sched-energy.rst.
369
3705.1.3 Load balancing
371~~~~~~~~~~~~~~~~~~~~
372
373A pathological case in the wakeup CPU selection occurs when a task rarely
374sleeps, if at all - it thus rarely wakes up, if at all. Consider::
375
376  w == wakeup event
377
378  capacity(CPU0) = C
379  capacity(CPU1) = C / 3
380
381                           workload on CPU0
382  CPU work ^
383           |     _________           _________           ____
384           |    |         |         |         |         |
385           +----+----+----+----+----+----+----+----+----+----+-> time
386                w                   w                   w
387
388                           workload on CPU1
389  CPU work ^
390           |     ____________________________________________
391           |    |
392           +----+----+----+----+----+----+----+----+----+----+->
393                w
394
395This workload should run on CPU0, but if the task either:
396
397- was improperly scheduled from the start (inaccurate initial
398  utilization estimation)
399- was properly scheduled from the start, but suddenly needs more
400  processing power
401
402then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
403the CPU capacity scheduling criterion is violated, and there may not be any more
404wakeup event to fix this up via wakeup CPU selection.
405
406Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
407put in place to handle this shares the same name. Misfit task migration
408leverages the CFS load balancer, more specifically the active load balance part
409(which caters to migrating currently running tasks). When load balance happens,
410a misfit active load balance will be triggered if a misfit task can be migrated
411to a CPU with more capacity than its current one.
412
4135.2 RT
414------
415
4165.2.1 Wakeup CPU selection
417~~~~~~~~~~~~~~~~~~~~~~~~~~
418
419RT task wakeup CPU selection searches for a CPU that satisfies::
420
421  task_uclamp_min(p) <= capacity(task_cpu(cpu))
422
423while still following the usual priority constraints. If none of the candidate
424CPUs can satisfy this capacity criterion, then strict priority based scheduling
425is followed and CPU capacities are ignored.
426
4275.3 DL
428------
429
4305.3.1 Wakeup CPU selection
431~~~~~~~~~~~~~~~~~~~~~~~~~~
432
433DL task wakeup CPU selection searches for a CPU that satisfies::
434
435  task_bandwidth(p) < capacity(task_cpu(p))
436
437while still respecting the usual bandwidth and deadline constraints. If
438none of the candidate CPUs can satisfy this capacity criterion, then the
439task will remain on its current CPU.
440