xref: /openbmc/linux/Documentation/admin-guide/thermal/intel_powerclamp.rst (revision b003fb5c9df8a8923bf46e0c00cc54edcfb0fbe3)
1=======================
2Intel Powerclamp Driver
3=======================
4
5By:
6  - Arjan van de Ven <arjan@linux.intel.com>
7  - Jacob Pan <jacob.jun.pan@linux.intel.com>
8
9.. Contents:
10
11	(*) Introduction
12	    - Goals and Objectives
13
14	(*) Theory of Operation
15	    - Idle Injection
16	    - Calibration
17
18	(*) Performance Analysis
19	    - Effectiveness and Limitations
20	    - Power vs Performance
21	    - Scalability
22	    - Calibration
23	    - Comparison with Alternative Techniques
24
25	(*) Usage and Interfaces
26	    - Generic Thermal Layer (sysfs)
27	    - Kernel APIs (TBD)
28
29	(*) Module Parameters
30
31INTRODUCTION
32============
33
34Consider the situation where a system’s power consumption must be
35reduced at runtime, due to power budget, thermal constraint, or noise
36level, and where active cooling is not preferred. Software managed
37passive power reduction must be performed to prevent the hardware
38actions that are designed for catastrophic scenarios.
39
40Currently, P-states, T-states (clock modulation), and CPU offlining
41are used for CPU throttling.
42
43On Intel CPUs, C-states provide effective power reduction, but so far
44they’re only used opportunistically, based on workload. With the
45development of intel_powerclamp driver, the method of synchronizing
46idle injection across all online CPU threads was introduced. The goal
47is to achieve forced and controllable C-state residency.
48
49Test/Analysis has been made in the areas of power, performance,
50scalability, and user experience. In many cases, clear advantage is
51shown over taking the CPU offline or modulating the CPU clock.
52
53
54THEORY OF OPERATION
55===================
56
57Idle Injection
58--------------
59
60On modern Intel processors (Nehalem or later), package level C-state
61residency is available in MSRs, thus also available to the kernel.
62
63These MSRs are::
64
65      #define MSR_PKG_C2_RESIDENCY      0x60D
66      #define MSR_PKG_C3_RESIDENCY      0x3F8
67      #define MSR_PKG_C6_RESIDENCY      0x3F9
68      #define MSR_PKG_C7_RESIDENCY      0x3FA
69
70If the kernel can also inject idle time to the system, then a
71closed-loop control system can be established that manages package
72level C-state. The intel_powerclamp driver is conceived as such a
73control system, where the target set point is a user-selected idle
74ratio (based on power reduction), and the error is the difference
75between the actual package level C-state residency ratio and the target idle
76ratio.
77
78Injection is controlled by high priority kernel threads, spawned for
79each online CPU.
80
81These kernel threads, with SCHED_FIFO class, are created to perform
82clamping actions of controlled duty ratio and duration. Each per-CPU
83thread synchronizes its idle time and duration, based on the rounding
84of jiffies, so accumulated errors can be prevented to avoid a jittery
85effect. Threads are also bound to the CPU such that they cannot be
86migrated, unless the CPU is taken offline. In this case, threads
87belong to the offlined CPUs will be terminated immediately.
88
89Running as SCHED_FIFO and relatively high priority, also allows such
90scheme to work for both preemptible and non-preemptible kernels.
91Alignment of idle time around jiffies ensures scalability for HZ
92values. This effect can be better visualized using a Perf timechart.
93The following diagram shows the behavior of kernel thread
94kidle_inject/cpu. During idle injection, it runs monitor/mwait idle
95for a given "duration", then relinquishes the CPU to other tasks,
96until the next time interval.
97
98The NOHZ schedule tick is disabled during idle time, but interrupts
99are not masked. Tests show that the extra wakeups from scheduler tick
100have a dramatic impact on the effectiveness of the powerclamp driver
101on large scale systems (Westmere system with 80 processors).
102
103::
104
105  CPU0
106		    ____________          ____________
107  kidle_inject/0   |   sleep    |  mwait |  sleep     |
108	  _________|            |________|            |_______
109				 duration
110  CPU1
111		    ____________          ____________
112  kidle_inject/1   |   sleep    |  mwait |  sleep     |
113	  _________|            |________|            |_______
114				^
115				|
116				|
117				roundup(jiffies, interval)
118
119Only one CPU is allowed to collect statistics and update global
120control parameters. This CPU is referred to as the controlling CPU in
121this document. The controlling CPU is elected at runtime, with a
122policy that favors BSP, taking into account the possibility of a CPU
123hot-plug.
124
125In terms of dynamics of the idle control system, package level idle
126time is considered largely as a non-causal system where its behavior
127cannot be based on the past or current input. Therefore, the
128intel_powerclamp driver attempts to enforce the desired idle time
129instantly as given input (target idle ratio). After injection,
130powerclamp monitors the actual idle for a given time window and adjust
131the next injection accordingly to avoid over/under correction.
132
133When used in a causal control system, such as a temperature control,
134it is up to the user of this driver to implement algorithms where
135past samples and outputs are included in the feedback. For example, a
136PID-based thermal controller can use the powerclamp driver to
137maintain a desired target temperature, based on integral and
138derivative gains of the past samples.
139
140
141
142Calibration
143-----------
144During scalability testing, it is observed that synchronized actions
145among CPUs become challenging as the number of cores grows. This is
146also true for the ability of a system to enter package level C-states.
147
148To make sure the intel_powerclamp driver scales well, online
149calibration is implemented. The goals for doing such a calibration
150are:
151
152a) determine the effective range of idle injection ratio
153b) determine the amount of compensation needed at each target ratio
154
155Compensation to each target ratio consists of two parts:
156
157	a) steady state error compensation
158
159	   This is to offset the error occurring when the system can
160	   enter idle without extra wakeups (such as external interrupts).
161
162	b) dynamic error compensation
163
164	   When an excessive amount of wakeups occurs during idle, an
165	   additional idle ratio can be added to quiet interrupts, by
166	   slowing down CPU activities.
167
168A debugfs file is provided for the user to examine compensation
169progress and results, such as on a Westmere system::
170
171  [jacob@nex01 ~]$ cat
172  /sys/kernel/debug/intel_powerclamp/powerclamp_calib
173  controlling cpu: 0
174  pct confidence steady dynamic (compensation)
175  0       0       0       0
176  1       1       0       0
177  2       1       1       0
178  3       3       1       0
179  4       3       1       0
180  5       3       1       0
181  6       3       1       0
182  7       3       1       0
183  8       3       1       0
184  ...
185  30      3       2       0
186  31      3       2       0
187  32      3       1       0
188  33      3       2       0
189  34      3       1       0
190  35      3       2       0
191  36      3       1       0
192  37      3       2       0
193  38      3       1       0
194  39      3       2       0
195  40      3       3       0
196  41      3       1       0
197  42      3       2       0
198  43      3       1       0
199  44      3       1       0
200  45      3       2       0
201  46      3       3       0
202  47      3       0       0
203  48      3       2       0
204  49      3       3       0
205
206Calibration occurs during runtime. No offline method is available.
207Steady state compensation is used only when confidence levels of all
208adjacent ratios have reached satisfactory level. A confidence level
209is accumulated based on clean data collected at runtime. Data
210collected during a period without extra interrupts is considered
211clean.
212
213To compensate for excessive amounts of wakeup during idle, additional
214idle time is injected when such a condition is detected. Currently,
215we have a simple algorithm to double the injection ratio. A possible
216enhancement might be to throttle the offending IRQ, such as delaying
217EOI for level triggered interrupts. But it is a challenge to be
218non-intrusive to the scheduler or the IRQ core code.
219
220
221CPU Online/Offline
222------------------
223Per-CPU kernel threads are started/stopped upon receiving
224notifications of CPU hotplug activities. The intel_powerclamp driver
225keeps track of clamping kernel threads, even after they are migrated
226to other CPUs, after a CPU offline event.
227
228
229Performance Analysis
230====================
231This section describes the general performance data collected on
232multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P).
233
234Effectiveness and Limitations
235-----------------------------
236The maximum range that idle injection is allowed is capped at 50
237percent. As mentioned earlier, since interrupts are allowed during
238forced idle time, excessive interrupts could result in less
239effectiveness. The extreme case would be doing a ping -f to generated
240flooded network interrupts without much CPU acknowledgement. In this
241case, little can be done from the idle injection threads. In most
242normal cases, such as scp a large file, applications can be throttled
243by the powerclamp driver, since slowing down the CPU also slows down
244network protocol processing, which in turn reduces interrupts.
245
246When control parameters change at runtime by the controlling CPU, it
247may take an additional period for the rest of the CPUs to catch up
248with the changes. During this time, idle injection is out of sync,
249thus not able to enter package C- states at the expected ratio. But
250this effect is minor, in that in most cases change to the target
251ratio is updated much less frequently than the idle injection
252frequency.
253
254Scalability
255-----------
256Tests also show a minor, but measurable, difference between the 4P/8P
257Ivy Bridge system and the 80P Westmere server under 50% idle ratio.
258More compensation is needed on Westmere for the same amount of
259target idle ratio. The compensation also increases as the idle ratio
260gets larger. The above reason constitutes the need for the
261calibration code.
262
263On the IVB 8P system, compared to an offline CPU, powerclamp can
264achieve up to 40% better performance per watt. (measured by a spin
265counter summed over per CPU counting threads spawned for all running
266CPUs).
267
268Usage and Interfaces
269====================
270The powerclamp driver is registered to the generic thermal layer as a
271cooling device. Currently, it’s not bound to any thermal zones::
272
273  jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . *
274  cur_state:0
275  max_state:50
276  type:intel_powerclamp
277
278cur_state allows user to set the desired idle percentage. Writing 0 to
279cur_state will stop idle injection. Writing a value between 1 and
280max_state will start the idle injection. Reading cur_state returns the
281actual and current idle percentage. This may not be the same value
282set by the user in that current idle percentage depends on workload
283and includes natural idle. When idle injection is disabled, reading
284cur_state returns value -1 instead of 0 which is to avoid confusing
285100% busy state with the disabled state.
286
287Example usage:
288
289- To inject 25% idle time::
290
291	$ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state
292
293If the system is not busy and has more than 25% idle time already,
294then the powerclamp driver will not start idle injection. Using Top
295will not show idle injection kernel threads.
296
297If the system is busy (spin test below) and has less than 25% natural
298idle time, powerclamp kernel threads will do idle injection. Forced
299idle time is accounted as normal idle in that common code path is
300taken as the idle task.
301
302In this example, 24.1% idle is shown. This helps the system admin or
303user determine the cause of slowdown, when a powerclamp driver is in action::
304
305
306  Tasks: 197 total,   1 running, 196 sleeping,   0 stopped,   0 zombie
307  Cpu(s): 71.2%us,  4.7%sy,  0.0%ni, 24.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
308  Mem:   3943228k total,  1689632k used,  2253596k free,    74960k buffers
309  Swap:  4087804k total,        0k used,  4087804k free,   945336k cached
310
311    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
312   3352 jacob     20   0  262m  644  428 S  286  0.0   0:17.16 spin
313   3341 root     -51   0     0    0    0 D   25  0.0   0:01.62 kidle_inject/0
314   3344 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/3
315   3342 root     -51   0     0    0    0 D   25  0.0   0:01.61 kidle_inject/1
316   3343 root     -51   0     0    0    0 D   25  0.0   0:01.60 kidle_inject/2
317   2935 jacob     20   0  696m 125m  35m S    5  3.3   0:31.11 firefox
318   1546 root      20   0  158m  20m 6640 S    3  0.5   0:26.97 Xorg
319   2100 jacob     20   0 1223m  88m  30m S    3  2.3   0:23.68 compiz
320
321Tests have shown that by using the powerclamp driver as a cooling
322device, a PID based userspace thermal controller can manage to
323control CPU temperature effectively, when no other thermal influence
324is added. For example, a UltraBook user can compile the kernel under
325certain temperature (below most active trip points).
326
327Module Parameters
328=================
329
330``cpumask`` (RW)
331	A bit mask of CPUs to inject idle. The format of the bitmask is same as
332	used in other subsystems like in /proc/irq/\*/smp_affinity. The mask is
333	comma separated 32 bit groups. Each CPU is one bit. For example for a 256
334	CPU system the full mask is:
335	ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
336
337	The rightmost mask is for CPU 0-32.
338
339``max_idle`` (RW)
340	Maximum injected idle time to the total CPU time ratio in percent range
341	from 1 to 100. Even if the cooling device max_state is always 100 (100%),
342	this parameter allows to add a max idle percent limit. The default is 50,
343	to match the current implementation of powerclamp driver. Also doesn't
344	allow value more than 75, if the cpumask includes every CPU present in
345	the system.
346