1======================= 2Intel Powerclamp Driver 3======================= 4 5By: 6 - Arjan van de Ven <arjan@linux.intel.com> 7 - Jacob Pan <jacob.jun.pan@linux.intel.com> 8 9.. Contents: 10 11 (*) Introduction 12 - Goals and Objectives 13 14 (*) Theory of Operation 15 - Idle Injection 16 - Calibration 17 18 (*) Performance Analysis 19 - Effectiveness and Limitations 20 - Power vs Performance 21 - Scalability 22 - Calibration 23 - Comparison with Alternative Techniques 24 25 (*) Usage and Interfaces 26 - Generic Thermal Layer (sysfs) 27 - Kernel APIs (TBD) 28 29 (*) Module Parameters 30 31INTRODUCTION 32============ 33 34Consider the situation where a system’s power consumption must be 35reduced at runtime, due to power budget, thermal constraint, or noise 36level, and where active cooling is not preferred. Software managed 37passive power reduction must be performed to prevent the hardware 38actions that are designed for catastrophic scenarios. 39 40Currently, P-states, T-states (clock modulation), and CPU offlining 41are used for CPU throttling. 42 43On Intel CPUs, C-states provide effective power reduction, but so far 44they’re only used opportunistically, based on workload. With the 45development of intel_powerclamp driver, the method of synchronizing 46idle injection across all online CPU threads was introduced. The goal 47is to achieve forced and controllable C-state residency. 48 49Test/Analysis has been made in the areas of power, performance, 50scalability, and user experience. In many cases, clear advantage is 51shown over taking the CPU offline or modulating the CPU clock. 52 53 54THEORY OF OPERATION 55=================== 56 57Idle Injection 58-------------- 59 60On modern Intel processors (Nehalem or later), package level C-state 61residency is available in MSRs, thus also available to the kernel. 62 63These MSRs are:: 64 65 #define MSR_PKG_C2_RESIDENCY 0x60D 66 #define MSR_PKG_C3_RESIDENCY 0x3F8 67 #define MSR_PKG_C6_RESIDENCY 0x3F9 68 #define MSR_PKG_C7_RESIDENCY 0x3FA 69 70If the kernel can also inject idle time to the system, then a 71closed-loop control system can be established that manages package 72level C-state. The intel_powerclamp driver is conceived as such a 73control system, where the target set point is a user-selected idle 74ratio (based on power reduction), and the error is the difference 75between the actual package level C-state residency ratio and the target idle 76ratio. 77 78Injection is controlled by high priority kernel threads, spawned for 79each online CPU. 80 81These kernel threads, with SCHED_FIFO class, are created to perform 82clamping actions of controlled duty ratio and duration. Each per-CPU 83thread synchronizes its idle time and duration, based on the rounding 84of jiffies, so accumulated errors can be prevented to avoid a jittery 85effect. Threads are also bound to the CPU such that they cannot be 86migrated, unless the CPU is taken offline. In this case, threads 87belong to the offlined CPUs will be terminated immediately. 88 89Running as SCHED_FIFO and relatively high priority, also allows such 90scheme to work for both preemptable and non-preemptable kernels. 91Alignment of idle time around jiffies ensures scalability for HZ 92values. This effect can be better visualized using a Perf timechart. 93The following diagram shows the behavior of kernel thread 94kidle_inject/cpu. During idle injection, it runs monitor/mwait idle 95for a given "duration", then relinquishes the CPU to other tasks, 96until the next time interval. 97 98The NOHZ schedule tick is disabled during idle time, but interrupts 99are not masked. Tests show that the extra wakeups from scheduler tick 100have a dramatic impact on the effectiveness of the powerclamp driver 101on large scale systems (Westmere system with 80 processors). 102 103:: 104 105 CPU0 106 ____________ ____________ 107 kidle_inject/0 | sleep | mwait | sleep | 108 _________| |________| |_______ 109 duration 110 CPU1 111 ____________ ____________ 112 kidle_inject/1 | sleep | mwait | sleep | 113 _________| |________| |_______ 114 ^ 115 | 116 | 117 roundup(jiffies, interval) 118 119Only one CPU is allowed to collect statistics and update global 120control parameters. This CPU is referred to as the controlling CPU in 121this document. The controlling CPU is elected at runtime, with a 122policy that favors BSP, taking into account the possibility of a CPU 123hot-plug. 124 125In terms of dynamics of the idle control system, package level idle 126time is considered largely as a non-causal system where its behavior 127cannot be based on the past or current input. Therefore, the 128intel_powerclamp driver attempts to enforce the desired idle time 129instantly as given input (target idle ratio). After injection, 130powerclamp monitors the actual idle for a given time window and adjust 131the next injection accordingly to avoid over/under correction. 132 133When used in a causal control system, such as a temperature control, 134it is up to the user of this driver to implement algorithms where 135past samples and outputs are included in the feedback. For example, a 136PID-based thermal controller can use the powerclamp driver to 137maintain a desired target temperature, based on integral and 138derivative gains of the past samples. 139 140 141 142Calibration 143----------- 144During scalability testing, it is observed that synchronized actions 145among CPUs become challenging as the number of cores grows. This is 146also true for the ability of a system to enter package level C-states. 147 148To make sure the intel_powerclamp driver scales well, online 149calibration is implemented. The goals for doing such a calibration 150are: 151 152a) determine the effective range of idle injection ratio 153b) determine the amount of compensation needed at each target ratio 154 155Compensation to each target ratio consists of two parts: 156 157 a) steady state error compensation 158 This is to offset the error occurring when the system can 159 enter idle without extra wakeups (such as external interrupts). 160 161 b) dynamic error compensation 162 When an excessive amount of wakeups occurs during idle, an 163 additional idle ratio can be added to quiet interrupts, by 164 slowing down CPU activities. 165 166A debugfs file is provided for the user to examine compensation 167progress and results, such as on a Westmere system:: 168 169 [jacob@nex01 ~]$ cat 170 /sys/kernel/debug/intel_powerclamp/powerclamp_calib 171 controlling cpu: 0 172 pct confidence steady dynamic (compensation) 173 0 0 0 0 174 1 1 0 0 175 2 1 1 0 176 3 3 1 0 177 4 3 1 0 178 5 3 1 0 179 6 3 1 0 180 7 3 1 0 181 8 3 1 0 182 ... 183 30 3 2 0 184 31 3 2 0 185 32 3 1 0 186 33 3 2 0 187 34 3 1 0 188 35 3 2 0 189 36 3 1 0 190 37 3 2 0 191 38 3 1 0 192 39 3 2 0 193 40 3 3 0 194 41 3 1 0 195 42 3 2 0 196 43 3 1 0 197 44 3 1 0 198 45 3 2 0 199 46 3 3 0 200 47 3 0 0 201 48 3 2 0 202 49 3 3 0 203 204Calibration occurs during runtime. No offline method is available. 205Steady state compensation is used only when confidence levels of all 206adjacent ratios have reached satisfactory level. A confidence level 207is accumulated based on clean data collected at runtime. Data 208collected during a period without extra interrupts is considered 209clean. 210 211To compensate for excessive amounts of wakeup during idle, additional 212idle time is injected when such a condition is detected. Currently, 213we have a simple algorithm to double the injection ratio. A possible 214enhancement might be to throttle the offending IRQ, such as delaying 215EOI for level triggered interrupts. But it is a challenge to be 216non-intrusive to the scheduler or the IRQ core code. 217 218 219CPU Online/Offline 220------------------ 221Per-CPU kernel threads are started/stopped upon receiving 222notifications of CPU hotplug activities. The intel_powerclamp driver 223keeps track of clamping kernel threads, even after they are migrated 224to other CPUs, after a CPU offline event. 225 226 227Performance Analysis 228==================== 229This section describes the general performance data collected on 230multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). 231 232Effectiveness and Limitations 233----------------------------- 234The maximum range that idle injection is allowed is capped at 50 235percent. As mentioned earlier, since interrupts are allowed during 236forced idle time, excessive interrupts could result in less 237effectiveness. The extreme case would be doing a ping -f to generated 238flooded network interrupts without much CPU acknowledgement. In this 239case, little can be done from the idle injection threads. In most 240normal cases, such as scp a large file, applications can be throttled 241by the powerclamp driver, since slowing down the CPU also slows down 242network protocol processing, which in turn reduces interrupts. 243 244When control parameters change at runtime by the controlling CPU, it 245may take an additional period for the rest of the CPUs to catch up 246with the changes. During this time, idle injection is out of sync, 247thus not able to enter package C- states at the expected ratio. But 248this effect is minor, in that in most cases change to the target 249ratio is updated much less frequently than the idle injection 250frequency. 251 252Scalability 253----------- 254Tests also show a minor, but measurable, difference between the 4P/8P 255Ivy Bridge system and the 80P Westmere server under 50% idle ratio. 256More compensation is needed on Westmere for the same amount of 257target idle ratio. The compensation also increases as the idle ratio 258gets larger. The above reason constitutes the need for the 259calibration code. 260 261On the IVB 8P system, compared to an offline CPU, powerclamp can 262achieve up to 40% better performance per watt. (measured by a spin 263counter summed over per CPU counting threads spawned for all running 264CPUs). 265 266Usage and Interfaces 267==================== 268The powerclamp driver is registered to the generic thermal layer as a 269cooling device. Currently, it’s not bound to any thermal zones:: 270 271 jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * 272 cur_state:0 273 max_state:50 274 type:intel_powerclamp 275 276cur_state allows user to set the desired idle percentage. Writing 0 to 277cur_state will stop idle injection. Writing a value between 1 and 278max_state will start the idle injection. Reading cur_state returns the 279actual and current idle percentage. This may not be the same value 280set by the user in that current idle percentage depends on workload 281and includes natural idle. When idle injection is disabled, reading 282cur_state returns value -1 instead of 0 which is to avoid confusing 283100% busy state with the disabled state. 284 285Example usage: 286- To inject 25% idle time:: 287 288 $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state 289 290If the system is not busy and has more than 25% idle time already, 291then the powerclamp driver will not start idle injection. Using Top 292will not show idle injection kernel threads. 293 294If the system is busy (spin test below) and has less than 25% natural 295idle time, powerclamp kernel threads will do idle injection. Forced 296idle time is accounted as normal idle in that common code path is 297taken as the idle task. 298 299In this example, 24.1% idle is shown. This helps the system admin or 300user determine the cause of slowdown, when a powerclamp driver is in action:: 301 302 303 Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie 304 Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st 305 Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers 306 Swap: 4087804k total, 0k used, 4087804k free, 945336k cached 307 308 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 309 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin 310 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 311 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 312 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 313 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 314 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox 315 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg 316 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz 317 318Tests have shown that by using the powerclamp driver as a cooling 319device, a PID based userspace thermal controller can manage to 320control CPU temperature effectively, when no other thermal influence 321is added. For example, a UltraBook user can compile the kernel under 322certain temperature (below most active trip points). 323 324Module Parameters 325================= 326 327``cpumask`` (RW) 328 A bit mask of CPUs to inject idle. The format of the bitmask is same as 329 used in other subsystems like in /proc/irq/*/smp_affinity. The mask is 330 comma separated 32 bit groups. Each CPU is one bit. For example for a 256 331 CPU system the full mask is: 332 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 333 334 The rightmost mask is for CPU 0-32. 335 336``max_idle`` (RW) 337 Maximum injected idle time to the total CPU time ratio in percent range 338 from 1 to 100. Even if the cooling device max_state is always 100 (100%), 339 this parameter allows to add a max idle percent limit. The default is 50, 340 to match the current implementation of powerclamp driver. Also doesn't 341 allow value more than 75, if the cpumask includes every CPU present in 342 the system. 343