1.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` 2.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` 3 4======================= 5CPU Performance Scaling 6======================= 7 8:: 9 10 Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> 11 12The Concept of CPU Performance Scaling 13====================================== 14 15The majority of modern processors are capable of operating in a number of 16different clock frequency and voltage configurations, often referred to as 17Operating Performance Points or P-states (in ACPI terminology). As a rule, 18the higher the clock frequency and the higher the voltage, the more instructions 19can be retired by the CPU over a unit of time, but also the higher the clock 20frequency and the higher the voltage, the more energy is consumed over a unit of 21time (or the more power is drawn) by the CPU in the given P-state. Therefore 22there is a natural tradeoff between the CPU capacity (the number of instructions 23that can be executed over a unit of time) and the power drawn by the CPU. 24 25In some situations it is desirable or even necessary to run the program as fast 26as possible and then there is no reason to use any P-states different from the 27highest one (i.e. the highest-performance frequency/voltage configuration 28available). In some other cases, however, it may not be necessary to execute 29instructions so quickly and maintaining the highest available CPU capacity for a 30relatively long time without utilizing it entirely may be regarded as wasteful. 31It also may not be physically possible to maintain maximum CPU capacity for too 32long for thermal or power supply capacity reasons or similar. To cover those 33cases, there are hardware interfaces allowing CPUs to be switched between 34different frequency/voltage configurations or (in the ACPI terminology) to be 35put into different P-states. 36 37Typically, they are used along with algorithms to estimate the required CPU 38capacity, so as to decide which P-states to put the CPUs into. Of course, since 39the utilization of the system generally changes over time, that has to be done 40repeatedly on a regular basis. The activity by which this happens is referred 41to as CPU performance scaling or CPU frequency scaling (because it involves 42adjusting the CPU clock frequency). 43 44 45CPU Performance Scaling in Linux 46================================ 47 48The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` 49(CPU Frequency scaling) subsystem that consists of three layers of code: the 50core, scaling governors and scaling drivers. 51 52The ``CPUFreq`` core provides the common code infrastructure and user space 53interfaces for all platforms that support CPU performance scaling. It defines 54the basic framework in which the other components operate. 55 56Scaling governors implement algorithms to estimate the required CPU capacity. 57As a rule, each governor implements one, possibly parametrized, scaling 58algorithm. 59 60Scaling drivers talk to the hardware. They provide scaling governors with 61information on the available P-states (or P-state ranges in some cases) and 62access platform-specific hardware interfaces to change CPU P-states as requested 63by scaling governors. 64 65In principle, all available scaling governors can be used with every scaling 66driver. That design is based on the observation that the information used by 67performance scaling algorithms for P-state selection can be represented in a 68platform-independent form in the majority of cases, so it should be possible 69to use the same performance scaling algorithm implemented in exactly the same 70way regardless of which scaling driver is used. Consequently, the same set of 71scaling governors should be suitable for every supported platform. 72 73However, that observation may not hold for performance scaling algorithms 74based on information provided by the hardware itself, for example through 75feedback registers, as that information is typically specific to the hardware 76interface it comes from and may not be easily represented in an abstract, 77platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers 78to bypass the governor layer and implement their own performance scaling 79algorithms. That is done by the |intel_pstate| scaling driver. 80 81 82``CPUFreq`` Policy Objects 83========================== 84 85In some cases the hardware interface for P-state control is shared by multiple 86CPUs. That is, for example, the same register (or set of registers) is used to 87control the P-state of multiple CPUs at the same time and writing to it affects 88all of those CPUs simultaneously. 89 90Sets of CPUs sharing hardware P-state control interfaces are represented by 91``CPUFreq`` as |struct cpufreq_policy| objects. For consistency, 92|struct cpufreq_policy| is also used when there is only one CPU in the given 93set. 94 95The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for 96every CPU in the system, including CPUs that are currently offline. If multiple 97CPUs share the same hardware P-state control interface, all of the pointers 98corresponding to them point to the same |struct cpufreq_policy| object. 99 100``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design 101of its user space interface is based on the policy concept. 102 103 104CPU Initialization 105================== 106 107First of all, a scaling driver has to be registered for ``CPUFreq`` to work. 108It is only possible to register one scaling driver at a time, so the scaling 109driver is expected to be able to handle all CPUs in the system. 110 111The scaling driver may be registered before or after CPU registration. If 112CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to 113take a note of all of the already registered CPUs during the registration of the 114scaling driver. In turn, if any CPUs are registered after the registration of 115the scaling driver, the ``CPUFreq`` core will be invoked to take note of them 116at their registration time. 117 118In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it 119has not seen so far as soon as it is ready to handle that CPU. [Note that the 120logical CPU may be a physical single-core processor, or a single core in a 121multicore processor, or a hardware thread in a physical processor or processor 122core. In what follows "CPU" always means "logical CPU" unless explicitly stated 123otherwise and the word "processor" is used to refer to the physical part 124possibly including multiple logical CPUs.] 125 126Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set 127for the given CPU and if so, it skips the policy object creation. Otherwise, 128a new policy object is created and initialized, which involves the creation of 129a new policy directory in ``sysfs``, and the policy pointer corresponding to 130the given CPU is set to the new policy object's address in memory. 131 132Next, the scaling driver's ``->init()`` callback is invoked with the policy 133pointer of the new CPU passed to it as the argument. That callback is expected 134to initialize the performance scaling hardware interface for the given CPU (or, 135more precisely, for the set of CPUs sharing the hardware interface it belongs 136to, represented by its policy object) and, if the policy object it has been 137called for is new, to set parameters of the policy, like the minimum and maximum 138frequencies supported by the hardware, the table of available frequencies (if 139the set of supported P-states is not a continuous range), and the mask of CPUs 140that belong to the same policy (including both online and offline CPUs). That 141mask is then used by the core to populate the policy pointers for all of the 142CPUs in it. 143 144The next major initialization step for a new policy object is to attach a 145scaling governor to it (to begin with, that is the default scaling governor 146determined by the kernel configuration, but it may be changed later 147via ``sysfs``). First, a pointer to the new policy object is passed to the 148governor's ``->init()`` callback which is expected to initialize all of the 149data structures necessary to handle the given policy and, possibly, to add 150a governor ``sysfs`` interface to it. Next, the governor is started by 151invoking its ``->start()`` callback. 152 153That callback it expected to register per-CPU utilization update callbacks for 154all of the online CPUs belonging to the given policy with the CPU scheduler. 155The utilization update callbacks will be invoked by the CPU scheduler on 156important events, like task enqueue and dequeue, on every iteration of the 157scheduler tick or generally whenever the CPU utilization may change (from the 158scheduler's perspective). They are expected to carry out computations needed 159to determine the P-state to use for the given policy going forward and to 160invoke the scaling driver to make changes to the hardware in accordance with 161the P-state selection. The scaling driver may be invoked directly from 162scheduler context or asynchronously, via a kernel thread or workqueue, depending 163on the configuration and capabilities of the scaling driver and the governor. 164 165Similar steps are taken for policy objects that are not new, but were "inactive" 166previously, meaning that all of the CPUs belonging to them were offline. The 167only practical difference in that case is that the ``CPUFreq`` core will attempt 168to use the scaling governor previously used with the policy that became 169"inactive" (and is re-initialized now) instead of the default governor. 170 171In turn, if a previously offline CPU is being brought back online, but some 172other CPUs sharing the policy object with it are online already, there is no 173need to re-initialize the policy object at all. In that case, it only is 174necessary to restart the scaling governor so that it can take the new online CPU 175into account. That is achieved by invoking the governor's ``->stop`` and 176``->start()`` callbacks, in this order, for the entire policy. 177 178As mentioned before, the |intel_pstate| scaling driver bypasses the scaling 179governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. 180Consequently, if |intel_pstate| is used, scaling governors are not attached to 181new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked 182to register per-CPU utilization update callbacks for each policy. These 183callbacks are invoked by the CPU scheduler in the same way as for scaling 184governors, but in the |intel_pstate| case they both determine the P-state to 185use and change the hardware configuration accordingly in one go from scheduler 186context. 187 188The policy objects created during CPU initialization and other data structures 189associated with them are torn down when the scaling driver is unregistered 190(which happens when the kernel module containing it is unloaded, for example) or 191when the last CPU belonging to the given policy in unregistered. 192 193 194Policy Interface in ``sysfs`` 195============================= 196 197During the initialization of the kernel, the ``CPUFreq`` core creates a 198``sysfs`` directory (kobject) called ``cpufreq`` under 199:file:`/sys/devices/system/cpu/`. 200 201That directory contains a ``policyX`` subdirectory (where ``X`` represents an 202integer number) for every policy object maintained by the ``CPUFreq`` core. 203Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links 204under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer 205that may be different from the one represented by ``X``) for all of the CPUs 206associated with (or belonging to) the given policy. The ``policyX`` directories 207in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific 208attributes (files) to control ``CPUFreq`` behavior for the corresponding policy 209objects (that is, for all of the CPUs associated with them). 210 211Some of those attributes are generic. They are created by the ``CPUFreq`` core 212and their behavior generally does not depend on what scaling driver is in use 213and what scaling governor is attached to the given policy. Some scaling drivers 214also add driver-specific attributes to the policy directories in ``sysfs`` to 215control policy-specific aspects of driver behavior. 216 217The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` 218are the following: 219 220``affected_cpus`` 221 List of online CPUs belonging to this policy (i.e. sharing the hardware 222 performance scaling interface represented by the ``policyX`` policy 223 object). 224 225``bios_limit`` 226 If the platform firmware (BIOS) tells the OS to apply an upper limit to 227 CPU frequencies, that limit will be reported through this attribute (if 228 present). 229 230 The existence of the limit may be a result of some (often unintentional) 231 BIOS settings, restrictions coming from a service processor or another 232 BIOS/HW-based mechanisms. 233 234 This does not cover ACPI thermal limitations which can be discovered 235 through a generic thermal driver. 236 237 This attribute is not present if the scaling driver in use does not 238 support it. 239 240``cpuinfo_max_freq`` 241 Maximum possible operating frequency the CPUs belonging to this policy 242 can run at (in kHz). 243 244``cpuinfo_min_freq`` 245 Minimum possible operating frequency the CPUs belonging to this policy 246 can run at (in kHz). 247 248``cpuinfo_transition_latency`` 249 The time it takes to switch the CPUs belonging to this policy from one 250 P-state to another, in nanoseconds. 251 252 If unknown or if known to be so high that the scaling driver does not 253 work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) 254 will be returned by reads from this attribute. 255 256``related_cpus`` 257 List of all (online and offline) CPUs belonging to this policy. 258 259``scaling_available_governors`` 260 List of ``CPUFreq`` scaling governors present in the kernel that can 261 be attached to this policy or (if the |intel_pstate| scaling driver is 262 in use) list of scaling algorithms provided by the driver that can be 263 applied to this policy. 264 265 [Note that some governors are modular and it may be necessary to load a 266 kernel module for the governor held by it to become available and be 267 listed by this attribute.] 268 269``scaling_cur_freq`` 270 Current frequency of all of the CPUs belonging to this policy (in kHz). 271 272 In the majority of cases, this is the frequency of the last P-state 273 requested by the scaling driver from the hardware using the scaling 274 interface provided by it, which may or may not reflect the frequency 275 the CPU is actually running at (due to hardware design and other 276 limitations). 277 278 Some architectures (e.g. ``x86``) may attempt to provide information 279 more precisely reflecting the current CPU frequency through this 280 attribute, but that still may not be the exact current CPU frequency as 281 seen by the hardware at the moment. 282 283``scaling_driver`` 284 The scaling driver currently in use. 285 286``scaling_governor`` 287 The scaling governor currently attached to this policy or (if the 288 |intel_pstate| scaling driver is in use) the scaling algorithm 289 provided by the driver that is currently applied to this policy. 290 291 This attribute is read-write and writing to it will cause a new scaling 292 governor to be attached to this policy or a new scaling algorithm 293 provided by the scaling driver to be applied to it (in the 294 |intel_pstate| case), as indicated by the string written to this 295 attribute (which must be one of the names listed by the 296 ``scaling_available_governors`` attribute described above). 297 298``scaling_max_freq`` 299 Maximum frequency the CPUs belonging to this policy are allowed to be 300 running at (in kHz). 301 302 This attribute is read-write and writing a string representing an 303 integer to it will cause a new limit to be set (it must not be lower 304 than the value of the ``scaling_min_freq`` attribute). 305 306``scaling_min_freq`` 307 Minimum frequency the CPUs belonging to this policy are allowed to be 308 running at (in kHz). 309 310 This attribute is read-write and writing a string representing a 311 non-negative integer to it will cause a new limit to be set (it must not 312 be higher than the value of the ``scaling_max_freq`` attribute). 313 314``scaling_setspeed`` 315 This attribute is functional only if the `userspace`_ scaling governor 316 is attached to the given policy. 317 318 It returns the last frequency requested by the governor (in kHz) or can 319 be written to in order to set a new frequency for the policy. 320 321 322Generic Scaling Governors 323========================= 324 325``CPUFreq`` provides generic scaling governors that can be used with all 326scaling drivers. As stated before, each of them implements a single, possibly 327parametrized, performance scaling algorithm. 328 329Scaling governors are attached to policy objects and different policy objects 330can be handled by different scaling governors at the same time (although that 331may lead to suboptimal results in some cases). 332 333The scaling governor for a given policy object can be changed at any time with 334the help of the ``scaling_governor`` policy attribute in ``sysfs``. 335 336Some governors expose ``sysfs`` attributes to control or fine-tune the scaling 337algorithms implemented by them. Those attributes, referred to as governor 338tunables, can be either global (system-wide) or per-policy, depending on the 339scaling driver in use. If the driver requires governor tunables to be 340per-policy, they are located in a subdirectory of each policy directory. 341Otherwise, they are located in a subdirectory under 342:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the 343subdirectory containing the governor tunables is the name of the governor 344providing them. 345 346``performance`` 347--------------- 348 349When attached to a policy object, this governor causes the highest frequency, 350within the ``scaling_max_freq`` policy limit, to be requested for that policy. 351 352The request is made once at that time the governor for the policy is set to 353``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 354policy limits change after that. 355 356``powersave`` 357------------- 358 359When attached to a policy object, this governor causes the lowest frequency, 360within the ``scaling_min_freq`` policy limit, to be requested for that policy. 361 362The request is made once at that time the governor for the policy is set to 363``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 364policy limits change after that. 365 366``userspace`` 367------------- 368 369This governor does not do anything by itself. Instead, it allows user space 370to set the CPU frequency for the policy it is attached to by writing to the 371``scaling_setspeed`` attribute of that policy. 372 373``schedutil`` 374------------- 375 376This governor uses CPU utilization data available from the CPU scheduler. It 377generally is regarded as a part of the CPU scheduler, so it can access the 378scheduler's internal data structures directly. 379 380It runs entirely in scheduler context, although in some cases it may need to 381invoke the scaling driver asynchronously when it decides that the CPU frequency 382should be changed for a given policy (that depends on whether or not the driver 383is capable of changing the CPU frequency from scheduler context). 384 385The actions of this governor for a particular CPU depend on the scheduling class 386invoking its utilization update callback for that CPU. If it is invoked by the 387RT or deadline scheduling classes, the governor will increase the frequency to 388the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, 389if it is invoked by the CFS scheduling class, the governor will use the 390Per-Entity Load Tracking (PELT) metric for the root control group of the 391given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ 392LWN.net article for a description of the PELT mechanism). Then, the new 393CPU frequency to apply is computed in accordance with the formula 394 395 f = 1.25 * ``f_0`` * ``util`` / ``max`` 396 397where ``util`` is the PELT number, ``max`` is the theoretical maximum of 398``util``, and ``f_0`` is either the maximum possible CPU frequency for the given 399policy (if the PELT number is frequency-invariant), or the current CPU frequency 400(otherwise). 401 402This governor also employs a mechanism allowing it to temporarily bump up the 403CPU frequency for tasks that have been waiting on I/O most recently, called 404"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag 405is passed by the scheduler to the governor callback which causes the frequency 406to go up to the allowed maximum immediately and then draw back to the value 407returned by the above formula over time. 408 409This governor exposes only one tunable: 410 411``rate_limit_us`` 412 Minimum time (in microseconds) that has to pass between two consecutive 413 runs of governor computations (default: 1000 times the scaling driver's 414 transition latency). 415 416 The purpose of this tunable is to reduce the scheduler context overhead 417 of the governor which might be excessive without it. 418 419This governor generally is regarded as a replacement for the older `ondemand`_ 420and `conservative`_ governors (described below), as it is simpler and more 421tightly integrated with the CPU scheduler, its overhead in terms of CPU context 422switches and similar is less significant, and it uses the scheduler's own CPU 423utilization metric, so in principle its decisions should not contradict the 424decisions made by the other parts of the scheduler. 425 426``ondemand`` 427------------ 428 429This governor uses CPU load as a CPU frequency selection metric. 430 431In order to estimate the current CPU load, it measures the time elapsed between 432consecutive invocations of its worker routine and computes the fraction of that 433time in which the given CPU was not idle. The ratio of the non-idle (active) 434time to the total CPU time is taken as an estimate of the load. 435 436If this governor is attached to a policy shared by multiple CPUs, the load is 437estimated for all of them and the greatest result is taken as the load estimate 438for the entire policy. 439 440The worker routine of this governor has to run in process context, so it is 441invoked asynchronously (via a workqueue) and CPU P-states are updated from 442there if necessary. As a result, the scheduler context overhead from this 443governor is minimum, but it causes additional CPU context switches to happen 444relatively often and the CPU P-state updates triggered by it can be relatively 445irregular. Also, it affects its own CPU load metric by running code that 446reduces the CPU idle time (even though the CPU idle time is only reduced very 447slightly by it). 448 449It generally selects CPU frequencies proportional to the estimated load, so that 450the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of 4511 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute 452corresponds to the load of 0, unless when the load exceeds a (configurable) 453speedup threshold, in which case it will go straight for the highest frequency 454it is allowed to use (the ``scaling_max_freq`` policy limit). 455 456This governor exposes the following tunables: 457 458``sampling_rate`` 459 This is how often the governor's worker routine should run, in 460 microseconds. 461 462 Typically, it is set to values of the order of 10000 (10 ms). Its 463 default value is equal to the value of ``cpuinfo_transition_latency`` 464 for each policy this governor is attached to (but since the unit here 465 is greater by 1000, this means that the time represented by 466 ``sampling_rate`` is 1000 times greater than the transition latency by 467 default). 468 469 If this tunable is per-policy, the following shell command sets the time 470 represented by it to be 750 times as high as the transition latency:: 471 472 # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate 473 474 475``min_sampling_rate`` 476 The minimum value of ``sampling_rate``. 477 478 Equal to 10000 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and 479 :c:data:`tick_nohz_active` are both set or to 20 times the value of 480 :c:data:`jiffies` in microseconds otherwise. 481 482``up_threshold`` 483 If the estimated CPU load is above this value (in percent), the governor 484 will set the frequency to the maximum value allowed for the policy. 485 Otherwise, the selected frequency will be proportional to the estimated 486 CPU load. 487 488``ignore_nice_load`` 489 If set to 1 (default 0), it will cause the CPU load estimation code to 490 treat the CPU time spent on executing tasks with "nice" levels greater 491 than 0 as CPU idle time. 492 493 This may be useful if there are tasks in the system that should not be 494 taken into account when deciding what frequency to run the CPUs at. 495 Then, to make that happen it is sufficient to increase the "nice" level 496 of those tasks above 0 and set this attribute to 1. 497 498``sampling_down_factor`` 499 Temporary multiplier, between 1 (default) and 100 inclusive, to apply to 500 the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. 501 502 This causes the next execution of the governor's worker routine (after 503 setting the frequency to the allowed maximum) to be delayed, so the 504 frequency stays at the maximum level for a longer time. 505 506 Frequency fluctuations in some bursty workloads may be avoided this way 507 at the cost of additional energy spent on maintaining the maximum CPU 508 capacity. 509 510``powersave_bias`` 511 Reduction factor to apply to the original frequency target of the 512 governor (including the maximum value used when the ``up_threshold`` 513 value is exceeded by the estimated CPU load) or sensitivity threshold 514 for the AMD frequency sensitivity powersave bias driver 515 (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 516 inclusive. 517 518 If the AMD frequency sensitivity powersave bias driver is not loaded, 519 the effective frequency to apply is given by 520 521 f * (1 - ``powersave_bias`` / 1000) 522 523 where f is the governor's original frequency target. The default value 524 of this attribute is 0 in that case. 525 526 If the AMD frequency sensitivity powersave bias driver is loaded, the 527 value of this attribute is 400 by default and it is used in a different 528 way. 529 530 On Family 16h (and later) AMD processors there is a mechanism to get a 531 measured workload sensitivity, between 0 and 100% inclusive, from the 532 hardware. That value can be used to estimate how the performance of the 533 workload running on a CPU will change in response to frequency changes. 534 535 The performance of a workload with the sensitivity of 0 (memory-bound or 536 IO-bound) is not expected to increase at all as a result of increasing 537 the CPU frequency, whereas workloads with the sensitivity of 100% 538 (CPU-bound) are expected to perform much better if the CPU frequency is 539 increased. 540 541 If the workload sensitivity is less than the threshold represented by 542 the ``powersave_bias`` value, the sensitivity powersave bias driver 543 will cause the governor to select a frequency lower than its original 544 target, so as to avoid over-provisioning workloads that will not benefit 545 from running at higher CPU frequencies. 546 547``conservative`` 548---------------- 549 550This governor uses CPU load as a CPU frequency selection metric. 551 552It estimates the CPU load in the same way as the `ondemand`_ governor described 553above, but the CPU frequency selection algorithm implemented by it is different. 554 555Namely, it avoids changing the frequency significantly over short time intervals 556which may not be suitable for systems with limited power supply capacity (e.g. 557battery-powered). To achieve that, it changes the frequency in relatively 558small steps, one step at a time, up or down - depending on whether or not a 559(configurable) threshold has been exceeded by the estimated CPU load. 560 561This governor exposes the following tunables: 562 563``freq_step`` 564 Frequency step in percent of the maximum frequency the governor is 565 allowed to set (the ``scaling_max_freq`` policy limit), between 0 and 566 100 (5 by default). 567 568 This is how much the frequency is allowed to change in one go. Setting 569 it to 0 will cause the default frequency step (5 percent) to be used 570 and setting it to 100 effectively causes the governor to periodically 571 switch the frequency between the ``scaling_min_freq`` and 572 ``scaling_max_freq`` policy limits. 573 574``down_threshold`` 575 Threshold value (in percent, 20 by default) used to determine the 576 frequency change direction. 577 578 If the estimated CPU load is greater than this value, the frequency will 579 go up (by ``freq_step``). If the load is less than this value (and the 580 ``sampling_down_factor`` mechanism is not in effect), the frequency will 581 go down. Otherwise, the frequency will not be changed. 582 583``sampling_down_factor`` 584 Frequency decrease deferral factor, between 1 (default) and 10 585 inclusive. 586 587 It effectively causes the frequency to go down ``sampling_down_factor`` 588 times slower than it ramps up. 589 590 591Frequency Boost Support 592======================= 593 594Background 595---------- 596 597Some processors support a mechanism to raise the operating frequency of some 598cores in a multicore package temporarily (and above the sustainable frequency 599threshold for the whole package) under certain conditions, for example if the 600whole chip is not fully utilized and below its intended thermal or power budget. 601 602Different names are used by different vendors to refer to this functionality. 603For Intel processors it is referred to as "Turbo Boost", AMD calls it 604"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. 605As a rule, it also is implemented differently by different vendors. The simple 606term "frequency boost" is used here for brevity to refer to all of those 607implementations. 608 609The frequency boost mechanism may be either hardware-based or software-based. 610If it is hardware-based (e.g. on x86), the decision to trigger the boosting is 611made by the hardware (although in general it requires the hardware to be put 612into a special state in which it can control the CPU frequency within certain 613limits). If it is software-based (e.g. on ARM), the scaling driver decides 614whether or not to trigger boosting and when to do that. 615 616The ``boost`` File in ``sysfs`` 617------------------------------- 618 619This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls 620the "boost" setting for the whole system. It is not present if the underlying 621scaling driver does not support the frequency boost mechanism (or supports it, 622but provides a driver-specific interface for controlling it, like 623|intel_pstate|). 624 625If the value in this file is 1, the frequency boost mechanism is enabled. This 626means that either the hardware can be put into states in which it is able to 627trigger boosting (in the hardware-based case), or the software is allowed to 628trigger boosting (in the software-based case). It does not mean that boosting 629is actually in use at the moment on any CPUs in the system. It only means a 630permission to use the frequency boost mechanism (which still may never be used 631for other reasons). 632 633If the value in this file is 0, the frequency boost mechanism is disabled and 634cannot be used at all. 635 636The only values that can be written to this file are 0 and 1. 637 638Rationale for Boost Control Knob 639-------------------------------- 640 641The frequency boost mechanism is generally intended to help to achieve optimum 642CPU performance on time scales below software resolution (e.g. below the 643scheduler tick interval) and it is demonstrably suitable for many workloads, but 644it may lead to problems in certain situations. 645 646For this reason, many systems make it possible to disable the frequency boost 647mechanism in the platform firmware (BIOS) setup, but that requires the system to 648be restarted for the setting to be adjusted as desired, which may not be 649practical at least in some cases. For example: 650 651 1. Boosting means overclocking the processor, although under controlled 652 conditions. Generally, the processor's energy consumption increases 653 as a result of increasing its frequency and voltage, even temporarily. 654 That may not be desirable on systems that switch to power sources of 655 limited capacity, such as batteries, so the ability to disable the boost 656 mechanism while the system is running may help there (but that depends on 657 the workload too). 658 659 2. In some situations deterministic behavior is more important than 660 performance or energy consumption (or both) and the ability to disable 661 boosting while the system is running may be useful then. 662 663 3. To examine the impact of the frequency boost mechanism itself, it is useful 664 to be able to run tests with and without boosting, preferably without 665 restarting the system in the meantime. 666 667 4. Reproducible results are important when running benchmarks. Since 668 the boosting functionality depends on the load of the whole package, 669 single-thread performance may vary because of it which may lead to 670 unreproducible results sometimes. That can be avoided by disabling the 671 frequency boost mechanism before running benchmarks sensitive to that 672 issue. 673 674Legacy AMD ``cpb`` Knob 675----------------------- 676 677The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to 678the global ``boost`` one. It is used for disabling/enabling the "Core 679Performance Boost" feature of some AMD processors. 680 681If present, that knob is located in every ``CPUFreq`` policy directory in 682``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called 683``cpb``, which indicates a more fine grained control interface. The actual 684implementation, however, works on the system-wide basis and setting that knob 685for one policy causes the same value of it to be set for all of the other 686policies at the same time. 687 688That knob is still supported on AMD processors that support its underlying 689hardware feature, but it may be configured out of the kernel (via the 690:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global 691``boost`` knob is present regardless. Thus it is always possible use the 692``boost`` knob instead of the ``cpb`` one which is highly recommended, as that 693is more consistent with what all of the other systems do (and the ``cpb`` knob 694may not be supported any more in the future). 695 696The ``cpb`` knob is never present for any processors without the underlying 697hardware feature (e.g. all Intel ones), even if the 698:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. 699 700 701.. _Per-entity load tracking: https://lwn.net/Articles/531853/ 702