1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` 5.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` 6 7======================= 8CPU Performance Scaling 9======================= 10 11:Copyright: |copy| 2017 Intel Corporation 12 13:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> 14 15 16The Concept of CPU Performance Scaling 17====================================== 18 19The majority of modern processors are capable of operating in a number of 20different clock frequency and voltage configurations, often referred to as 21Operating Performance Points or P-states (in ACPI terminology). As a rule, 22the higher the clock frequency and the higher the voltage, the more instructions 23can be retired by the CPU over a unit of time, but also the higher the clock 24frequency and the higher the voltage, the more energy is consumed over a unit of 25time (or the more power is drawn) by the CPU in the given P-state. Therefore 26there is a natural tradeoff between the CPU capacity (the number of instructions 27that can be executed over a unit of time) and the power drawn by the CPU. 28 29In some situations it is desirable or even necessary to run the program as fast 30as possible and then there is no reason to use any P-states different from the 31highest one (i.e. the highest-performance frequency/voltage configuration 32available). In some other cases, however, it may not be necessary to execute 33instructions so quickly and maintaining the highest available CPU capacity for a 34relatively long time without utilizing it entirely may be regarded as wasteful. 35It also may not be physically possible to maintain maximum CPU capacity for too 36long for thermal or power supply capacity reasons or similar. To cover those 37cases, there are hardware interfaces allowing CPUs to be switched between 38different frequency/voltage configurations or (in the ACPI terminology) to be 39put into different P-states. 40 41Typically, they are used along with algorithms to estimate the required CPU 42capacity, so as to decide which P-states to put the CPUs into. Of course, since 43the utilization of the system generally changes over time, that has to be done 44repeatedly on a regular basis. The activity by which this happens is referred 45to as CPU performance scaling or CPU frequency scaling (because it involves 46adjusting the CPU clock frequency). 47 48 49CPU Performance Scaling in Linux 50================================ 51 52The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` 53(CPU Frequency scaling) subsystem that consists of three layers of code: the 54core, scaling governors and scaling drivers. 55 56The ``CPUFreq`` core provides the common code infrastructure and user space 57interfaces for all platforms that support CPU performance scaling. It defines 58the basic framework in which the other components operate. 59 60Scaling governors implement algorithms to estimate the required CPU capacity. 61As a rule, each governor implements one, possibly parametrized, scaling 62algorithm. 63 64Scaling drivers talk to the hardware. They provide scaling governors with 65information on the available P-states (or P-state ranges in some cases) and 66access platform-specific hardware interfaces to change CPU P-states as requested 67by scaling governors. 68 69In principle, all available scaling governors can be used with every scaling 70driver. That design is based on the observation that the information used by 71performance scaling algorithms for P-state selection can be represented in a 72platform-independent form in the majority of cases, so it should be possible 73to use the same performance scaling algorithm implemented in exactly the same 74way regardless of which scaling driver is used. Consequently, the same set of 75scaling governors should be suitable for every supported platform. 76 77However, that observation may not hold for performance scaling algorithms 78based on information provided by the hardware itself, for example through 79feedback registers, as that information is typically specific to the hardware 80interface it comes from and may not be easily represented in an abstract, 81platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers 82to bypass the governor layer and implement their own performance scaling 83algorithms. That is done by the |intel_pstate| scaling driver. 84 85 86``CPUFreq`` Policy Objects 87========================== 88 89In some cases the hardware interface for P-state control is shared by multiple 90CPUs. That is, for example, the same register (or set of registers) is used to 91control the P-state of multiple CPUs at the same time and writing to it affects 92all of those CPUs simultaneously. 93 94Sets of CPUs sharing hardware P-state control interfaces are represented by 95``CPUFreq`` as |struct cpufreq_policy| objects. For consistency, 96|struct cpufreq_policy| is also used when there is only one CPU in the given 97set. 98 99The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for 100every CPU in the system, including CPUs that are currently offline. If multiple 101CPUs share the same hardware P-state control interface, all of the pointers 102corresponding to them point to the same |struct cpufreq_policy| object. 103 104``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design 105of its user space interface is based on the policy concept. 106 107 108CPU Initialization 109================== 110 111First of all, a scaling driver has to be registered for ``CPUFreq`` to work. 112It is only possible to register one scaling driver at a time, so the scaling 113driver is expected to be able to handle all CPUs in the system. 114 115The scaling driver may be registered before or after CPU registration. If 116CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to 117take a note of all of the already registered CPUs during the registration of the 118scaling driver. In turn, if any CPUs are registered after the registration of 119the scaling driver, the ``CPUFreq`` core will be invoked to take note of them 120at their registration time. 121 122In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it 123has not seen so far as soon as it is ready to handle that CPU. [Note that the 124logical CPU may be a physical single-core processor, or a single core in a 125multicore processor, or a hardware thread in a physical processor or processor 126core. In what follows "CPU" always means "logical CPU" unless explicitly stated 127otherwise and the word "processor" is used to refer to the physical part 128possibly including multiple logical CPUs.] 129 130Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set 131for the given CPU and if so, it skips the policy object creation. Otherwise, 132a new policy object is created and initialized, which involves the creation of 133a new policy directory in ``sysfs``, and the policy pointer corresponding to 134the given CPU is set to the new policy object's address in memory. 135 136Next, the scaling driver's ``->init()`` callback is invoked with the policy 137pointer of the new CPU passed to it as the argument. That callback is expected 138to initialize the performance scaling hardware interface for the given CPU (or, 139more precisely, for the set of CPUs sharing the hardware interface it belongs 140to, represented by its policy object) and, if the policy object it has been 141called for is new, to set parameters of the policy, like the minimum and maximum 142frequencies supported by the hardware, the table of available frequencies (if 143the set of supported P-states is not a continuous range), and the mask of CPUs 144that belong to the same policy (including both online and offline CPUs). That 145mask is then used by the core to populate the policy pointers for all of the 146CPUs in it. 147 148The next major initialization step for a new policy object is to attach a 149scaling governor to it (to begin with, that is the default scaling governor 150determined by the kernel configuration, but it may be changed later 151via ``sysfs``). First, a pointer to the new policy object is passed to the 152governor's ``->init()`` callback which is expected to initialize all of the 153data structures necessary to handle the given policy and, possibly, to add 154a governor ``sysfs`` interface to it. Next, the governor is started by 155invoking its ``->start()`` callback. 156 157That callback is expected to register per-CPU utilization update callbacks for 158all of the online CPUs belonging to the given policy with the CPU scheduler. 159The utilization update callbacks will be invoked by the CPU scheduler on 160important events, like task enqueue and dequeue, on every iteration of the 161scheduler tick or generally whenever the CPU utilization may change (from the 162scheduler's perspective). They are expected to carry out computations needed 163to determine the P-state to use for the given policy going forward and to 164invoke the scaling driver to make changes to the hardware in accordance with 165the P-state selection. The scaling driver may be invoked directly from 166scheduler context or asynchronously, via a kernel thread or workqueue, depending 167on the configuration and capabilities of the scaling driver and the governor. 168 169Similar steps are taken for policy objects that are not new, but were "inactive" 170previously, meaning that all of the CPUs belonging to them were offline. The 171only practical difference in that case is that the ``CPUFreq`` core will attempt 172to use the scaling governor previously used with the policy that became 173"inactive" (and is re-initialized now) instead of the default governor. 174 175In turn, if a previously offline CPU is being brought back online, but some 176other CPUs sharing the policy object with it are online already, there is no 177need to re-initialize the policy object at all. In that case, it only is 178necessary to restart the scaling governor so that it can take the new online CPU 179into account. That is achieved by invoking the governor's ``->stop`` and 180``->start()`` callbacks, in this order, for the entire policy. 181 182As mentioned before, the |intel_pstate| scaling driver bypasses the scaling 183governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. 184Consequently, if |intel_pstate| is used, scaling governors are not attached to 185new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked 186to register per-CPU utilization update callbacks for each policy. These 187callbacks are invoked by the CPU scheduler in the same way as for scaling 188governors, but in the |intel_pstate| case they both determine the P-state to 189use and change the hardware configuration accordingly in one go from scheduler 190context. 191 192The policy objects created during CPU initialization and other data structures 193associated with them are torn down when the scaling driver is unregistered 194(which happens when the kernel module containing it is unloaded, for example) or 195when the last CPU belonging to the given policy in unregistered. 196 197 198Policy Interface in ``sysfs`` 199============================= 200 201During the initialization of the kernel, the ``CPUFreq`` core creates a 202``sysfs`` directory (kobject) called ``cpufreq`` under 203:file:`/sys/devices/system/cpu/`. 204 205That directory contains a ``policyX`` subdirectory (where ``X`` represents an 206integer number) for every policy object maintained by the ``CPUFreq`` core. 207Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links 208under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer 209that may be different from the one represented by ``X``) for all of the CPUs 210associated with (or belonging to) the given policy. The ``policyX`` directories 211in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific 212attributes (files) to control ``CPUFreq`` behavior for the corresponding policy 213objects (that is, for all of the CPUs associated with them). 214 215Some of those attributes are generic. They are created by the ``CPUFreq`` core 216and their behavior generally does not depend on what scaling driver is in use 217and what scaling governor is attached to the given policy. Some scaling drivers 218also add driver-specific attributes to the policy directories in ``sysfs`` to 219control policy-specific aspects of driver behavior. 220 221The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` 222are the following: 223 224``affected_cpus`` 225 List of online CPUs belonging to this policy (i.e. sharing the hardware 226 performance scaling interface represented by the ``policyX`` policy 227 object). 228 229``bios_limit`` 230 If the platform firmware (BIOS) tells the OS to apply an upper limit to 231 CPU frequencies, that limit will be reported through this attribute (if 232 present). 233 234 The existence of the limit may be a result of some (often unintentional) 235 BIOS settings, restrictions coming from a service processor or another 236 BIOS/HW-based mechanisms. 237 238 This does not cover ACPI thermal limitations which can be discovered 239 through a generic thermal driver. 240 241 This attribute is not present if the scaling driver in use does not 242 support it. 243 244``cpuinfo_cur_freq`` 245 Current frequency of the CPUs belonging to this policy as obtained from 246 the hardware (in KHz). 247 248 This is expected to be the frequency the hardware actually runs at. 249 If that frequency cannot be determined, this attribute should not 250 be present. 251 252``cpuinfo_max_freq`` 253 Maximum possible operating frequency the CPUs belonging to this policy 254 can run at (in kHz). 255 256``cpuinfo_min_freq`` 257 Minimum possible operating frequency the CPUs belonging to this policy 258 can run at (in kHz). 259 260``cpuinfo_transition_latency`` 261 The time it takes to switch the CPUs belonging to this policy from one 262 P-state to another, in nanoseconds. 263 264 If unknown or if known to be so high that the scaling driver does not 265 work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) 266 will be returned by reads from this attribute. 267 268``related_cpus`` 269 List of all (online and offline) CPUs belonging to this policy. 270 271``scaling_available_governors`` 272 List of ``CPUFreq`` scaling governors present in the kernel that can 273 be attached to this policy or (if the |intel_pstate| scaling driver is 274 in use) list of scaling algorithms provided by the driver that can be 275 applied to this policy. 276 277 [Note that some governors are modular and it may be necessary to load a 278 kernel module for the governor held by it to become available and be 279 listed by this attribute.] 280 281``scaling_cur_freq`` 282 Current frequency of all of the CPUs belonging to this policy (in kHz). 283 284 In the majority of cases, this is the frequency of the last P-state 285 requested by the scaling driver from the hardware using the scaling 286 interface provided by it, which may or may not reflect the frequency 287 the CPU is actually running at (due to hardware design and other 288 limitations). 289 290 Some architectures (e.g. ``x86``) may attempt to provide information 291 more precisely reflecting the current CPU frequency through this 292 attribute, but that still may not be the exact current CPU frequency as 293 seen by the hardware at the moment. 294 295``scaling_driver`` 296 The scaling driver currently in use. 297 298``scaling_governor`` 299 The scaling governor currently attached to this policy or (if the 300 |intel_pstate| scaling driver is in use) the scaling algorithm 301 provided by the driver that is currently applied to this policy. 302 303 This attribute is read-write and writing to it will cause a new scaling 304 governor to be attached to this policy or a new scaling algorithm 305 provided by the scaling driver to be applied to it (in the 306 |intel_pstate| case), as indicated by the string written to this 307 attribute (which must be one of the names listed by the 308 ``scaling_available_governors`` attribute described above). 309 310``scaling_max_freq`` 311 Maximum frequency the CPUs belonging to this policy are allowed to be 312 running at (in kHz). 313 314 This attribute is read-write and writing a string representing an 315 integer to it will cause a new limit to be set (it must not be lower 316 than the value of the ``scaling_min_freq`` attribute). 317 318``scaling_min_freq`` 319 Minimum frequency the CPUs belonging to this policy are allowed to be 320 running at (in kHz). 321 322 This attribute is read-write and writing a string representing a 323 non-negative integer to it will cause a new limit to be set (it must not 324 be higher than the value of the ``scaling_max_freq`` attribute). 325 326``scaling_setspeed`` 327 This attribute is functional only if the `userspace`_ scaling governor 328 is attached to the given policy. 329 330 It returns the last frequency requested by the governor (in kHz) or can 331 be written to in order to set a new frequency for the policy. 332 333 334Generic Scaling Governors 335========================= 336 337``CPUFreq`` provides generic scaling governors that can be used with all 338scaling drivers. As stated before, each of them implements a single, possibly 339parametrized, performance scaling algorithm. 340 341Scaling governors are attached to policy objects and different policy objects 342can be handled by different scaling governors at the same time (although that 343may lead to suboptimal results in some cases). 344 345The scaling governor for a given policy object can be changed at any time with 346the help of the ``scaling_governor`` policy attribute in ``sysfs``. 347 348Some governors expose ``sysfs`` attributes to control or fine-tune the scaling 349algorithms implemented by them. Those attributes, referred to as governor 350tunables, can be either global (system-wide) or per-policy, depending on the 351scaling driver in use. If the driver requires governor tunables to be 352per-policy, they are located in a subdirectory of each policy directory. 353Otherwise, they are located in a subdirectory under 354:file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the 355subdirectory containing the governor tunables is the name of the governor 356providing them. 357 358``performance`` 359--------------- 360 361When attached to a policy object, this governor causes the highest frequency, 362within the ``scaling_max_freq`` policy limit, to be requested for that policy. 363 364The request is made once at that time the governor for the policy is set to 365``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 366policy limits change after that. 367 368``powersave`` 369------------- 370 371When attached to a policy object, this governor causes the lowest frequency, 372within the ``scaling_min_freq`` policy limit, to be requested for that policy. 373 374The request is made once at that time the governor for the policy is set to 375``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` 376policy limits change after that. 377 378``userspace`` 379------------- 380 381This governor does not do anything by itself. Instead, it allows user space 382to set the CPU frequency for the policy it is attached to by writing to the 383``scaling_setspeed`` attribute of that policy. 384 385``schedutil`` 386------------- 387 388This governor uses CPU utilization data available from the CPU scheduler. It 389generally is regarded as a part of the CPU scheduler, so it can access the 390scheduler's internal data structures directly. 391 392It runs entirely in scheduler context, although in some cases it may need to 393invoke the scaling driver asynchronously when it decides that the CPU frequency 394should be changed for a given policy (that depends on whether or not the driver 395is capable of changing the CPU frequency from scheduler context). 396 397The actions of this governor for a particular CPU depend on the scheduling class 398invoking its utilization update callback for that CPU. If it is invoked by the 399RT or deadline scheduling classes, the governor will increase the frequency to 400the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, 401if it is invoked by the CFS scheduling class, the governor will use the 402Per-Entity Load Tracking (PELT) metric for the root control group of the 403given CPU as the CPU utilization estimate (see the *Per-entity load tracking* 404LWN.net article [1]_ for a description of the PELT mechanism). Then, the new 405CPU frequency to apply is computed in accordance with the formula 406 407 f = 1.25 * ``f_0`` * ``util`` / ``max`` 408 409where ``util`` is the PELT number, ``max`` is the theoretical maximum of 410``util``, and ``f_0`` is either the maximum possible CPU frequency for the given 411policy (if the PELT number is frequency-invariant), or the current CPU frequency 412(otherwise). 413 414This governor also employs a mechanism allowing it to temporarily bump up the 415CPU frequency for tasks that have been waiting on I/O most recently, called 416"IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag 417is passed by the scheduler to the governor callback which causes the frequency 418to go up to the allowed maximum immediately and then draw back to the value 419returned by the above formula over time. 420 421This governor exposes only one tunable: 422 423``rate_limit_us`` 424 Minimum time (in microseconds) that has to pass between two consecutive 425 runs of governor computations (default: 1000 times the scaling driver's 426 transition latency). 427 428 The purpose of this tunable is to reduce the scheduler context overhead 429 of the governor which might be excessive without it. 430 431This governor generally is regarded as a replacement for the older `ondemand`_ 432and `conservative`_ governors (described below), as it is simpler and more 433tightly integrated with the CPU scheduler, its overhead in terms of CPU context 434switches and similar is less significant, and it uses the scheduler's own CPU 435utilization metric, so in principle its decisions should not contradict the 436decisions made by the other parts of the scheduler. 437 438``ondemand`` 439------------ 440 441This governor uses CPU load as a CPU frequency selection metric. 442 443In order to estimate the current CPU load, it measures the time elapsed between 444consecutive invocations of its worker routine and computes the fraction of that 445time in which the given CPU was not idle. The ratio of the non-idle (active) 446time to the total CPU time is taken as an estimate of the load. 447 448If this governor is attached to a policy shared by multiple CPUs, the load is 449estimated for all of them and the greatest result is taken as the load estimate 450for the entire policy. 451 452The worker routine of this governor has to run in process context, so it is 453invoked asynchronously (via a workqueue) and CPU P-states are updated from 454there if necessary. As a result, the scheduler context overhead from this 455governor is minimum, but it causes additional CPU context switches to happen 456relatively often and the CPU P-state updates triggered by it can be relatively 457irregular. Also, it affects its own CPU load metric by running code that 458reduces the CPU idle time (even though the CPU idle time is only reduced very 459slightly by it). 460 461It generally selects CPU frequencies proportional to the estimated load, so that 462the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of 4631 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute 464corresponds to the load of 0, unless when the load exceeds a (configurable) 465speedup threshold, in which case it will go straight for the highest frequency 466it is allowed to use (the ``scaling_max_freq`` policy limit). 467 468This governor exposes the following tunables: 469 470``sampling_rate`` 471 This is how often the governor's worker routine should run, in 472 microseconds. 473 474 Typically, it is set to values of the order of 10000 (10 ms). Its 475 default value is equal to the value of ``cpuinfo_transition_latency`` 476 for each policy this governor is attached to (but since the unit here 477 is greater by 1000, this means that the time represented by 478 ``sampling_rate`` is 1000 times greater than the transition latency by 479 default). 480 481 If this tunable is per-policy, the following shell command sets the time 482 represented by it to be 750 times as high as the transition latency:: 483 484 # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate 485 486``up_threshold`` 487 If the estimated CPU load is above this value (in percent), the governor 488 will set the frequency to the maximum value allowed for the policy. 489 Otherwise, the selected frequency will be proportional to the estimated 490 CPU load. 491 492``ignore_nice_load`` 493 If set to 1 (default 0), it will cause the CPU load estimation code to 494 treat the CPU time spent on executing tasks with "nice" levels greater 495 than 0 as CPU idle time. 496 497 This may be useful if there are tasks in the system that should not be 498 taken into account when deciding what frequency to run the CPUs at. 499 Then, to make that happen it is sufficient to increase the "nice" level 500 of those tasks above 0 and set this attribute to 1. 501 502``sampling_down_factor`` 503 Temporary multiplier, between 1 (default) and 100 inclusive, to apply to 504 the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. 505 506 This causes the next execution of the governor's worker routine (after 507 setting the frequency to the allowed maximum) to be delayed, so the 508 frequency stays at the maximum level for a longer time. 509 510 Frequency fluctuations in some bursty workloads may be avoided this way 511 at the cost of additional energy spent on maintaining the maximum CPU 512 capacity. 513 514``powersave_bias`` 515 Reduction factor to apply to the original frequency target of the 516 governor (including the maximum value used when the ``up_threshold`` 517 value is exceeded by the estimated CPU load) or sensitivity threshold 518 for the AMD frequency sensitivity powersave bias driver 519 (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 520 inclusive. 521 522 If the AMD frequency sensitivity powersave bias driver is not loaded, 523 the effective frequency to apply is given by 524 525 f * (1 - ``powersave_bias`` / 1000) 526 527 where f is the governor's original frequency target. The default value 528 of this attribute is 0 in that case. 529 530 If the AMD frequency sensitivity powersave bias driver is loaded, the 531 value of this attribute is 400 by default and it is used in a different 532 way. 533 534 On Family 16h (and later) AMD processors there is a mechanism to get a 535 measured workload sensitivity, between 0 and 100% inclusive, from the 536 hardware. That value can be used to estimate how the performance of the 537 workload running on a CPU will change in response to frequency changes. 538 539 The performance of a workload with the sensitivity of 0 (memory-bound or 540 IO-bound) is not expected to increase at all as a result of increasing 541 the CPU frequency, whereas workloads with the sensitivity of 100% 542 (CPU-bound) are expected to perform much better if the CPU frequency is 543 increased. 544 545 If the workload sensitivity is less than the threshold represented by 546 the ``powersave_bias`` value, the sensitivity powersave bias driver 547 will cause the governor to select a frequency lower than its original 548 target, so as to avoid over-provisioning workloads that will not benefit 549 from running at higher CPU frequencies. 550 551``conservative`` 552---------------- 553 554This governor uses CPU load as a CPU frequency selection metric. 555 556It estimates the CPU load in the same way as the `ondemand`_ governor described 557above, but the CPU frequency selection algorithm implemented by it is different. 558 559Namely, it avoids changing the frequency significantly over short time intervals 560which may not be suitable for systems with limited power supply capacity (e.g. 561battery-powered). To achieve that, it changes the frequency in relatively 562small steps, one step at a time, up or down - depending on whether or not a 563(configurable) threshold has been exceeded by the estimated CPU load. 564 565This governor exposes the following tunables: 566 567``freq_step`` 568 Frequency step in percent of the maximum frequency the governor is 569 allowed to set (the ``scaling_max_freq`` policy limit), between 0 and 570 100 (5 by default). 571 572 This is how much the frequency is allowed to change in one go. Setting 573 it to 0 will cause the default frequency step (5 percent) to be used 574 and setting it to 100 effectively causes the governor to periodically 575 switch the frequency between the ``scaling_min_freq`` and 576 ``scaling_max_freq`` policy limits. 577 578``down_threshold`` 579 Threshold value (in percent, 20 by default) used to determine the 580 frequency change direction. 581 582 If the estimated CPU load is greater than this value, the frequency will 583 go up (by ``freq_step``). If the load is less than this value (and the 584 ``sampling_down_factor`` mechanism is not in effect), the frequency will 585 go down. Otherwise, the frequency will not be changed. 586 587``sampling_down_factor`` 588 Frequency decrease deferral factor, between 1 (default) and 10 589 inclusive. 590 591 It effectively causes the frequency to go down ``sampling_down_factor`` 592 times slower than it ramps up. 593 594 595Frequency Boost Support 596======================= 597 598Background 599---------- 600 601Some processors support a mechanism to raise the operating frequency of some 602cores in a multicore package temporarily (and above the sustainable frequency 603threshold for the whole package) under certain conditions, for example if the 604whole chip is not fully utilized and below its intended thermal or power budget. 605 606Different names are used by different vendors to refer to this functionality. 607For Intel processors it is referred to as "Turbo Boost", AMD calls it 608"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. 609As a rule, it also is implemented differently by different vendors. The simple 610term "frequency boost" is used here for brevity to refer to all of those 611implementations. 612 613The frequency boost mechanism may be either hardware-based or software-based. 614If it is hardware-based (e.g. on x86), the decision to trigger the boosting is 615made by the hardware (although in general it requires the hardware to be put 616into a special state in which it can control the CPU frequency within certain 617limits). If it is software-based (e.g. on ARM), the scaling driver decides 618whether or not to trigger boosting and when to do that. 619 620The ``boost`` File in ``sysfs`` 621------------------------------- 622 623This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls 624the "boost" setting for the whole system. It is not present if the underlying 625scaling driver does not support the frequency boost mechanism (or supports it, 626but provides a driver-specific interface for controlling it, like 627|intel_pstate|). 628 629If the value in this file is 1, the frequency boost mechanism is enabled. This 630means that either the hardware can be put into states in which it is able to 631trigger boosting (in the hardware-based case), or the software is allowed to 632trigger boosting (in the software-based case). It does not mean that boosting 633is actually in use at the moment on any CPUs in the system. It only means a 634permission to use the frequency boost mechanism (which still may never be used 635for other reasons). 636 637If the value in this file is 0, the frequency boost mechanism is disabled and 638cannot be used at all. 639 640The only values that can be written to this file are 0 and 1. 641 642Rationale for Boost Control Knob 643-------------------------------- 644 645The frequency boost mechanism is generally intended to help to achieve optimum 646CPU performance on time scales below software resolution (e.g. below the 647scheduler tick interval) and it is demonstrably suitable for many workloads, but 648it may lead to problems in certain situations. 649 650For this reason, many systems make it possible to disable the frequency boost 651mechanism in the platform firmware (BIOS) setup, but that requires the system to 652be restarted for the setting to be adjusted as desired, which may not be 653practical at least in some cases. For example: 654 655 1. Boosting means overclocking the processor, although under controlled 656 conditions. Generally, the processor's energy consumption increases 657 as a result of increasing its frequency and voltage, even temporarily. 658 That may not be desirable on systems that switch to power sources of 659 limited capacity, such as batteries, so the ability to disable the boost 660 mechanism while the system is running may help there (but that depends on 661 the workload too). 662 663 2. In some situations deterministic behavior is more important than 664 performance or energy consumption (or both) and the ability to disable 665 boosting while the system is running may be useful then. 666 667 3. To examine the impact of the frequency boost mechanism itself, it is useful 668 to be able to run tests with and without boosting, preferably without 669 restarting the system in the meantime. 670 671 4. Reproducible results are important when running benchmarks. Since 672 the boosting functionality depends on the load of the whole package, 673 single-thread performance may vary because of it which may lead to 674 unreproducible results sometimes. That can be avoided by disabling the 675 frequency boost mechanism before running benchmarks sensitive to that 676 issue. 677 678Legacy AMD ``cpb`` Knob 679----------------------- 680 681The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to 682the global ``boost`` one. It is used for disabling/enabling the "Core 683Performance Boost" feature of some AMD processors. 684 685If present, that knob is located in every ``CPUFreq`` policy directory in 686``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called 687``cpb``, which indicates a more fine grained control interface. The actual 688implementation, however, works on the system-wide basis and setting that knob 689for one policy causes the same value of it to be set for all of the other 690policies at the same time. 691 692That knob is still supported on AMD processors that support its underlying 693hardware feature, but it may be configured out of the kernel (via the 694:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global 695``boost`` knob is present regardless. Thus it is always possible use the 696``boost`` knob instead of the ``cpb`` one which is highly recommended, as that 697is more consistent with what all of the other systems do (and the ``cpb`` knob 698may not be supported any more in the future). 699 700The ``cpb`` knob is never present for any processors without the underlying 701hardware feature (e.g. all Intel ones), even if the 702:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. 703 704 705References 706========== 707 708.. [1] Jonathan Corbet, *Per-entity load tracking*, 709 https://lwn.net/Articles/531853/ 710