1fc7db767SRafael J. Wysocki.. SPDX-License-Identifier: GPL-2.0
2fc1860d6SRafael J. Wysocki.. include:: <isonum.txt>
3fc7db767SRafael J. Wysocki
433fc30b4SRafael J. Wysocki.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
52a0e4927SRafael J. Wysocki
62a0e4927SRafael J. Wysocki=======================
72a0e4927SRafael J. WysockiCPU Performance Scaling
82a0e4927SRafael J. Wysocki=======================
92a0e4927SRafael J. Wysocki
10fc1860d6SRafael J. Wysocki:Copyright: |copy| 2017 Intel Corporation
112a0e4927SRafael J. Wysocki
12fc1860d6SRafael J. Wysocki:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
13fc1860d6SRafael J. Wysocki
142a0e4927SRafael J. Wysocki
152a0e4927SRafael J. WysockiThe Concept of CPU Performance Scaling
162a0e4927SRafael J. Wysocki======================================
172a0e4927SRafael J. Wysocki
182a0e4927SRafael J. WysockiThe majority of modern processors are capable of operating in a number of
192a0e4927SRafael J. Wysockidifferent clock frequency and voltage configurations, often referred to as
202a0e4927SRafael J. WysockiOperating Performance Points or P-states (in ACPI terminology).  As a rule,
212a0e4927SRafael J. Wysockithe higher the clock frequency and the higher the voltage, the more instructions
222a0e4927SRafael J. Wysockican be retired by the CPU over a unit of time, but also the higher the clock
232a0e4927SRafael J. Wysockifrequency and the higher the voltage, the more energy is consumed over a unit of
242a0e4927SRafael J. Wysockitime (or the more power is drawn) by the CPU in the given P-state.  Therefore
252a0e4927SRafael J. Wysockithere is a natural tradeoff between the CPU capacity (the number of instructions
262a0e4927SRafael J. Wysockithat can be executed over a unit of time) and the power drawn by the CPU.
272a0e4927SRafael J. Wysocki
282a0e4927SRafael J. WysockiIn some situations it is desirable or even necessary to run the program as fast
292a0e4927SRafael J. Wysockias possible and then there is no reason to use any P-states different from the
302a0e4927SRafael J. Wysockihighest one (i.e. the highest-performance frequency/voltage configuration
312a0e4927SRafael J. Wysockiavailable).  In some other cases, however, it may not be necessary to execute
322a0e4927SRafael J. Wysockiinstructions so quickly and maintaining the highest available CPU capacity for a
332a0e4927SRafael J. Wysockirelatively long time without utilizing it entirely may be regarded as wasteful.
342a0e4927SRafael J. WysockiIt also may not be physically possible to maintain maximum CPU capacity for too
352a0e4927SRafael J. Wysockilong for thermal or power supply capacity reasons or similar.  To cover those
362a0e4927SRafael J. Wysockicases, there are hardware interfaces allowing CPUs to be switched between
372a0e4927SRafael J. Wysockidifferent frequency/voltage configurations or (in the ACPI terminology) to be
382a0e4927SRafael J. Wysockiput into different P-states.
392a0e4927SRafael J. Wysocki
402a0e4927SRafael J. WysockiTypically, they are used along with algorithms to estimate the required CPU
412a0e4927SRafael J. Wysockicapacity, so as to decide which P-states to put the CPUs into.  Of course, since
422a0e4927SRafael J. Wysockithe utilization of the system generally changes over time, that has to be done
432a0e4927SRafael J. Wysockirepeatedly on a regular basis.  The activity by which this happens is referred
442a0e4927SRafael J. Wysockito as CPU performance scaling or CPU frequency scaling (because it involves
452a0e4927SRafael J. Wysockiadjusting the CPU clock frequency).
462a0e4927SRafael J. Wysocki
472a0e4927SRafael J. Wysocki
482a0e4927SRafael J. WysockiCPU Performance Scaling in Linux
492a0e4927SRafael J. Wysocki================================
502a0e4927SRafael J. Wysocki
512a0e4927SRafael J. WysockiThe Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
522a0e4927SRafael J. Wysocki(CPU Frequency scaling) subsystem that consists of three layers of code: the
532a0e4927SRafael J. Wysockicore, scaling governors and scaling drivers.
542a0e4927SRafael J. Wysocki
552a0e4927SRafael J. WysockiThe ``CPUFreq`` core provides the common code infrastructure and user space
562a0e4927SRafael J. Wysockiinterfaces for all platforms that support CPU performance scaling.  It defines
572a0e4927SRafael J. Wysockithe basic framework in which the other components operate.
582a0e4927SRafael J. Wysocki
592a0e4927SRafael J. WysockiScaling governors implement algorithms to estimate the required CPU capacity.
602a0e4927SRafael J. WysockiAs a rule, each governor implements one, possibly parametrized, scaling
612a0e4927SRafael J. Wysockialgorithm.
622a0e4927SRafael J. Wysocki
632a0e4927SRafael J. WysockiScaling drivers talk to the hardware.  They provide scaling governors with
642a0e4927SRafael J. Wysockiinformation on the available P-states (or P-state ranges in some cases) and
652a0e4927SRafael J. Wysockiaccess platform-specific hardware interfaces to change CPU P-states as requested
662a0e4927SRafael J. Wysockiby scaling governors.
672a0e4927SRafael J. Wysocki
682a0e4927SRafael J. WysockiIn principle, all available scaling governors can be used with every scaling
692a0e4927SRafael J. Wysockidriver.  That design is based on the observation that the information used by
702a0e4927SRafael J. Wysockiperformance scaling algorithms for P-state selection can be represented in a
712a0e4927SRafael J. Wysockiplatform-independent form in the majority of cases, so it should be possible
722a0e4927SRafael J. Wysockito use the same performance scaling algorithm implemented in exactly the same
732a0e4927SRafael J. Wysockiway regardless of which scaling driver is used.  Consequently, the same set of
742a0e4927SRafael J. Wysockiscaling governors should be suitable for every supported platform.
752a0e4927SRafael J. Wysocki
762a0e4927SRafael J. WysockiHowever, that observation may not hold for performance scaling algorithms
772a0e4927SRafael J. Wysockibased on information provided by the hardware itself, for example through
782a0e4927SRafael J. Wysockifeedback registers, as that information is typically specific to the hardware
792a0e4927SRafael J. Wysockiinterface it comes from and may not be easily represented in an abstract,
802a0e4927SRafael J. Wysockiplatform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
812a0e4927SRafael J. Wysockito bypass the governor layer and implement their own performance scaling
8233fc30b4SRafael J. Wysockialgorithms.  That is done by the |intel_pstate| scaling driver.
832a0e4927SRafael J. Wysocki
842a0e4927SRafael J. Wysocki
852a0e4927SRafael J. Wysocki``CPUFreq`` Policy Objects
862a0e4927SRafael J. Wysocki==========================
872a0e4927SRafael J. Wysocki
882a0e4927SRafael J. WysockiIn some cases the hardware interface for P-state control is shared by multiple
892a0e4927SRafael J. WysockiCPUs.  That is, for example, the same register (or set of registers) is used to
902a0e4927SRafael J. Wysockicontrol the P-state of multiple CPUs at the same time and writing to it affects
912a0e4927SRafael J. Wysockiall of those CPUs simultaneously.
922a0e4927SRafael J. Wysocki
932a0e4927SRafael J. WysockiSets of CPUs sharing hardware P-state control interfaces are represented by
94abc59fd4SMauro Carvalho Chehab``CPUFreq`` as struct cpufreq_policy objects.  For consistency,
95abc59fd4SMauro Carvalho Chehabstruct cpufreq_policy is also used when there is only one CPU in the given
962a0e4927SRafael J. Wysockiset.
972a0e4927SRafael J. Wysocki
98abc59fd4SMauro Carvalho ChehabThe ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for
992a0e4927SRafael J. Wysockievery CPU in the system, including CPUs that are currently offline.  If multiple
1002a0e4927SRafael J. WysockiCPUs share the same hardware P-state control interface, all of the pointers
101abc59fd4SMauro Carvalho Chehabcorresponding to them point to the same struct cpufreq_policy object.
1022a0e4927SRafael J. Wysocki
103abc59fd4SMauro Carvalho Chehab``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design
1042a0e4927SRafael J. Wysockiof its user space interface is based on the policy concept.
1052a0e4927SRafael J. Wysocki
1062a0e4927SRafael J. Wysocki
1072a0e4927SRafael J. WysockiCPU Initialization
1082a0e4927SRafael J. Wysocki==================
1092a0e4927SRafael J. Wysocki
1102a0e4927SRafael J. WysockiFirst of all, a scaling driver has to be registered for ``CPUFreq`` to work.
1112a0e4927SRafael J. WysockiIt is only possible to register one scaling driver at a time, so the scaling
1122a0e4927SRafael J. Wysockidriver is expected to be able to handle all CPUs in the system.
1132a0e4927SRafael J. Wysocki
1142a0e4927SRafael J. WysockiThe scaling driver may be registered before or after CPU registration.  If
1152a0e4927SRafael J. WysockiCPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
1162a0e4927SRafael J. Wysockitake a note of all of the already registered CPUs during the registration of the
1172a0e4927SRafael J. Wysockiscaling driver.  In turn, if any CPUs are registered after the registration of
1182a0e4927SRafael J. Wysockithe scaling driver, the ``CPUFreq`` core will be invoked to take note of them
1192a0e4927SRafael J. Wysockiat their registration time.
1202a0e4927SRafael J. Wysocki
1212a0e4927SRafael J. WysockiIn any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
1222a0e4927SRafael J. Wysockihas not seen so far as soon as it is ready to handle that CPU.  [Note that the
1232a0e4927SRafael J. Wysockilogical CPU may be a physical single-core processor, or a single core in a
1242a0e4927SRafael J. Wysockimulticore processor, or a hardware thread in a physical processor or processor
1252a0e4927SRafael J. Wysockicore.  In what follows "CPU" always means "logical CPU" unless explicitly stated
1262a0e4927SRafael J. Wysockiotherwise and the word "processor" is used to refer to the physical part
1272a0e4927SRafael J. Wysockipossibly including multiple logical CPUs.]
1282a0e4927SRafael J. Wysocki
1292a0e4927SRafael J. WysockiOnce invoked, the ``CPUFreq`` core checks if the policy pointer is already set
1302a0e4927SRafael J. Wysockifor the given CPU and if so, it skips the policy object creation.  Otherwise,
1312a0e4927SRafael J. Wysockia new policy object is created and initialized, which involves the creation of
1322a0e4927SRafael J. Wysockia new policy directory in ``sysfs``, and the policy pointer corresponding to
1332a0e4927SRafael J. Wysockithe given CPU is set to the new policy object's address in memory.
1342a0e4927SRafael J. Wysocki
1352a0e4927SRafael J. WysockiNext, the scaling driver's ``->init()`` callback is invoked with the policy
1362a0e4927SRafael J. Wysockipointer of the new CPU passed to it as the argument.  That callback is expected
1372a0e4927SRafael J. Wysockito initialize the performance scaling hardware interface for the given CPU (or,
1382a0e4927SRafael J. Wysockimore precisely, for the set of CPUs sharing the hardware interface it belongs
1392a0e4927SRafael J. Wysockito, represented by its policy object) and, if the policy object it has been
1402a0e4927SRafael J. Wysockicalled for is new, to set parameters of the policy, like the minimum and maximum
1412a0e4927SRafael J. Wysockifrequencies supported by the hardware, the table of available frequencies (if
1422a0e4927SRafael J. Wysockithe set of supported P-states is not a continuous range), and the mask of CPUs
1432a0e4927SRafael J. Wysockithat belong to the same policy (including both online and offline CPUs).  That
1442a0e4927SRafael J. Wysockimask is then used by the core to populate the policy pointers for all of the
1452a0e4927SRafael J. WysockiCPUs in it.
1462a0e4927SRafael J. Wysocki
1472a0e4927SRafael J. WysockiThe next major initialization step for a new policy object is to attach a
1482a0e4927SRafael J. Wysockiscaling governor to it (to begin with, that is the default scaling governor
1498412b456SQuentin Perretdetermined by the kernel command line or configuration, but it may be changed
1508412b456SQuentin Perretlater via ``sysfs``).  First, a pointer to the new policy object is passed to
1518412b456SQuentin Perretthe governor's ``->init()`` callback which is expected to initialize all of the
1522a0e4927SRafael J. Wysockidata structures necessary to handle the given policy and, possibly, to add
1532a0e4927SRafael J. Wysockia governor ``sysfs`` interface to it.  Next, the governor is started by
1542a0e4927SRafael J. Wysockiinvoking its ``->start()`` callback.
1552a0e4927SRafael J. Wysocki
156e531efa1SZhao Wei LiewThat callback is expected to register per-CPU utilization update callbacks for
1572a0e4927SRafael J. Wysockiall of the online CPUs belonging to the given policy with the CPU scheduler.
1582a0e4927SRafael J. WysockiThe utilization update callbacks will be invoked by the CPU scheduler on
1592a0e4927SRafael J. Wysockiimportant events, like task enqueue and dequeue, on every iteration of the
1602a0e4927SRafael J. Wysockischeduler tick or generally whenever the CPU utilization may change (from the
1612a0e4927SRafael J. Wysockischeduler's perspective).  They are expected to carry out computations needed
1622a0e4927SRafael J. Wysockito determine the P-state to use for the given policy going forward and to
1632a0e4927SRafael J. Wysockiinvoke the scaling driver to make changes to the hardware in accordance with
1642a0e4927SRafael J. Wysockithe P-state selection.  The scaling driver may be invoked directly from
1652a0e4927SRafael J. Wysockischeduler context or asynchronously, via a kernel thread or workqueue, depending
1662a0e4927SRafael J. Wysockion the configuration and capabilities of the scaling driver and the governor.
1672a0e4927SRafael J. Wysocki
1682a0e4927SRafael J. WysockiSimilar steps are taken for policy objects that are not new, but were "inactive"
1692a0e4927SRafael J. Wysockipreviously, meaning that all of the CPUs belonging to them were offline.  The
1702a0e4927SRafael J. Wysockionly practical difference in that case is that the ``CPUFreq`` core will attempt
1712a0e4927SRafael J. Wysockito use the scaling governor previously used with the policy that became
1722a0e4927SRafael J. Wysocki"inactive" (and is re-initialized now) instead of the default governor.
1732a0e4927SRafael J. Wysocki
1742a0e4927SRafael J. WysockiIn turn, if a previously offline CPU is being brought back online, but some
1752a0e4927SRafael J. Wysockiother CPUs sharing the policy object with it are online already, there is no
1762a0e4927SRafael J. Wysockineed to re-initialize the policy object at all.  In that case, it only is
1772a0e4927SRafael J. Wysockinecessary to restart the scaling governor so that it can take the new online CPU
1782a0e4927SRafael J. Wysockiinto account.  That is achieved by invoking the governor's ``->stop`` and
1792a0e4927SRafael J. Wysocki``->start()`` callbacks, in this order, for the entire policy.
1802a0e4927SRafael J. Wysocki
18133fc30b4SRafael J. WysockiAs mentioned before, the |intel_pstate| scaling driver bypasses the scaling
1822a0e4927SRafael J. Wysockigovernor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
18333fc30b4SRafael J. WysockiConsequently, if |intel_pstate| is used, scaling governors are not attached to
1842a0e4927SRafael J. Wysockinew policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
1852a0e4927SRafael J. Wysockito register per-CPU utilization update callbacks for each policy.  These
1862a0e4927SRafael J. Wysockicallbacks are invoked by the CPU scheduler in the same way as for scaling
18733fc30b4SRafael J. Wysockigovernors, but in the |intel_pstate| case they both determine the P-state to
1882a0e4927SRafael J. Wysockiuse and change the hardware configuration accordingly in one go from scheduler
1892a0e4927SRafael J. Wysockicontext.
1902a0e4927SRafael J. Wysocki
1912a0e4927SRafael J. WysockiThe policy objects created during CPU initialization and other data structures
1922a0e4927SRafael J. Wysockiassociated with them are torn down when the scaling driver is unregistered
1932a0e4927SRafael J. Wysocki(which happens when the kernel module containing it is unloaded, for example) or
1942a0e4927SRafael J. Wysockiwhen the last CPU belonging to the given policy in unregistered.
1952a0e4927SRafael J. Wysocki
1962a0e4927SRafael J. Wysocki
1972a0e4927SRafael J. WysockiPolicy Interface in ``sysfs``
1982a0e4927SRafael J. Wysocki=============================
1992a0e4927SRafael J. Wysocki
2002a0e4927SRafael J. WysockiDuring the initialization of the kernel, the ``CPUFreq`` core creates a
2012a0e4927SRafael J. Wysocki``sysfs`` directory (kobject) called ``cpufreq`` under
2022a0e4927SRafael J. Wysocki:file:`/sys/devices/system/cpu/`.
2032a0e4927SRafael J. Wysocki
2042a0e4927SRafael J. WysockiThat directory contains a ``policyX`` subdirectory (where ``X`` represents an
2052a0e4927SRafael J. Wysockiinteger number) for every policy object maintained by the ``CPUFreq`` core.
2062a0e4927SRafael J. WysockiEach ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
2072a0e4927SRafael J. Wysockiunder :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
2082a0e4927SRafael J. Wysockithat may be different from the one represented by ``X``) for all of the CPUs
2092a0e4927SRafael J. Wysockiassociated with (or belonging to) the given policy.  The ``policyX`` directories
2102a0e4927SRafael J. Wysockiin :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
2112a0e4927SRafael J. Wysockiattributes (files) to control ``CPUFreq`` behavior for the corresponding policy
2122a0e4927SRafael J. Wysockiobjects (that is, for all of the CPUs associated with them).
2132a0e4927SRafael J. Wysocki
2142a0e4927SRafael J. WysockiSome of those attributes are generic.  They are created by the ``CPUFreq`` core
2152a0e4927SRafael J. Wysockiand their behavior generally does not depend on what scaling driver is in use
2162a0e4927SRafael J. Wysockiand what scaling governor is attached to the given policy.  Some scaling drivers
2172a0e4927SRafael J. Wysockialso add driver-specific attributes to the policy directories in ``sysfs`` to
2182a0e4927SRafael J. Wysockicontrol policy-specific aspects of driver behavior.
2192a0e4927SRafael J. Wysocki
2202a0e4927SRafael J. WysockiThe generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
2212a0e4927SRafael J. Wysockiare the following:
2222a0e4927SRafael J. Wysocki
2232a0e4927SRafael J. Wysocki``affected_cpus``
2242a0e4927SRafael J. Wysocki	List of online CPUs belonging to this policy (i.e. sharing the hardware
2252a0e4927SRafael J. Wysocki	performance scaling interface represented by the ``policyX`` policy
2262a0e4927SRafael J. Wysocki	object).
2272a0e4927SRafael J. Wysocki
2282a0e4927SRafael J. Wysocki``bios_limit``
2292a0e4927SRafael J. Wysocki	If the platform firmware (BIOS) tells the OS to apply an upper limit to
2302a0e4927SRafael J. Wysocki	CPU frequencies, that limit will be reported through this attribute (if
2312a0e4927SRafael J. Wysocki	present).
2322a0e4927SRafael J. Wysocki
2332a0e4927SRafael J. Wysocki	The existence of the limit may be a result of some (often unintentional)
2342a0e4927SRafael J. Wysocki	BIOS settings, restrictions coming from a service processor or another
2352a0e4927SRafael J. Wysocki	BIOS/HW-based mechanisms.
2362a0e4927SRafael J. Wysocki
2372a0e4927SRafael J. Wysocki	This does not cover ACPI thermal limitations which can be discovered
2382a0e4927SRafael J. Wysocki	through a generic thermal driver.
2392a0e4927SRafael J. Wysocki
2402a0e4927SRafael J. Wysocki	This attribute is not present if the scaling driver in use does not
2412a0e4927SRafael J. Wysocki	support it.
2422a0e4927SRafael J. Wysocki
243c2e3af11SRafael J. Wysocki``cpuinfo_cur_freq``
244c2e3af11SRafael J. Wysocki	Current frequency of the CPUs belonging to this policy as obtained from
245c2e3af11SRafael J. Wysocki	the hardware (in KHz).
246c2e3af11SRafael J. Wysocki
247c2e3af11SRafael J. Wysocki	This is expected to be the frequency the hardware actually runs at.
248c2e3af11SRafael J. Wysocki	If that frequency cannot be determined, this attribute should not
249c2e3af11SRafael J. Wysocki	be present.
250c2e3af11SRafael J. Wysocki
2512a0e4927SRafael J. Wysocki``cpuinfo_max_freq``
2522a0e4927SRafael J. Wysocki	Maximum possible operating frequency the CPUs belonging to this policy
2532a0e4927SRafael J. Wysocki	can run at (in kHz).
2542a0e4927SRafael J. Wysocki
2552a0e4927SRafael J. Wysocki``cpuinfo_min_freq``
2562a0e4927SRafael J. Wysocki	Minimum possible operating frequency the CPUs belonging to this policy
2572a0e4927SRafael J. Wysocki	can run at (in kHz).
2582a0e4927SRafael J. Wysocki
2592a0e4927SRafael J. Wysocki``cpuinfo_transition_latency``
2602a0e4927SRafael J. Wysocki	The time it takes to switch the CPUs belonging to this policy from one
2612a0e4927SRafael J. Wysocki	P-state to another, in nanoseconds.
2622a0e4927SRafael J. Wysocki
2632a0e4927SRafael J. Wysocki	If unknown or if known to be so high that the scaling driver does not
2642a0e4927SRafael J. Wysocki	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
2652a0e4927SRafael J. Wysocki	will be returned by reads from this attribute.
2662a0e4927SRafael J. Wysocki
2672a0e4927SRafael J. Wysocki``related_cpus``
2682a0e4927SRafael J. Wysocki	List of all (online and offline) CPUs belonging to this policy.
2692a0e4927SRafael J. Wysocki
2702a0e4927SRafael J. Wysocki``scaling_available_governors``
2712a0e4927SRafael J. Wysocki	List of ``CPUFreq`` scaling governors present in the kernel that can
27233fc30b4SRafael J. Wysocki	be attached to this policy or (if the |intel_pstate| scaling driver is
2732a0e4927SRafael J. Wysocki	in use) list of scaling algorithms provided by the driver that can be
2742a0e4927SRafael J. Wysocki	applied to this policy.
2752a0e4927SRafael J. Wysocki
2762a0e4927SRafael J. Wysocki	[Note that some governors are modular and it may be necessary to load a
2772a0e4927SRafael J. Wysocki	kernel module for the governor held by it to become available and be
2782a0e4927SRafael J. Wysocki	listed by this attribute.]
2792a0e4927SRafael J. Wysocki
2802a0e4927SRafael J. Wysocki``scaling_cur_freq``
2812a0e4927SRafael J. Wysocki	Current frequency of all of the CPUs belonging to this policy (in kHz).
2822a0e4927SRafael J. Wysocki
2838183003eSRafael J. Wysocki	In the majority of cases, this is the frequency of the last P-state
2848183003eSRafael J. Wysocki	requested by the scaling driver from the hardware using the scaling
2852a0e4927SRafael J. Wysocki	interface provided by it, which may or may not reflect the frequency
2862a0e4927SRafael J. Wysocki	the CPU is actually running at (due to hardware design and other
2872a0e4927SRafael J. Wysocki	limitations).
2882a0e4927SRafael J. Wysocki
2898183003eSRafael J. Wysocki	Some architectures (e.g. ``x86``) may attempt to provide information
2908183003eSRafael J. Wysocki	more precisely reflecting the current CPU frequency through this
2918183003eSRafael J. Wysocki	attribute, but that still may not be the exact current CPU frequency as
2928183003eSRafael J. Wysocki	seen by the hardware at the moment.
2932a0e4927SRafael J. Wysocki
2942a0e4927SRafael J. Wysocki``scaling_driver``
2952a0e4927SRafael J. Wysocki	The scaling driver currently in use.
2962a0e4927SRafael J. Wysocki
2972a0e4927SRafael J. Wysocki``scaling_governor``
2982a0e4927SRafael J. Wysocki	The scaling governor currently attached to this policy or (if the
29933fc30b4SRafael J. Wysocki	|intel_pstate| scaling driver is in use) the scaling algorithm
3002a0e4927SRafael J. Wysocki	provided by the driver that is currently applied to this policy.
3012a0e4927SRafael J. Wysocki
3022a0e4927SRafael J. Wysocki	This attribute is read-write and writing to it will cause a new scaling
3032a0e4927SRafael J. Wysocki	governor to be attached to this policy or a new scaling algorithm
3042a0e4927SRafael J. Wysocki	provided by the scaling driver to be applied to it (in the
30533fc30b4SRafael J. Wysocki	|intel_pstate| case), as indicated by the string written to this
3062a0e4927SRafael J. Wysocki	attribute (which must be one of the names listed by the
3072a0e4927SRafael J. Wysocki	``scaling_available_governors`` attribute described above).
3082a0e4927SRafael J. Wysocki
3092a0e4927SRafael J. Wysocki``scaling_max_freq``
3102a0e4927SRafael J. Wysocki	Maximum frequency the CPUs belonging to this policy are allowed to be
3112a0e4927SRafael J. Wysocki	running at (in kHz).
3122a0e4927SRafael J. Wysocki
3132a0e4927SRafael J. Wysocki	This attribute is read-write and writing a string representing an
3142a0e4927SRafael J. Wysocki	integer to it will cause a new limit to be set (it must not be lower
3152a0e4927SRafael J. Wysocki	than the value of the ``scaling_min_freq`` attribute).
3162a0e4927SRafael J. Wysocki
3172a0e4927SRafael J. Wysocki``scaling_min_freq``
3182a0e4927SRafael J. Wysocki	Minimum frequency the CPUs belonging to this policy are allowed to be
3192a0e4927SRafael J. Wysocki	running at (in kHz).
3202a0e4927SRafael J. Wysocki
3212a0e4927SRafael J. Wysocki	This attribute is read-write and writing a string representing a
3222a0e4927SRafael J. Wysocki	non-negative integer to it will cause a new limit to be set (it must not
3232a0e4927SRafael J. Wysocki	be higher than the value of the ``scaling_max_freq`` attribute).
3242a0e4927SRafael J. Wysocki
3252a0e4927SRafael J. Wysocki``scaling_setspeed``
3262a0e4927SRafael J. Wysocki	This attribute is functional only if the `userspace`_ scaling governor
3272a0e4927SRafael J. Wysocki	is attached to the given policy.
3282a0e4927SRafael J. Wysocki
3292a0e4927SRafael J. Wysocki	It returns the last frequency requested by the governor (in kHz) or can
3302a0e4927SRafael J. Wysocki	be written to in order to set a new frequency for the policy.
3312a0e4927SRafael J. Wysocki
3322a0e4927SRafael J. Wysocki
3332a0e4927SRafael J. WysockiGeneric Scaling Governors
3342a0e4927SRafael J. Wysocki=========================
3352a0e4927SRafael J. Wysocki
3362a0e4927SRafael J. Wysocki``CPUFreq`` provides generic scaling governors that can be used with all
3372a0e4927SRafael J. Wysockiscaling drivers.  As stated before, each of them implements a single, possibly
3382a0e4927SRafael J. Wysockiparametrized, performance scaling algorithm.
3392a0e4927SRafael J. Wysocki
3402a0e4927SRafael J. WysockiScaling governors are attached to policy objects and different policy objects
3412a0e4927SRafael J. Wysockican be handled by different scaling governors at the same time (although that
3422a0e4927SRafael J. Wysockimay lead to suboptimal results in some cases).
3432a0e4927SRafael J. Wysocki
3442a0e4927SRafael J. WysockiThe scaling governor for a given policy object can be changed at any time with
3452a0e4927SRafael J. Wysockithe help of the ``scaling_governor`` policy attribute in ``sysfs``.
3462a0e4927SRafael J. Wysocki
3472a0e4927SRafael J. WysockiSome governors expose ``sysfs`` attributes to control or fine-tune the scaling
3482a0e4927SRafael J. Wysockialgorithms implemented by them.  Those attributes, referred to as governor
3492a0e4927SRafael J. Wysockitunables, can be either global (system-wide) or per-policy, depending on the
3502a0e4927SRafael J. Wysockiscaling driver in use.  If the driver requires governor tunables to be
3512a0e4927SRafael J. Wysockiper-policy, they are located in a subdirectory of each policy directory.
3522a0e4927SRafael J. WysockiOtherwise, they are located in a subdirectory under
3532a0e4927SRafael J. Wysocki:file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
3542a0e4927SRafael J. Wysockisubdirectory containing the governor tunables is the name of the governor
3552a0e4927SRafael J. Wysockiproviding them.
3562a0e4927SRafael J. Wysocki
3572a0e4927SRafael J. Wysocki``performance``
3582a0e4927SRafael J. Wysocki---------------
3592a0e4927SRafael J. Wysocki
3602a0e4927SRafael J. WysockiWhen attached to a policy object, this governor causes the highest frequency,
3612a0e4927SRafael J. Wysockiwithin the ``scaling_max_freq`` policy limit, to be requested for that policy.
3622a0e4927SRafael J. Wysocki
3632a0e4927SRafael J. WysockiThe request is made once at that time the governor for the policy is set to
3642a0e4927SRafael J. Wysocki``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
3652a0e4927SRafael J. Wysockipolicy limits change after that.
3662a0e4927SRafael J. Wysocki
3672a0e4927SRafael J. Wysocki``powersave``
3682a0e4927SRafael J. Wysocki-------------
3692a0e4927SRafael J. Wysocki
3702a0e4927SRafael J. WysockiWhen attached to a policy object, this governor causes the lowest frequency,
3712a0e4927SRafael J. Wysockiwithin the ``scaling_min_freq`` policy limit, to be requested for that policy.
3722a0e4927SRafael J. Wysocki
3732a0e4927SRafael J. WysockiThe request is made once at that time the governor for the policy is set to
3742a0e4927SRafael J. Wysocki``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
3752a0e4927SRafael J. Wysockipolicy limits change after that.
3762a0e4927SRafael J. Wysocki
3772a0e4927SRafael J. Wysocki``userspace``
3782a0e4927SRafael J. Wysocki-------------
3792a0e4927SRafael J. Wysocki
3802a0e4927SRafael J. WysockiThis governor does not do anything by itself.  Instead, it allows user space
3812a0e4927SRafael J. Wysockito set the CPU frequency for the policy it is attached to by writing to the
3822a0e4927SRafael J. Wysocki``scaling_setspeed`` attribute of that policy.
3832a0e4927SRafael J. Wysocki
3842a0e4927SRafael J. Wysocki``schedutil``
3852a0e4927SRafael J. Wysocki-------------
3862a0e4927SRafael J. Wysocki
3872a0e4927SRafael J. WysockiThis governor uses CPU utilization data available from the CPU scheduler.  It
3882a0e4927SRafael J. Wysockigenerally is regarded as a part of the CPU scheduler, so it can access the
3892a0e4927SRafael J. Wysockischeduler's internal data structures directly.
3902a0e4927SRafael J. Wysocki
3912a0e4927SRafael J. WysockiIt runs entirely in scheduler context, although in some cases it may need to
3922a0e4927SRafael J. Wysockiinvoke the scaling driver asynchronously when it decides that the CPU frequency
3932a0e4927SRafael J. Wysockishould be changed for a given policy (that depends on whether or not the driver
3942a0e4927SRafael J. Wysockiis capable of changing the CPU frequency from scheduler context).
3952a0e4927SRafael J. Wysocki
3962a0e4927SRafael J. WysockiThe actions of this governor for a particular CPU depend on the scheduling class
3972a0e4927SRafael J. Wysockiinvoking its utilization update callback for that CPU.  If it is invoked by the
3982a0e4927SRafael J. WysockiRT or deadline scheduling classes, the governor will increase the frequency to
3992a0e4927SRafael J. Wysockithe allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
4002a0e4927SRafael J. Wysockiif it is invoked by the CFS scheduling class, the governor will use the
4012a0e4927SRafael J. WysockiPer-Entity Load Tracking (PELT) metric for the root control group of the
4021120b0f9SRafael J. Wysockigiven CPU as the CPU utilization estimate (see the *Per-entity load tracking*
4031120b0f9SRafael J. WysockiLWN.net article [1]_ for a description of the PELT mechanism).  Then, the new
4042a0e4927SRafael J. WysockiCPU frequency to apply is computed in accordance with the formula
4052a0e4927SRafael J. Wysocki
4062a0e4927SRafael J. Wysocki	f = 1.25 * ``f_0`` * ``util`` / ``max``
4072a0e4927SRafael J. Wysocki
4082a0e4927SRafael J. Wysockiwhere ``util`` is the PELT number, ``max`` is the theoretical maximum of
4092a0e4927SRafael J. Wysocki``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
4102a0e4927SRafael J. Wysockipolicy (if the PELT number is frequency-invariant), or the current CPU frequency
4112a0e4927SRafael J. Wysocki(otherwise).
4122a0e4927SRafael J. Wysocki
4132a0e4927SRafael J. WysockiThis governor also employs a mechanism allowing it to temporarily bump up the
4142a0e4927SRafael J. WysockiCPU frequency for tasks that have been waiting on I/O most recently, called
4152a0e4927SRafael J. Wysocki"IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
4162a0e4927SRafael J. Wysockiis passed by the scheduler to the governor callback which causes the frequency
4172a0e4927SRafael J. Wysockito go up to the allowed maximum immediately and then draw back to the value
4182a0e4927SRafael J. Wysockireturned by the above formula over time.
4192a0e4927SRafael J. Wysocki
4202a0e4927SRafael J. WysockiThis governor exposes only one tunable:
4212a0e4927SRafael J. Wysocki
4222a0e4927SRafael J. Wysocki``rate_limit_us``
4232a0e4927SRafael J. Wysocki	Minimum time (in microseconds) that has to pass between two consecutive
4242a0e4927SRafael J. Wysocki	runs of governor computations (default: 1000 times the scaling driver's
4252a0e4927SRafael J. Wysocki	transition latency).
4262a0e4927SRafael J. Wysocki
4272a0e4927SRafael J. Wysocki	The purpose of this tunable is to reduce the scheduler context overhead
4282a0e4927SRafael J. Wysocki	of the governor which might be excessive without it.
4292a0e4927SRafael J. Wysocki
4302a0e4927SRafael J. WysockiThis governor generally is regarded as a replacement for the older `ondemand`_
4312a0e4927SRafael J. Wysockiand `conservative`_ governors (described below), as it is simpler and more
4322a0e4927SRafael J. Wysockitightly integrated with the CPU scheduler, its overhead in terms of CPU context
4332a0e4927SRafael J. Wysockiswitches and similar is less significant, and it uses the scheduler's own CPU
4342a0e4927SRafael J. Wysockiutilization metric, so in principle its decisions should not contradict the
4352a0e4927SRafael J. Wysockidecisions made by the other parts of the scheduler.
4362a0e4927SRafael J. Wysocki
4372a0e4927SRafael J. Wysocki``ondemand``
4382a0e4927SRafael J. Wysocki------------
4392a0e4927SRafael J. Wysocki
4402a0e4927SRafael J. WysockiThis governor uses CPU load as a CPU frequency selection metric.
4412a0e4927SRafael J. Wysocki
4422a0e4927SRafael J. WysockiIn order to estimate the current CPU load, it measures the time elapsed between
4432a0e4927SRafael J. Wysockiconsecutive invocations of its worker routine and computes the fraction of that
4442a0e4927SRafael J. Wysockitime in which the given CPU was not idle.  The ratio of the non-idle (active)
4452a0e4927SRafael J. Wysockitime to the total CPU time is taken as an estimate of the load.
4462a0e4927SRafael J. Wysocki
4472a0e4927SRafael J. WysockiIf this governor is attached to a policy shared by multiple CPUs, the load is
4482a0e4927SRafael J. Wysockiestimated for all of them and the greatest result is taken as the load estimate
4492a0e4927SRafael J. Wysockifor the entire policy.
4502a0e4927SRafael J. Wysocki
4512a0e4927SRafael J. WysockiThe worker routine of this governor has to run in process context, so it is
4522a0e4927SRafael J. Wysockiinvoked asynchronously (via a workqueue) and CPU P-states are updated from
4532a0e4927SRafael J. Wysockithere if necessary.  As a result, the scheduler context overhead from this
4542a0e4927SRafael J. Wysockigovernor is minimum, but it causes additional CPU context switches to happen
4552a0e4927SRafael J. Wysockirelatively often and the CPU P-state updates triggered by it can be relatively
4562a0e4927SRafael J. Wysockiirregular.  Also, it affects its own CPU load metric by running code that
4572a0e4927SRafael J. Wysockireduces the CPU idle time (even though the CPU idle time is only reduced very
4582a0e4927SRafael J. Wysockislightly by it).
4592a0e4927SRafael J. Wysocki
4602a0e4927SRafael J. WysockiIt generally selects CPU frequencies proportional to the estimated load, so that
4612a0e4927SRafael J. Wysockithe value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
4622a0e4927SRafael J. Wysocki1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
4632a0e4927SRafael J. Wysockicorresponds to the load of 0, unless when the load exceeds a (configurable)
4642a0e4927SRafael J. Wysockispeedup threshold, in which case it will go straight for the highest frequency
4652a0e4927SRafael J. Wysockiit is allowed to use (the ``scaling_max_freq`` policy limit).
4662a0e4927SRafael J. Wysocki
4672a0e4927SRafael J. WysockiThis governor exposes the following tunables:
4682a0e4927SRafael J. Wysocki
4692a0e4927SRafael J. Wysocki``sampling_rate``
4702a0e4927SRafael J. Wysocki	This is how often the governor's worker routine should run, in
4712a0e4927SRafael J. Wysocki	microseconds.
4722a0e4927SRafael J. Wysocki
4732a0e4927SRafael J. Wysocki	Typically, it is set to values of the order of 10000 (10 ms).  Its
4742a0e4927SRafael J. Wysocki	default value is equal to the value of ``cpuinfo_transition_latency``
4752a0e4927SRafael J. Wysocki	for each policy this governor is attached to (but since the unit here
4762a0e4927SRafael J. Wysocki	is greater by 1000, this means that the time represented by
4772a0e4927SRafael J. Wysocki	``sampling_rate`` is 1000 times greater than the transition latency by
4782a0e4927SRafael J. Wysocki	default).
4792a0e4927SRafael J. Wysocki
4802a0e4927SRafael J. Wysocki	If this tunable is per-policy, the following shell command sets the time
4812a0e4927SRafael J. Wysocki	represented by it to be 750 times as high as the transition latency::
4822a0e4927SRafael J. Wysocki
4832a0e4927SRafael J. Wysocki	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
4842a0e4927SRafael J. Wysocki
4852a0e4927SRafael J. Wysocki``up_threshold``
4862a0e4927SRafael J. Wysocki	If the estimated CPU load is above this value (in percent), the governor
4872a0e4927SRafael J. Wysocki	will set the frequency to the maximum value allowed for the policy.
4882a0e4927SRafael J. Wysocki	Otherwise, the selected frequency will be proportional to the estimated
4892a0e4927SRafael J. Wysocki	CPU load.
4902a0e4927SRafael J. Wysocki
4912a0e4927SRafael J. Wysocki``ignore_nice_load``
4922a0e4927SRafael J. Wysocki	If set to 1 (default 0), it will cause the CPU load estimation code to
4932a0e4927SRafael J. Wysocki	treat the CPU time spent on executing tasks with "nice" levels greater
4942a0e4927SRafael J. Wysocki	than 0 as CPU idle time.
4952a0e4927SRafael J. Wysocki
4962a0e4927SRafael J. Wysocki	This may be useful if there are tasks in the system that should not be
4972a0e4927SRafael J. Wysocki	taken into account when deciding what frequency to run the CPUs at.
4982a0e4927SRafael J. Wysocki	Then, to make that happen it is sufficient to increase the "nice" level
4992a0e4927SRafael J. Wysocki	of those tasks above 0 and set this attribute to 1.
5002a0e4927SRafael J. Wysocki
5012a0e4927SRafael J. Wysocki``sampling_down_factor``
5022a0e4927SRafael J. Wysocki	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
5032a0e4927SRafael J. Wysocki	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
5042a0e4927SRafael J. Wysocki
5052a0e4927SRafael J. Wysocki	This causes the next execution of the governor's worker routine (after
5062a0e4927SRafael J. Wysocki	setting the frequency to the allowed maximum) to be delayed, so the
5072a0e4927SRafael J. Wysocki	frequency stays at the maximum level for a longer time.
5082a0e4927SRafael J. Wysocki
5092a0e4927SRafael J. Wysocki	Frequency fluctuations in some bursty workloads may be avoided this way
5102a0e4927SRafael J. Wysocki	at the cost of additional energy spent on maintaining the maximum CPU
5112a0e4927SRafael J. Wysocki	capacity.
5122a0e4927SRafael J. Wysocki
5132a0e4927SRafael J. Wysocki``powersave_bias``
5142a0e4927SRafael J. Wysocki	Reduction factor to apply to the original frequency target of the
5152a0e4927SRafael J. Wysocki	governor (including the maximum value used when the ``up_threshold``
5162a0e4927SRafael J. Wysocki	value is exceeded by the estimated CPU load) or sensitivity threshold
5172a0e4927SRafael J. Wysocki	for the AMD frequency sensitivity powersave bias driver
5182a0e4927SRafael J. Wysocki	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
5192a0e4927SRafael J. Wysocki	inclusive.
5202a0e4927SRafael J. Wysocki
5212a0e4927SRafael J. Wysocki	If the AMD frequency sensitivity powersave bias driver is not loaded,
5222a0e4927SRafael J. Wysocki	the effective frequency to apply is given by
5232a0e4927SRafael J. Wysocki
5242a0e4927SRafael J. Wysocki		f * (1 - ``powersave_bias`` / 1000)
5252a0e4927SRafael J. Wysocki
5262a0e4927SRafael J. Wysocki	where f is the governor's original frequency target.  The default value
5272a0e4927SRafael J. Wysocki	of this attribute is 0 in that case.
5282a0e4927SRafael J. Wysocki
5292a0e4927SRafael J. Wysocki	If the AMD frequency sensitivity powersave bias driver is loaded, the
5302a0e4927SRafael J. Wysocki	value of this attribute is 400 by default and it is used in a different
5312a0e4927SRafael J. Wysocki	way.
5322a0e4927SRafael J. Wysocki
5332a0e4927SRafael J. Wysocki	On Family 16h (and later) AMD processors there is a mechanism to get a
5342a0e4927SRafael J. Wysocki	measured workload sensitivity, between 0 and 100% inclusive, from the
5352a0e4927SRafael J. Wysocki	hardware.  That value can be used to estimate how the performance of the
5362a0e4927SRafael J. Wysocki	workload running on a CPU will change in response to frequency changes.
5372a0e4927SRafael J. Wysocki
5382a0e4927SRafael J. Wysocki	The performance of a workload with the sensitivity of 0 (memory-bound or
5392a0e4927SRafael J. Wysocki	IO-bound) is not expected to increase at all as a result of increasing
5402a0e4927SRafael J. Wysocki	the CPU frequency, whereas workloads with the sensitivity of 100%
5412a0e4927SRafael J. Wysocki	(CPU-bound) are expected to perform much better if the CPU frequency is
5422a0e4927SRafael J. Wysocki	increased.
5432a0e4927SRafael J. Wysocki
5442a0e4927SRafael J. Wysocki	If the workload sensitivity is less than the threshold represented by
5452a0e4927SRafael J. Wysocki	the ``powersave_bias`` value, the sensitivity powersave bias driver
5462a0e4927SRafael J. Wysocki	will cause the governor to select a frequency lower than its original
5472a0e4927SRafael J. Wysocki	target, so as to avoid over-provisioning workloads that will not benefit
5482a0e4927SRafael J. Wysocki	from running at higher CPU frequencies.
5492a0e4927SRafael J. Wysocki
5502a0e4927SRafael J. Wysocki``conservative``
5512a0e4927SRafael J. Wysocki----------------
5522a0e4927SRafael J. Wysocki
5532a0e4927SRafael J. WysockiThis governor uses CPU load as a CPU frequency selection metric.
5542a0e4927SRafael J. Wysocki
5552a0e4927SRafael J. WysockiIt estimates the CPU load in the same way as the `ondemand`_ governor described
5562a0e4927SRafael J. Wysockiabove, but the CPU frequency selection algorithm implemented by it is different.
5572a0e4927SRafael J. Wysocki
5582a0e4927SRafael J. WysockiNamely, it avoids changing the frequency significantly over short time intervals
5592a0e4927SRafael J. Wysockiwhich may not be suitable for systems with limited power supply capacity (e.g.
5602a0e4927SRafael J. Wysockibattery-powered).  To achieve that, it changes the frequency in relatively
5612a0e4927SRafael J. Wysockismall steps, one step at a time, up or down - depending on whether or not a
5622a0e4927SRafael J. Wysocki(configurable) threshold has been exceeded by the estimated CPU load.
5632a0e4927SRafael J. Wysocki
5642a0e4927SRafael J. WysockiThis governor exposes the following tunables:
5652a0e4927SRafael J. Wysocki
5662a0e4927SRafael J. Wysocki``freq_step``
5672a0e4927SRafael J. Wysocki	Frequency step in percent of the maximum frequency the governor is
5682a0e4927SRafael J. Wysocki	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
5692a0e4927SRafael J. Wysocki	100 (5 by default).
5702a0e4927SRafael J. Wysocki
5712a0e4927SRafael J. Wysocki	This is how much the frequency is allowed to change in one go.  Setting
5722a0e4927SRafael J. Wysocki	it to 0 will cause the default frequency step (5 percent) to be used
5732a0e4927SRafael J. Wysocki	and setting it to 100 effectively causes the governor to periodically
5742a0e4927SRafael J. Wysocki	switch the frequency between the ``scaling_min_freq`` and
5752a0e4927SRafael J. Wysocki	``scaling_max_freq`` policy limits.
5762a0e4927SRafael J. Wysocki
5772a0e4927SRafael J. Wysocki``down_threshold``
5782a0e4927SRafael J. Wysocki	Threshold value (in percent, 20 by default) used to determine the
5792a0e4927SRafael J. Wysocki	frequency change direction.
5802a0e4927SRafael J. Wysocki
5812a0e4927SRafael J. Wysocki	If the estimated CPU load is greater than this value, the frequency will
5822a0e4927SRafael J. Wysocki	go up (by ``freq_step``).  If the load is less than this value (and the
5832a0e4927SRafael J. Wysocki	``sampling_down_factor`` mechanism is not in effect), the frequency will
5842a0e4927SRafael J. Wysocki	go down.  Otherwise, the frequency will not be changed.
5852a0e4927SRafael J. Wysocki
5862a0e4927SRafael J. Wysocki``sampling_down_factor``
5872a0e4927SRafael J. Wysocki	Frequency decrease deferral factor, between 1 (default) and 10
5882a0e4927SRafael J. Wysocki	inclusive.
5892a0e4927SRafael J. Wysocki
5902a0e4927SRafael J. Wysocki	It effectively causes the frequency to go down ``sampling_down_factor``
5912a0e4927SRafael J. Wysocki	times slower than it ramps up.
5922a0e4927SRafael J. Wysocki
5932a0e4927SRafael J. Wysocki
5942a0e4927SRafael J. WysockiFrequency Boost Support
5952a0e4927SRafael J. Wysocki=======================
5962a0e4927SRafael J. Wysocki
5972a0e4927SRafael J. WysockiBackground
5982a0e4927SRafael J. Wysocki----------
5992a0e4927SRafael J. Wysocki
6002a0e4927SRafael J. WysockiSome processors support a mechanism to raise the operating frequency of some
6012a0e4927SRafael J. Wysockicores in a multicore package temporarily (and above the sustainable frequency
6022a0e4927SRafael J. Wysockithreshold for the whole package) under certain conditions, for example if the
6032a0e4927SRafael J. Wysockiwhole chip is not fully utilized and below its intended thermal or power budget.
6042a0e4927SRafael J. Wysocki
6052a0e4927SRafael J. WysockiDifferent names are used by different vendors to refer to this functionality.
6062a0e4927SRafael J. WysockiFor Intel processors it is referred to as "Turbo Boost", AMD calls it
6072a0e4927SRafael J. Wysocki"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
6082a0e4927SRafael J. WysockiAs a rule, it also is implemented differently by different vendors.  The simple
6092a0e4927SRafael J. Wysockiterm "frequency boost" is used here for brevity to refer to all of those
6102a0e4927SRafael J. Wysockiimplementations.
6112a0e4927SRafael J. Wysocki
6122a0e4927SRafael J. WysockiThe frequency boost mechanism may be either hardware-based or software-based.
6132a0e4927SRafael J. WysockiIf it is hardware-based (e.g. on x86), the decision to trigger the boosting is
6142a0e4927SRafael J. Wysockimade by the hardware (although in general it requires the hardware to be put
6152a0e4927SRafael J. Wysockiinto a special state in which it can control the CPU frequency within certain
6162a0e4927SRafael J. Wysockilimits).  If it is software-based (e.g. on ARM), the scaling driver decides
6172a0e4927SRafael J. Wysockiwhether or not to trigger boosting and when to do that.
6182a0e4927SRafael J. Wysocki
6192a0e4927SRafael J. WysockiThe ``boost`` File in ``sysfs``
6202a0e4927SRafael J. Wysocki-------------------------------
6212a0e4927SRafael J. Wysocki
6222a0e4927SRafael J. WysockiThis file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
6232a0e4927SRafael J. Wysockithe "boost" setting for the whole system.  It is not present if the underlying
6242a0e4927SRafael J. Wysockiscaling driver does not support the frequency boost mechanism (or supports it,
6252a0e4927SRafael J. Wysockibut provides a driver-specific interface for controlling it, like
62633fc30b4SRafael J. Wysocki|intel_pstate|).
6272a0e4927SRafael J. Wysocki
6282a0e4927SRafael J. WysockiIf the value in this file is 1, the frequency boost mechanism is enabled.  This
6292a0e4927SRafael J. Wysockimeans that either the hardware can be put into states in which it is able to
6302a0e4927SRafael J. Wysockitrigger boosting (in the hardware-based case), or the software is allowed to
6312a0e4927SRafael J. Wysockitrigger boosting (in the software-based case).  It does not mean that boosting
6322a0e4927SRafael J. Wysockiis actually in use at the moment on any CPUs in the system.  It only means a
6332a0e4927SRafael J. Wysockipermission to use the frequency boost mechanism (which still may never be used
6342a0e4927SRafael J. Wysockifor other reasons).
6352a0e4927SRafael J. Wysocki
6362a0e4927SRafael J. WysockiIf the value in this file is 0, the frequency boost mechanism is disabled and
6372a0e4927SRafael J. Wysockicannot be used at all.
6382a0e4927SRafael J. Wysocki
6392a0e4927SRafael J. WysockiThe only values that can be written to this file are 0 and 1.
6402a0e4927SRafael J. Wysocki
6412a0e4927SRafael J. WysockiRationale for Boost Control Knob
6422a0e4927SRafael J. Wysocki--------------------------------
6432a0e4927SRafael J. Wysocki
6442a0e4927SRafael J. WysockiThe frequency boost mechanism is generally intended to help to achieve optimum
6452a0e4927SRafael J. WysockiCPU performance on time scales below software resolution (e.g. below the
6462a0e4927SRafael J. Wysockischeduler tick interval) and it is demonstrably suitable for many workloads, but
6472a0e4927SRafael J. Wysockiit may lead to problems in certain situations.
6482a0e4927SRafael J. Wysocki
6492a0e4927SRafael J. WysockiFor this reason, many systems make it possible to disable the frequency boost
6502a0e4927SRafael J. Wysockimechanism in the platform firmware (BIOS) setup, but that requires the system to
6512a0e4927SRafael J. Wysockibe restarted for the setting to be adjusted as desired, which may not be
6522a0e4927SRafael J. Wysockipractical at least in some cases.  For example:
6532a0e4927SRafael J. Wysocki
6542a0e4927SRafael J. Wysocki  1. Boosting means overclocking the processor, although under controlled
6552a0e4927SRafael J. Wysocki     conditions.  Generally, the processor's energy consumption increases
6562a0e4927SRafael J. Wysocki     as a result of increasing its frequency and voltage, even temporarily.
6572a0e4927SRafael J. Wysocki     That may not be desirable on systems that switch to power sources of
6582a0e4927SRafael J. Wysocki     limited capacity, such as batteries, so the ability to disable the boost
6592a0e4927SRafael J. Wysocki     mechanism while the system is running may help there (but that depends on
6602a0e4927SRafael J. Wysocki     the workload too).
6612a0e4927SRafael J. Wysocki
6622a0e4927SRafael J. Wysocki  2. In some situations deterministic behavior is more important than
6632a0e4927SRafael J. Wysocki     performance or energy consumption (or both) and the ability to disable
6642a0e4927SRafael J. Wysocki     boosting while the system is running may be useful then.
6652a0e4927SRafael J. Wysocki
6662a0e4927SRafael J. Wysocki  3. To examine the impact of the frequency boost mechanism itself, it is useful
6672a0e4927SRafael J. Wysocki     to be able to run tests with and without boosting, preferably without
6682a0e4927SRafael J. Wysocki     restarting the system in the meantime.
6692a0e4927SRafael J. Wysocki
6702a0e4927SRafael J. Wysocki  4. Reproducible results are important when running benchmarks.  Since
6712a0e4927SRafael J. Wysocki     the boosting functionality depends on the load of the whole package,
6722a0e4927SRafael J. Wysocki     single-thread performance may vary because of it which may lead to
6732a0e4927SRafael J. Wysocki     unreproducible results sometimes.  That can be avoided by disabling the
6742a0e4927SRafael J. Wysocki     frequency boost mechanism before running benchmarks sensitive to that
6752a0e4927SRafael J. Wysocki     issue.
6762a0e4927SRafael J. Wysocki
6772a0e4927SRafael J. WysockiLegacy AMD ``cpb`` Knob
6782a0e4927SRafael J. Wysocki-----------------------
6792a0e4927SRafael J. Wysocki
6802a0e4927SRafael J. WysockiThe AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
6812a0e4927SRafael J. Wysockithe global ``boost`` one.  It is used for disabling/enabling the "Core
6822a0e4927SRafael J. WysockiPerformance Boost" feature of some AMD processors.
6832a0e4927SRafael J. Wysocki
6842a0e4927SRafael J. WysockiIf present, that knob is located in every ``CPUFreq`` policy directory in
6852a0e4927SRafael J. Wysocki``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
6862a0e4927SRafael J. Wysocki``cpb``, which indicates a more fine grained control interface.  The actual
6872a0e4927SRafael J. Wysockiimplementation, however, works on the system-wide basis and setting that knob
6882a0e4927SRafael J. Wysockifor one policy causes the same value of it to be set for all of the other
6892a0e4927SRafael J. Wysockipolicies at the same time.
6902a0e4927SRafael J. Wysocki
6912a0e4927SRafael J. WysockiThat knob is still supported on AMD processors that support its underlying
6922a0e4927SRafael J. Wysockihardware feature, but it may be configured out of the kernel (via the
6932a0e4927SRafael J. Wysocki:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
6942a0e4927SRafael J. Wysocki``boost`` knob is present regardless.  Thus it is always possible use the
6952a0e4927SRafael J. Wysocki``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
6962a0e4927SRafael J. Wysockiis more consistent with what all of the other systems do (and the ``cpb`` knob
6972a0e4927SRafael J. Wysockimay not be supported any more in the future).
6982a0e4927SRafael J. Wysocki
6992a0e4927SRafael J. WysockiThe ``cpb`` knob is never present for any processors without the underlying
7002a0e4927SRafael J. Wysockihardware feature (e.g. all Intel ones), even if the
7012a0e4927SRafael J. Wysocki:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
7022a0e4927SRafael J. Wysocki
7032a0e4927SRafael J. Wysocki
7041120b0f9SRafael J. WysockiReferences
7051120b0f9SRafael J. Wysocki==========
7061120b0f9SRafael J. Wysocki
7071120b0f9SRafael J. Wysocki.. [1] Jonathan Corbet, *Per-entity load tracking*,
7081120b0f9SRafael J. Wysocki       https://lwn.net/Articles/531853/
709