1fc7db767SRafael J. Wysocki.. SPDX-License-Identifier: GPL-2.0
2fc1860d6SRafael J. Wysocki.. include:: <isonum.txt>
3fc7db767SRafael J. Wysocki
42a0e4927SRafael J. Wysocki.. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>`
533fc30b4SRafael J. Wysocki.. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>`
62a0e4927SRafael J. Wysocki
72a0e4927SRafael J. Wysocki=======================
82a0e4927SRafael J. WysockiCPU Performance Scaling
92a0e4927SRafael J. Wysocki=======================
102a0e4927SRafael J. Wysocki
11fc1860d6SRafael J. Wysocki:Copyright: |copy| 2017 Intel Corporation
122a0e4927SRafael J. Wysocki
13fc1860d6SRafael J. Wysocki:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
14fc1860d6SRafael J. Wysocki
152a0e4927SRafael J. Wysocki
162a0e4927SRafael J. WysockiThe Concept of CPU Performance Scaling
172a0e4927SRafael J. Wysocki======================================
182a0e4927SRafael J. Wysocki
192a0e4927SRafael J. WysockiThe majority of modern processors are capable of operating in a number of
202a0e4927SRafael J. Wysockidifferent clock frequency and voltage configurations, often referred to as
212a0e4927SRafael J. WysockiOperating Performance Points or P-states (in ACPI terminology).  As a rule,
222a0e4927SRafael J. Wysockithe higher the clock frequency and the higher the voltage, the more instructions
232a0e4927SRafael J. Wysockican be retired by the CPU over a unit of time, but also the higher the clock
242a0e4927SRafael J. Wysockifrequency and the higher the voltage, the more energy is consumed over a unit of
252a0e4927SRafael J. Wysockitime (or the more power is drawn) by the CPU in the given P-state.  Therefore
262a0e4927SRafael J. Wysockithere is a natural tradeoff between the CPU capacity (the number of instructions
272a0e4927SRafael J. Wysockithat can be executed over a unit of time) and the power drawn by the CPU.
282a0e4927SRafael J. Wysocki
292a0e4927SRafael J. WysockiIn some situations it is desirable or even necessary to run the program as fast
302a0e4927SRafael J. Wysockias possible and then there is no reason to use any P-states different from the
312a0e4927SRafael J. Wysockihighest one (i.e. the highest-performance frequency/voltage configuration
322a0e4927SRafael J. Wysockiavailable).  In some other cases, however, it may not be necessary to execute
332a0e4927SRafael J. Wysockiinstructions so quickly and maintaining the highest available CPU capacity for a
342a0e4927SRafael J. Wysockirelatively long time without utilizing it entirely may be regarded as wasteful.
352a0e4927SRafael J. WysockiIt also may not be physically possible to maintain maximum CPU capacity for too
362a0e4927SRafael J. Wysockilong for thermal or power supply capacity reasons or similar.  To cover those
372a0e4927SRafael J. Wysockicases, there are hardware interfaces allowing CPUs to be switched between
382a0e4927SRafael J. Wysockidifferent frequency/voltage configurations or (in the ACPI terminology) to be
392a0e4927SRafael J. Wysockiput into different P-states.
402a0e4927SRafael J. Wysocki
412a0e4927SRafael J. WysockiTypically, they are used along with algorithms to estimate the required CPU
422a0e4927SRafael J. Wysockicapacity, so as to decide which P-states to put the CPUs into.  Of course, since
432a0e4927SRafael J. Wysockithe utilization of the system generally changes over time, that has to be done
442a0e4927SRafael J. Wysockirepeatedly on a regular basis.  The activity by which this happens is referred
452a0e4927SRafael J. Wysockito as CPU performance scaling or CPU frequency scaling (because it involves
462a0e4927SRafael J. Wysockiadjusting the CPU clock frequency).
472a0e4927SRafael J. Wysocki
482a0e4927SRafael J. Wysocki
492a0e4927SRafael J. WysockiCPU Performance Scaling in Linux
502a0e4927SRafael J. Wysocki================================
512a0e4927SRafael J. Wysocki
522a0e4927SRafael J. WysockiThe Linux kernel supports CPU performance scaling by means of the ``CPUFreq``
532a0e4927SRafael J. Wysocki(CPU Frequency scaling) subsystem that consists of three layers of code: the
542a0e4927SRafael J. Wysockicore, scaling governors and scaling drivers.
552a0e4927SRafael J. Wysocki
562a0e4927SRafael J. WysockiThe ``CPUFreq`` core provides the common code infrastructure and user space
572a0e4927SRafael J. Wysockiinterfaces for all platforms that support CPU performance scaling.  It defines
582a0e4927SRafael J. Wysockithe basic framework in which the other components operate.
592a0e4927SRafael J. Wysocki
602a0e4927SRafael J. WysockiScaling governors implement algorithms to estimate the required CPU capacity.
612a0e4927SRafael J. WysockiAs a rule, each governor implements one, possibly parametrized, scaling
622a0e4927SRafael J. Wysockialgorithm.
632a0e4927SRafael J. Wysocki
642a0e4927SRafael J. WysockiScaling drivers talk to the hardware.  They provide scaling governors with
652a0e4927SRafael J. Wysockiinformation on the available P-states (or P-state ranges in some cases) and
662a0e4927SRafael J. Wysockiaccess platform-specific hardware interfaces to change CPU P-states as requested
672a0e4927SRafael J. Wysockiby scaling governors.
682a0e4927SRafael J. Wysocki
692a0e4927SRafael J. WysockiIn principle, all available scaling governors can be used with every scaling
702a0e4927SRafael J. Wysockidriver.  That design is based on the observation that the information used by
712a0e4927SRafael J. Wysockiperformance scaling algorithms for P-state selection can be represented in a
722a0e4927SRafael J. Wysockiplatform-independent form in the majority of cases, so it should be possible
732a0e4927SRafael J. Wysockito use the same performance scaling algorithm implemented in exactly the same
742a0e4927SRafael J. Wysockiway regardless of which scaling driver is used.  Consequently, the same set of
752a0e4927SRafael J. Wysockiscaling governors should be suitable for every supported platform.
762a0e4927SRafael J. Wysocki
772a0e4927SRafael J. WysockiHowever, that observation may not hold for performance scaling algorithms
782a0e4927SRafael J. Wysockibased on information provided by the hardware itself, for example through
792a0e4927SRafael J. Wysockifeedback registers, as that information is typically specific to the hardware
802a0e4927SRafael J. Wysockiinterface it comes from and may not be easily represented in an abstract,
812a0e4927SRafael J. Wysockiplatform-independent way.  For this reason, ``CPUFreq`` allows scaling drivers
822a0e4927SRafael J. Wysockito bypass the governor layer and implement their own performance scaling
8333fc30b4SRafael J. Wysockialgorithms.  That is done by the |intel_pstate| scaling driver.
842a0e4927SRafael J. Wysocki
852a0e4927SRafael J. Wysocki
862a0e4927SRafael J. Wysocki``CPUFreq`` Policy Objects
872a0e4927SRafael J. Wysocki==========================
882a0e4927SRafael J. Wysocki
892a0e4927SRafael J. WysockiIn some cases the hardware interface for P-state control is shared by multiple
902a0e4927SRafael J. WysockiCPUs.  That is, for example, the same register (or set of registers) is used to
912a0e4927SRafael J. Wysockicontrol the P-state of multiple CPUs at the same time and writing to it affects
922a0e4927SRafael J. Wysockiall of those CPUs simultaneously.
932a0e4927SRafael J. Wysocki
942a0e4927SRafael J. WysockiSets of CPUs sharing hardware P-state control interfaces are represented by
952a0e4927SRafael J. Wysocki``CPUFreq`` as |struct cpufreq_policy| objects.  For consistency,
962a0e4927SRafael J. Wysocki|struct cpufreq_policy| is also used when there is only one CPU in the given
972a0e4927SRafael J. Wysockiset.
982a0e4927SRafael J. Wysocki
992a0e4927SRafael J. WysockiThe ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for
1002a0e4927SRafael J. Wysockievery CPU in the system, including CPUs that are currently offline.  If multiple
1012a0e4927SRafael J. WysockiCPUs share the same hardware P-state control interface, all of the pointers
1022a0e4927SRafael J. Wysockicorresponding to them point to the same |struct cpufreq_policy| object.
1032a0e4927SRafael J. Wysocki
1042a0e4927SRafael J. Wysocki``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design
1052a0e4927SRafael J. Wysockiof its user space interface is based on the policy concept.
1062a0e4927SRafael J. Wysocki
1072a0e4927SRafael J. Wysocki
1082a0e4927SRafael J. WysockiCPU Initialization
1092a0e4927SRafael J. Wysocki==================
1102a0e4927SRafael J. Wysocki
1112a0e4927SRafael J. WysockiFirst of all, a scaling driver has to be registered for ``CPUFreq`` to work.
1122a0e4927SRafael J. WysockiIt is only possible to register one scaling driver at a time, so the scaling
1132a0e4927SRafael J. Wysockidriver is expected to be able to handle all CPUs in the system.
1142a0e4927SRafael J. Wysocki
1152a0e4927SRafael J. WysockiThe scaling driver may be registered before or after CPU registration.  If
1162a0e4927SRafael J. WysockiCPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to
1172a0e4927SRafael J. Wysockitake a note of all of the already registered CPUs during the registration of the
1182a0e4927SRafael J. Wysockiscaling driver.  In turn, if any CPUs are registered after the registration of
1192a0e4927SRafael J. Wysockithe scaling driver, the ``CPUFreq`` core will be invoked to take note of them
1202a0e4927SRafael J. Wysockiat their registration time.
1212a0e4927SRafael J. Wysocki
1222a0e4927SRafael J. WysockiIn any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it
1232a0e4927SRafael J. Wysockihas not seen so far as soon as it is ready to handle that CPU.  [Note that the
1242a0e4927SRafael J. Wysockilogical CPU may be a physical single-core processor, or a single core in a
1252a0e4927SRafael J. Wysockimulticore processor, or a hardware thread in a physical processor or processor
1262a0e4927SRafael J. Wysockicore.  In what follows "CPU" always means "logical CPU" unless explicitly stated
1272a0e4927SRafael J. Wysockiotherwise and the word "processor" is used to refer to the physical part
1282a0e4927SRafael J. Wysockipossibly including multiple logical CPUs.]
1292a0e4927SRafael J. Wysocki
1302a0e4927SRafael J. WysockiOnce invoked, the ``CPUFreq`` core checks if the policy pointer is already set
1312a0e4927SRafael J. Wysockifor the given CPU and if so, it skips the policy object creation.  Otherwise,
1322a0e4927SRafael J. Wysockia new policy object is created and initialized, which involves the creation of
1332a0e4927SRafael J. Wysockia new policy directory in ``sysfs``, and the policy pointer corresponding to
1342a0e4927SRafael J. Wysockithe given CPU is set to the new policy object's address in memory.
1352a0e4927SRafael J. Wysocki
1362a0e4927SRafael J. WysockiNext, the scaling driver's ``->init()`` callback is invoked with the policy
1372a0e4927SRafael J. Wysockipointer of the new CPU passed to it as the argument.  That callback is expected
1382a0e4927SRafael J. Wysockito initialize the performance scaling hardware interface for the given CPU (or,
1392a0e4927SRafael J. Wysockimore precisely, for the set of CPUs sharing the hardware interface it belongs
1402a0e4927SRafael J. Wysockito, represented by its policy object) and, if the policy object it has been
1412a0e4927SRafael J. Wysockicalled for is new, to set parameters of the policy, like the minimum and maximum
1422a0e4927SRafael J. Wysockifrequencies supported by the hardware, the table of available frequencies (if
1432a0e4927SRafael J. Wysockithe set of supported P-states is not a continuous range), and the mask of CPUs
1442a0e4927SRafael J. Wysockithat belong to the same policy (including both online and offline CPUs).  That
1452a0e4927SRafael J. Wysockimask is then used by the core to populate the policy pointers for all of the
1462a0e4927SRafael J. WysockiCPUs in it.
1472a0e4927SRafael J. Wysocki
1482a0e4927SRafael J. WysockiThe next major initialization step for a new policy object is to attach a
1492a0e4927SRafael J. Wysockiscaling governor to it (to begin with, that is the default scaling governor
1502a0e4927SRafael J. Wysockidetermined by the kernel configuration, but it may be changed later
1512a0e4927SRafael J. Wysockivia ``sysfs``).  First, a pointer to the new policy object is passed to the
1522a0e4927SRafael J. Wysockigovernor's ``->init()`` callback which is expected to initialize all of the
1532a0e4927SRafael J. Wysockidata structures necessary to handle the given policy and, possibly, to add
1542a0e4927SRafael J. Wysockia governor ``sysfs`` interface to it.  Next, the governor is started by
1552a0e4927SRafael J. Wysockiinvoking its ``->start()`` callback.
1562a0e4927SRafael J. Wysocki
157e531efa1SZhao Wei LiewThat callback is expected to register per-CPU utilization update callbacks for
1582a0e4927SRafael J. Wysockiall of the online CPUs belonging to the given policy with the CPU scheduler.
1592a0e4927SRafael J. WysockiThe utilization update callbacks will be invoked by the CPU scheduler on
1602a0e4927SRafael J. Wysockiimportant events, like task enqueue and dequeue, on every iteration of the
1612a0e4927SRafael J. Wysockischeduler tick or generally whenever the CPU utilization may change (from the
1622a0e4927SRafael J. Wysockischeduler's perspective).  They are expected to carry out computations needed
1632a0e4927SRafael J. Wysockito determine the P-state to use for the given policy going forward and to
1642a0e4927SRafael J. Wysockiinvoke the scaling driver to make changes to the hardware in accordance with
1652a0e4927SRafael J. Wysockithe P-state selection.  The scaling driver may be invoked directly from
1662a0e4927SRafael J. Wysockischeduler context or asynchronously, via a kernel thread or workqueue, depending
1672a0e4927SRafael J. Wysockion the configuration and capabilities of the scaling driver and the governor.
1682a0e4927SRafael J. Wysocki
1692a0e4927SRafael J. WysockiSimilar steps are taken for policy objects that are not new, but were "inactive"
1702a0e4927SRafael J. Wysockipreviously, meaning that all of the CPUs belonging to them were offline.  The
1712a0e4927SRafael J. Wysockionly practical difference in that case is that the ``CPUFreq`` core will attempt
1722a0e4927SRafael J. Wysockito use the scaling governor previously used with the policy that became
1732a0e4927SRafael J. Wysocki"inactive" (and is re-initialized now) instead of the default governor.
1742a0e4927SRafael J. Wysocki
1752a0e4927SRafael J. WysockiIn turn, if a previously offline CPU is being brought back online, but some
1762a0e4927SRafael J. Wysockiother CPUs sharing the policy object with it are online already, there is no
1772a0e4927SRafael J. Wysockineed to re-initialize the policy object at all.  In that case, it only is
1782a0e4927SRafael J. Wysockinecessary to restart the scaling governor so that it can take the new online CPU
1792a0e4927SRafael J. Wysockiinto account.  That is achieved by invoking the governor's ``->stop`` and
1802a0e4927SRafael J. Wysocki``->start()`` callbacks, in this order, for the entire policy.
1812a0e4927SRafael J. Wysocki
18233fc30b4SRafael J. WysockiAs mentioned before, the |intel_pstate| scaling driver bypasses the scaling
1832a0e4927SRafael J. Wysockigovernor layer of ``CPUFreq`` and provides its own P-state selection algorithms.
18433fc30b4SRafael J. WysockiConsequently, if |intel_pstate| is used, scaling governors are not attached to
1852a0e4927SRafael J. Wysockinew policy objects.  Instead, the driver's ``->setpolicy()`` callback is invoked
1862a0e4927SRafael J. Wysockito register per-CPU utilization update callbacks for each policy.  These
1872a0e4927SRafael J. Wysockicallbacks are invoked by the CPU scheduler in the same way as for scaling
18833fc30b4SRafael J. Wysockigovernors, but in the |intel_pstate| case they both determine the P-state to
1892a0e4927SRafael J. Wysockiuse and change the hardware configuration accordingly in one go from scheduler
1902a0e4927SRafael J. Wysockicontext.
1912a0e4927SRafael J. Wysocki
1922a0e4927SRafael J. WysockiThe policy objects created during CPU initialization and other data structures
1932a0e4927SRafael J. Wysockiassociated with them are torn down when the scaling driver is unregistered
1942a0e4927SRafael J. Wysocki(which happens when the kernel module containing it is unloaded, for example) or
1952a0e4927SRafael J. Wysockiwhen the last CPU belonging to the given policy in unregistered.
1962a0e4927SRafael J. Wysocki
1972a0e4927SRafael J. Wysocki
1982a0e4927SRafael J. WysockiPolicy Interface in ``sysfs``
1992a0e4927SRafael J. Wysocki=============================
2002a0e4927SRafael J. Wysocki
2012a0e4927SRafael J. WysockiDuring the initialization of the kernel, the ``CPUFreq`` core creates a
2022a0e4927SRafael J. Wysocki``sysfs`` directory (kobject) called ``cpufreq`` under
2032a0e4927SRafael J. Wysocki:file:`/sys/devices/system/cpu/`.
2042a0e4927SRafael J. Wysocki
2052a0e4927SRafael J. WysockiThat directory contains a ``policyX`` subdirectory (where ``X`` represents an
2062a0e4927SRafael J. Wysockiinteger number) for every policy object maintained by the ``CPUFreq`` core.
2072a0e4927SRafael J. WysockiEach ``policyX`` directory is pointed to by ``cpufreq`` symbolic links
2082a0e4927SRafael J. Wysockiunder :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer
2092a0e4927SRafael J. Wysockithat may be different from the one represented by ``X``) for all of the CPUs
2102a0e4927SRafael J. Wysockiassociated with (or belonging to) the given policy.  The ``policyX`` directories
2112a0e4927SRafael J. Wysockiin :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific
2122a0e4927SRafael J. Wysockiattributes (files) to control ``CPUFreq`` behavior for the corresponding policy
2132a0e4927SRafael J. Wysockiobjects (that is, for all of the CPUs associated with them).
2142a0e4927SRafael J. Wysocki
2152a0e4927SRafael J. WysockiSome of those attributes are generic.  They are created by the ``CPUFreq`` core
2162a0e4927SRafael J. Wysockiand their behavior generally does not depend on what scaling driver is in use
2172a0e4927SRafael J. Wysockiand what scaling governor is attached to the given policy.  Some scaling drivers
2182a0e4927SRafael J. Wysockialso add driver-specific attributes to the policy directories in ``sysfs`` to
2192a0e4927SRafael J. Wysockicontrol policy-specific aspects of driver behavior.
2202a0e4927SRafael J. Wysocki
2212a0e4927SRafael J. WysockiThe generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/`
2222a0e4927SRafael J. Wysockiare the following:
2232a0e4927SRafael J. Wysocki
2242a0e4927SRafael J. Wysocki``affected_cpus``
2252a0e4927SRafael J. Wysocki	List of online CPUs belonging to this policy (i.e. sharing the hardware
2262a0e4927SRafael J. Wysocki	performance scaling interface represented by the ``policyX`` policy
2272a0e4927SRafael J. Wysocki	object).
2282a0e4927SRafael J. Wysocki
2292a0e4927SRafael J. Wysocki``bios_limit``
2302a0e4927SRafael J. Wysocki	If the platform firmware (BIOS) tells the OS to apply an upper limit to
2312a0e4927SRafael J. Wysocki	CPU frequencies, that limit will be reported through this attribute (if
2322a0e4927SRafael J. Wysocki	present).
2332a0e4927SRafael J. Wysocki
2342a0e4927SRafael J. Wysocki	The existence of the limit may be a result of some (often unintentional)
2352a0e4927SRafael J. Wysocki	BIOS settings, restrictions coming from a service processor or another
2362a0e4927SRafael J. Wysocki	BIOS/HW-based mechanisms.
2372a0e4927SRafael J. Wysocki
2382a0e4927SRafael J. Wysocki	This does not cover ACPI thermal limitations which can be discovered
2392a0e4927SRafael J. Wysocki	through a generic thermal driver.
2402a0e4927SRafael J. Wysocki
2412a0e4927SRafael J. Wysocki	This attribute is not present if the scaling driver in use does not
2422a0e4927SRafael J. Wysocki	support it.
2432a0e4927SRafael J. Wysocki
244c2e3af11SRafael J. Wysocki``cpuinfo_cur_freq``
245c2e3af11SRafael J. Wysocki	Current frequency of the CPUs belonging to this policy as obtained from
246c2e3af11SRafael J. Wysocki	the hardware (in KHz).
247c2e3af11SRafael J. Wysocki
248c2e3af11SRafael J. Wysocki	This is expected to be the frequency the hardware actually runs at.
249c2e3af11SRafael J. Wysocki	If that frequency cannot be determined, this attribute should not
250c2e3af11SRafael J. Wysocki	be present.
251c2e3af11SRafael J. Wysocki
2522a0e4927SRafael J. Wysocki``cpuinfo_max_freq``
2532a0e4927SRafael J. Wysocki	Maximum possible operating frequency the CPUs belonging to this policy
2542a0e4927SRafael J. Wysocki	can run at (in kHz).
2552a0e4927SRafael J. Wysocki
2562a0e4927SRafael J. Wysocki``cpuinfo_min_freq``
2572a0e4927SRafael J. Wysocki	Minimum possible operating frequency the CPUs belonging to this policy
2582a0e4927SRafael J. Wysocki	can run at (in kHz).
2592a0e4927SRafael J. Wysocki
2602a0e4927SRafael J. Wysocki``cpuinfo_transition_latency``
2612a0e4927SRafael J. Wysocki	The time it takes to switch the CPUs belonging to this policy from one
2622a0e4927SRafael J. Wysocki	P-state to another, in nanoseconds.
2632a0e4927SRafael J. Wysocki
2642a0e4927SRafael J. Wysocki	If unknown or if known to be so high that the scaling driver does not
2652a0e4927SRafael J. Wysocki	work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`)
2662a0e4927SRafael J. Wysocki	will be returned by reads from this attribute.
2672a0e4927SRafael J. Wysocki
2682a0e4927SRafael J. Wysocki``related_cpus``
2692a0e4927SRafael J. Wysocki	List of all (online and offline) CPUs belonging to this policy.
2702a0e4927SRafael J. Wysocki
2712a0e4927SRafael J. Wysocki``scaling_available_governors``
2722a0e4927SRafael J. Wysocki	List of ``CPUFreq`` scaling governors present in the kernel that can
27333fc30b4SRafael J. Wysocki	be attached to this policy or (if the |intel_pstate| scaling driver is
2742a0e4927SRafael J. Wysocki	in use) list of scaling algorithms provided by the driver that can be
2752a0e4927SRafael J. Wysocki	applied to this policy.
2762a0e4927SRafael J. Wysocki
2772a0e4927SRafael J. Wysocki	[Note that some governors are modular and it may be necessary to load a
2782a0e4927SRafael J. Wysocki	kernel module for the governor held by it to become available and be
2792a0e4927SRafael J. Wysocki	listed by this attribute.]
2802a0e4927SRafael J. Wysocki
2812a0e4927SRafael J. Wysocki``scaling_cur_freq``
2822a0e4927SRafael J. Wysocki	Current frequency of all of the CPUs belonging to this policy (in kHz).
2832a0e4927SRafael J. Wysocki
2848183003eSRafael J. Wysocki	In the majority of cases, this is the frequency of the last P-state
2858183003eSRafael J. Wysocki	requested by the scaling driver from the hardware using the scaling
2862a0e4927SRafael J. Wysocki	interface provided by it, which may or may not reflect the frequency
2872a0e4927SRafael J. Wysocki	the CPU is actually running at (due to hardware design and other
2882a0e4927SRafael J. Wysocki	limitations).
2892a0e4927SRafael J. Wysocki
2908183003eSRafael J. Wysocki	Some architectures (e.g. ``x86``) may attempt to provide information
2918183003eSRafael J. Wysocki	more precisely reflecting the current CPU frequency through this
2928183003eSRafael J. Wysocki	attribute, but that still may not be the exact current CPU frequency as
2938183003eSRafael J. Wysocki	seen by the hardware at the moment.
2942a0e4927SRafael J. Wysocki
2952a0e4927SRafael J. Wysocki``scaling_driver``
2962a0e4927SRafael J. Wysocki	The scaling driver currently in use.
2972a0e4927SRafael J. Wysocki
2982a0e4927SRafael J. Wysocki``scaling_governor``
2992a0e4927SRafael J. Wysocki	The scaling governor currently attached to this policy or (if the
30033fc30b4SRafael J. Wysocki	|intel_pstate| scaling driver is in use) the scaling algorithm
3012a0e4927SRafael J. Wysocki	provided by the driver that is currently applied to this policy.
3022a0e4927SRafael J. Wysocki
3032a0e4927SRafael J. Wysocki	This attribute is read-write and writing to it will cause a new scaling
3042a0e4927SRafael J. Wysocki	governor to be attached to this policy or a new scaling algorithm
3052a0e4927SRafael J. Wysocki	provided by the scaling driver to be applied to it (in the
30633fc30b4SRafael J. Wysocki	|intel_pstate| case), as indicated by the string written to this
3072a0e4927SRafael J. Wysocki	attribute (which must be one of the names listed by the
3082a0e4927SRafael J. Wysocki	``scaling_available_governors`` attribute described above).
3092a0e4927SRafael J. Wysocki
3102a0e4927SRafael J. Wysocki``scaling_max_freq``
3112a0e4927SRafael J. Wysocki	Maximum frequency the CPUs belonging to this policy are allowed to be
3122a0e4927SRafael J. Wysocki	running at (in kHz).
3132a0e4927SRafael J. Wysocki
3142a0e4927SRafael J. Wysocki	This attribute is read-write and writing a string representing an
3152a0e4927SRafael J. Wysocki	integer to it will cause a new limit to be set (it must not be lower
3162a0e4927SRafael J. Wysocki	than the value of the ``scaling_min_freq`` attribute).
3172a0e4927SRafael J. Wysocki
3182a0e4927SRafael J. Wysocki``scaling_min_freq``
3192a0e4927SRafael J. Wysocki	Minimum frequency the CPUs belonging to this policy are allowed to be
3202a0e4927SRafael J. Wysocki	running at (in kHz).
3212a0e4927SRafael J. Wysocki
3222a0e4927SRafael J. Wysocki	This attribute is read-write and writing a string representing a
3232a0e4927SRafael J. Wysocki	non-negative integer to it will cause a new limit to be set (it must not
3242a0e4927SRafael J. Wysocki	be higher than the value of the ``scaling_max_freq`` attribute).
3252a0e4927SRafael J. Wysocki
3262a0e4927SRafael J. Wysocki``scaling_setspeed``
3272a0e4927SRafael J. Wysocki	This attribute is functional only if the `userspace`_ scaling governor
3282a0e4927SRafael J. Wysocki	is attached to the given policy.
3292a0e4927SRafael J. Wysocki
3302a0e4927SRafael J. Wysocki	It returns the last frequency requested by the governor (in kHz) or can
3312a0e4927SRafael J. Wysocki	be written to in order to set a new frequency for the policy.
3322a0e4927SRafael J. Wysocki
3332a0e4927SRafael J. Wysocki
3342a0e4927SRafael J. WysockiGeneric Scaling Governors
3352a0e4927SRafael J. Wysocki=========================
3362a0e4927SRafael J. Wysocki
3372a0e4927SRafael J. Wysocki``CPUFreq`` provides generic scaling governors that can be used with all
3382a0e4927SRafael J. Wysockiscaling drivers.  As stated before, each of them implements a single, possibly
3392a0e4927SRafael J. Wysockiparametrized, performance scaling algorithm.
3402a0e4927SRafael J. Wysocki
3412a0e4927SRafael J. WysockiScaling governors are attached to policy objects and different policy objects
3422a0e4927SRafael J. Wysockican be handled by different scaling governors at the same time (although that
3432a0e4927SRafael J. Wysockimay lead to suboptimal results in some cases).
3442a0e4927SRafael J. Wysocki
3452a0e4927SRafael J. WysockiThe scaling governor for a given policy object can be changed at any time with
3462a0e4927SRafael J. Wysockithe help of the ``scaling_governor`` policy attribute in ``sysfs``.
3472a0e4927SRafael J. Wysocki
3482a0e4927SRafael J. WysockiSome governors expose ``sysfs`` attributes to control or fine-tune the scaling
3492a0e4927SRafael J. Wysockialgorithms implemented by them.  Those attributes, referred to as governor
3502a0e4927SRafael J. Wysockitunables, can be either global (system-wide) or per-policy, depending on the
3512a0e4927SRafael J. Wysockiscaling driver in use.  If the driver requires governor tunables to be
3522a0e4927SRafael J. Wysockiper-policy, they are located in a subdirectory of each policy directory.
3532a0e4927SRafael J. WysockiOtherwise, they are located in a subdirectory under
3542a0e4927SRafael J. Wysocki:file:`/sys/devices/system/cpu/cpufreq/`.  In either case the name of the
3552a0e4927SRafael J. Wysockisubdirectory containing the governor tunables is the name of the governor
3562a0e4927SRafael J. Wysockiproviding them.
3572a0e4927SRafael J. Wysocki
3582a0e4927SRafael J. Wysocki``performance``
3592a0e4927SRafael J. Wysocki---------------
3602a0e4927SRafael J. Wysocki
3612a0e4927SRafael J. WysockiWhen attached to a policy object, this governor causes the highest frequency,
3622a0e4927SRafael J. Wysockiwithin the ``scaling_max_freq`` policy limit, to be requested for that policy.
3632a0e4927SRafael J. Wysocki
3642a0e4927SRafael J. WysockiThe request is made once at that time the governor for the policy is set to
3652a0e4927SRafael J. Wysocki``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
3662a0e4927SRafael J. Wysockipolicy limits change after that.
3672a0e4927SRafael J. Wysocki
3682a0e4927SRafael J. Wysocki``powersave``
3692a0e4927SRafael J. Wysocki-------------
3702a0e4927SRafael J. Wysocki
3712a0e4927SRafael J. WysockiWhen attached to a policy object, this governor causes the lowest frequency,
3722a0e4927SRafael J. Wysockiwithin the ``scaling_min_freq`` policy limit, to be requested for that policy.
3732a0e4927SRafael J. Wysocki
3742a0e4927SRafael J. WysockiThe request is made once at that time the governor for the policy is set to
3752a0e4927SRafael J. Wysocki``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq``
3762a0e4927SRafael J. Wysockipolicy limits change after that.
3772a0e4927SRafael J. Wysocki
3782a0e4927SRafael J. Wysocki``userspace``
3792a0e4927SRafael J. Wysocki-------------
3802a0e4927SRafael J. Wysocki
3812a0e4927SRafael J. WysockiThis governor does not do anything by itself.  Instead, it allows user space
3822a0e4927SRafael J. Wysockito set the CPU frequency for the policy it is attached to by writing to the
3832a0e4927SRafael J. Wysocki``scaling_setspeed`` attribute of that policy.
3842a0e4927SRafael J. Wysocki
3852a0e4927SRafael J. Wysocki``schedutil``
3862a0e4927SRafael J. Wysocki-------------
3872a0e4927SRafael J. Wysocki
3882a0e4927SRafael J. WysockiThis governor uses CPU utilization data available from the CPU scheduler.  It
3892a0e4927SRafael J. Wysockigenerally is regarded as a part of the CPU scheduler, so it can access the
3902a0e4927SRafael J. Wysockischeduler's internal data structures directly.
3912a0e4927SRafael J. Wysocki
3922a0e4927SRafael J. WysockiIt runs entirely in scheduler context, although in some cases it may need to
3932a0e4927SRafael J. Wysockiinvoke the scaling driver asynchronously when it decides that the CPU frequency
3942a0e4927SRafael J. Wysockishould be changed for a given policy (that depends on whether or not the driver
3952a0e4927SRafael J. Wysockiis capable of changing the CPU frequency from scheduler context).
3962a0e4927SRafael J. Wysocki
3972a0e4927SRafael J. WysockiThe actions of this governor for a particular CPU depend on the scheduling class
3982a0e4927SRafael J. Wysockiinvoking its utilization update callback for that CPU.  If it is invoked by the
3992a0e4927SRafael J. WysockiRT or deadline scheduling classes, the governor will increase the frequency to
4002a0e4927SRafael J. Wysockithe allowed maximum (that is, the ``scaling_max_freq`` policy limit).  In turn,
4012a0e4927SRafael J. Wysockiif it is invoked by the CFS scheduling class, the governor will use the
4022a0e4927SRafael J. WysockiPer-Entity Load Tracking (PELT) metric for the root control group of the
4031120b0f9SRafael J. Wysockigiven CPU as the CPU utilization estimate (see the *Per-entity load tracking*
4041120b0f9SRafael J. WysockiLWN.net article [1]_ for a description of the PELT mechanism).  Then, the new
4052a0e4927SRafael J. WysockiCPU frequency to apply is computed in accordance with the formula
4062a0e4927SRafael J. Wysocki
4072a0e4927SRafael J. Wysocki	f = 1.25 * ``f_0`` * ``util`` / ``max``
4082a0e4927SRafael J. Wysocki
4092a0e4927SRafael J. Wysockiwhere ``util`` is the PELT number, ``max`` is the theoretical maximum of
4102a0e4927SRafael J. Wysocki``util``, and ``f_0`` is either the maximum possible CPU frequency for the given
4112a0e4927SRafael J. Wysockipolicy (if the PELT number is frequency-invariant), or the current CPU frequency
4122a0e4927SRafael J. Wysocki(otherwise).
4132a0e4927SRafael J. Wysocki
4142a0e4927SRafael J. WysockiThis governor also employs a mechanism allowing it to temporarily bump up the
4152a0e4927SRafael J. WysockiCPU frequency for tasks that have been waiting on I/O most recently, called
4162a0e4927SRafael J. Wysocki"IO-wait boosting".  That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag
4172a0e4927SRafael J. Wysockiis passed by the scheduler to the governor callback which causes the frequency
4182a0e4927SRafael J. Wysockito go up to the allowed maximum immediately and then draw back to the value
4192a0e4927SRafael J. Wysockireturned by the above formula over time.
4202a0e4927SRafael J. Wysocki
4212a0e4927SRafael J. WysockiThis governor exposes only one tunable:
4222a0e4927SRafael J. Wysocki
4232a0e4927SRafael J. Wysocki``rate_limit_us``
4242a0e4927SRafael J. Wysocki	Minimum time (in microseconds) that has to pass between two consecutive
4252a0e4927SRafael J. Wysocki	runs of governor computations (default: 1000 times the scaling driver's
4262a0e4927SRafael J. Wysocki	transition latency).
4272a0e4927SRafael J. Wysocki
4282a0e4927SRafael J. Wysocki	The purpose of this tunable is to reduce the scheduler context overhead
4292a0e4927SRafael J. Wysocki	of the governor which might be excessive without it.
4302a0e4927SRafael J. Wysocki
4312a0e4927SRafael J. WysockiThis governor generally is regarded as a replacement for the older `ondemand`_
4322a0e4927SRafael J. Wysockiand `conservative`_ governors (described below), as it is simpler and more
4332a0e4927SRafael J. Wysockitightly integrated with the CPU scheduler, its overhead in terms of CPU context
4342a0e4927SRafael J. Wysockiswitches and similar is less significant, and it uses the scheduler's own CPU
4352a0e4927SRafael J. Wysockiutilization metric, so in principle its decisions should not contradict the
4362a0e4927SRafael J. Wysockidecisions made by the other parts of the scheduler.
4372a0e4927SRafael J. Wysocki
4382a0e4927SRafael J. Wysocki``ondemand``
4392a0e4927SRafael J. Wysocki------------
4402a0e4927SRafael J. Wysocki
4412a0e4927SRafael J. WysockiThis governor uses CPU load as a CPU frequency selection metric.
4422a0e4927SRafael J. Wysocki
4432a0e4927SRafael J. WysockiIn order to estimate the current CPU load, it measures the time elapsed between
4442a0e4927SRafael J. Wysockiconsecutive invocations of its worker routine and computes the fraction of that
4452a0e4927SRafael J. Wysockitime in which the given CPU was not idle.  The ratio of the non-idle (active)
4462a0e4927SRafael J. Wysockitime to the total CPU time is taken as an estimate of the load.
4472a0e4927SRafael J. Wysocki
4482a0e4927SRafael J. WysockiIf this governor is attached to a policy shared by multiple CPUs, the load is
4492a0e4927SRafael J. Wysockiestimated for all of them and the greatest result is taken as the load estimate
4502a0e4927SRafael J. Wysockifor the entire policy.
4512a0e4927SRafael J. Wysocki
4522a0e4927SRafael J. WysockiThe worker routine of this governor has to run in process context, so it is
4532a0e4927SRafael J. Wysockiinvoked asynchronously (via a workqueue) and CPU P-states are updated from
4542a0e4927SRafael J. Wysockithere if necessary.  As a result, the scheduler context overhead from this
4552a0e4927SRafael J. Wysockigovernor is minimum, but it causes additional CPU context switches to happen
4562a0e4927SRafael J. Wysockirelatively often and the CPU P-state updates triggered by it can be relatively
4572a0e4927SRafael J. Wysockiirregular.  Also, it affects its own CPU load metric by running code that
4582a0e4927SRafael J. Wysockireduces the CPU idle time (even though the CPU idle time is only reduced very
4592a0e4927SRafael J. Wysockislightly by it).
4602a0e4927SRafael J. Wysocki
4612a0e4927SRafael J. WysockiIt generally selects CPU frequencies proportional to the estimated load, so that
4622a0e4927SRafael J. Wysockithe value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of
4632a0e4927SRafael J. Wysocki1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute
4642a0e4927SRafael J. Wysockicorresponds to the load of 0, unless when the load exceeds a (configurable)
4652a0e4927SRafael J. Wysockispeedup threshold, in which case it will go straight for the highest frequency
4662a0e4927SRafael J. Wysockiit is allowed to use (the ``scaling_max_freq`` policy limit).
4672a0e4927SRafael J. Wysocki
4682a0e4927SRafael J. WysockiThis governor exposes the following tunables:
4692a0e4927SRafael J. Wysocki
4702a0e4927SRafael J. Wysocki``sampling_rate``
4712a0e4927SRafael J. Wysocki	This is how often the governor's worker routine should run, in
4722a0e4927SRafael J. Wysocki	microseconds.
4732a0e4927SRafael J. Wysocki
4742a0e4927SRafael J. Wysocki	Typically, it is set to values of the order of 10000 (10 ms).  Its
4752a0e4927SRafael J. Wysocki	default value is equal to the value of ``cpuinfo_transition_latency``
4762a0e4927SRafael J. Wysocki	for each policy this governor is attached to (but since the unit here
4772a0e4927SRafael J. Wysocki	is greater by 1000, this means that the time represented by
4782a0e4927SRafael J. Wysocki	``sampling_rate`` is 1000 times greater than the transition latency by
4792a0e4927SRafael J. Wysocki	default).
4802a0e4927SRafael J. Wysocki
4812a0e4927SRafael J. Wysocki	If this tunable is per-policy, the following shell command sets the time
4822a0e4927SRafael J. Wysocki	represented by it to be 750 times as high as the transition latency::
4832a0e4927SRafael J. Wysocki
4842a0e4927SRafael J. Wysocki	# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
4852a0e4927SRafael J. Wysocki
4862a0e4927SRafael J. Wysocki``up_threshold``
4872a0e4927SRafael J. Wysocki	If the estimated CPU load is above this value (in percent), the governor
4882a0e4927SRafael J. Wysocki	will set the frequency to the maximum value allowed for the policy.
4892a0e4927SRafael J. Wysocki	Otherwise, the selected frequency will be proportional to the estimated
4902a0e4927SRafael J. Wysocki	CPU load.
4912a0e4927SRafael J. Wysocki
4922a0e4927SRafael J. Wysocki``ignore_nice_load``
4932a0e4927SRafael J. Wysocki	If set to 1 (default 0), it will cause the CPU load estimation code to
4942a0e4927SRafael J. Wysocki	treat the CPU time spent on executing tasks with "nice" levels greater
4952a0e4927SRafael J. Wysocki	than 0 as CPU idle time.
4962a0e4927SRafael J. Wysocki
4972a0e4927SRafael J. Wysocki	This may be useful if there are tasks in the system that should not be
4982a0e4927SRafael J. Wysocki	taken into account when deciding what frequency to run the CPUs at.
4992a0e4927SRafael J. Wysocki	Then, to make that happen it is sufficient to increase the "nice" level
5002a0e4927SRafael J. Wysocki	of those tasks above 0 and set this attribute to 1.
5012a0e4927SRafael J. Wysocki
5022a0e4927SRafael J. Wysocki``sampling_down_factor``
5032a0e4927SRafael J. Wysocki	Temporary multiplier, between 1 (default) and 100 inclusive, to apply to
5042a0e4927SRafael J. Wysocki	the ``sampling_rate`` value if the CPU load goes above ``up_threshold``.
5052a0e4927SRafael J. Wysocki
5062a0e4927SRafael J. Wysocki	This causes the next execution of the governor's worker routine (after
5072a0e4927SRafael J. Wysocki	setting the frequency to the allowed maximum) to be delayed, so the
5082a0e4927SRafael J. Wysocki	frequency stays at the maximum level for a longer time.
5092a0e4927SRafael J. Wysocki
5102a0e4927SRafael J. Wysocki	Frequency fluctuations in some bursty workloads may be avoided this way
5112a0e4927SRafael J. Wysocki	at the cost of additional energy spent on maintaining the maximum CPU
5122a0e4927SRafael J. Wysocki	capacity.
5132a0e4927SRafael J. Wysocki
5142a0e4927SRafael J. Wysocki``powersave_bias``
5152a0e4927SRafael J. Wysocki	Reduction factor to apply to the original frequency target of the
5162a0e4927SRafael J. Wysocki	governor (including the maximum value used when the ``up_threshold``
5172a0e4927SRafael J. Wysocki	value is exceeded by the estimated CPU load) or sensitivity threshold
5182a0e4927SRafael J. Wysocki	for the AMD frequency sensitivity powersave bias driver
5192a0e4927SRafael J. Wysocki	(:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000
5202a0e4927SRafael J. Wysocki	inclusive.
5212a0e4927SRafael J. Wysocki
5222a0e4927SRafael J. Wysocki	If the AMD frequency sensitivity powersave bias driver is not loaded,
5232a0e4927SRafael J. Wysocki	the effective frequency to apply is given by
5242a0e4927SRafael J. Wysocki
5252a0e4927SRafael J. Wysocki		f * (1 - ``powersave_bias`` / 1000)
5262a0e4927SRafael J. Wysocki
5272a0e4927SRafael J. Wysocki	where f is the governor's original frequency target.  The default value
5282a0e4927SRafael J. Wysocki	of this attribute is 0 in that case.
5292a0e4927SRafael J. Wysocki
5302a0e4927SRafael J. Wysocki	If the AMD frequency sensitivity powersave bias driver is loaded, the
5312a0e4927SRafael J. Wysocki	value of this attribute is 400 by default and it is used in a different
5322a0e4927SRafael J. Wysocki	way.
5332a0e4927SRafael J. Wysocki
5342a0e4927SRafael J. Wysocki	On Family 16h (and later) AMD processors there is a mechanism to get a
5352a0e4927SRafael J. Wysocki	measured workload sensitivity, between 0 and 100% inclusive, from the
5362a0e4927SRafael J. Wysocki	hardware.  That value can be used to estimate how the performance of the
5372a0e4927SRafael J. Wysocki	workload running on a CPU will change in response to frequency changes.
5382a0e4927SRafael J. Wysocki
5392a0e4927SRafael J. Wysocki	The performance of a workload with the sensitivity of 0 (memory-bound or
5402a0e4927SRafael J. Wysocki	IO-bound) is not expected to increase at all as a result of increasing
5412a0e4927SRafael J. Wysocki	the CPU frequency, whereas workloads with the sensitivity of 100%
5422a0e4927SRafael J. Wysocki	(CPU-bound) are expected to perform much better if the CPU frequency is
5432a0e4927SRafael J. Wysocki	increased.
5442a0e4927SRafael J. Wysocki
5452a0e4927SRafael J. Wysocki	If the workload sensitivity is less than the threshold represented by
5462a0e4927SRafael J. Wysocki	the ``powersave_bias`` value, the sensitivity powersave bias driver
5472a0e4927SRafael J. Wysocki	will cause the governor to select a frequency lower than its original
5482a0e4927SRafael J. Wysocki	target, so as to avoid over-provisioning workloads that will not benefit
5492a0e4927SRafael J. Wysocki	from running at higher CPU frequencies.
5502a0e4927SRafael J. Wysocki
5512a0e4927SRafael J. Wysocki``conservative``
5522a0e4927SRafael J. Wysocki----------------
5532a0e4927SRafael J. Wysocki
5542a0e4927SRafael J. WysockiThis governor uses CPU load as a CPU frequency selection metric.
5552a0e4927SRafael J. Wysocki
5562a0e4927SRafael J. WysockiIt estimates the CPU load in the same way as the `ondemand`_ governor described
5572a0e4927SRafael J. Wysockiabove, but the CPU frequency selection algorithm implemented by it is different.
5582a0e4927SRafael J. Wysocki
5592a0e4927SRafael J. WysockiNamely, it avoids changing the frequency significantly over short time intervals
5602a0e4927SRafael J. Wysockiwhich may not be suitable for systems with limited power supply capacity (e.g.
5612a0e4927SRafael J. Wysockibattery-powered).  To achieve that, it changes the frequency in relatively
5622a0e4927SRafael J. Wysockismall steps, one step at a time, up or down - depending on whether or not a
5632a0e4927SRafael J. Wysocki(configurable) threshold has been exceeded by the estimated CPU load.
5642a0e4927SRafael J. Wysocki
5652a0e4927SRafael J. WysockiThis governor exposes the following tunables:
5662a0e4927SRafael J. Wysocki
5672a0e4927SRafael J. Wysocki``freq_step``
5682a0e4927SRafael J. Wysocki	Frequency step in percent of the maximum frequency the governor is
5692a0e4927SRafael J. Wysocki	allowed to set (the ``scaling_max_freq`` policy limit), between 0 and
5702a0e4927SRafael J. Wysocki	100 (5 by default).
5712a0e4927SRafael J. Wysocki
5722a0e4927SRafael J. Wysocki	This is how much the frequency is allowed to change in one go.  Setting
5732a0e4927SRafael J. Wysocki	it to 0 will cause the default frequency step (5 percent) to be used
5742a0e4927SRafael J. Wysocki	and setting it to 100 effectively causes the governor to periodically
5752a0e4927SRafael J. Wysocki	switch the frequency between the ``scaling_min_freq`` and
5762a0e4927SRafael J. Wysocki	``scaling_max_freq`` policy limits.
5772a0e4927SRafael J. Wysocki
5782a0e4927SRafael J. Wysocki``down_threshold``
5792a0e4927SRafael J. Wysocki	Threshold value (in percent, 20 by default) used to determine the
5802a0e4927SRafael J. Wysocki	frequency change direction.
5812a0e4927SRafael J. Wysocki
5822a0e4927SRafael J. Wysocki	If the estimated CPU load is greater than this value, the frequency will
5832a0e4927SRafael J. Wysocki	go up (by ``freq_step``).  If the load is less than this value (and the
5842a0e4927SRafael J. Wysocki	``sampling_down_factor`` mechanism is not in effect), the frequency will
5852a0e4927SRafael J. Wysocki	go down.  Otherwise, the frequency will not be changed.
5862a0e4927SRafael J. Wysocki
5872a0e4927SRafael J. Wysocki``sampling_down_factor``
5882a0e4927SRafael J. Wysocki	Frequency decrease deferral factor, between 1 (default) and 10
5892a0e4927SRafael J. Wysocki	inclusive.
5902a0e4927SRafael J. Wysocki
5912a0e4927SRafael J. Wysocki	It effectively causes the frequency to go down ``sampling_down_factor``
5922a0e4927SRafael J. Wysocki	times slower than it ramps up.
5932a0e4927SRafael J. Wysocki
5942a0e4927SRafael J. Wysocki
5952a0e4927SRafael J. WysockiFrequency Boost Support
5962a0e4927SRafael J. Wysocki=======================
5972a0e4927SRafael J. Wysocki
5982a0e4927SRafael J. WysockiBackground
5992a0e4927SRafael J. Wysocki----------
6002a0e4927SRafael J. Wysocki
6012a0e4927SRafael J. WysockiSome processors support a mechanism to raise the operating frequency of some
6022a0e4927SRafael J. Wysockicores in a multicore package temporarily (and above the sustainable frequency
6032a0e4927SRafael J. Wysockithreshold for the whole package) under certain conditions, for example if the
6042a0e4927SRafael J. Wysockiwhole chip is not fully utilized and below its intended thermal or power budget.
6052a0e4927SRafael J. Wysocki
6062a0e4927SRafael J. WysockiDifferent names are used by different vendors to refer to this functionality.
6072a0e4927SRafael J. WysockiFor Intel processors it is referred to as "Turbo Boost", AMD calls it
6082a0e4927SRafael J. Wysocki"Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on.
6092a0e4927SRafael J. WysockiAs a rule, it also is implemented differently by different vendors.  The simple
6102a0e4927SRafael J. Wysockiterm "frequency boost" is used here for brevity to refer to all of those
6112a0e4927SRafael J. Wysockiimplementations.
6122a0e4927SRafael J. Wysocki
6132a0e4927SRafael J. WysockiThe frequency boost mechanism may be either hardware-based or software-based.
6142a0e4927SRafael J. WysockiIf it is hardware-based (e.g. on x86), the decision to trigger the boosting is
6152a0e4927SRafael J. Wysockimade by the hardware (although in general it requires the hardware to be put
6162a0e4927SRafael J. Wysockiinto a special state in which it can control the CPU frequency within certain
6172a0e4927SRafael J. Wysockilimits).  If it is software-based (e.g. on ARM), the scaling driver decides
6182a0e4927SRafael J. Wysockiwhether or not to trigger boosting and when to do that.
6192a0e4927SRafael J. Wysocki
6202a0e4927SRafael J. WysockiThe ``boost`` File in ``sysfs``
6212a0e4927SRafael J. Wysocki-------------------------------
6222a0e4927SRafael J. Wysocki
6232a0e4927SRafael J. WysockiThis file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls
6242a0e4927SRafael J. Wysockithe "boost" setting for the whole system.  It is not present if the underlying
6252a0e4927SRafael J. Wysockiscaling driver does not support the frequency boost mechanism (or supports it,
6262a0e4927SRafael J. Wysockibut provides a driver-specific interface for controlling it, like
62733fc30b4SRafael J. Wysocki|intel_pstate|).
6282a0e4927SRafael J. Wysocki
6292a0e4927SRafael J. WysockiIf the value in this file is 1, the frequency boost mechanism is enabled.  This
6302a0e4927SRafael J. Wysockimeans that either the hardware can be put into states in which it is able to
6312a0e4927SRafael J. Wysockitrigger boosting (in the hardware-based case), or the software is allowed to
6322a0e4927SRafael J. Wysockitrigger boosting (in the software-based case).  It does not mean that boosting
6332a0e4927SRafael J. Wysockiis actually in use at the moment on any CPUs in the system.  It only means a
6342a0e4927SRafael J. Wysockipermission to use the frequency boost mechanism (which still may never be used
6352a0e4927SRafael J. Wysockifor other reasons).
6362a0e4927SRafael J. Wysocki
6372a0e4927SRafael J. WysockiIf the value in this file is 0, the frequency boost mechanism is disabled and
6382a0e4927SRafael J. Wysockicannot be used at all.
6392a0e4927SRafael J. Wysocki
6402a0e4927SRafael J. WysockiThe only values that can be written to this file are 0 and 1.
6412a0e4927SRafael J. Wysocki
6422a0e4927SRafael J. WysockiRationale for Boost Control Knob
6432a0e4927SRafael J. Wysocki--------------------------------
6442a0e4927SRafael J. Wysocki
6452a0e4927SRafael J. WysockiThe frequency boost mechanism is generally intended to help to achieve optimum
6462a0e4927SRafael J. WysockiCPU performance on time scales below software resolution (e.g. below the
6472a0e4927SRafael J. Wysockischeduler tick interval) and it is demonstrably suitable for many workloads, but
6482a0e4927SRafael J. Wysockiit may lead to problems in certain situations.
6492a0e4927SRafael J. Wysocki
6502a0e4927SRafael J. WysockiFor this reason, many systems make it possible to disable the frequency boost
6512a0e4927SRafael J. Wysockimechanism in the platform firmware (BIOS) setup, but that requires the system to
6522a0e4927SRafael J. Wysockibe restarted for the setting to be adjusted as desired, which may not be
6532a0e4927SRafael J. Wysockipractical at least in some cases.  For example:
6542a0e4927SRafael J. Wysocki
6552a0e4927SRafael J. Wysocki  1. Boosting means overclocking the processor, although under controlled
6562a0e4927SRafael J. Wysocki     conditions.  Generally, the processor's energy consumption increases
6572a0e4927SRafael J. Wysocki     as a result of increasing its frequency and voltage, even temporarily.
6582a0e4927SRafael J. Wysocki     That may not be desirable on systems that switch to power sources of
6592a0e4927SRafael J. Wysocki     limited capacity, such as batteries, so the ability to disable the boost
6602a0e4927SRafael J. Wysocki     mechanism while the system is running may help there (but that depends on
6612a0e4927SRafael J. Wysocki     the workload too).
6622a0e4927SRafael J. Wysocki
6632a0e4927SRafael J. Wysocki  2. In some situations deterministic behavior is more important than
6642a0e4927SRafael J. Wysocki     performance or energy consumption (or both) and the ability to disable
6652a0e4927SRafael J. Wysocki     boosting while the system is running may be useful then.
6662a0e4927SRafael J. Wysocki
6672a0e4927SRafael J. Wysocki  3. To examine the impact of the frequency boost mechanism itself, it is useful
6682a0e4927SRafael J. Wysocki     to be able to run tests with and without boosting, preferably without
6692a0e4927SRafael J. Wysocki     restarting the system in the meantime.
6702a0e4927SRafael J. Wysocki
6712a0e4927SRafael J. Wysocki  4. Reproducible results are important when running benchmarks.  Since
6722a0e4927SRafael J. Wysocki     the boosting functionality depends on the load of the whole package,
6732a0e4927SRafael J. Wysocki     single-thread performance may vary because of it which may lead to
6742a0e4927SRafael J. Wysocki     unreproducible results sometimes.  That can be avoided by disabling the
6752a0e4927SRafael J. Wysocki     frequency boost mechanism before running benchmarks sensitive to that
6762a0e4927SRafael J. Wysocki     issue.
6772a0e4927SRafael J. Wysocki
6782a0e4927SRafael J. WysockiLegacy AMD ``cpb`` Knob
6792a0e4927SRafael J. Wysocki-----------------------
6802a0e4927SRafael J. Wysocki
6812a0e4927SRafael J. WysockiThe AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to
6822a0e4927SRafael J. Wysockithe global ``boost`` one.  It is used for disabling/enabling the "Core
6832a0e4927SRafael J. WysockiPerformance Boost" feature of some AMD processors.
6842a0e4927SRafael J. Wysocki
6852a0e4927SRafael J. WysockiIf present, that knob is located in every ``CPUFreq`` policy directory in
6862a0e4927SRafael J. Wysocki``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called
6872a0e4927SRafael J. Wysocki``cpb``, which indicates a more fine grained control interface.  The actual
6882a0e4927SRafael J. Wysockiimplementation, however, works on the system-wide basis and setting that knob
6892a0e4927SRafael J. Wysockifor one policy causes the same value of it to be set for all of the other
6902a0e4927SRafael J. Wysockipolicies at the same time.
6912a0e4927SRafael J. Wysocki
6922a0e4927SRafael J. WysockiThat knob is still supported on AMD processors that support its underlying
6932a0e4927SRafael J. Wysockihardware feature, but it may be configured out of the kernel (via the
6942a0e4927SRafael J. Wysocki:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global
6952a0e4927SRafael J. Wysocki``boost`` knob is present regardless.  Thus it is always possible use the
6962a0e4927SRafael J. Wysocki``boost`` knob instead of the ``cpb`` one which is highly recommended, as that
6972a0e4927SRafael J. Wysockiis more consistent with what all of the other systems do (and the ``cpb`` knob
6982a0e4927SRafael J. Wysockimay not be supported any more in the future).
6992a0e4927SRafael J. Wysocki
7002a0e4927SRafael J. WysockiThe ``cpb`` knob is never present for any processors without the underlying
7012a0e4927SRafael J. Wysockihardware feature (e.g. all Intel ones), even if the
7022a0e4927SRafael J. Wysocki:c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set.
7032a0e4927SRafael J. Wysocki
7042a0e4927SRafael J. Wysocki
7051120b0f9SRafael J. WysockiReferences
7061120b0f9SRafael J. Wysocki==========
7071120b0f9SRafael J. Wysocki
7081120b0f9SRafael J. Wysocki.. [1] Jonathan Corbet, *Per-entity load tracking*,
7091120b0f9SRafael J. Wysocki       https://lwn.net/Articles/531853/
710