xref: /openbmc/linux/Documentation/admin-guide/pm/intel_idle.rst (revision 762f99f4f3cb41a775b5157dd761217beba65873)
1a3299182SRafael J. Wysocki.. SPDX-License-Identifier: GPL-2.0
2a3299182SRafael J. Wysocki.. include:: <isonum.txt>
3a3299182SRafael J. Wysocki
4a3299182SRafael J. Wysocki==============================================
5a3299182SRafael J. Wysocki``intel_idle`` CPU Idle Time Management Driver
6a3299182SRafael J. Wysocki==============================================
7a3299182SRafael J. Wysocki
8a3299182SRafael J. Wysocki:Copyright: |copy| 2020 Intel Corporation
9a3299182SRafael J. Wysocki
10a3299182SRafael J. Wysocki:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11a3299182SRafael J. Wysocki
12a3299182SRafael J. Wysocki
13a3299182SRafael J. WysockiGeneral Information
14a3299182SRafael J. Wysocki===================
15a3299182SRafael J. Wysocki
16a3299182SRafael J. Wysocki``intel_idle`` is a part of the
17a3299182SRafael J. Wysocki:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
18a3299182SRafael J. Wysocki(``CPUIdle``).  It is the default CPU idle time management driver for the
19a3299182SRafael J. WysockiNehalem and later generations of Intel processors, but the level of support for
20a3299182SRafael J. Wysockia particular processor model in it depends on whether or not it recognizes that
21a3299182SRafael J. Wysockiprocessor model and may also depend on information coming from the platform
22a3299182SRafael J. Wysockifirmware.  [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
23*17420f31SMauro Carvalho Chehabworks in general, so this is the time to get familiar with
24*17420f31SMauro Carvalho ChehabDocumentation/admin-guide/pm/cpuidle.rst if you have not done that yet.]
25a3299182SRafael J. Wysocki
26a3299182SRafael J. Wysocki``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
27a3299182SRafael J. Wysockilogical CPU executing it is idle and so it may be possible to put some of the
28a3299182SRafael J. Wysockiprocessor's functional blocks into low-power states.  That instruction takes two
29a3299182SRafael J. Wysockiarguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
30a3299182SRafael J. Wysockifirst of which, referred to as a *hint*, can be used by the processor to
31a3299182SRafael J. Wysockidetermine what can be done (for details refer to Intel Software Developer’s
32a3299182SRafael J. WysockiManual [1]_).  Accordingly, ``intel_idle`` refuses to work with processors in
33a3299182SRafael J. Wysockiwhich the support for the ``MWAIT`` instruction has been disabled (for example,
34a3299182SRafael J. Wysockivia the platform firmware configuration menu) or which do not support that
35a3299182SRafael J. Wysockiinstruction at all.
36a3299182SRafael J. Wysocki
37a3299182SRafael J. Wysocki``intel_idle`` is not modular, so it cannot be unloaded, which means that the
38a3299182SRafael J. Wysockionly way to pass early-configuration-time parameters to it is via the kernel
39a3299182SRafael J. Wysockicommand line.
40a3299182SRafael J. Wysocki
41a3299182SRafael J. Wysocki
42a3299182SRafael J. Wysocki.. _intel-idle-enumeration-of-states:
43a3299182SRafael J. Wysocki
44a3299182SRafael J. WysockiEnumeration of Idle States
45a3299182SRafael J. Wysocki==========================
46a3299182SRafael J. Wysocki
47a3299182SRafael J. WysockiEach ``MWAIT`` hint value is interpreted by the processor as a license to
48a3299182SRafael J. Wysockireconfigure itself in a certain way in order to save energy.  The processor
49a3299182SRafael J. Wysockiconfigurations (with reduced power draw) resulting from that are referred to
50a3299182SRafael J. Wysockias C-states (in the ACPI terminology) or idle states.  The list of meaningful
51a3299182SRafael J. Wysocki``MWAIT`` hint values and idle states (i.e. low-power configurations of the
52a3299182SRafael J. Wysockiprocessor) corresponding to them depends on the processor model and it may also
53a3299182SRafael J. Wysockidepend on the configuration of the platform.
54a3299182SRafael J. Wysocki
55a3299182SRafael J. WysockiIn order to create a list of available idle states required by the ``CPUIdle``
56*17420f31SMauro Carvalho Chehabsubsystem (see :ref:`idle-states-representation` in
57*17420f31SMauro Carvalho ChehabDocumentation/admin-guide/pm/cpuidle.rst),
58a3299182SRafael J. Wysocki``intel_idle`` can use two sources of information: static tables of idle states
59a3299182SRafael J. Wysockifor different processor models included in the driver itself and the ACPI tables
60a3299182SRafael J. Wysockiof the system.  The former are always used if the processor model at hand is
61a3299182SRafael J. Wysockirecognized by ``intel_idle`` and the latter are used if that is required for
62a3299182SRafael J. Wysockithe given processor model (which is the case for all server processor models
63a3299182SRafael J. Wysockirecognized by ``intel_idle``) or if the processor model is not recognized.
643a5be9b8SRafael J. Wysocki[There is a module parameter that can be used to make the driver use the ACPI
653a5be9b8SRafael J. Wysockitables with any processor model recognized by it; see
663a5be9b8SRafael J. Wysocki`below <intel-idle-parameters_>`_.]
67a3299182SRafael J. Wysocki
68a3299182SRafael J. WysockiIf the ACPI tables are going to be used for building the list of available idle
69a3299182SRafael J. Wysockistates, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
70a3299182SRafael J. Wysockiobjects corresponding to the CPUs in the system (refer to the ACPI specification
71a3299182SRafael J. Wysocki[2]_ for the description of ``_CST`` and its output package).  Because the
72a3299182SRafael J. Wysocki``CPUIdle`` subsystem expects that the list of idle states supplied by the
73a3299182SRafael J. Wysockidriver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
74a3299182SRafael J. Wysockiregistered as the ``CPUIdle`` driver for all of the CPUs in the system, the
75a3299182SRafael J. Wysockidriver looks for the first ``_CST`` object returning at least one valid idle
76a3299182SRafael J. Wysockistate description and such that all of the idle states included in its return
77a3299182SRafael J. Wysockipackage are of the FFH (Functional Fixed Hardware) type, which means that the
78a3299182SRafael J. Wysocki``MWAIT`` instruction is expected to be used to tell the processor that it can
79a3299182SRafael J. Wysockienter one of them.  The return package of that ``_CST`` is then assumed to be
80a3299182SRafael J. Wysockiapplicable to all of the other CPUs in the system and the idle state
81a3299182SRafael J. Wysockidescriptions extracted from it are stored in a preliminary list of idle states
82a3299182SRafael J. Wysockicoming from the ACPI tables.  [This step is skipped if ``intel_idle`` is
83a3299182SRafael J. Wysockiconfigured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
84a3299182SRafael J. Wysocki
85a3299182SRafael J. WysockiNext, the first (index 0) entry in the list of available idle states is
86a3299182SRafael J. Wysockiinitialized to represent a "polling idle state" (a pseudo-idle state in which
87a3299182SRafael J. Wysockithe target CPU continuously fetches and executes instructions), and the
88a3299182SRafael J. Wysockisubsequent (real) idle state entries are populated as follows.
89a3299182SRafael J. Wysocki
90a3299182SRafael J. WysockiIf the processor model at hand is recognized by ``intel_idle``, there is a
91a3299182SRafael J. Wysocki(static) table of idle state descriptions for it in the driver.  In that case,
92a3299182SRafael J. Wysockithe "internal" table is the primary source of information on idle states and the
93a3299182SRafael J. Wysockiinformation from it is copied to the final list of available idle states.  If
94a3299182SRafael J. Wysockiusing the ACPI tables for the enumeration of idle states is not required
95a3299182SRafael J. Wysocki(depending on the processor model), all of the listed idle state are enabled by
96a3299182SRafael J. Wysockidefault (so all of them will be taken into consideration by ``CPUIdle``
97a3299182SRafael J. Wysockigovernors during CPU idle state selection).  Otherwise, some of the listed idle
98a3299182SRafael J. Wysockistates may not be enabled by default if there are no matching entries in the
99a3299182SRafael J. Wysockipreliminary list of idle states coming from the ACPI tables.  In that case user
100a3299182SRafael J. Wysockispace still can enable them later (on a per-CPU basis) with the help of
101a3299182SRafael J. Wysockithe ``disable`` idle state attribute in ``sysfs`` (see
102*17420f31SMauro Carvalho Chehab:ref:`idle-states-representation` in
103*17420f31SMauro Carvalho ChehabDocumentation/admin-guide/pm/cpuidle.rst).  This basically means that
104a3299182SRafael J. Wysockithe idle states "known" to the driver may not be enabled by default if they have
105a3299182SRafael J. Wysockinot been exposed by the platform firmware (through the ACPI tables).
106a3299182SRafael J. Wysocki
107a3299182SRafael J. WysockiIf the given processor model is not recognized by ``intel_idle``, but it
108a3299182SRafael J. Wysockisupports ``MWAIT``, the preliminary list of idle states coming from the ACPI
109a3299182SRafael J. Wysockitables is used for building the final list that will be supplied to the
110a3299182SRafael J. Wysocki``CPUIdle`` core during driver registration.  For each idle state in that list,
111a3299182SRafael J. Wysockithe description, ``MWAIT`` hint and exit latency are copied to the corresponding
112a3299182SRafael J. Wysockientry in the final list of idle states.  The name of the idle state represented
113a3299182SRafael J. Wysockiby it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
114a3299182SRafael J. Wysocki"CX_ACPI", where X is the index of that idle state in the final list (note that
115a3299182SRafael J. Wysockithe minimum value of X is 1, because 0 is reserved for the "polling" state), and
116a3299182SRafael J. Wysockiits target residency is based on the exit latency value.  Specifically, for
117a3299182SRafael J. WysockiC1-type idle states the exit latency value is also used as the target residency
118a3299182SRafael J. Wysocki(for compatibility with the majority of the "internal" tables of idle states for
119a3299182SRafael J. Wysockivarious processor models recognized by ``intel_idle``) and for the other idle
120a3299182SRafael J. Wysockistate types (C2 and C3) the target residency value is 3 times the exit latency
121a3299182SRafael J. Wysocki(again, that is because it reflects the target residency to exit latency ratio
122a3299182SRafael J. Wysockiin the majority of cases for the processor models recognized by ``intel_idle``).
123a3299182SRafael J. WysockiAll of the idle states in the final list are enabled by default in this case.
124a3299182SRafael J. Wysocki
125a3299182SRafael J. Wysocki
126a3299182SRafael J. Wysocki.. _intel-idle-initialization:
127a3299182SRafael J. Wysocki
128a3299182SRafael J. WysockiInitialization
129a3299182SRafael J. Wysocki==============
130a3299182SRafael J. Wysocki
131a3299182SRafael J. WysockiThe initialization of ``intel_idle`` starts with checking if the kernel command
132a3299182SRafael J. Wysockiline options forbid the use of the ``MWAIT`` instruction.  If that is the case,
133a3299182SRafael J. Wysockian error code is returned right away.
134a3299182SRafael J. Wysocki
135a3299182SRafael J. WysockiThe next step is to check whether or not the processor model is known to the
136a3299182SRafael J. Wysockidriver, which determines the idle states enumeration method (see
137a3299182SRafael J. Wysocki`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
138a3299182SRafael J. Wysockisupports ``MWAIT`` (the initialization fails if that is not the case).  Then,
139a3299182SRafael J. Wysockithe ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
140a3299182SRafael J. Wysockidriver initialization fails if the level of support is not as expected (for
141a3299182SRafael J. Wysockiexample, if the total number of ``MWAIT`` substates returned is 0).
142a3299182SRafael J. Wysocki
143a3299182SRafael J. WysockiNext, if the driver is not configured to ignore the ACPI tables (see
144a3299182SRafael J. Wysocki`below <intel-idle-parameters_>`_), the idle states information provided by the
145a3299182SRafael J. Wysockiplatform firmware is extracted from them.
146a3299182SRafael J. Wysocki
147a3299182SRafael J. WysockiThen, ``CPUIdle`` device objects are allocated for all CPUs and the list of
148a3299182SRafael J. Wysockiavailable idle states is created as explained
149a3299182SRafael J. Wysocki`above <intel-idle-enumeration-of-states_>`_.
150a3299182SRafael J. Wysocki
151a3299182SRafael J. WysockiFinally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
152a3299182SRafael J. Wysockias the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
153a3299182SRafael J. Wysockifor configuring individual CPUs is registered via cpuhp_setup_state(), which
154a3299182SRafael J. Wysocki(among other things) causes the callback routine to be invoked for all of the
155a3299182SRafael J. WysockiCPUs present in the system at that time (each CPU executes its own instance of
156a3299182SRafael J. Wysockithe callback routine).  That routine registers a ``CPUIdle`` device for the CPU
157a3299182SRafael J. Wysockirunning it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
158a3299182SRafael J. Wysockioptionally performs some CPU-specific initialization actions that may be
159a3299182SRafael J. Wysockirequired for the given processor model.
160a3299182SRafael J. Wysocki
161a3299182SRafael J. Wysocki
162a3299182SRafael J. Wysocki.. _intel-idle-parameters:
163a3299182SRafael J. Wysocki
164a3299182SRafael J. WysockiKernel Command Line Options and Module Parameters
165a3299182SRafael J. Wysocki=================================================
166a3299182SRafael J. Wysocki
167a3299182SRafael J. WysockiThe *x86* architecture support code recognizes three kernel command line
168a3299182SRafael J. Wysockioptions related to CPU idle time management: ``idle=poll``, ``idle=halt``,
169a3299182SRafael J. Wysockiand ``idle=nomwait``.  If any of them is present in the kernel command line, the
170a3299182SRafael J. Wysocki``MWAIT`` instruction is not allowed to be used, so the initialization of
171a3299182SRafael J. Wysocki``intel_idle`` will fail.
172a3299182SRafael J. Wysocki
1734dcb78eeSRafael J. WysockiApart from that there are four module parameters recognized by ``intel_idle``
174a3299182SRafael J. Wysockiitself that can be set via the kernel command line (they cannot be updated via
175a3299182SRafael J. Wysockisysfs, so that is the only way to change their values).
176a3299182SRafael J. Wysocki
177a3299182SRafael J. WysockiThe ``max_cstate`` parameter value is the maximum idle state index in the list
178a3299182SRafael J. Wysockiof idle states supplied to the ``CPUIdle`` core during the registration of the
179a3299182SRafael J. Wysockidriver.  It is also the maximum number of regular (non-polling) idle states that
180a3299182SRafael J. Wysockican be used by ``intel_idle``, so the enumeration of idle states is terminated
181a3299182SRafael J. Wysockiafter finding that number of usable idle states (the other idle states that
182a3299182SRafael J. Wysockipotentially might have been used if ``max_cstate`` had been greater are not
183a3299182SRafael J. Wysockitaken into consideration at all).  Setting ``max_cstate`` can prevent
184a3299182SRafael J. Wysocki``intel_idle`` from exposing idle states that are regarded as "too deep" for
185a3299182SRafael J. Wysockisome reason to the ``CPUIdle`` core, but it does so by making them effectively
186a3299182SRafael J. Wysockiinvisible until the system is shut down and started again which may not always
187a3299182SRafael J. Wysockibe desirable.  In practice, it is only really necessary to do that if the idle
188a3299182SRafael J. Wysockistates in question cannot be enabled during system startup, because in the
189a3299182SRafael J. Wysockiworking state of the system the CPU power management quality of service (PM
190a3299182SRafael J. WysockiQoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
191*17420f31SMauro Carvalho Chehabeven if they have been enumerated (see :ref:`cpu-pm-qos` in
192*17420f31SMauro Carvalho ChehabDocumentation/admin-guide/pm/cpuidle.rst).
193a3299182SRafael J. WysockiSetting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
194a3299182SRafael J. Wysocki
1953a5be9b8SRafael J. WysockiThe ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle``
1963a5be9b8SRafael J. Wysockiif the kernel has been configured with ACPI support) can be set to make the
1973a5be9b8SRafael J. Wysockidriver ignore the system's ACPI tables entirely or use them for all of the
1983a5be9b8SRafael J. Wysockirecognized processor models, respectively (they both are unset by default and
1993a5be9b8SRafael J. Wysocki``use_acpi`` has no effect if ``no_acpi`` is set).
200a3299182SRafael J. Wysocki
2014dcb78eeSRafael J. WysockiThe value of the ``states_off`` module parameter (0 by default) represents a
2024dcb78eeSRafael J. Wysockilist of idle states to be disabled by default in the form of a bitmask.
2034dcb78eeSRafael J. Wysocki
2044dcb78eeSRafael J. WysockiNamely, the positions of the bits that are set in the ``states_off`` value are
2054dcb78eeSRafael J. Wysockithe indices of idle states to be disabled by default (as reflected by the names
2064dcb78eeSRafael J. Wysockiof the corresponding idle state directories in ``sysfs``, :file:`state0`,
2074dcb78eeSRafael J. Wysocki:file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given
208*17420f31SMauro Carvalho Chehabidle state; see :ref:`idle-states-representation` in
209*17420f31SMauro Carvalho ChehabDocumentation/admin-guide/pm/cpuidle.rst).
2104dcb78eeSRafael J. Wysocki
2114dcb78eeSRafael J. WysockiFor example, if ``states_off`` is equal to 3, the driver will disable idle
2124dcb78eeSRafael J. Wysockistates 0 and 1 by default, and if it is equal to 8, idle state 3 will be
2134dcb78eeSRafael J. Wysockidisabled by default and so on (bit positions beyond the maximum idle state index
2144dcb78eeSRafael J. Wysockiare ignored).
2154dcb78eeSRafael J. Wysocki
2164dcb78eeSRafael J. WysockiThe idle states disabled this way can be enabled (on a per-CPU basis) from user
2174dcb78eeSRafael J. Wysockispace via ``sysfs``.
2184dcb78eeSRafael J. Wysocki
219a3299182SRafael J. Wysocki
220a3299182SRafael J. Wysocki.. _intel-idle-core-and-package-idle-states:
221a3299182SRafael J. Wysocki
222a3299182SRafael J. WysockiCore and Package Levels of Idle States
223a3299182SRafael J. Wysocki======================================
224a3299182SRafael J. Wysocki
225a3299182SRafael J. WysockiTypically, in a processor supporting the ``MWAIT`` instruction there are (at
226a3299182SRafael J. Wysockileast) two levels of idle states (or C-states).  One level, referred to as
227a3299182SRafael J. Wysocki"core C-states", covers individual cores in the processor, whereas the other
228a3299182SRafael J. Wysockilevel, referred to as "package C-states", covers the entire processor package
229a3299182SRafael J. Wysockiand it may also involve other components of the system (GPUs, memory
230a3299182SRafael J. Wysockicontrollers, I/O hubs etc.).
231a3299182SRafael J. Wysocki
232a3299182SRafael J. WysockiSome of the ``MWAIT`` hint values allow the processor to use core C-states only
233a3299182SRafael J. Wysocki(most importantly, that is the case for the ``MWAIT`` hint value corresponding
234a3299182SRafael J. Wysockito the ``C1`` idle state), but the majority of them give it a license to put
235a3299182SRafael J. Wysockithe target core (i.e. the core containing the logical CPU executing ``MWAIT``
236a3299182SRafael J. Wysockiwith the given hint value) into a specific core C-state and then (if possible)
237a3299182SRafael J. Wysockito enter a specific package C-state at the deeper level.  For example, the
238a3299182SRafael J. Wysocki``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
239a3299182SRafael J. Wysockiput the target core into the low-power state referred to as "core ``C3``" (or
240a3299182SRafael J. Wysocki``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
241a3299182SRafael J. Wysockihave executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
242a3299182SRafael J. Wysockirepresenting a deeper idle state), and in addition to that (in the majority of
243a3299182SRafael J. Wysockicases) it gives the processor a license to put the entire package (possibly
244a3299182SRafael J. Wysockiincluding some non-CPU components such as a GPU or a memory controller) into the
245a3299182SRafael J. Wysockilow-power state referred to as "package ``C3``" (or ``PC3``), which happens if
246a3299182SRafael J. Wysockiall of the cores have gone into the ``CC3`` state and (possibly) some additional
247a3299182SRafael J. Wysockiconditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
248a3299182SRafael J. Wysockibe required to be in a certain GPU-specific low-power state for ``PC3`` to be
249a3299182SRafael J. Wysockireachable).
250a3299182SRafael J. Wysocki
251a3299182SRafael J. WysockiAs a rule, there is no simple way to make the processor use core C-states only
252a3299182SRafael J. Wysockiif the conditions for entering the corresponding package C-states are met, so
253a3299182SRafael J. Wysockithe logical CPU executing ``MWAIT`` with a hint value that is not core-level
254a3299182SRafael J. Wysockionly (like for ``C1``) must always assume that this may cause the processor to
255a3299182SRafael J. Wysockienter a package C-state.  [That is why the exit latency and target residency
256a3299182SRafael J. Wysockivalues corresponding to the majority of ``MWAIT`` hint values in the "internal"
257a3299182SRafael J. Wysockitables of idle states in ``intel_idle`` reflect the properties of package
258a3299182SRafael J. WysockiC-states.]  If using package C-states is not desirable at all, either
259a3299182SRafael J. Wysocki:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
260a3299182SRafael J. Wysocki``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
261a3299182SRafael J. Wysockirestrict the range of permissible idle states to the ones with core-level only
262a3299182SRafael J. Wysocki``MWAIT`` hint values (like ``C1``).
263a3299182SRafael J. Wysocki
264a3299182SRafael J. Wysocki
265a3299182SRafael J. WysockiReferences
266a3299182SRafael J. Wysocki==========
267a3299182SRafael J. Wysocki
268a3299182SRafael J. Wysocki.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
269a3299182SRafael J. Wysocki       https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
270a3299182SRafael J. Wysocki
271a3299182SRafael J. Wysocki.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
272a3299182SRafael J. Wysocki       https://uefi.org/specifications
273