xref: /openbmc/linux/Documentation/admin-guide/pm/intel_idle.rst (revision a3299182216397a0b943d2549d1997f4eba2bdd2)
1*a3299182SRafael J. Wysocki.. SPDX-License-Identifier: GPL-2.0
2*a3299182SRafael J. Wysocki.. include:: <isonum.txt>
3*a3299182SRafael J. Wysocki
4*a3299182SRafael J. Wysocki==============================================
5*a3299182SRafael J. Wysocki``intel_idle`` CPU Idle Time Management Driver
6*a3299182SRafael J. Wysocki==============================================
7*a3299182SRafael J. Wysocki
8*a3299182SRafael J. Wysocki:Copyright: |copy| 2020 Intel Corporation
9*a3299182SRafael J. Wysocki
10*a3299182SRafael J. Wysocki:Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
11*a3299182SRafael J. Wysocki
12*a3299182SRafael J. Wysocki
13*a3299182SRafael J. WysockiGeneral Information
14*a3299182SRafael J. Wysocki===================
15*a3299182SRafael J. Wysocki
16*a3299182SRafael J. Wysocki``intel_idle`` is a part of the
17*a3299182SRafael J. Wysocki:doc:`CPU idle time management subsystem <cpuidle>` in the Linux kernel
18*a3299182SRafael J. Wysocki(``CPUIdle``).  It is the default CPU idle time management driver for the
19*a3299182SRafael J. WysockiNehalem and later generations of Intel processors, but the level of support for
20*a3299182SRafael J. Wysockia particular processor model in it depends on whether or not it recognizes that
21*a3299182SRafael J. Wysockiprocessor model and may also depend on information coming from the platform
22*a3299182SRafael J. Wysockifirmware.  [To understand ``intel_idle`` it is necessary to know how ``CPUIdle``
23*a3299182SRafael J. Wysockiworks in general, so this is the time to get familiar with :doc:`cpuidle` if you
24*a3299182SRafael J. Wysockihave not done that yet.]
25*a3299182SRafael J. Wysocki
26*a3299182SRafael J. Wysocki``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the
27*a3299182SRafael J. Wysockilogical CPU executing it is idle and so it may be possible to put some of the
28*a3299182SRafael J. Wysockiprocessor's functional blocks into low-power states.  That instruction takes two
29*a3299182SRafael J. Wysockiarguments (passed in the ``EAX`` and ``ECX`` registers of the target CPU), the
30*a3299182SRafael J. Wysockifirst of which, referred to as a *hint*, can be used by the processor to
31*a3299182SRafael J. Wysockidetermine what can be done (for details refer to Intel Software Developer’s
32*a3299182SRafael J. WysockiManual [1]_).  Accordingly, ``intel_idle`` refuses to work with processors in
33*a3299182SRafael J. Wysockiwhich the support for the ``MWAIT`` instruction has been disabled (for example,
34*a3299182SRafael J. Wysockivia the platform firmware configuration menu) or which do not support that
35*a3299182SRafael J. Wysockiinstruction at all.
36*a3299182SRafael J. Wysocki
37*a3299182SRafael J. Wysocki``intel_idle`` is not modular, so it cannot be unloaded, which means that the
38*a3299182SRafael J. Wysockionly way to pass early-configuration-time parameters to it is via the kernel
39*a3299182SRafael J. Wysockicommand line.
40*a3299182SRafael J. Wysocki
41*a3299182SRafael J. Wysocki
42*a3299182SRafael J. Wysocki.. _intel-idle-enumeration-of-states:
43*a3299182SRafael J. Wysocki
44*a3299182SRafael J. WysockiEnumeration of Idle States
45*a3299182SRafael J. Wysocki==========================
46*a3299182SRafael J. Wysocki
47*a3299182SRafael J. WysockiEach ``MWAIT`` hint value is interpreted by the processor as a license to
48*a3299182SRafael J. Wysockireconfigure itself in a certain way in order to save energy.  The processor
49*a3299182SRafael J. Wysockiconfigurations (with reduced power draw) resulting from that are referred to
50*a3299182SRafael J. Wysockias C-states (in the ACPI terminology) or idle states.  The list of meaningful
51*a3299182SRafael J. Wysocki``MWAIT`` hint values and idle states (i.e. low-power configurations of the
52*a3299182SRafael J. Wysockiprocessor) corresponding to them depends on the processor model and it may also
53*a3299182SRafael J. Wysockidepend on the configuration of the platform.
54*a3299182SRafael J. Wysocki
55*a3299182SRafael J. WysockiIn order to create a list of available idle states required by the ``CPUIdle``
56*a3299182SRafael J. Wysockisubsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`),
57*a3299182SRafael J. Wysocki``intel_idle`` can use two sources of information: static tables of idle states
58*a3299182SRafael J. Wysockifor different processor models included in the driver itself and the ACPI tables
59*a3299182SRafael J. Wysockiof the system.  The former are always used if the processor model at hand is
60*a3299182SRafael J. Wysockirecognized by ``intel_idle`` and the latter are used if that is required for
61*a3299182SRafael J. Wysockithe given processor model (which is the case for all server processor models
62*a3299182SRafael J. Wysockirecognized by ``intel_idle``) or if the processor model is not recognized.
63*a3299182SRafael J. Wysocki
64*a3299182SRafael J. WysockiIf the ACPI tables are going to be used for building the list of available idle
65*a3299182SRafael J. Wysockistates, ``intel_idle`` first looks for a ``_CST`` object under one of the ACPI
66*a3299182SRafael J. Wysockiobjects corresponding to the CPUs in the system (refer to the ACPI specification
67*a3299182SRafael J. Wysocki[2]_ for the description of ``_CST`` and its output package).  Because the
68*a3299182SRafael J. Wysocki``CPUIdle`` subsystem expects that the list of idle states supplied by the
69*a3299182SRafael J. Wysockidriver will be suitable for all of the CPUs handled by it and ``intel_idle`` is
70*a3299182SRafael J. Wysockiregistered as the ``CPUIdle`` driver for all of the CPUs in the system, the
71*a3299182SRafael J. Wysockidriver looks for the first ``_CST`` object returning at least one valid idle
72*a3299182SRafael J. Wysockistate description and such that all of the idle states included in its return
73*a3299182SRafael J. Wysockipackage are of the FFH (Functional Fixed Hardware) type, which means that the
74*a3299182SRafael J. Wysocki``MWAIT`` instruction is expected to be used to tell the processor that it can
75*a3299182SRafael J. Wysockienter one of them.  The return package of that ``_CST`` is then assumed to be
76*a3299182SRafael J. Wysockiapplicable to all of the other CPUs in the system and the idle state
77*a3299182SRafael J. Wysockidescriptions extracted from it are stored in a preliminary list of idle states
78*a3299182SRafael J. Wysockicoming from the ACPI tables.  [This step is skipped if ``intel_idle`` is
79*a3299182SRafael J. Wysockiconfigured to ignore the ACPI tables; see `below <intel-idle-parameters_>`_.]
80*a3299182SRafael J. Wysocki
81*a3299182SRafael J. WysockiNext, the first (index 0) entry in the list of available idle states is
82*a3299182SRafael J. Wysockiinitialized to represent a "polling idle state" (a pseudo-idle state in which
83*a3299182SRafael J. Wysockithe target CPU continuously fetches and executes instructions), and the
84*a3299182SRafael J. Wysockisubsequent (real) idle state entries are populated as follows.
85*a3299182SRafael J. Wysocki
86*a3299182SRafael J. WysockiIf the processor model at hand is recognized by ``intel_idle``, there is a
87*a3299182SRafael J. Wysocki(static) table of idle state descriptions for it in the driver.  In that case,
88*a3299182SRafael J. Wysockithe "internal" table is the primary source of information on idle states and the
89*a3299182SRafael J. Wysockiinformation from it is copied to the final list of available idle states.  If
90*a3299182SRafael J. Wysockiusing the ACPI tables for the enumeration of idle states is not required
91*a3299182SRafael J. Wysocki(depending on the processor model), all of the listed idle state are enabled by
92*a3299182SRafael J. Wysockidefault (so all of them will be taken into consideration by ``CPUIdle``
93*a3299182SRafael J. Wysockigovernors during CPU idle state selection).  Otherwise, some of the listed idle
94*a3299182SRafael J. Wysockistates may not be enabled by default if there are no matching entries in the
95*a3299182SRafael J. Wysockipreliminary list of idle states coming from the ACPI tables.  In that case user
96*a3299182SRafael J. Wysockispace still can enable them later (on a per-CPU basis) with the help of
97*a3299182SRafael J. Wysockithe ``disable`` idle state attribute in ``sysfs`` (see
98*a3299182SRafael J. Wysocki:ref:`idle-states-representation` in :doc:`cpuidle`).  This basically means that
99*a3299182SRafael J. Wysockithe idle states "known" to the driver may not be enabled by default if they have
100*a3299182SRafael J. Wysockinot been exposed by the platform firmware (through the ACPI tables).
101*a3299182SRafael J. Wysocki
102*a3299182SRafael J. WysockiIf the given processor model is not recognized by ``intel_idle``, but it
103*a3299182SRafael J. Wysockisupports ``MWAIT``, the preliminary list of idle states coming from the ACPI
104*a3299182SRafael J. Wysockitables is used for building the final list that will be supplied to the
105*a3299182SRafael J. Wysocki``CPUIdle`` core during driver registration.  For each idle state in that list,
106*a3299182SRafael J. Wysockithe description, ``MWAIT`` hint and exit latency are copied to the corresponding
107*a3299182SRafael J. Wysockientry in the final list of idle states.  The name of the idle state represented
108*a3299182SRafael J. Wysockiby it (to be returned by the ``name`` idle state attribute in ``sysfs``) is
109*a3299182SRafael J. Wysocki"CX_ACPI", where X is the index of that idle state in the final list (note that
110*a3299182SRafael J. Wysockithe minimum value of X is 1, because 0 is reserved for the "polling" state), and
111*a3299182SRafael J. Wysockiits target residency is based on the exit latency value.  Specifically, for
112*a3299182SRafael J. WysockiC1-type idle states the exit latency value is also used as the target residency
113*a3299182SRafael J. Wysocki(for compatibility with the majority of the "internal" tables of idle states for
114*a3299182SRafael J. Wysockivarious processor models recognized by ``intel_idle``) and for the other idle
115*a3299182SRafael J. Wysockistate types (C2 and C3) the target residency value is 3 times the exit latency
116*a3299182SRafael J. Wysocki(again, that is because it reflects the target residency to exit latency ratio
117*a3299182SRafael J. Wysockiin the majority of cases for the processor models recognized by ``intel_idle``).
118*a3299182SRafael J. WysockiAll of the idle states in the final list are enabled by default in this case.
119*a3299182SRafael J. Wysocki
120*a3299182SRafael J. Wysocki
121*a3299182SRafael J. Wysocki.. _intel-idle-initialization:
122*a3299182SRafael J. Wysocki
123*a3299182SRafael J. WysockiInitialization
124*a3299182SRafael J. Wysocki==============
125*a3299182SRafael J. Wysocki
126*a3299182SRafael J. WysockiThe initialization of ``intel_idle`` starts with checking if the kernel command
127*a3299182SRafael J. Wysockiline options forbid the use of the ``MWAIT`` instruction.  If that is the case,
128*a3299182SRafael J. Wysockian error code is returned right away.
129*a3299182SRafael J. Wysocki
130*a3299182SRafael J. WysockiThe next step is to check whether or not the processor model is known to the
131*a3299182SRafael J. Wysockidriver, which determines the idle states enumeration method (see
132*a3299182SRafael J. Wysocki`above <intel-idle-enumeration-of-states_>`_), and whether or not the processor
133*a3299182SRafael J. Wysockisupports ``MWAIT`` (the initialization fails if that is not the case).  Then,
134*a3299182SRafael J. Wysockithe ``MWAIT`` support in the processor is enumerated through ``CPUID`` and the
135*a3299182SRafael J. Wysockidriver initialization fails if the level of support is not as expected (for
136*a3299182SRafael J. Wysockiexample, if the total number of ``MWAIT`` substates returned is 0).
137*a3299182SRafael J. Wysocki
138*a3299182SRafael J. WysockiNext, if the driver is not configured to ignore the ACPI tables (see
139*a3299182SRafael J. Wysocki`below <intel-idle-parameters_>`_), the idle states information provided by the
140*a3299182SRafael J. Wysockiplatform firmware is extracted from them.
141*a3299182SRafael J. Wysocki
142*a3299182SRafael J. WysockiThen, ``CPUIdle`` device objects are allocated for all CPUs and the list of
143*a3299182SRafael J. Wysockiavailable idle states is created as explained
144*a3299182SRafael J. Wysocki`above <intel-idle-enumeration-of-states_>`_.
145*a3299182SRafael J. Wysocki
146*a3299182SRafael J. WysockiFinally, ``intel_idle`` is registered with the help of cpuidle_register_driver()
147*a3299182SRafael J. Wysockias the ``CPUIdle`` driver for all CPUs in the system and a CPU online callback
148*a3299182SRafael J. Wysockifor configuring individual CPUs is registered via cpuhp_setup_state(), which
149*a3299182SRafael J. Wysocki(among other things) causes the callback routine to be invoked for all of the
150*a3299182SRafael J. WysockiCPUs present in the system at that time (each CPU executes its own instance of
151*a3299182SRafael J. Wysockithe callback routine).  That routine registers a ``CPUIdle`` device for the CPU
152*a3299182SRafael J. Wysockirunning it (which enables the ``CPUIdle`` subsystem to operate that CPU) and
153*a3299182SRafael J. Wysockioptionally performs some CPU-specific initialization actions that may be
154*a3299182SRafael J. Wysockirequired for the given processor model.
155*a3299182SRafael J. Wysocki
156*a3299182SRafael J. Wysocki
157*a3299182SRafael J. Wysocki.. _intel-idle-parameters:
158*a3299182SRafael J. Wysocki
159*a3299182SRafael J. WysockiKernel Command Line Options and Module Parameters
160*a3299182SRafael J. Wysocki=================================================
161*a3299182SRafael J. Wysocki
162*a3299182SRafael J. WysockiThe *x86* architecture support code recognizes three kernel command line
163*a3299182SRafael J. Wysockioptions related to CPU idle time management: ``idle=poll``, ``idle=halt``,
164*a3299182SRafael J. Wysockiand ``idle=nomwait``.  If any of them is present in the kernel command line, the
165*a3299182SRafael J. Wysocki``MWAIT`` instruction is not allowed to be used, so the initialization of
166*a3299182SRafael J. Wysocki``intel_idle`` will fail.
167*a3299182SRafael J. Wysocki
168*a3299182SRafael J. WysockiApart from that there are two module parameters recognized by ``intel_idle``
169*a3299182SRafael J. Wysockiitself that can be set via the kernel command line (they cannot be updated via
170*a3299182SRafael J. Wysockisysfs, so that is the only way to change their values).
171*a3299182SRafael J. Wysocki
172*a3299182SRafael J. WysockiThe ``max_cstate`` parameter value is the maximum idle state index in the list
173*a3299182SRafael J. Wysockiof idle states supplied to the ``CPUIdle`` core during the registration of the
174*a3299182SRafael J. Wysockidriver.  It is also the maximum number of regular (non-polling) idle states that
175*a3299182SRafael J. Wysockican be used by ``intel_idle``, so the enumeration of idle states is terminated
176*a3299182SRafael J. Wysockiafter finding that number of usable idle states (the other idle states that
177*a3299182SRafael J. Wysockipotentially might have been used if ``max_cstate`` had been greater are not
178*a3299182SRafael J. Wysockitaken into consideration at all).  Setting ``max_cstate`` can prevent
179*a3299182SRafael J. Wysocki``intel_idle`` from exposing idle states that are regarded as "too deep" for
180*a3299182SRafael J. Wysockisome reason to the ``CPUIdle`` core, but it does so by making them effectively
181*a3299182SRafael J. Wysockiinvisible until the system is shut down and started again which may not always
182*a3299182SRafael J. Wysockibe desirable.  In practice, it is only really necessary to do that if the idle
183*a3299182SRafael J. Wysockistates in question cannot be enabled during system startup, because in the
184*a3299182SRafael J. Wysockiworking state of the system the CPU power management quality of service (PM
185*a3299182SRafael J. WysockiQoS) feature can be used to prevent ``CPUIdle`` from touching those idle states
186*a3299182SRafael J. Wysockieven if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`).
187*a3299182SRafael J. WysockiSetting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail.
188*a3299182SRafael J. Wysocki
189*a3299182SRafael J. WysockiThe ``noacpi`` module parameter (which is recognized by ``intel_idle`` if the
190*a3299182SRafael J. Wysockikernel has been configured with ACPI support), can be set to make the driver
191*a3299182SRafael J. Wysockiignore the system's ACPI tables entirely (it is unset by default).
192*a3299182SRafael J. Wysocki
193*a3299182SRafael J. Wysocki
194*a3299182SRafael J. Wysocki.. _intel-idle-core-and-package-idle-states:
195*a3299182SRafael J. Wysocki
196*a3299182SRafael J. WysockiCore and Package Levels of Idle States
197*a3299182SRafael J. Wysocki======================================
198*a3299182SRafael J. Wysocki
199*a3299182SRafael J. WysockiTypically, in a processor supporting the ``MWAIT`` instruction there are (at
200*a3299182SRafael J. Wysockileast) two levels of idle states (or C-states).  One level, referred to as
201*a3299182SRafael J. Wysocki"core C-states", covers individual cores in the processor, whereas the other
202*a3299182SRafael J. Wysockilevel, referred to as "package C-states", covers the entire processor package
203*a3299182SRafael J. Wysockiand it may also involve other components of the system (GPUs, memory
204*a3299182SRafael J. Wysockicontrollers, I/O hubs etc.).
205*a3299182SRafael J. Wysocki
206*a3299182SRafael J. WysockiSome of the ``MWAIT`` hint values allow the processor to use core C-states only
207*a3299182SRafael J. Wysocki(most importantly, that is the case for the ``MWAIT`` hint value corresponding
208*a3299182SRafael J. Wysockito the ``C1`` idle state), but the majority of them give it a license to put
209*a3299182SRafael J. Wysockithe target core (i.e. the core containing the logical CPU executing ``MWAIT``
210*a3299182SRafael J. Wysockiwith the given hint value) into a specific core C-state and then (if possible)
211*a3299182SRafael J. Wysockito enter a specific package C-state at the deeper level.  For example, the
212*a3299182SRafael J. Wysocki``MWAIT`` hint value representing the ``C3`` idle state allows the processor to
213*a3299182SRafael J. Wysockiput the target core into the low-power state referred to as "core ``C3``" (or
214*a3299182SRafael J. Wysocki``CC3``), which happens if all of the logical CPUs (SMT siblings) in that core
215*a3299182SRafael J. Wysockihave executed ``MWAIT`` with the ``C3`` hint value (or with a hint value
216*a3299182SRafael J. Wysockirepresenting a deeper idle state), and in addition to that (in the majority of
217*a3299182SRafael J. Wysockicases) it gives the processor a license to put the entire package (possibly
218*a3299182SRafael J. Wysockiincluding some non-CPU components such as a GPU or a memory controller) into the
219*a3299182SRafael J. Wysockilow-power state referred to as "package ``C3``" (or ``PC3``), which happens if
220*a3299182SRafael J. Wysockiall of the cores have gone into the ``CC3`` state and (possibly) some additional
221*a3299182SRafael J. Wysockiconditions are satisfied (for instance, if the GPU is covered by ``PC3``, it may
222*a3299182SRafael J. Wysockibe required to be in a certain GPU-specific low-power state for ``PC3`` to be
223*a3299182SRafael J. Wysockireachable).
224*a3299182SRafael J. Wysocki
225*a3299182SRafael J. WysockiAs a rule, there is no simple way to make the processor use core C-states only
226*a3299182SRafael J. Wysockiif the conditions for entering the corresponding package C-states are met, so
227*a3299182SRafael J. Wysockithe logical CPU executing ``MWAIT`` with a hint value that is not core-level
228*a3299182SRafael J. Wysockionly (like for ``C1``) must always assume that this may cause the processor to
229*a3299182SRafael J. Wysockienter a package C-state.  [That is why the exit latency and target residency
230*a3299182SRafael J. Wysockivalues corresponding to the majority of ``MWAIT`` hint values in the "internal"
231*a3299182SRafael J. Wysockitables of idle states in ``intel_idle`` reflect the properties of package
232*a3299182SRafael J. WysockiC-states.]  If using package C-states is not desirable at all, either
233*a3299182SRafael J. Wysocki:ref:`PM QoS <cpu-pm-qos>` or the ``max_cstate`` module parameter of
234*a3299182SRafael J. Wysocki``intel_idle`` described `above <intel-idle-parameters_>`_ must be used to
235*a3299182SRafael J. Wysockirestrict the range of permissible idle states to the ones with core-level only
236*a3299182SRafael J. Wysocki``MWAIT`` hint values (like ``C1``).
237*a3299182SRafael J. Wysocki
238*a3299182SRafael J. Wysocki
239*a3299182SRafael J. WysockiReferences
240*a3299182SRafael J. Wysocki==========
241*a3299182SRafael J. Wysocki
242*a3299182SRafael J. Wysocki.. [1] *Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2B*,
243*a3299182SRafael J. Wysocki       https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-2b-manual.html
244*a3299182SRafael J. Wysocki
245*a3299182SRafael J. Wysocki.. [2] *Advanced Configuration and Power Interface (ACPI) Specification*,
246*a3299182SRafael J. Wysocki       https://uefi.org/specifications
247