History log of /openbmc/linux/drivers/cpufreq/intel_pstate.c (Results 1 – 25 of 1302)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.6.25, v6.6.24, v6.6.23
# 1a868273 17-Feb-2024 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: fix pstate limits enforcement for adjust_perf call back

[ Upstream commit f0a0fc10abb062d122db5ac4ed42f6d1ca342649 ]

There is a loophole in pstate limit clamping for the inte

cpufreq: intel_pstate: fix pstate limits enforcement for adjust_perf call back

[ Upstream commit f0a0fc10abb062d122db5ac4ed42f6d1ca342649 ]

There is a loophole in pstate limit clamping for the intel_cpufreq CPU
frequency scaling driver (intel_pstate in passive mode), schedutil CPU
frequency scaling governor, HWP (HardWare Pstate) control enabled, when
the adjust_perf call back path is used.

Fix it.

Fixes: a365ab6b9dfb cpufreq: intel_pstate: Implement the ->adjust_perf() callback
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v6.6.16, v6.6.15, v6.6.14
# 212b6868 22-Jan-2024 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Refine computation of P-state for given frequency

commit 192cdb1c907fd8df2d764c5bb17496e415e59391 upstream.

On systems using HWP, if a given frequency is equal to the maximum

cpufreq: intel_pstate: Refine computation of P-state for given frequency

commit 192cdb1c907fd8df2d764c5bb17496e415e59391 upstream.

On systems using HWP, if a given frequency is equal to the maximum turbo
frequency or the maximum non-turbo frequency, the HWP performance level
corresponding to it is already known and can be used directly without
any computation.

Accordingly, adjust the code to use the known HWP performance levels in
the cases mentioned above.

This also helps to avoid limiting CPU capacity artificially in some
cases when the BIOS produces the HWP_CAP numbers using a different
E-core-to-P-core performance scaling factor than expected by the kernel.

Fixes: f5c8cf2a4992 ("cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores")
Cc: 6.1+ <stable@vger.kernel.org> # 6.1+
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48
# d51847ac 20-Aug-2023 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: set stale CPU frequency to minimum

The intel_pstate CPU frequency scaling driver does not
use policy->cur and it is 0.
When the CPU frequency is outdated arch_freq_get_on_cpu(

cpufreq: intel_pstate: set stale CPU frequency to minimum

The intel_pstate CPU frequency scaling driver does not
use policy->cur and it is 0.
When the CPU frequency is outdated arch_freq_get_on_cpu()
will default to the nominal clock frequency when its call to
cpufreq_quick_getpolicy_cur returns the never updated 0.
Thus, the listed frequency might be outside of currently
set limits. Some users are complaining about the high
reported frequency, albeit stale, when their system is
idle and/or it is above the reduced maximum they have set.

This patch will maintain policy_cur for the intel_pstate
driver at the current minimum CPU frequency.

Reported-by: Yang Jie <yang.jie@linux.intel.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217597
Signed-off-by: Doug Smythies <dsmythies@telus.net>
[ rjw: White space damage fixes and comment adjustment ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37
# 0fcfc9e5 29-Jun-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Fix scaling for hybrid-capable systems with disabled E-cores

Some system BIOS configuration may provide option to disable E-cores.
As part of this change, CPUID feature for hy

cpufreq: intel_pstate: Fix scaling for hybrid-capable systems with disabled E-cores

Some system BIOS configuration may provide option to disable E-cores.
As part of this change, CPUID feature for hybrid (Leaf 7 sub leaf 0,
EDX[15] = 0) may not be set. But HWP performance limits will still be
using a scaling factor like any other hybrid enabled system.

The current check for applying scaling factor will fail when hybrid
CPUID feature is not set and the only way to make sure that scaling
should be applied by checking CPPC nominal frequency and nominal
performance.

First, or systems predating Alder Lake, the CPPC nominal frequency and
nominal performance are 0, which can be used to distinguish those
systems from hybrid systems with disabled E-cores.

Second, if the CPPC nominal frequency and nominal performance are
defined, which indicates the need to use a special scaling factor, and
the nominal performance value multiplied by 100 is not equal to the
nominal frequency one, use hybrid scaling factor.

This can be done for all HWP systems without additional CPU model check.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits, removal of unneeded parens, comment
edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.1.36, v6.4, v6.1.35
# 03f44ffb 21-Jun-2023 Tero Kristo <tero.kristo@linux.intel.com>

cpufreq: intel_pstate: Fix energy_performance_preference for passive

If the intel_pstate driver is set to passive mode, then writing the
same value to the energy_performance_preference sysfs twice w

cpufreq: intel_pstate: Fix energy_performance_preference for passive

If the intel_pstate driver is set to passive mode, then writing the
same value to the energy_performance_preference sysfs twice will fail.
This is caused by the wrong return value used (index of the matched
energy_perf_string), instead of the length of the passed in parameter.
Fix by forcing the internal return value to zero when the same
preference is passed in by user. This same issue is not present when
active mode is used for the driver.

Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Niklas Neronin <niklas.neronin@intel.com>
Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15
# 1f5e62f5 02-Mar-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Enable HWP IO boost for all servers

The HWP IO boost results in slight improvements for IO performance on
both Ice Lake and Sapphire Rapid servers.

Currently there is a CPU m

cpufreq: intel_pstate: Enable HWP IO boost for all servers

The HWP IO boost results in slight improvements for IO performance on
both Ice Lake and Sapphire Rapid servers.

Currently there is a CPU model check for Skylake desktop and server along
with the ACPI PM profile for performance and enterprise servers to enable
IO boost.

Remove the CPU model check, so that all current server models enable HWP
IO boost by default.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# 2744a63c 13-Mar-2023 Greg Kroah-Hartman <gregkh@linuxfoundation.org>

cpufreq: move to use bus_get_dev_root()

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there

cpufreq: move to use bus_get_dev_root()

Direct access to the struct bus_type dev_root pointer is going away soon
so replace that with a call to bus_get_dev_root() instead, which is what
it is there for.

Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Link: https://lore.kernel.org/r/20230313182918.1312597-3-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.1.14
# 5bd289f6 24-Feb-2023 Nick Alcock <nick.alcock@oracle.com>

cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules

Since commit 8b41fc4454e ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are

cpufreq: intel_pstate: remove MODULE_LICENSE in non-modules

Since commit 8b41fc4454e ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.1.13
# 60675225 22-Feb-2023 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids

While the majority of server OS distributions are deployed with the
"performance" governor as the default, some distribution

cpufreq: intel_pstate: Adjust balance_performance EPP for Sapphire Rapids

While the majority of server OS distributions are deployed with the
"performance" governor as the default, some distributions like Ubuntu use
the "powersave" governor by default.

While using the "powersave" governor in its default configuration on
Sapphire Rapids systems leads to much lower power, the performance is
lower by more than 25% for several workloads relative to the
"performance" governor.

A 37% difference has been reported by www.Phoronix.com [1].

This is a consequence of using a relatively high EPP value in the
default configuration of the "powersave" governor and the performance
can be made much closer to the "performance" governor's level by
adjusting the default EPP value. Based on experiments, with EPP of 0x00,
0x10, 0x20, the performance delta between the "powersave" governor and
the "performance" one is around 12%. However, the EPP of 0x20 reduces
average power by 18% with respect to the lower EPP values.

[Note that raising min_perf_pct in sysfs as high as 50% in addition to
adjusting EPP does not improve the performance any further.]

For this reason, change the EPP value corresponding to the the default
balance_performance setting for Sapphire Rapids to 0x20, which is
straightforward, because analogous default EPP adjustment has been
applied to Alder Lake and there is a way to set the balance_performance
EPP value in intel_pstate based on the processor model already.

The goal here is to limit the mean performance delta between the
"powersave" governor in the default configuration and the "performance"
governor for a wide variety of server workloadsto to around 10-12%. For
some bursty workloads, this delta can be still large, as the frequency
ramp-up will still lag when the "powersave" governor is in use
irrespective of the EPP setting, because the performance governor always
requests the maximum possible frequency.

Link: https://www.phoronix.com/review/centos-clear-spr/6 # [1]
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16
# e8a0e30b 28-Dec-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop ACPI _PSS states table patching

After making acpi_processor_get_platform_limit() use the "no limit"
value for its frequency QoS request when _PPC returns 0, it is not
nec

cpufreq: intel_pstate: Drop ACPI _PSS states table patching

After making acpi_processor_get_platform_limit() use the "no limit"
value for its frequency QoS request when _PPC returns 0, it is not
necessary to replace the frequency corresponding to the first _PSS
return package entry with the maximum turbo frequency of the given
CPU in intel_pstate_init_acpi_perf_limits() any more, so drop the
code doing that along with the comment explaining it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80
# df51f287 21-Nov-2022 Giovanni Gherdovich <ggherdovich@suse.cz>

cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also

cpufreq: intel_pstate: Add Sapphire Rapids support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

See also the following past commits:

commit d8de7a44e11f ("cpufreq: intel_pstate: Add Skylake servers support")
commit 706c5328851d ("cpufreq: intel_pstate: Add Cometlake support in
no-HWP mode")
commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in
no-HWP mode")
commit 71bb5c82aaae ("cpufreq: intel_pstate: Add Tigerlake support in
no-HWP mode")

Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6
# 21cdb6c1 27-Oct-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Allow EPP 0x80 setting by the firmware

With the
"commit 3d13058ed2a6 ("cpufreq: intel_pstate: Use firmware default EPP")"
the firmware can set an EPP, which driver will not ov

cpufreq: intel_pstate: Allow EPP 0x80 setting by the firmware

With the
"commit 3d13058ed2a6 ("cpufreq: intel_pstate: Use firmware default EPP")"
the firmware can set an EPP, which driver will not overwrite. But the
driver has a valid range check for:
0x40 > firmware epp < 0x80.
Hence firmware can't specify EPP of 0x80.

If the firmware didn't specify in the valid range, the driver has a
hard coded EPP of 102. But some Chrome hardware vendors don't want
this overwrite and wants to boot with chipset default EPP of 0x80 as
this improves battery life.

In this case they want to have capability to specify EPP of 0x80 via
the firmware. This require the valid range to include 0x80 also.
But here the valid range can't be simply extended to include 0x80 as
this is the chipset default EPP. Even without any firmware specifying
EPP, the chipset will always boot with EPP of 0x80.

To make sure that firmware specified EPP of 0x80 and not by the
chipset default, it will require additional check to make sure HWP
was enabled by the firmware before boot. Only way the firmware can
update EPP, is to enable HWP and update EPP via MSR_HWP_REQUEST.

This driver already checks, if the HWP is enabled by the firmware.
Use the same flag and extend valid range to include 0x80.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.0.5, v5.15.75, v6.0.4
# f5c8cf2a 24-Oct-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores

Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP
calibration") attempted to use the information from CPPC (the nomi

cpufreq: intel_pstate: hybrid: Use known scaling factor for P-cores

Commit 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP
calibration") attempted to use the information from CPPC (the nominal
performance in particular) to obtain the scaling factor allowing the
frequency to be computed if the HWP performance level of the given CPU
is known or vice versa.

However, it turns out that on some platforms this doesn't work, because
the CPPC information on them does not align with the contents of the
MSR_HWP_CAPABILITIES registers.

This basically means that the only way to make intel_pstate work on all
of the hybrid platforms to date is to use the observation that on all
of them the scaling factor between the HWP performance levels and
frequency for P-cores is 78741 (approximately 100000/1.27). For
E-cores it is 100000, which is the same as for all of the non-hybrid
"core" platforms and does not require any changes.

Accordingly, make intel_pstate use 78741 as the scaling factor between
HWP performance levels and frequency for P-cores on all hybrid platforms
and drop the dependency of the HWP calibration code on CPPC.

Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


# 8dbab94d 24-Oct-2022 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Read all MSRs on the target CPU

Some of the MSR accesses in intel_pstate are carried out on the CPU
that is running the code, but the values coming from them are used
for the

cpufreq: intel_pstate: Read all MSRs on the target CPU

Some of the MSR accesses in intel_pstate are carried out on the CPU
that is running the code, but the values coming from them are used
for the performance scaling of the other CPUs.

This is problematic, for example, on hybrid platforms where
MSR_TURBO_RATIO_LIMIT for P-cores and E-cores is different, so the
values read from it on a P-core are generally not applicable to E-cores
and the other way around.

For this reason, make the driver access all MSRs on the target CPU on
platforms using the "core" pstate_funcs callbacks which is the case for
all of the hybrid platforms released to date. For this purpose, pass
a CPU argument to the ->get_max(), ->get_max_physical(), ->get_min()
and ->get_turbo() pstate_funcs callbacks and from there pass it to
rdmsrl_on_cpu() or rdmsrl_safe_on_cpu() to access the MSR on the target
CPU.

Fixes: 46573fd6369f ("cpufreq: intel_pstate: hybrid: Rework HWP calibration")
Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.15+ <stable@vger.kernel.org> # 5.15+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66
# 71bb5c82 06-Sep-2022 Doug Smythies <dsmythies@telus.net>

cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add TIGERLAKE t

cpufreq: intel_pstate: Add Tigerlake support in no-HWP mode

Users may disable HWP in firmware, in which case intel_pstate wouldn't load
unless the CPU model is explicitly supported.

Add TIGERLAKE to the list of CPUs that can register intel_pstate while not
advertising the HWP capability. Without this change, an TIGERLAKE in no-HWP
mode could only use the acpi_cpufreq frequency scaling driver.

See also commits:
d8de7a44e11f: cpufreq: intel_pstate: Add Skylake servers support
fbdc21e9b038: cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode
706c5328851d: cpufreq: intel_pstate: Add Cometlake support in no-HWP mode

Reported by: M. Cargi Ari <cagriari@pm.me>
Signed-off-by: Doug Smythies <dsmythies@telus.net>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39, v5.15.38
# bbd67f1b 02-May-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Support Sapphire Rapids OOB mode

Prevent intel_pstate to load when OOB (Out Of Band) P-states mode is
enabled in Sapphire Rapids. The OOB identifying bits are same as the
prio

cpufreq: intel_pstate: Support Sapphire Rapids OOB mode

Prevent intel_pstate to load when OOB (Out Of Band) P-states mode is
enabled in Sapphire Rapids. The OOB identifying bits are same as the
prior generation CPUs like Ice Lake servers. So, also add Sapphire
Rapids to intel_pstate_cpu_oob_ids list.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.37, v5.15.36, v5.15.35, v5.15.34, v5.15.33
# addca285 07-Apr-2022 Chen Yu <yu.c.chen@intel.com>

cpufreq: intel_pstate: Handle no_turbo in frequency invariance

Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq

cpufreq: intel_pstate: Handle no_turbo in frequency invariance

Problem statement:

Once the user has disabled turbo frequency by

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

the cfs_rq's util_avg becomes quite small when compared with
CPU capacity.

Step to reproduce:

# echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

# ./x86_cpuload --count 1 --start 3 --timeout 100 --busy 99

would launch 1 thread and bind it to CPU3, lasting for 100 seconds,
with a CPU utilization of 99%. [1]

top result:
%Cpu3 : 98.4 us, 0.0 sy, 0.0 ni, 1.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

check util_avg:
cat /sys/kernel/debug/sched/debug | grep "cfs_rq\[3\]" -A 20 | grep util_avg
.util_avg : 611

So the util_avg/cpu capacity is 611/1024, which is much smaller than
98.4% shown in the top result.

This might impact some logic in the scheduler. For example,
group_is_overloaded() would compare the group_capacity and group_util
in the sched group, to check if this sched group is overloaded or not.
With this gap, even when there is a nearly 100% workload, the sched
group will not be regarded as overloaded. Besides group_is_overloaded(),
there are also other victims. There is a ongoing work that aims to
optimize the task wakeup in a LLC domain. The main idea is to stop
searching idle CPUs if the sched domain is overloaded[2]. This proposal
also relies on the util_avg/CPU capacity to decide whether the LLC
domain is overloaded.

Analysis:

CPU frequency invariance has caused this difference. In summary,
the util_sum of cfs rq would decay quite fast when the CPU is in
idle, when the CPU frequency invariance is enabled.

The detail is as followed:

As depicted in update_rq_clock_pelt(), when the frequency invariance
is enabled, there would be two clock variables on each rq, clock_task
and clock_pelt:

The clock_pelt scales the time to reflect the effective amount of
computation done during the running delta time but then syncs back to
clock_task when rq is idle.

absolute time | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
@ max frequency ------******---------------******---------------
@ half frequency ------************---------************---------
clock pelt | 1| 2| 3| 4| 7| 8| 9| 10| 11|14|15|16

The fast decay of util_sum during idle is due to:

1. rq->clock_pelt is always behind rq->clock_task
2. rq->last_update is updated to rq->clock_pelt' after invoking
___update_load_sum()
3. Then the CPU becomes idle, the rq->clock_pelt' would be suddenly
increased a lot to rq->clock_task
4. Enters ___update_load_sum() again, the idle period is calculated by
rq->clock_task - rq->last_update, AKA, rq->clock_task - rq->clock_pelt'.
The lower the CPU frequency is, the larger the delta =
rq->clock_task - rq->clock_pelt' will be. Since the idle period will be
used to decay the util_sum only, the util_sum drops significantly during
idle period.

Proposal:

This symptom is not only caused by disabling turbo frequency, but it
would also appear if the user limits the max frequency at runtime.

Because, if the frequency is always lower than the max frequency,
CPU frequency invariance would decay the util_sum quite fast during
idle.

As some end users would disable turbo after boot up, this patch aims to
present this symptom and deals with turbo scenarios for now.

It might be ideal if CPU frequency invariance is aware of the max CPU
frequency (user specified) at runtime in the future.

Link: https://github.com/yu-chen-surf/x86_cpuload.git #1
Link: https://lore.kernel.org/lkml/20220310005228.11737-1-yu.c.chen@intel.com/ #2
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.32, v5.15.31, v5.17, v5.15.30, v5.15.29, v5.15.28
# 3d13058e 10-Mar-2022 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Use firmware default EPP

For some specific platforms (E.g. AlderLake) the balance performance
EPP is updated from the hard coded value in the driver. This acts as
the default

cpufreq: intel_pstate: Use firmware default EPP

For some specific platforms (E.g. AlderLake) the balance performance
EPP is updated from the hard coded value in the driver. This acts as
the default and balance_performance EPP. The purpose of this EPP
update is to reach maximum 1 core turbo frequency (when possible) out
of the box.

Although we can achieve the objective by using hard coded value in the
driver, there can be other EPP which can be better in terms of power.
But that will be very subjective based on platform and use cases.
This is not practical to have a per platform specific default hard coded
in the driver.

If a platform wants to specify default EPP, it can be set in the firmware.
If this EPP is not the chipset default of 0x80 (balance_perf_epp unless
driver changed it) and more performance oriented but not 0, the driver
can use this as the default and balanced_perf EPP. In this case no driver
update is required every time there is some new platform and default EPP.

If the firmware didn't update the EPP from the chipset default then
the hard coded value is used as per existing implementation.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.27, v5.15.26, v5.15.25, v5.15.24, v5.15.23, v5.15.22, v5.15.21, v5.15.20, v5.15.19, v5.15.18, v5.15.17, v5.4.173, v5.15.16, v5.15.15, v5.16
# dfeeedc1 17-Dec-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Update cpuinfo.max_freq on HWP_CAP changes

With HWP enabled, when the turbo range of performance levels is
disabled by the platform firmware, the CPU capacity is given by
the

cpufreq: intel_pstate: Update cpuinfo.max_freq on HWP_CAP changes

With HWP enabled, when the turbo range of performance levels is
disabled by the platform firmware, the CPU capacity is given by
the "guaranteed performance" field in MSR_HWP_CAPABILITIES which
is generally dynamic. When it changes, the kernel receives an HWP
notification interrupt handled by notify_hwp_interrupt().

When the "guaranteed performance" value changes in the above
configuration, the CPU performance scaling needs to be adjusted so
as to use the new CPU capacity in computations, which means that
the cpuinfo.max_freq value needs to be updated for that CPU.

Accordingly, modify intel_pstate_notify_work() to read
MSR_HWP_CAPABILITIES and update cpuinfo.max_freq to reflect the
new configuration (this update can be carried out even if the
configuration doesn't actually change, because it simply doesn't
matter then and it takes less time to update it than to do extra
checks to decide whether or not a change has really occurred).

Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.10
# b6e6f8be 16-Dec-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Update EPP for AlderLake mobile

There is an expectation from users that they can get frequency specified
by cpufreq/cpuinfo_max_freq when conditions permit. But with AlderLake

cpufreq: intel_pstate: Update EPP for AlderLake mobile

There is an expectation from users that they can get frequency specified
by cpufreq/cpuinfo_max_freq when conditions permit. But with AlderLake
mobile it may not be possible. This is possible that frequency is clipped
based on the system power-up EPP value. In this case users can update
cpufreq/energy_performance_preference to some performance oriented EPP to
limit clipping of frequencies.

To get out of box behavior as the prior generations of CPUs, update EPP
for AlderLake mobile CPUs on boot. On prior generations of CPUs EPP = 128
was enough to get maximum frequency, but with AlderLake mobile the
equivalent EPP is 102. Since EPP is model specific, this is possible that
they have different meaning on each generation of CPU.

The current EPP string "balance_performance" corresponds to EPP = 128.
Change the EPP corresponding to "balance_performance" to 102 for only
AlderLake mobile CPUs and update this on each CPU during boot.

To implement reuse epp_values[] array and update the modified EPP at the
index for BALANCE_PERFORMANCE. Add a dummy EPP_INDEX_DEFAULT to
epp_values[] to match indexes in the energy_perf_strings[].

After HWP PM is enabled also update EPP when "balance_performance" is
redefined for the very first time after the boot on each CPU. On
subsequent suspend/resume or offline/online the old EPP is restored,
so no specific action is needed.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.9, v5.15.8
# 458b03f8 10-Dec-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Drop redundant intel_pstate_get_hwp_cap() call

It is not necessary to call intel_pstate_get_hwp_cap() from
intel_pstate_update_perf_limits(), because it gets called from
intel

cpufreq: intel_pstate: Drop redundant intel_pstate_get_hwp_cap() call

It is not necessary to call intel_pstate_get_hwp_cap() from
intel_pstate_update_perf_limits(), because it gets called from
intel_pstate_verify_cpu_policy() which is either invoked directly
right before intel_pstate_update_perf_limits(), in
intel_cpufreq_verify_policy() in the passive mode, or called
from driver callbacks in a sequence that causes it to be followed
by an immediate intel_pstate_update_perf_limits().

Namely, in the active mode intel_cpufreq_verify_policy() is called
by intel_pstate_verify_policy() which is the ->verify() callback
routine of intel_pstate and gets called by the cpufreq core right
before intel_pstate_set_policy(), which is the driver's ->setoplicy()
callback routine, where intel_pstate_update_perf_limits() is called.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.7, v5.15.6, v5.15.5, v5.15.4
# 03c83982 18-Nov-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: ITMT support for overclocked system

On systems with overclocking enabled, CPPC Highest Performance can be
hard coded to 0xff. In this case even if we have cores with different

cpufreq: intel_pstate: ITMT support for overclocked system

On systems with overclocking enabled, CPPC Highest Performance can be
hard coded to 0xff. In this case even if we have cores with different
highest performance, ITMT can't be enabled as the current implementation
depends on CPPC Highest Performance.

On such systems we can use MSR_HWP_CAPABILITIES maximum performance field
when CPPC.Highest Performance is 0xff.

Due to legacy reasons, we can't solely depend on MSR_HWP_CAPABILITIES as
in some older systems CPPC Highest Performance is the only way to identify
different performing cores.

Reported-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Tested-by: Michael Larabel <Michael@MichaelLarabel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.3
# ed38eb49 17-Nov-2021 Rafael J. Wysocki <rafael.j.wysocki@intel.com>

cpufreq: intel_pstate: Fix active mode offline/online EPP handling

After commit 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" sc

cpufreq: intel_pstate: Fix active mode offline/online EPP handling

After commit 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and
->online callbacks") the EPP value set by the "performance" scaling
algorithm in the active mode is not restored after an offline/online
cycle which replaces it with the saved EPP value coming from user
space.

Address this issue by forcing intel_pstate_hwp_set() to set a new
EPP value when it runs first time after online.

Fixes: 4adcf2e5829f ("cpufreq: intel_pstate: Add ->offline and ->online callbacks")
Link: https://lore.kernel.org/linux-pm/adc7132c8655bd4d1c8b6129578e931a14fe1db2.camel@linux.intel.com/
Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: 5.9+ <stable@vger.kernel.org> # 5.9+
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.2
# cd23f02f 12-Nov-2021 Adamos Ttofari <attofari@amazon.de>

cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs

Commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ic

cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs

Commit fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers
support in no-HWP mode") enabled the use of Intel P-State driver
for Ice Lake servers.

But it doesn't cover the case when OS can't control P-States.

Therefore, for Ice Lake server, if MSR_MISC_PWR_MGMT bits 8 or 18
are enabled, then the Intel P-State driver should exit as OS can't
control P-States.

Fixes: fbdc21e9b038 ("cpufreq: intel_pstate: Add Icelake servers support in no-HWP mode")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


Revision tags: v5.15.1
# 074d0cdf 04-Nov-2021 Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

cpufreq: intel_pstate: Clear HWP Status during HWP Interrupt enable

It is possible that some performance excursions happened before OS boot
or enable HWP interrupts. So clear MSR_HWP_STATUS bits whe

cpufreq: intel_pstate: Clear HWP Status during HWP Interrupt enable

It is possible that some performance excursions happened before OS boot
or enable HWP interrupts. So clear MSR_HWP_STATUS bits when we enable
HWP interrupt. In this way a next excursion will results in a HWP
interrupt.

The status bits of MSR_HWP_STATUS must be cleared (0) by software so
that a new status condition change will cause the hardware to set the
bit again and issue the notification.

Fixes: 57577c996d73 ("cpufreq: intel_pstate: Process HWP Guaranteed change notification")
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

show more ...


12345678910>>...53