1.. SPDX-License-Identifier: GPL-2.0
2.. include:: <isonum.txt>
3
4===============================================
5``amd-pstate`` CPU Performance Scaling Driver
6===============================================
7
8:Copyright: |copy| 2021 Advanced Micro Devices, Inc.
9
10:Author: Huang Rui <ray.huang@amd.com>
11
12
13Introduction
14===================
15
16``amd-pstate`` is the AMD CPU performance scaling driver that introduces a
17new CPU frequency control mechanism on modern AMD APU and CPU series in
18Linux kernel. The new mechanism is based on Collaborative Processor
19Performance Control (CPPC) which provides finer grain frequency management
20than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using
21the ACPI P-states driver to manage CPU frequency and clocks with switching
22only in 3 P-states. CPPC replaces the ACPI P-states controls, allows a
23flexible, low-latency interface for the Linux kernel to directly
24communicate the performance hints to hardware.
25
26``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``,
27``ondemand``, etc. to manage the performance hints which are provided by
28CPPC hardware functionality that internally follows the hardware
29specification (for details refer to AMD64 Architecture Programmer's Manual
30Volume 2: System Programming [1]_). Currently ``amd-pstate`` supports basic
31frequency control function according to kernel governors on some of the
32Zen2 and Zen3 processors, and we will implement more AMD specific functions
33in future after we verify them on the hardware and SBIOS.
34
35
36AMD CPPC Overview
37=======================
38
39Collaborative Processor Performance Control (CPPC) interface enumerates a
40continuous, abstract, and unit-less performance value in a scale that is
41not tied to a specific performance state / frequency. This is an ACPI
42standard [2]_ which software can specify application performance goals and
43hints as a relative target to the infrastructure limits. AMD processors
44provides the low latency register model (MSR) instead of AML code
45interpreter for performance adjustments. ``amd-pstate`` will initialize a
46``struct cpufreq_driver`` instance ``amd_pstate_driver`` with the callbacks
47to manage each performance update behavior. ::
48
49 Highest Perf ------>+-----------------------+                         +-----------------------+
50                     |                       |                         |                       |
51                     |                       |                         |                       |
52                     |                       |          Max Perf  ---->|                       |
53                     |                       |                         |                       |
54                     |                       |                         |                       |
55 Nominal Perf ------>+-----------------------+                         +-----------------------+
56                     |                       |                         |                       |
57                     |                       |                         |                       |
58                     |                       |                         |                       |
59                     |                       |                         |                       |
60                     |                       |                         |                       |
61                     |                       |                         |                       |
62                     |                       |      Desired Perf  ---->|                       |
63                     |                       |                         |                       |
64                     |                       |                         |                       |
65                     |                       |                         |                       |
66                     |                       |                         |                       |
67                     |                       |                         |                       |
68                     |                       |                         |                       |
69                     |                       |                         |                       |
70                     |                       |                         |                       |
71                     |                       |                         |                       |
72  Lowest non-        |                       |                         |                       |
73  linear perf ------>+-----------------------+                         +-----------------------+
74                     |                       |                         |                       |
75                     |                       |       Lowest perf  ---->|                       |
76                     |                       |                         |                       |
77  Lowest perf ------>+-----------------------+                         +-----------------------+
78                     |                       |                         |                       |
79                     |                       |                         |                       |
80                     |                       |                         |                       |
81          0   ------>+-----------------------+                         +-----------------------+
82
83                                     AMD P-States Performance Scale
84
85
86.. _perf_cap:
87
88AMD CPPC Performance Capability
89--------------------------------
90
91Highest Performance (RO)
92.........................
93
94It is the absolute maximum performance an individual processor may reach,
95assuming ideal conditions. This performance level may not be sustainable
96for long durations and may only be achievable if other platform components
97are in a specific state; for example, it may require other processors be in
98an idle state. This would be equivalent to the highest frequencies
99supported by the processor.
100
101Nominal (Guaranteed) Performance (RO)
102......................................
103
104It is the maximum sustained performance level of the processor, assuming
105ideal operating conditions. In absence of an external constraint (power,
106thermal, etc.) this is the performance level the processor is expected to
107be able to maintain continuously. All cores/processors are expected to be
108able to sustain their nominal performance state simultaneously.
109
110Lowest non-linear Performance (RO)
111...................................
112
113It is the lowest performance level at which nonlinear power savings are
114achieved, for example, due to the combined effects of voltage and frequency
115scaling. Above this threshold, lower performance levels should be generally
116more energy efficient than higher performance levels. This register
117effectively conveys the most efficient performance level to ``amd-pstate``.
118
119Lowest Performance (RO)
120........................
121
122It is the absolute lowest performance level of the processor. Selecting a
123performance level lower than the lowest nonlinear performance level may
124cause an efficiency penalty but should reduce the instantaneous power
125consumption of the processor.
126
127AMD CPPC Performance Control
128------------------------------
129
130``amd-pstate`` passes performance goals through these registers. The
131register drives the behavior of the desired performance target.
132
133Minimum requested performance (RW)
134...................................
135
136``amd-pstate`` specifies the minimum allowed performance level.
137
138Maximum requested performance (RW)
139...................................
140
141``amd-pstate`` specifies a limit the maximum performance that is expected
142to be supplied by the hardware.
143
144Desired performance target (RW)
145...................................
146
147``amd-pstate`` specifies a desired target in the CPPC performance scale as
148a relative number. This can be expressed as percentage of nominal
149performance (infrastructure max). Below the nominal sustained performance
150level, desired performance expresses the average performance level of the
151processor subject to hardware. Above the nominal performance level,
152processor must provide at least nominal performance requested and go higher
153if current operating conditions allow.
154
155Energy Performance Preference (EPP) (RW)
156.........................................
157
158Provides a hint to the hardware if software wants to bias toward performance
159(0x0) or energy efficiency (0xff).
160
161
162Key Governors Support
163=======================
164
165``amd-pstate`` can be used with all the (generic) scaling governors listed
166by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then,
167it is responsible for the configuration of policy objects corresponding to
168CPUs and provides the ``CPUFreq`` core (and the scaling governors attached
169to the policy objects) with accurate information on the maximum and minimum
170operating frequencies supported by the hardware. Users can check the
171``scaling_cur_freq`` information comes from the ``CPUFreq`` core.
172
173``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic
174frequency control. It is to fine tune the processor configuration on
175``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate``
176registers adjust_perf callback to implement the CPPC similar performance
177update behavior. It is initialized by ``sugov_start`` and then populate the
178CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as
179the utilization update callback function in CPU scheduler. CPU scheduler
180will call ``cpufreq_update_util`` and assign the target performance
181according to the ``struct sugov_cpu`` that utilization update belongs to.
182Then ``amd-pstate`` updates the desired performance according to the CPU
183scheduler assigned.
184
185
186Processor Support
187=======================
188
189The ``amd-pstate`` initialization will fail if the _CPC in ACPI SBIOS is
190not existed at the detected processor, and it uses ``acpi_cpc_valid`` to
191check the _CPC existence. All Zen based processors support legacy ACPI
192hardware P-States function, so while the ``amd-pstate`` fails to be
193initialized, the kernel will fall back to initialize ``acpi-cpufreq``
194driver.
195
196There are two types of hardware implementations for ``amd-pstate``: one is
197`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support
198<perf_cap_>`_. It can use :c:macro:`X86_FEATURE_CPPC` feature flag (for
199details refer to Processor Programming Reference (PPR) for AMD Family
20019h Model 51h, Revision A1 Processors [3]_) to indicate the different
201types. ``amd-pstate`` is to register different ``static_call`` instances
202for different hardware implementations.
203
204Currently, some of Zen2 and Zen3 processors support ``amd-pstate``. In the
205future, it will be supported on more and more AMD processors.
206
207Full MSR Support
208-----------------
209
210Some new Zen3 processors such as Cezanne provide the MSR registers directly
211while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set.
212``amd-pstate`` can handle the MSR register to implement the fast switch
213function in ``CPUFreq`` that can shrink latency of frequency control on the
214interrupt context. The functions with ``pstate_xxx`` prefix represent the
215operations of MSR registers.
216
217Shared Memory Support
218----------------------
219
220If :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, that means the
221processor supports shared memory solution. In this case, ``amd-pstate``
222uses the ``cppc_acpi`` helper methods to implement the callback functions
223that defined on ``static_call``. The functions with ``cppc_xxx`` prefix
224represent the operations of acpi cppc helpers for shared memory solution.
225
226
227AMD P-States and ACPI hardware P-States always can be supported in one
228processor. But AMD P-States has the higher priority and if it is enabled
229with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond
230to the request from AMD P-States.
231
232
233User Space Interface in ``sysfs``
234==================================
235
236``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to
237control its functionality at the system level. They located in the
238``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. ::
239
240 root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd*
241 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf
242 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq
243 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq
244
245
246``amd_pstate_highest_perf / amd_pstate_max_freq``
247
248Maximum CPPC performance and CPU frequency that the driver is allowed to
249set in percent of the maximum supported CPPC performance level (the highest
250performance supported in `AMD CPPC Performance Capability <perf_cap_>`_).
251In some of ASICs, the highest CPPC performance is not the one in the _CPC
252table, so we need to expose it to sysfs. If boost is not active but
253supported, this maximum frequency will be larger than the one in
254``cpuinfo``.
255This attribute is read-only.
256
257``amd_pstate_lowest_nonlinear_freq``
258
259The lowest non-linear CPPC CPU frequency that the driver is allowed to set
260in percent of the maximum supported CPPC performance level (Please see the
261lowest non-linear performance in `AMD CPPC Performance Capability
262<perf_cap_>`_).
263This attribute is read-only.
264
265For other performance and frequency values, we can read them back from
266``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
267
268
269``amd-pstate`` vs ``acpi-cpufreq``
270======================================
271
272On majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables
273provided by the platform firmware used for CPU performance scaling, but
274only provides 3 P-states on AMD processors.
275However, on modern AMD APU and CPU series, it provides the collaborative
276processor performance control according to ACPI protocol and customize this
277for AMD platforms. That is fine-grain and continuous frequency range
278instead of the legacy hardware P-states. ``amd-pstate`` is the kernel
279module which supports the new AMD P-States mechanism on most of future AMD
280platforms. The AMD P-States mechanism will be the more performance and energy
281efficiency frequency management method on AMD processors.
282
283Kernel Module Options for ``amd-pstate``
284=========================================
285
286``shared_mem``
287Use a module param (shared_mem) to enable related processors manually with
288**amd_pstate.shared_mem=1**.
289Due to the performance issue on the processors with `Shared Memory Support
290<perf_cap_>`_, so we disable it for the moment and will enable this by default
291once we address performance issue on this solution.
292
293The way to check whether current processor is `Full MSR Support <perf_cap_>`_
294or `Shared Memory Support <perf_cap_>`_ : ::
295
296  ray@hr-test1:~$ lscpu | grep cppc
297  Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
298
299If CPU Flags have cppc, then this processor supports `Full MSR Support
300<perf_cap_>`_. Otherwise it supports `Shared Memory Support <perf_cap_>`_.
301
302
303``cpupower`` tool support for ``amd-pstate``
304===============================================
305
306``amd-pstate`` is supported on ``cpupower`` tool that can be used to dump the frequency
307information. And it is in progress to support more and more operations for new
308``amd-pstate`` module with this tool. ::
309
310 root@hr-test1:/home/ray# cpupower frequency-info
311 analyzing CPU 0:
312   driver: amd-pstate
313   CPUs which run at the same hardware frequency: 0
314   CPUs which need to have their frequency coordinated by software: 0
315   maximum transition latency: 131 us
316   hardware limits: 400 MHz - 4.68 GHz
317   available cpufreq governors: ondemand conservative powersave userspace performance schedutil
318   current policy: frequency should be within 400 MHz and 4.68 GHz.
319                   The governor "schedutil" may decide which speed to use
320                   within this range.
321   current CPU frequency: Unable to call hardware
322   current CPU frequency: 4.02 GHz (asserted by call to kernel)
323   boost state support:
324     Supported: yes
325     Active: yes
326     AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz.
327     AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz.
328     AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz.
329     AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz.
330
331
332Diagnostics and Tuning
333=======================
334
335Trace Events
336--------------
337
338There are two static trace events that can be used for ``amd-pstate``
339diagnostics.  One of them is the cpu_frequency trace event generally used
340by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event
341specific to ``amd-pstate``.  The following sequence of shell commands can
342be used to enable them and see their output (if the kernel is generally
343configured to support event tracing). ::
344
345 root@hr-test1:/home/ray# cd /sys/kernel/tracing/
346 root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable
347 root@hr-test1:/sys/kernel/tracing# cat trace
348 # tracer: nop
349 #
350 # entries-in-buffer/entries-written: 47827/42233061   #P:2
351 #
352 #                                _-----=> irqs-off
353 #                               / _----=> need-resched
354 #                              | / _---=> hardirq/softirq
355 #                              || / _--=> preempt-depth
356 #                              ||| /     delay
357 #           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
358 #              | |         |   ||||      |         |
359          <idle>-0       [015] dN...  4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true
360          <idle>-0       [007] d.h..  4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
361             cat-2161    [000] d....  4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true
362            sshd-2125    [004] d.s..  4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true
363          <idle>-0       [007] d.s..  4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true
364          <idle>-0       [003] d.s..  4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true
365          <idle>-0       [011] d.s..  4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true
366
367The cpu_frequency trace event will be triggered either by the ``schedutil`` scaling
368governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the
369policies with other scaling governors).
370
371
372Reference
373===========
374
375.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming,
376       https://www.amd.com/system/files/TechDocs/24593.pdf
377
378.. [2] Advanced Configuration and Power Interface Specification,
379       https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf
380
381.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors
382       https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip
383