1.. SPDX-License-Identifier: GPL-2.0 2.. include:: <isonum.txt> 3 4=============================================== 5``amd-pstate`` CPU Performance Scaling Driver 6=============================================== 7 8:Copyright: |copy| 2021 Advanced Micro Devices, Inc. 9 10:Author: Huang Rui <ray.huang@amd.com> 11 12 13Introduction 14=================== 15 16``amd-pstate`` is the AMD CPU performance scaling driver that introduces a 17new CPU frequency control mechanism on modern AMD APU and CPU series in 18Linux kernel. The new mechanism is based on Collaborative Processor 19Performance Control (CPPC) which provides finer grain frequency management 20than legacy ACPI hardware P-States. Current AMD CPU/APU platforms are using 21the ACPI P-states driver to manage CPU frequency and clocks with switching 22only in 3 P-states. CPPC replaces the ACPI P-states controls and allows a 23flexible, low-latency interface for the Linux kernel to directly 24communicate the performance hints to hardware. 25 26``amd-pstate`` leverages the Linux kernel governors such as ``schedutil``, 27``ondemand``, etc. to manage the performance hints which are provided by 28CPPC hardware functionality that internally follows the hardware 29specification (for details refer to AMD64 Architecture Programmer's Manual 30Volume 2: System Programming [1]_). Currently, ``amd-pstate`` supports basic 31frequency control function according to kernel governors on some of the 32Zen2 and Zen3 processors, and we will implement more AMD specific functions 33in future after we verify them on the hardware and SBIOS. 34 35 36AMD CPPC Overview 37======================= 38 39Collaborative Processor Performance Control (CPPC) interface enumerates a 40continuous, abstract, and unit-less performance value in a scale that is 41not tied to a specific performance state / frequency. This is an ACPI 42standard [2]_ which software can specify application performance goals and 43hints as a relative target to the infrastructure limits. AMD processors 44provide the low latency register model (MSR) instead of an AML code 45interpreter for performance adjustments. ``amd-pstate`` will initialize a 46``struct cpufreq_driver`` instance, ``amd_pstate_driver``, with the callbacks 47to manage each performance update behavior. :: 48 49 Highest Perf ------>+-----------------------+ +-----------------------+ 50 | | | | 51 | | | | 52 | | Max Perf ---->| | 53 | | | | 54 | | | | 55 Nominal Perf ------>+-----------------------+ +-----------------------+ 56 | | | | 57 | | | | 58 | | | | 59 | | | | 60 | | | | 61 | | | | 62 | | Desired Perf ---->| | 63 | | | | 64 | | | | 65 | | | | 66 | | | | 67 | | | | 68 | | | | 69 | | | | 70 | | | | 71 | | | | 72 Lowest non- | | | | 73 linear perf ------>+-----------------------+ +-----------------------+ 74 | | | | 75 | | Lowest perf ---->| | 76 | | | | 77 Lowest perf ------>+-----------------------+ +-----------------------+ 78 | | | | 79 | | | | 80 | | | | 81 0 ------>+-----------------------+ +-----------------------+ 82 83 AMD P-States Performance Scale 84 85 86.. _perf_cap: 87 88AMD CPPC Performance Capability 89-------------------------------- 90 91Highest Performance (RO) 92......................... 93 94This is the absolute maximum performance an individual processor may reach, 95assuming ideal conditions. This performance level may not be sustainable 96for long durations and may only be achievable if other platform components 97are in a specific state; for example, it may require other processors to be in 98an idle state. This would be equivalent to the highest frequencies 99supported by the processor. 100 101Nominal (Guaranteed) Performance (RO) 102...................................... 103 104This is the maximum sustained performance level of the processor, assuming 105ideal operating conditions. In the absence of an external constraint (power, 106thermal, etc.), this is the performance level the processor is expected to 107be able to maintain continuously. All cores/processors are expected to be 108able to sustain their nominal performance state simultaneously. 109 110Lowest non-linear Performance (RO) 111................................... 112 113This is the lowest performance level at which nonlinear power savings are 114achieved, for example, due to the combined effects of voltage and frequency 115scaling. Above this threshold, lower performance levels should be generally 116more energy efficient than higher performance levels. This register 117effectively conveys the most efficient performance level to ``amd-pstate``. 118 119Lowest Performance (RO) 120........................ 121 122This is the absolute lowest performance level of the processor. Selecting a 123performance level lower than the lowest nonlinear performance level may 124cause an efficiency penalty but should reduce the instantaneous power 125consumption of the processor. 126 127AMD CPPC Performance Control 128------------------------------ 129 130``amd-pstate`` passes performance goals through these registers. The 131register drives the behavior of the desired performance target. 132 133Minimum requested performance (RW) 134................................... 135 136``amd-pstate`` specifies the minimum allowed performance level. 137 138Maximum requested performance (RW) 139................................... 140 141``amd-pstate`` specifies a limit the maximum performance that is expected 142to be supplied by the hardware. 143 144Desired performance target (RW) 145................................... 146 147``amd-pstate`` specifies a desired target in the CPPC performance scale as 148a relative number. This can be expressed as percentage of nominal 149performance (infrastructure max). Below the nominal sustained performance 150level, desired performance expresses the average performance level of the 151processor subject to hardware. Above the nominal performance level, 152the processor must provide at least nominal performance requested and go higher 153if current operating conditions allow. 154 155Energy Performance Preference (EPP) (RW) 156......................................... 157 158This attribute provides a hint to the hardware if software wants to bias 159toward performance (0x0) or energy efficiency (0xff). 160 161 162Key Governors Support 163======================= 164 165``amd-pstate`` can be used with all the (generic) scaling governors listed 166by the ``scaling_available_governors`` policy attribute in ``sysfs``. Then, 167it is responsible for the configuration of policy objects corresponding to 168CPUs and provides the ``CPUFreq`` core (and the scaling governors attached 169to the policy objects) with accurate information on the maximum and minimum 170operating frequencies supported by the hardware. Users can check the 171``scaling_cur_freq`` information comes from the ``CPUFreq`` core. 172 173``amd-pstate`` mainly supports ``schedutil`` and ``ondemand`` for dynamic 174frequency control. It is to fine tune the processor configuration on 175``amd-pstate`` to the ``schedutil`` with CPU CFS scheduler. ``amd-pstate`` 176registers the adjust_perf callback to implement performance update behavior 177similar to CPPC. It is initialized by ``sugov_start`` and then populates the 178CPU's update_util_data pointer to assign ``sugov_update_single_perf`` as the 179utilization update callback function in the CPU scheduler. The CPU scheduler 180will call ``cpufreq_update_util`` and assigns the target performance according 181to the ``struct sugov_cpu`` that the utilization update belongs to. 182Then, ``amd-pstate`` updates the desired performance according to the CPU 183scheduler assigned. 184 185 186Processor Support 187======================= 188 189The ``amd-pstate`` initialization will fail if the ``_CPC`` entry in the ACPI 190SBIOS does not exist in the detected processor. It uses ``acpi_cpc_valid`` 191to check the existence of ``_CPC``. All Zen based processors support the legacy 192ACPI hardware P-States function, so when ``amd-pstate`` fails initialization, 193the kernel will fall back to initialize the ``acpi-cpufreq`` driver. 194 195There are two types of hardware implementations for ``amd-pstate``: one is 196`Full MSR Support <perf_cap_>`_ and another is `Shared Memory Support 197<perf_cap_>`_. It can use the :c:macro:`X86_FEATURE_CPPC` feature flag to 198indicate the different types. (For details, refer to the Processor Programming 199Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors [3]_.) 200``amd-pstate`` is to register different ``static_call`` instances for different 201hardware implementations. 202 203Currently, some of the Zen2 and Zen3 processors support ``amd-pstate``. In the 204future, it will be supported on more and more AMD processors. 205 206Full MSR Support 207----------------- 208 209Some new Zen3 processors such as Cezanne provide the MSR registers directly 210while the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is set. 211``amd-pstate`` can handle the MSR register to implement the fast switch 212function in ``CPUFreq`` that can reduce the latency of frequency control in 213interrupt context. The functions with a ``pstate_xxx`` prefix represent the 214operations on MSR registers. 215 216Shared Memory Support 217---------------------- 218 219If the :c:macro:`X86_FEATURE_CPPC` CPU feature flag is not set, the 220processor supports the shared memory solution. In this case, ``amd-pstate`` 221uses the ``cppc_acpi`` helper methods to implement the callback functions 222that are defined on ``static_call``. The functions with the ``cppc_xxx`` prefix 223represent the operations of ACPI CPPC helpers for the shared memory solution. 224 225 226AMD P-States and ACPI hardware P-States always can be supported in one 227processor. But AMD P-States has the higher priority and if it is enabled 228with :c:macro:`MSR_AMD_CPPC_ENABLE` or ``cppc_set_enable``, it will respond 229to the request from AMD P-States. 230 231 232User Space Interface in ``sysfs`` 233================================== 234 235``amd-pstate`` exposes several global attributes (files) in ``sysfs`` to 236control its functionality at the system level. They are located in the 237``/sys/devices/system/cpu/cpufreq/policyX/`` directory and affect all CPUs. :: 238 239 root@hr-test1:/home/ray# ls /sys/devices/system/cpu/cpufreq/policy0/*amd* 240 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_highest_perf 241 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_lowest_nonlinear_freq 242 /sys/devices/system/cpu/cpufreq/policy0/amd_pstate_max_freq 243 244 245``amd_pstate_highest_perf / amd_pstate_max_freq`` 246 247Maximum CPPC performance and CPU frequency that the driver is allowed to 248set, in percent of the maximum supported CPPC performance level (the highest 249performance supported in `AMD CPPC Performance Capability <perf_cap_>`_). 250In some ASICs, the highest CPPC performance is not the one in the ``_CPC`` 251table, so we need to expose it to sysfs. If boost is not active, but 252still supported, this maximum frequency will be larger than the one in 253``cpuinfo``. 254This attribute is read-only. 255 256``amd_pstate_lowest_nonlinear_freq`` 257 258The lowest non-linear CPPC CPU frequency that the driver is allowed to set, 259in percent of the maximum supported CPPC performance level. (Please see the 260lowest non-linear performance in `AMD CPPC Performance Capability 261<perf_cap_>`_.) 262This attribute is read-only. 263 264Other performance and frequency values can be read back from 265``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`. 266 267 268``amd-pstate`` vs ``acpi-cpufreq`` 269====================================== 270 271On the majority of AMD platforms supported by ``acpi-cpufreq``, the ACPI tables 272provided by the platform firmware are used for CPU performance scaling, but 273only provide 3 P-states on AMD processors. 274However, on modern AMD APU and CPU series, hardware provides the Collaborative 275Processor Performance Control according to the ACPI protocol and customizes this 276for AMD platforms. That is, fine-grained and continuous frequency ranges 277instead of the legacy hardware P-states. ``amd-pstate`` is the kernel 278module which supports the new AMD P-States mechanism on most of the future AMD 279platforms. The AMD P-States mechanism is the more performance and energy 280efficiency frequency management method on AMD processors. 281 282Kernel Module Options for ``amd-pstate`` 283========================================= 284 285``shared_mem`` 286Use a module param (shared_mem) to enable related processors manually with 287**amd_pstate.shared_mem=1**. 288Due to the performance issue on the processors with `Shared Memory Support 289<perf_cap_>`_, we disable it presently and will re-enable this by default 290once we address performance issue with this solution. 291 292To check whether the current processor is using `Full MSR Support <perf_cap_>`_ 293or `Shared Memory Support <perf_cap_>`_ : :: 294 295 ray@hr-test1:~$ lscpu | grep cppc 296 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm 297 298If the CPU flags have ``cppc``, then this processor supports `Full MSR Support 299<perf_cap_>`_. Otherwise, it supports `Shared Memory Support <perf_cap_>`_. 300 301 302``cpupower`` tool support for ``amd-pstate`` 303=============================================== 304 305``amd-pstate`` is supported by the ``cpupower`` tool, which can be used to dump 306frequency information. Development is in progress to support more and more 307operations for the new ``amd-pstate`` module with this tool. :: 308 309 root@hr-test1:/home/ray# cpupower frequency-info 310 analyzing CPU 0: 311 driver: amd-pstate 312 CPUs which run at the same hardware frequency: 0 313 CPUs which need to have their frequency coordinated by software: 0 314 maximum transition latency: 131 us 315 hardware limits: 400 MHz - 4.68 GHz 316 available cpufreq governors: ondemand conservative powersave userspace performance schedutil 317 current policy: frequency should be within 400 MHz and 4.68 GHz. 318 The governor "schedutil" may decide which speed to use 319 within this range. 320 current CPU frequency: Unable to call hardware 321 current CPU frequency: 4.02 GHz (asserted by call to kernel) 322 boost state support: 323 Supported: yes 324 Active: yes 325 AMD PSTATE Highest Performance: 166. Maximum Frequency: 4.68 GHz. 326 AMD PSTATE Nominal Performance: 117. Nominal Frequency: 3.30 GHz. 327 AMD PSTATE Lowest Non-linear Performance: 39. Lowest Non-linear Frequency: 1.10 GHz. 328 AMD PSTATE Lowest Performance: 15. Lowest Frequency: 400 MHz. 329 330 331Diagnostics and Tuning 332======================= 333 334Trace Events 335-------------- 336 337There are two static trace events that can be used for ``amd-pstate`` 338diagnostics. One of them is the ``cpu_frequency`` trace event generally used 339by ``CPUFreq``, and the other one is the ``amd_pstate_perf`` trace event 340specific to ``amd-pstate``. The following sequence of shell commands can 341be used to enable them and see their output (if the kernel is 342configured to support event tracing). :: 343 344 root@hr-test1:/home/ray# cd /sys/kernel/tracing/ 345 root@hr-test1:/sys/kernel/tracing# echo 1 > events/amd_cpu/enable 346 root@hr-test1:/sys/kernel/tracing# cat trace 347 # tracer: nop 348 # 349 # entries-in-buffer/entries-written: 47827/42233061 #P:2 350 # 351 # _-----=> irqs-off 352 # / _----=> need-resched 353 # | / _---=> hardirq/softirq 354 # || / _--=> preempt-depth 355 # ||| / delay 356 # TASK-PID CPU# |||| TIMESTAMP FUNCTION 357 # | | | |||| | | 358 <idle>-0 [015] dN... 4995.979886: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=15 changed=false fast_switch=true 359 <idle>-0 [007] d.h.. 4995.979893: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true 360 cat-2161 [000] d.... 4995.980841: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=0 changed=false fast_switch=true 361 sshd-2125 [004] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=4 changed=false fast_switch=true 362 <idle>-0 [007] d.s.. 4995.980968: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=7 changed=false fast_switch=true 363 <idle>-0 [003] d.s.. 4995.980971: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=3 changed=false fast_switch=true 364 <idle>-0 [011] d.s.. 4995.980996: amd_pstate_perf: amd_min_perf=85 amd_des_perf=85 amd_max_perf=166 cpu_id=11 changed=false fast_switch=true 365 366The ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` scaling 367governor (for the policies it is attached to), or by the ``CPUFreq`` core (for the 368policies with other scaling governors). 369 370 371Tracer Tool 372------------- 373 374``amd_pstate_tracer.py`` can record and parse ``amd-pstate`` trace log, then 375generate performance plots. This utility can be used to debug and tune the 376performance of ``amd-pstate`` driver. The tracer tool needs to import intel 377pstate tracer. 378 379Tracer tool located in ``linux/tools/power/x86/amd_pstate_tracer``. It can be 380used in two ways. If trace file is available, then directly parse the file 381with command :: 382 383 ./amd_pstate_trace.py [-c cpus] -t <trace_file> -n <test_name> 384 385Or generate trace file with root privilege, then parse and plot with command :: 386 387 sudo ./amd_pstate_trace.py [-c cpus] -n <test_name> -i <interval> [-m kbytes] 388 389The test result can be found in ``results/test_name``. Following is the example 390about part of the output. :: 391 392 common_cpu common_secs common_usecs min_perf des_perf max_perf freq mperf apef tsc load duration_ms sample_num elapsed_time common_comm 393 CPU_005 712 116384 39 49 166 0.7565 9645075 2214891 38431470 25.1 11.646 469 2.496 kworker/5:0-40 394 CPU_006 712 116408 39 49 166 0.6769 8950227 1839034 37192089 24.06 11.272 470 2.496 kworker/6:0-1264 395 396 397Reference 398=========== 399 400.. [1] AMD64 Architecture Programmer's Manual Volume 2: System Programming, 401 https://www.amd.com/system/files/TechDocs/24593.pdf 402 403.. [2] Advanced Configuration and Power Interface Specification, 404 https://uefi.org/sites/default/files/resources/ACPI_Spec_6_4_Jan22.pdf 405 406.. [3] Processor Programming Reference (PPR) for AMD Family 19h Model 51h, Revision A1 Processors 407 https://www.amd.com/system/files/TechDocs/56569-A1-PUB.zip 408