xref: /openbmc/qemu/docs/specs/rapl-msr.rst (revision e818c01a)
1================
2RAPL MSR support
3================
4
5The RAPL interface (Running Average Power Limit) is advertising the accumulated
6energy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
7
8The consumption is reported via MSRs (model specific registers) like
9MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
10registers that represent the accumulated energy consumption in micro Joules.
11
12Thanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
13of them can now be handled by the userspace (QEMU). It uses a mechanism called
14"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
15that a callback is put in place. The design of this patch uses only this
16mechanism for handling the MSRs between guest/host.
17
18At the moment the following MSRs are involved:
19
20.. code:: C
21
22    #define MSR_RAPL_POWER_UNIT             0x00000606
23    #define MSR_PKG_POWER_LIMIT             0x00000610
24    #define MSR_PKG_ENERGY_STATUS           0x00000611
25    #define MSR_PKG_POWER_INFO              0x00000614
26
27The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
28spec and specify the power limit of the package, provide range of parameter(min
29power, max power,..) and also the information of the multiplier for the energy
30counter to calculate the power. Those MSRs are populated once at the beginning
31by reading the host CPU MSRs and are given back to the guest 1:1 when
32requested.
33
34The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
35energy consumed since the last time the register was cleared. If you multiply
36it with the UNIT provided above you'll get the power in micro-joules. This
37counter is always increasing and it increases more or less faster depending on
38the consumption of the package. This counter is supposed to overflow at some
39point.
40
41Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
42"rdmsr 0x611") will retrieve the same value. The value represents the energy
43for the whole package. Whatever Core reading it will get the same value and a
44core that belongs to PKG-0 will not be able to get the value of PKG-1 and
45vice-versa.
46
47High level implementation
48-------------------------
49
50In order to update the value of the virtual MSR, a QEMU thread is created.
51The thread is basically just an infinity loop that does:
52
531. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
54   Userspace and System)
55
562. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
57   the QEMU threads are running on.
58
593. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
60   will do what they have to do and so the energy counter will increase.
61
624. Repeat 2. and 3. and calculate the delta of every metrics representing the
63   time spent scheduled for each QEMU thread *and* the energy spent by the
64   packages during the pause.
65
665. Filter the vcpu threads and the non-vcpu threads.
67
686. Retrieve the topology of the Virtual Machine. This helps identify which
69   vCPU is running on which virtual package.
70
717. The total energy spent by the non-vcpu threads is divided by the number
72   of vcpu threads so that each vcpu thread will get an equal part of the
73   energy spent by the QEMU workers.
74
758. Calculate the ratio of energy spent per vcpu threads.
76
779. Calculate the energy for each virtual package.
78
7910. The virtual MSRs are updated for each virtual package. Each vCPU that
80    belongs to the same package will return the same value when accessing the
81    the MSR.
82
8311. Loop back to 1.
84
85Ratio calculation
86-----------------
87
88In Linux, a process has an execution time associated with it. The scheduler is
89dividing the time in clock ticks. The number of clock ticks per second can be
90found by the sysconf system call. A typical value of clock ticks per second is
91100. So a core can run a process at the maximum of 100 ticks per second. If a
92package has 4 cores, 400 ticks maximum can be scheduled on all the cores
93of the package for a period of 1 second.
94
95The /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
96process with the [pid] as the process ID. It gives the amount of ticks the
97process has been scheduled in userspace (utime) and kernel space (stime).
98
99By reading those metrics for a thread, one can calculate the ratio of time the
100package has spent executing the thread.
101
102Example:
103
104A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
105per second per core. If a thread was scheduled for 100 ticks between a second
106on this package, that means my thread has been scheduled for 1/4 of the whole
107package. With that, the calculation of the energy spent by the thread on this
108package during this whole second is 1/4 of the total energy spent by the
109package.
110
111Usage
112-----
113
114Currently this feature is only working on an Intel CPU that has the RAPL driver
115mounted and available in the sysfs. if not, QEMU fails at start-up.
116
117This feature is activated with -accel
118kvm,rapl=true,rapl-helper-socket=/path/sock.sock
119
120It is important that the socket path is the same as the one
121:program:`qemu-vmsr-helper` is listening to.
122
123qemu-vmsr-helper
124----------------
125
126The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
127making persistent reservation, qemu-vmsr-helper is here to overcome the
128CVE-2020-8694 which remove user access to the rapl msr attributes.
129
130A socket communication is established between QEMU processes that has the RAPL
131MSR support activated and the qemu-vmsr-helper. A systemd service and socket
132activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
133
134The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
135socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
136changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
137also start a separate helper if needed. All in all, the policy is left to the
138user.
139
140See the qemu-pr-helper documentation or manpage for further details.
141
142Current Limitations
143-------------------
144
145- Works only on Intel host CPUs because AMD CPUs are using different MSR
146  addresses.
147
148- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
149  moment.
150
151References
152----------
153
154.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
155.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
156