xref: /openbmc/qemu/docs/specs/rapl-msr.rst (revision 0418f908)
1*0418f908SAnthony Harivel================
2*0418f908SAnthony HarivelRAPL MSR support
3*0418f908SAnthony Harivel================
4*0418f908SAnthony Harivel
5*0418f908SAnthony HarivelThe RAPL interface (Running Average Power Limit) is advertising the accumulated
6*0418f908SAnthony Harivelenergy consumption of various power domains (e.g. CPU packages, DRAM, etc.).
7*0418f908SAnthony Harivel
8*0418f908SAnthony HarivelThe consumption is reported via MSRs (model specific registers) like
9*0418f908SAnthony HarivelMSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits
10*0418f908SAnthony Harivelregisters that represent the accumulated energy consumption in micro Joules.
11*0418f908SAnthony Harivel
12*0418f908SAnthony HarivelThanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some
13*0418f908SAnthony Harivelof them can now be handled by the userspace (QEMU). It uses a mechanism called
14*0418f908SAnthony Harivel"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so
15*0418f908SAnthony Harivelthat a callback is put in place. The design of this patch uses only this
16*0418f908SAnthony Harivelmechanism for handling the MSRs between guest/host.
17*0418f908SAnthony Harivel
18*0418f908SAnthony HarivelAt the moment the following MSRs are involved:
19*0418f908SAnthony Harivel
20*0418f908SAnthony Harivel.. code:: C
21*0418f908SAnthony Harivel
22*0418f908SAnthony Harivel    #define MSR_RAPL_POWER_UNIT             0x00000606
23*0418f908SAnthony Harivel    #define MSR_PKG_POWER_LIMIT             0x00000610
24*0418f908SAnthony Harivel    #define MSR_PKG_ENERGY_STATUS           0x00000611
25*0418f908SAnthony Harivel    #define MSR_PKG_POWER_INFO              0x00000614
26*0418f908SAnthony Harivel
27*0418f908SAnthony HarivelThe ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL
28*0418f908SAnthony Harivelspec and specify the power limit of the package, provide range of parameter(min
29*0418f908SAnthony Harivelpower, max power,..) and also the information of the multiplier for the energy
30*0418f908SAnthony Harivelcounter to calculate the power. Those MSRs are populated once at the beginning
31*0418f908SAnthony Harivelby reading the host CPU MSRs and are given back to the guest 1:1 when
32*0418f908SAnthony Harivelrequested.
33*0418f908SAnthony Harivel
34*0418f908SAnthony HarivelThe MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of
35*0418f908SAnthony Harivelenergy consumed since the last time the register was cleared. If you multiply
36*0418f908SAnthony Harivelit with the UNIT provided above you'll get the power in micro-joules. This
37*0418f908SAnthony Harivelcounter is always increasing and it increases more or less faster depending on
38*0418f908SAnthony Harivelthe consumption of the package. This counter is supposed to overflow at some
39*0418f908SAnthony Harivelpoint.
40*0418f908SAnthony Harivel
41*0418f908SAnthony HarivelEach core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e
42*0418f908SAnthony Harivel"rdmsr 0x611") will retrieve the same value. The value represents the energy
43*0418f908SAnthony Harivelfor the whole package. Whatever Core reading it will get the same value and a
44*0418f908SAnthony Harivelcore that belongs to PKG-0 will not be able to get the value of PKG-1 and
45*0418f908SAnthony Harivelvice-versa.
46*0418f908SAnthony Harivel
47*0418f908SAnthony HarivelHigh level implementation
48*0418f908SAnthony Harivel-------------------------
49*0418f908SAnthony Harivel
50*0418f908SAnthony HarivelIn order to update the value of the virtual MSR, a QEMU thread is created.
51*0418f908SAnthony HarivelThe thread is basically just an infinity loop that does:
52*0418f908SAnthony Harivel
53*0418f908SAnthony Harivel1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in
54*0418f908SAnthony Harivel   Userspace and System)
55*0418f908SAnthony Harivel
56*0418f908SAnthony Harivel2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where
57*0418f908SAnthony Harivel   the QEMU threads are running on.
58*0418f908SAnthony Harivel
59*0418f908SAnthony Harivel3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads
60*0418f908SAnthony Harivel   will do what they have to do and so the energy counter will increase.
61*0418f908SAnthony Harivel
62*0418f908SAnthony Harivel4. Repeat 2. and 3. and calculate the delta of every metrics representing the
63*0418f908SAnthony Harivel   time spent scheduled for each QEMU thread *and* the energy spent by the
64*0418f908SAnthony Harivel   packages during the pause.
65*0418f908SAnthony Harivel
66*0418f908SAnthony Harivel5. Filter the vcpu threads and the non-vcpu threads.
67*0418f908SAnthony Harivel
68*0418f908SAnthony Harivel6. Retrieve the topology of the Virtual Machine. This helps identify which
69*0418f908SAnthony Harivel   vCPU is running on which virtual package.
70*0418f908SAnthony Harivel
71*0418f908SAnthony Harivel7. The total energy spent by the non-vcpu threads is divided by the number
72*0418f908SAnthony Harivel   of vcpu threads so that each vcpu thread will get an equal part of the
73*0418f908SAnthony Harivel   energy spent by the QEMU workers.
74*0418f908SAnthony Harivel
75*0418f908SAnthony Harivel8. Calculate the ratio of energy spent per vcpu threads.
76*0418f908SAnthony Harivel
77*0418f908SAnthony Harivel9. Calculate the energy for each virtual package.
78*0418f908SAnthony Harivel
79*0418f908SAnthony Harivel10. The virtual MSRs are updated for each virtual package. Each vCPU that
80*0418f908SAnthony Harivel    belongs to the same package will return the same value when accessing the
81*0418f908SAnthony Harivel    the MSR.
82*0418f908SAnthony Harivel
83*0418f908SAnthony Harivel11. Loop back to 1.
84*0418f908SAnthony Harivel
85*0418f908SAnthony HarivelRatio calculation
86*0418f908SAnthony Harivel-----------------
87*0418f908SAnthony Harivel
88*0418f908SAnthony HarivelIn Linux, a process has an execution time associated with it. The scheduler is
89*0418f908SAnthony Hariveldividing the time in clock ticks. The number of clock ticks per second can be
90*0418f908SAnthony Harivelfound by the sysconf system call. A typical value of clock ticks per second is
91*0418f908SAnthony Harivel100. So a core can run a process at the maximum of 100 ticks per second. If a
92*0418f908SAnthony Harivelpackage has 4 cores, 400 ticks maximum can be scheduled on all the cores
93*0418f908SAnthony Harivelof the package for a period of 1 second.
94*0418f908SAnthony Harivel
95*0418f908SAnthony HarivelThe /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a
96*0418f908SAnthony Harivelprocess with the [pid] as the process ID. It gives the amount of ticks the
97*0418f908SAnthony Harivelprocess has been scheduled in userspace (utime) and kernel space (stime).
98*0418f908SAnthony Harivel
99*0418f908SAnthony HarivelBy reading those metrics for a thread, one can calculate the ratio of time the
100*0418f908SAnthony Harivelpackage has spent executing the thread.
101*0418f908SAnthony Harivel
102*0418f908SAnthony HarivelExample:
103*0418f908SAnthony Harivel
104*0418f908SAnthony HarivelA 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks
105*0418f908SAnthony Harivelper second per core. If a thread was scheduled for 100 ticks between a second
106*0418f908SAnthony Harivelon this package, that means my thread has been scheduled for 1/4 of the whole
107*0418f908SAnthony Harivelpackage. With that, the calculation of the energy spent by the thread on this
108*0418f908SAnthony Harivelpackage during this whole second is 1/4 of the total energy spent by the
109*0418f908SAnthony Harivelpackage.
110*0418f908SAnthony Harivel
111*0418f908SAnthony HarivelUsage
112*0418f908SAnthony Harivel-----
113*0418f908SAnthony Harivel
114*0418f908SAnthony HarivelCurrently this feature is only working on an Intel CPU that has the RAPL driver
115*0418f908SAnthony Harivelmounted and available in the sysfs. if not, QEMU fails at start-up.
116*0418f908SAnthony Harivel
117*0418f908SAnthony HarivelThis feature is activated with -accel
118*0418f908SAnthony Harivelkvm,rapl=true,rapl-helper-socket=/path/sock.sock
119*0418f908SAnthony Harivel
120*0418f908SAnthony HarivelIt is important that the socket path is the same as the one
121*0418f908SAnthony Harivel:program:`qemu-vmsr-helper` is listening to.
122*0418f908SAnthony Harivel
123*0418f908SAnthony Harivelqemu-vmsr-helper
124*0418f908SAnthony Harivel----------------
125*0418f908SAnthony Harivel
126*0418f908SAnthony HarivelThe qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of
127*0418f908SAnthony Harivelmaking persistent reservation, qemu-vmsr-helper is here to overcome the
128*0418f908SAnthony HarivelCVE-2020-8694 which remove user access to the rapl msr attributes.
129*0418f908SAnthony Harivel
130*0418f908SAnthony HarivelA socket communication is established between QEMU processes that has the RAPL
131*0418f908SAnthony HarivelMSR support activated and the qemu-vmsr-helper. A systemd service and socket
132*0418f908SAnthony Harivelactivation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket).
133*0418f908SAnthony Harivel
134*0418f908SAnthony HarivelThe systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The
135*0418f908SAnthony Harivelsocket can be passed via SCM_RIGHTS by libvirt, or its permissions can be
136*0418f908SAnthony Harivelchanged (e.g. 660 and root:kvm for a Debian system for example). Libvirt could
137*0418f908SAnthony Harivelalso start a separate helper if needed. All in all, the policy is left to the
138*0418f908SAnthony Hariveluser.
139*0418f908SAnthony Harivel
140*0418f908SAnthony HarivelSee the qemu-pr-helper documentation or manpage for further details.
141*0418f908SAnthony Harivel
142*0418f908SAnthony HarivelCurrent Limitations
143*0418f908SAnthony Harivel-------------------
144*0418f908SAnthony Harivel
145*0418f908SAnthony Harivel- Works only on Intel host CPUs because AMD CPUs are using different MSR
146*0418f908SAnthony Harivel  addresses.
147*0418f908SAnthony Harivel
148*0418f908SAnthony Harivel- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the
149*0418f908SAnthony Harivel  moment.
150*0418f908SAnthony Harivel
151*0418f908SAnthony HarivelReferences
152*0418f908SAnthony Harivel----------
153*0418f908SAnthony Harivel
154*0418f908SAnthony Harivel.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/
155*0418f908SAnthony Harivel.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html
156