1================ 2RAPL MSR support 3================ 4 5The RAPL interface (Running Average Power Limit) is advertising the accumulated 6energy consumption of various power domains (e.g. CPU packages, DRAM, etc.). 7 8The consumption is reported via MSRs (model specific registers) like 9MSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits 10registers that represent the accumulated energy consumption in micro Joules. 11 12Thanks to KVM's `MSR filtering <msr-filter-patch_>`__ functionality, 13not all MSRs are handled by KVM. Some of them can now be handled by the 14userspace (QEMU); a list of MSRs is given at VM creation time to KVM, and 15a userspace exit occurs when they are accessed. 16 17.. _msr-filter-patch: https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/ 18 19At the moment the following MSRs are involved: 20 21.. code:: C 22 23 #define MSR_RAPL_POWER_UNIT 0x00000606 24 #define MSR_PKG_POWER_LIMIT 0x00000610 25 #define MSR_PKG_ENERGY_STATUS 0x00000611 26 #define MSR_PKG_POWER_INFO 0x00000614 27 28The ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL 29spec and specify the power limit of the package, provide range of parameter(min 30power, max power,..) and also the information of the multiplier for the energy 31counter to calculate the power. Those MSRs are populated once at the beginning 32by reading the host CPU MSRs and are given back to the guest 1:1 when 33requested. 34 35The MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of 36energy consumed since the last time the register was cleared. If you multiply 37it with the UNIT provided above you'll get the power in micro-joules. This 38counter is always increasing and it increases more or less faster depending on 39the consumption of the package. This counter is supposed to overflow at some 40point. 41 42Each core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e 43"rdmsr 0x611") will retrieve the same value. The value represents the energy 44for the whole package. Whatever Core reading it will get the same value and a 45core that belongs to PKG-0 will not be able to get the value of PKG-1 and 46vice-versa. 47 48High level implementation 49------------------------- 50 51In order to update the value of the virtual MSR, a QEMU thread is created. 52The thread is basically just an infinity loop that does: 53 541. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in 55 Userspace and System) 56 572. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where 58 the QEMU threads are running on. 59 603. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads 61 will do what they have to do and so the energy counter will increase. 62 634. Repeat 2. and 3. and calculate the delta of every metrics representing the 64 time spent scheduled for each QEMU thread *and* the energy spent by the 65 packages during the pause. 66 675. Filter the vcpu threads and the non-vcpu threads. 68 696. Retrieve the topology of the Virtual Machine. This helps identify which 70 vCPU is running on which virtual package. 71 727. The total energy spent by the non-vcpu threads is divided by the number 73 of vcpu threads so that each vcpu thread will get an equal part of the 74 energy spent by the QEMU workers. 75 768. Calculate the ratio of energy spent per vcpu threads. 77 789. Calculate the energy for each virtual package. 79 8010. The virtual MSRs are updated for each virtual package. Each vCPU that 81 belongs to the same package will return the same value when accessing the 82 the MSR. 83 8411. Loop back to 1. 85 86Ratio calculation 87----------------- 88 89In Linux, a process has an execution time associated with it. The scheduler is 90dividing the time in clock ticks. The number of clock ticks per second can be 91found by the sysconf system call. A typical value of clock ticks per second is 92100. So a core can run a process at the maximum of 100 ticks per second. If a 93package has 4 cores, 400 ticks maximum can be scheduled on all the cores 94of the package for a period of 1 second. 95 96`/proc/[pid]/stat <stat_>`__ is a procfs file that can give the executed 97time of a process with the [pid] as the process ID. It gives the amount 98of ticks the process has been scheduled in userspace (utime) and kernel 99space (stime). 100 101.. _stat: https://man7.org/linux/man-pages/man5/proc.5.html 102 103By reading those metrics for a thread, one can calculate the ratio of time the 104package has spent executing the thread. 105 106Example: 107 108A 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks 109per second per core. If a thread was scheduled for 100 ticks between a second 110on this package, that means my thread has been scheduled for 1/4 of the whole 111package. With that, the calculation of the energy spent by the thread on this 112package during this whole second is 1/4 of the total energy spent by the 113package. 114 115Usage 116----- 117 118Currently this feature is only working on an Intel CPU that has the RAPL driver 119mounted and available in the sysfs. if not, QEMU fails at start-up. 120 121This feature is activated with -accel 122kvm,rapl=true,rapl-helper-socket=/path/sock.sock 123 124It is important that the socket path is the same as the one 125:program:`qemu-vmsr-helper` is listening to. 126 127qemu-vmsr-helper 128---------------- 129 130The qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of 131making persistent reservation, qemu-vmsr-helper is here to overcome the 132CVE-2020-8694 which remove user access to the rapl msr attributes. 133 134A socket communication is established between QEMU processes that has the RAPL 135MSR support activated and the qemu-vmsr-helper. A systemd service and socket 136activation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket). 137 138The systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The 139socket can be passed via SCM_RIGHTS by libvirt, or its permissions can be 140changed (e.g. 660 and root:kvm for a Debian system for example). Libvirt could 141also start a separate helper if needed. All in all, the policy is left to the 142user. 143 144See the qemu-pr-helper documentation or manpage for further details. 145 146Current Limitations 147------------------- 148 149- Works only on Intel host CPUs because AMD CPUs are using different MSR 150 addresses. 151 152- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the 153 moment. 154 155