1*0418f908SAnthony Harivel================ 2*0418f908SAnthony HarivelRAPL MSR support 3*0418f908SAnthony Harivel================ 4*0418f908SAnthony Harivel 5*0418f908SAnthony HarivelThe RAPL interface (Running Average Power Limit) is advertising the accumulated 6*0418f908SAnthony Harivelenergy consumption of various power domains (e.g. CPU packages, DRAM, etc.). 7*0418f908SAnthony Harivel 8*0418f908SAnthony HarivelThe consumption is reported via MSRs (model specific registers) like 9*0418f908SAnthony HarivelMSR_PKG_ENERGY_STATUS for the CPU package power domain. These MSRs are 64 bits 10*0418f908SAnthony Harivelregisters that represent the accumulated energy consumption in micro Joules. 11*0418f908SAnthony Harivel 12*0418f908SAnthony HarivelThanks to the MSR Filtering patch [#a]_ not all MSRs are handled by KVM. Some 13*0418f908SAnthony Harivelof them can now be handled by the userspace (QEMU). It uses a mechanism called 14*0418f908SAnthony Harivel"MSR filtering" where a list of MSRs is given at init time of a VM to KVM so 15*0418f908SAnthony Harivelthat a callback is put in place. The design of this patch uses only this 16*0418f908SAnthony Harivelmechanism for handling the MSRs between guest/host. 17*0418f908SAnthony Harivel 18*0418f908SAnthony HarivelAt the moment the following MSRs are involved: 19*0418f908SAnthony Harivel 20*0418f908SAnthony Harivel.. code:: C 21*0418f908SAnthony Harivel 22*0418f908SAnthony Harivel #define MSR_RAPL_POWER_UNIT 0x00000606 23*0418f908SAnthony Harivel #define MSR_PKG_POWER_LIMIT 0x00000610 24*0418f908SAnthony Harivel #define MSR_PKG_ENERGY_STATUS 0x00000611 25*0418f908SAnthony Harivel #define MSR_PKG_POWER_INFO 0x00000614 26*0418f908SAnthony Harivel 27*0418f908SAnthony HarivelThe ``*_POWER_UNIT``, ``*_POWER_LIMIT``, ``*_POWER INFO`` are part of the RAPL 28*0418f908SAnthony Harivelspec and specify the power limit of the package, provide range of parameter(min 29*0418f908SAnthony Harivelpower, max power,..) and also the information of the multiplier for the energy 30*0418f908SAnthony Harivelcounter to calculate the power. Those MSRs are populated once at the beginning 31*0418f908SAnthony Harivelby reading the host CPU MSRs and are given back to the guest 1:1 when 32*0418f908SAnthony Harivelrequested. 33*0418f908SAnthony Harivel 34*0418f908SAnthony HarivelThe MSR_PKG_ENERGY_STATUS is a counter; it represents the total amount of 35*0418f908SAnthony Harivelenergy consumed since the last time the register was cleared. If you multiply 36*0418f908SAnthony Harivelit with the UNIT provided above you'll get the power in micro-joules. This 37*0418f908SAnthony Harivelcounter is always increasing and it increases more or less faster depending on 38*0418f908SAnthony Harivelthe consumption of the package. This counter is supposed to overflow at some 39*0418f908SAnthony Harivelpoint. 40*0418f908SAnthony Harivel 41*0418f908SAnthony HarivelEach core belonging to the same Package reading the MSR_PKG_ENERGY_STATUS (i.e 42*0418f908SAnthony Harivel"rdmsr 0x611") will retrieve the same value. The value represents the energy 43*0418f908SAnthony Harivelfor the whole package. Whatever Core reading it will get the same value and a 44*0418f908SAnthony Harivelcore that belongs to PKG-0 will not be able to get the value of PKG-1 and 45*0418f908SAnthony Harivelvice-versa. 46*0418f908SAnthony Harivel 47*0418f908SAnthony HarivelHigh level implementation 48*0418f908SAnthony Harivel------------------------- 49*0418f908SAnthony Harivel 50*0418f908SAnthony HarivelIn order to update the value of the virtual MSR, a QEMU thread is created. 51*0418f908SAnthony HarivelThe thread is basically just an infinity loop that does: 52*0418f908SAnthony Harivel 53*0418f908SAnthony Harivel1. Snapshot of the time metrics of all QEMU threads (Time spent scheduled in 54*0418f908SAnthony Harivel Userspace and System) 55*0418f908SAnthony Harivel 56*0418f908SAnthony Harivel2. Snapshot of the actual MSR_PKG_ENERGY_STATUS counter of all packages where 57*0418f908SAnthony Harivel the QEMU threads are running on. 58*0418f908SAnthony Harivel 59*0418f908SAnthony Harivel3. Sleep for 1 second - During this pause the vcpu and other non-vcpu threads 60*0418f908SAnthony Harivel will do what they have to do and so the energy counter will increase. 61*0418f908SAnthony Harivel 62*0418f908SAnthony Harivel4. Repeat 2. and 3. and calculate the delta of every metrics representing the 63*0418f908SAnthony Harivel time spent scheduled for each QEMU thread *and* the energy spent by the 64*0418f908SAnthony Harivel packages during the pause. 65*0418f908SAnthony Harivel 66*0418f908SAnthony Harivel5. Filter the vcpu threads and the non-vcpu threads. 67*0418f908SAnthony Harivel 68*0418f908SAnthony Harivel6. Retrieve the topology of the Virtual Machine. This helps identify which 69*0418f908SAnthony Harivel vCPU is running on which virtual package. 70*0418f908SAnthony Harivel 71*0418f908SAnthony Harivel7. The total energy spent by the non-vcpu threads is divided by the number 72*0418f908SAnthony Harivel of vcpu threads so that each vcpu thread will get an equal part of the 73*0418f908SAnthony Harivel energy spent by the QEMU workers. 74*0418f908SAnthony Harivel 75*0418f908SAnthony Harivel8. Calculate the ratio of energy spent per vcpu threads. 76*0418f908SAnthony Harivel 77*0418f908SAnthony Harivel9. Calculate the energy for each virtual package. 78*0418f908SAnthony Harivel 79*0418f908SAnthony Harivel10. The virtual MSRs are updated for each virtual package. Each vCPU that 80*0418f908SAnthony Harivel belongs to the same package will return the same value when accessing the 81*0418f908SAnthony Harivel the MSR. 82*0418f908SAnthony Harivel 83*0418f908SAnthony Harivel11. Loop back to 1. 84*0418f908SAnthony Harivel 85*0418f908SAnthony HarivelRatio calculation 86*0418f908SAnthony Harivel----------------- 87*0418f908SAnthony Harivel 88*0418f908SAnthony HarivelIn Linux, a process has an execution time associated with it. The scheduler is 89*0418f908SAnthony Hariveldividing the time in clock ticks. The number of clock ticks per second can be 90*0418f908SAnthony Harivelfound by the sysconf system call. A typical value of clock ticks per second is 91*0418f908SAnthony Harivel100. So a core can run a process at the maximum of 100 ticks per second. If a 92*0418f908SAnthony Harivelpackage has 4 cores, 400 ticks maximum can be scheduled on all the cores 93*0418f908SAnthony Harivelof the package for a period of 1 second. 94*0418f908SAnthony Harivel 95*0418f908SAnthony HarivelThe /proc/[pid]/stat [#b]_ is a sysfs file that can give the executed time of a 96*0418f908SAnthony Harivelprocess with the [pid] as the process ID. It gives the amount of ticks the 97*0418f908SAnthony Harivelprocess has been scheduled in userspace (utime) and kernel space (stime). 98*0418f908SAnthony Harivel 99*0418f908SAnthony HarivelBy reading those metrics for a thread, one can calculate the ratio of time the 100*0418f908SAnthony Harivelpackage has spent executing the thread. 101*0418f908SAnthony Harivel 102*0418f908SAnthony HarivelExample: 103*0418f908SAnthony Harivel 104*0418f908SAnthony HarivelA 4 cores package can schedule a maximum of 400 ticks per second with 100 ticks 105*0418f908SAnthony Harivelper second per core. If a thread was scheduled for 100 ticks between a second 106*0418f908SAnthony Harivelon this package, that means my thread has been scheduled for 1/4 of the whole 107*0418f908SAnthony Harivelpackage. With that, the calculation of the energy spent by the thread on this 108*0418f908SAnthony Harivelpackage during this whole second is 1/4 of the total energy spent by the 109*0418f908SAnthony Harivelpackage. 110*0418f908SAnthony Harivel 111*0418f908SAnthony HarivelUsage 112*0418f908SAnthony Harivel----- 113*0418f908SAnthony Harivel 114*0418f908SAnthony HarivelCurrently this feature is only working on an Intel CPU that has the RAPL driver 115*0418f908SAnthony Harivelmounted and available in the sysfs. if not, QEMU fails at start-up. 116*0418f908SAnthony Harivel 117*0418f908SAnthony HarivelThis feature is activated with -accel 118*0418f908SAnthony Harivelkvm,rapl=true,rapl-helper-socket=/path/sock.sock 119*0418f908SAnthony Harivel 120*0418f908SAnthony HarivelIt is important that the socket path is the same as the one 121*0418f908SAnthony Harivel:program:`qemu-vmsr-helper` is listening to. 122*0418f908SAnthony Harivel 123*0418f908SAnthony Harivelqemu-vmsr-helper 124*0418f908SAnthony Harivel---------------- 125*0418f908SAnthony Harivel 126*0418f908SAnthony HarivelThe qemu-vmsr-helper is working very much like the qemu-pr-helper. Instead of 127*0418f908SAnthony Harivelmaking persistent reservation, qemu-vmsr-helper is here to overcome the 128*0418f908SAnthony HarivelCVE-2020-8694 which remove user access to the rapl msr attributes. 129*0418f908SAnthony Harivel 130*0418f908SAnthony HarivelA socket communication is established between QEMU processes that has the RAPL 131*0418f908SAnthony HarivelMSR support activated and the qemu-vmsr-helper. A systemd service and socket 132*0418f908SAnthony Harivelactivation is provided in contrib/systemd/qemu-vmsr-helper.(service/socket). 133*0418f908SAnthony Harivel 134*0418f908SAnthony HarivelThe systemd socket uses 600, like contrib/systemd/qemu-pr-helper.socket. The 135*0418f908SAnthony Harivelsocket can be passed via SCM_RIGHTS by libvirt, or its permissions can be 136*0418f908SAnthony Harivelchanged (e.g. 660 and root:kvm for a Debian system for example). Libvirt could 137*0418f908SAnthony Harivelalso start a separate helper if needed. All in all, the policy is left to the 138*0418f908SAnthony Hariveluser. 139*0418f908SAnthony Harivel 140*0418f908SAnthony HarivelSee the qemu-pr-helper documentation or manpage for further details. 141*0418f908SAnthony Harivel 142*0418f908SAnthony HarivelCurrent Limitations 143*0418f908SAnthony Harivel------------------- 144*0418f908SAnthony Harivel 145*0418f908SAnthony Harivel- Works only on Intel host CPUs because AMD CPUs are using different MSR 146*0418f908SAnthony Harivel addresses. 147*0418f908SAnthony Harivel 148*0418f908SAnthony Harivel- Only the Package Power-Plane (MSR_PKG_ENERGY_STATUS) is reported at the 149*0418f908SAnthony Harivel moment. 150*0418f908SAnthony Harivel 151*0418f908SAnthony HarivelReferences 152*0418f908SAnthony Harivel---------- 153*0418f908SAnthony Harivel 154*0418f908SAnthony Harivel.. [#a] https://patchwork.kernel.org/project/kvm/patch/20200916202951.23760-7-graf@amazon.com/ 155*0418f908SAnthony Harivel.. [#b] https://man7.org/linux/man-pages/man5/proc.5.html 156