1================================ 2PSI - Pressure Stall Information 3================================ 4 5:Date: April, 2018 6:Author: Johannes Weiner <hannes@cmpxchg.org> 7 8When CPU, memory or IO devices are contended, workloads experience 9latency spikes, throughput losses, and run the risk of OOM kills. 10 11Without an accurate measure of such contention, users are forced to 12either play it safe and under-utilize their hardware resources, or 13roll the dice and frequently suffer the disruptions resulting from 14excessive overcommit. 15 16The psi feature identifies and quantifies the disruptions caused by 17such resource crunches and the time impact it has on complex workloads 18or even entire systems. 19 20Having an accurate measure of productivity losses caused by resource 21scarcity aids users in sizing workloads to hardware--or provisioning 22hardware according to workload demand. 23 24As psi aggregates this information in realtime, systems can be managed 25dynamically using techniques such as load shedding, migrating jobs to 26other systems or data centers, or strategically pausing or killing low 27priority or restartable batch jobs. 28 29This allows maximizing hardware utilization without sacrificing 30workload health or risking major disruptions such as OOM kills. 31 32Pressure interface 33================== 34 35Pressure information for each resource is exported through the 36respective file in /proc/pressure/ -- cpu, memory, and io. 37 38The format for CPU is as such:: 39 40 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 41 42and for memory and IO:: 43 44 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 45 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 46 47The "some" line indicates the share of time in which at least some 48tasks are stalled on a given resource. 49 50The "full" line indicates the share of time in which all non-idle 51tasks are stalled on a given resource simultaneously. In this state 52actual CPU cycles are going to waste, and a workload that spends 53extended time in this state is considered to be thrashing. This has 54severe impact on performance, and it's useful to distinguish this 55situation from a state where some tasks are stalled but the CPU is 56still doing productive work. As such, time spent in this subset of the 57stall state is tracked separately and exported in the "full" averages. 58 59The ratios (in %) are tracked as recent trends over ten, sixty, and 60three hundred second windows, which gives insight into short term events 61as well as medium and long term trends. The total absolute stall time 62(in us) is tracked and exported as well, to allow detection of latency 63spikes which wouldn't necessarily make a dent in the time averages, 64or to average trends over custom time frames. 65 66Monitoring for pressure thresholds 67================================== 68 69Users can register triggers and use poll() to be woken up when resource 70pressure exceeds certain thresholds. 71 72A trigger describes the maximum cumulative stall time over a specific 73time window, e.g. 100ms of total stall time within any 500ms window to 74generate a wakeup event. 75 76To register a trigger user has to open psi interface file under 77/proc/pressure/ representing the resource to be monitored and write the 78desired threshold and time window. The open file descriptor should be 79used to wait for trigger events using select(), poll() or epoll(). 80The following format is used:: 81 82 <some|full> <stall amount in us> <time window in us> 83 84For example writing "some 150000 1000000" into /proc/pressure/memory 85would add 150ms threshold for partial memory stall measured within 861sec time window. Writing "full 50000 1000000" into /proc/pressure/io 87would add 50ms threshold for full io stall measured within 1sec time window. 88 89Triggers can be set on more than one psi metric and more than one trigger 90for the same psi metric can be specified. However for each trigger a separate 91file descriptor is required to be able to poll it separately from others, 92therefore for each trigger a separate open() syscall should be made even 93when opening the same psi interface file. 94 95Monitors activate only when system enters stall state for the monitored 96psi metric and deactivates upon exit from the stall state. While system is 97in the stall state psi signal growth is monitored at a rate of 10 times per 98tracking window. 99 100The kernel accepts window sizes ranging from 500ms to 10s, therefore min 101monitoring update interval is 50ms and max is 1s. Min limit is set to 102prevent overly frequent polling. Max limit is chosen as a high enough number 103after which monitors are most likely not needed and psi averages can be used 104instead. 105 106When activated, psi monitor stays active for at least the duration of one 107tracking window to avoid repeated activations/deactivations when system is 108bouncing in and out of the stall state. 109 110Notifications to the userspace are rate-limited to one per tracking window. 111 112The trigger will de-register when the file descriptor used to define the 113trigger is closed. 114 115Userspace monitor usage example 116=============================== 117 118:: 119 120 #include <errno.h> 121 #include <fcntl.h> 122 #include <stdio.h> 123 #include <poll.h> 124 #include <string.h> 125 #include <unistd.h> 126 127 /* 128 * Monitor memory partial stall with 1s tracking window size 129 * and 150ms threshold. 130 */ 131 int main() { 132 const char trig[] = "some 150000 1000000"; 133 struct pollfd fds; 134 int n; 135 136 fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); 137 if (fds.fd < 0) { 138 printf("/proc/pressure/memory open error: %s\n", 139 strerror(errno)); 140 return 1; 141 } 142 fds.events = POLLPRI; 143 144 if (write(fds.fd, trig, strlen(trig) + 1) < 0) { 145 printf("/proc/pressure/memory write error: %s\n", 146 strerror(errno)); 147 return 1; 148 } 149 150 printf("waiting for events...\n"); 151 while (1) { 152 n = poll(&fds, 1, -1); 153 if (n < 0) { 154 printf("poll error: %s\n", strerror(errno)); 155 return 1; 156 } 157 if (fds.revents & POLLERR) { 158 printf("got POLLERR, event source is gone\n"); 159 return 0; 160 } 161 if (fds.revents & POLLPRI) { 162 printf("event triggered!\n"); 163 } else { 164 printf("unknown event received: 0x%x\n", fds.revents); 165 return 1; 166 } 167 } 168 169 return 0; 170 } 171 172Cgroup2 interface 173================= 174 175In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem 176mounted, pressure stall information is also tracked for tasks grouped 177into cgroups. Each subdirectory in the cgroupfs mountpoint contains 178cpu.pressure, memory.pressure, and io.pressure files; the format is 179the same as the /proc/pressure/ files. 180 181Per-cgroup psi monitors can be specified and used the same way as 182system-wide ones. 183