12fb75e1bSLiu Xinpeng // SPDX-License-Identifier: GPL-2.0 2eb414681SJohannes Weiner /* 3eb414681SJohannes Weiner * Pressure stall information for CPU, memory and IO 4eb414681SJohannes Weiner * 5eb414681SJohannes Weiner * Copyright (c) 2018 Facebook, Inc. 6eb414681SJohannes Weiner * Author: Johannes Weiner <hannes@cmpxchg.org> 7eb414681SJohannes Weiner * 80e94682bSSuren Baghdasaryan * Polling support by Suren Baghdasaryan <surenb@google.com> 90e94682bSSuren Baghdasaryan * Copyright (c) 2018 Google, Inc. 100e94682bSSuren Baghdasaryan * 11eb414681SJohannes Weiner * When CPU, memory and IO are contended, tasks experience delays that 12eb414681SJohannes Weiner * reduce throughput and introduce latencies into the workload. Memory 13eb414681SJohannes Weiner * and IO contention, in addition, can cause a full loss of forward 14eb414681SJohannes Weiner * progress in which the CPU goes idle. 15eb414681SJohannes Weiner * 16eb414681SJohannes Weiner * This code aggregates individual task delays into resource pressure 17eb414681SJohannes Weiner * metrics that indicate problems with both workload health and 18eb414681SJohannes Weiner * resource utilization. 19eb414681SJohannes Weiner * 20eb414681SJohannes Weiner * Model 21eb414681SJohannes Weiner * 22eb414681SJohannes Weiner * The time in which a task can execute on a CPU is our baseline for 23eb414681SJohannes Weiner * productivity. Pressure expresses the amount of time in which this 24eb414681SJohannes Weiner * potential cannot be realized due to resource contention. 25eb414681SJohannes Weiner * 26eb414681SJohannes Weiner * This concept of productivity has two components: the workload and 27eb414681SJohannes Weiner * the CPU. To measure the impact of pressure on both, we define two 28eb414681SJohannes Weiner * contention states for a resource: SOME and FULL. 29eb414681SJohannes Weiner * 30eb414681SJohannes Weiner * In the SOME state of a given resource, one or more tasks are 31eb414681SJohannes Weiner * delayed on that resource. This affects the workload's ability to 32eb414681SJohannes Weiner * perform work, but the CPU may still be executing other tasks. 33eb414681SJohannes Weiner * 34eb414681SJohannes Weiner * In the FULL state of a given resource, all non-idle tasks are 35eb414681SJohannes Weiner * delayed on that resource such that nobody is advancing and the CPU 36eb414681SJohannes Weiner * goes idle. This leaves both workload and CPU unproductive. 37eb414681SJohannes Weiner * 38eb414681SJohannes Weiner * SOME = nr_delayed_tasks != 0 39cb0e52b7SBrian Chen * FULL = nr_delayed_tasks != 0 && nr_productive_tasks == 0 40cb0e52b7SBrian Chen * 41cb0e52b7SBrian Chen * What it means for a task to be productive is defined differently 42cb0e52b7SBrian Chen * for each resource. For IO, productive means a running task. For 43cb0e52b7SBrian Chen * memory, productive means a running task that isn't a reclaimer. For 44cb0e52b7SBrian Chen * CPU, productive means an oncpu task. 45cb0e52b7SBrian Chen * 46cb0e52b7SBrian Chen * Naturally, the FULL state doesn't exist for the CPU resource at the 47cb0e52b7SBrian Chen * system level, but exist at the cgroup level. At the cgroup level, 48cb0e52b7SBrian Chen * FULL means all non-idle tasks in the cgroup are delayed on the CPU 49cb0e52b7SBrian Chen * resource which is being used by others outside of the cgroup or 50cb0e52b7SBrian Chen * throttled by the cgroup cpu.max configuration. 51eb414681SJohannes Weiner * 52eb414681SJohannes Weiner * The percentage of wallclock time spent in those compound stall 53eb414681SJohannes Weiner * states gives pressure numbers between 0 and 100 for each resource, 54eb414681SJohannes Weiner * where the SOME percentage indicates workload slowdowns and the FULL 55eb414681SJohannes Weiner * percentage indicates reduced CPU utilization: 56eb414681SJohannes Weiner * 57eb414681SJohannes Weiner * %SOME = time(SOME) / period 58eb414681SJohannes Weiner * %FULL = time(FULL) / period 59eb414681SJohannes Weiner * 60eb414681SJohannes Weiner * Multiple CPUs 61eb414681SJohannes Weiner * 62eb414681SJohannes Weiner * The more tasks and available CPUs there are, the more work can be 63eb414681SJohannes Weiner * performed concurrently. This means that the potential that can go 64eb414681SJohannes Weiner * unrealized due to resource contention *also* scales with non-idle 65eb414681SJohannes Weiner * tasks and CPUs. 66eb414681SJohannes Weiner * 67eb414681SJohannes Weiner * Consider a scenario where 257 number crunching tasks are trying to 68eb414681SJohannes Weiner * run concurrently on 256 CPUs. If we simply aggregated the task 69eb414681SJohannes Weiner * states, we would have to conclude a CPU SOME pressure number of 70eb414681SJohannes Weiner * 100%, since *somebody* is waiting on a runqueue at all 71eb414681SJohannes Weiner * times. However, that is clearly not the amount of contention the 723b03706fSIngo Molnar * workload is experiencing: only one out of 256 possible execution 73eb414681SJohannes Weiner * threads will be contended at any given time, or about 0.4%. 74eb414681SJohannes Weiner * 75eb414681SJohannes Weiner * Conversely, consider a scenario of 4 tasks and 4 CPUs where at any 76eb414681SJohannes Weiner * given time *one* of the tasks is delayed due to a lack of memory. 77eb414681SJohannes Weiner * Again, looking purely at the task state would yield a memory FULL 78eb414681SJohannes Weiner * pressure number of 0%, since *somebody* is always making forward 79eb414681SJohannes Weiner * progress. But again this wouldn't capture the amount of execution 80eb414681SJohannes Weiner * potential lost, which is 1 out of 4 CPUs, or 25%. 81eb414681SJohannes Weiner * 82eb414681SJohannes Weiner * To calculate wasted potential (pressure) with multiple processors, 83eb414681SJohannes Weiner * we have to base our calculation on the number of non-idle tasks in 84eb414681SJohannes Weiner * conjunction with the number of available CPUs, which is the number 85eb414681SJohannes Weiner * of potential execution threads. SOME becomes then the proportion of 863b03706fSIngo Molnar * delayed tasks to possible threads, and FULL is the share of possible 87eb414681SJohannes Weiner * threads that are unproductive due to delays: 88eb414681SJohannes Weiner * 89eb414681SJohannes Weiner * threads = min(nr_nonidle_tasks, nr_cpus) 90eb414681SJohannes Weiner * SOME = min(nr_delayed_tasks / threads, 1) 91cb0e52b7SBrian Chen * FULL = (threads - min(nr_productive_tasks, threads)) / threads 92eb414681SJohannes Weiner * 93eb414681SJohannes Weiner * For the 257 number crunchers on 256 CPUs, this yields: 94eb414681SJohannes Weiner * 95eb414681SJohannes Weiner * threads = min(257, 256) 96eb414681SJohannes Weiner * SOME = min(1 / 256, 1) = 0.4% 97cb0e52b7SBrian Chen * FULL = (256 - min(256, 256)) / 256 = 0% 98eb414681SJohannes Weiner * 99eb414681SJohannes Weiner * For the 1 out of 4 memory-delayed tasks, this yields: 100eb414681SJohannes Weiner * 101eb414681SJohannes Weiner * threads = min(4, 4) 102eb414681SJohannes Weiner * SOME = min(1 / 4, 1) = 25% 103eb414681SJohannes Weiner * FULL = (4 - min(3, 4)) / 4 = 25% 104eb414681SJohannes Weiner * 105eb414681SJohannes Weiner * [ Substitute nr_cpus with 1, and you can see that it's a natural 106eb414681SJohannes Weiner * extension of the single-CPU model. ] 107eb414681SJohannes Weiner * 108eb414681SJohannes Weiner * Implementation 109eb414681SJohannes Weiner * 110eb414681SJohannes Weiner * To assess the precise time spent in each such state, we would have 111eb414681SJohannes Weiner * to freeze the system on task changes and start/stop the state 112eb414681SJohannes Weiner * clocks accordingly. Obviously that doesn't scale in practice. 113eb414681SJohannes Weiner * 114eb414681SJohannes Weiner * Because the scheduler aims to distribute the compute load evenly 115eb414681SJohannes Weiner * among the available CPUs, we can track task state locally to each 116eb414681SJohannes Weiner * CPU and, at much lower frequency, extrapolate the global state for 117eb414681SJohannes Weiner * the cumulative stall times and the running averages. 118eb414681SJohannes Weiner * 119eb414681SJohannes Weiner * For each runqueue, we track: 120eb414681SJohannes Weiner * 121eb414681SJohannes Weiner * tSOME[cpu] = time(nr_delayed_tasks[cpu] != 0) 122cb0e52b7SBrian Chen * tFULL[cpu] = time(nr_delayed_tasks[cpu] && !nr_productive_tasks[cpu]) 123eb414681SJohannes Weiner * tNONIDLE[cpu] = time(nr_nonidle_tasks[cpu] != 0) 124eb414681SJohannes Weiner * 125eb414681SJohannes Weiner * and then periodically aggregate: 126eb414681SJohannes Weiner * 127eb414681SJohannes Weiner * tNONIDLE = sum(tNONIDLE[i]) 128eb414681SJohannes Weiner * 129eb414681SJohannes Weiner * tSOME = sum(tSOME[i] * tNONIDLE[i]) / tNONIDLE 130eb414681SJohannes Weiner * tFULL = sum(tFULL[i] * tNONIDLE[i]) / tNONIDLE 131eb414681SJohannes Weiner * 132eb414681SJohannes Weiner * %SOME = tSOME / period 133eb414681SJohannes Weiner * %FULL = tFULL / period 134eb414681SJohannes Weiner * 135eb414681SJohannes Weiner * This gives us an approximation of pressure that is practical 136eb414681SJohannes Weiner * cost-wise, yet way more sensitive and accurate than periodic 137eb414681SJohannes Weiner * sampling of the aggregate task states would be. 138eb414681SJohannes Weiner */ 139eb414681SJohannes Weiner 140eb414681SJohannes Weiner static int psi_bug __read_mostly; 141eb414681SJohannes Weiner 142e0c27447SJohannes Weiner DEFINE_STATIC_KEY_FALSE(psi_disabled); 1433958e2d0SSuren Baghdasaryan DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled); 144e0c27447SJohannes Weiner 145e0c27447SJohannes Weiner #ifdef CONFIG_PSI_DEFAULT_DISABLED 1469289c5e6SSuren Baghdasaryan static bool psi_enable; 147e0c27447SJohannes Weiner #else 1489289c5e6SSuren Baghdasaryan static bool psi_enable = true; 149e0c27447SJohannes Weiner #endif 150e0c27447SJohannes Weiner static int __init setup_psi(char *str) 151e0c27447SJohannes Weiner { 152e0c27447SJohannes Weiner return kstrtobool(str, &psi_enable) == 0; 153e0c27447SJohannes Weiner } 154e0c27447SJohannes Weiner __setup("psi=", setup_psi); 155eb414681SJohannes Weiner 156eb414681SJohannes Weiner /* Running averages - we need to be higher-res than loadavg */ 157eb414681SJohannes Weiner #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ 158eb414681SJohannes Weiner #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ 159eb414681SJohannes Weiner #define EXP_60s 1981 /* 1/exp(2s/60s) */ 160eb414681SJohannes Weiner #define EXP_300s 2034 /* 1/exp(2s/300s) */ 161eb414681SJohannes Weiner 1620e94682bSSuren Baghdasaryan /* PSI trigger definitions */ 1630e94682bSSuren Baghdasaryan #define WINDOW_MIN_US 500000 /* Min window size is 500ms */ 1640e94682bSSuren Baghdasaryan #define WINDOW_MAX_US 10000000 /* Max window size is 10s */ 1650e94682bSSuren Baghdasaryan #define UPDATES_PER_WINDOW 10 /* 10 updates per window */ 1660e94682bSSuren Baghdasaryan 167eb414681SJohannes Weiner /* Sampling frequency in nanoseconds */ 168eb414681SJohannes Weiner static u64 psi_period __read_mostly; 169eb414681SJohannes Weiner 170eb414681SJohannes Weiner /* System-level pressure and stall tracking */ 171eb414681SJohannes Weiner static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu); 172df5ba5beSDan Schatzberg struct psi_group psi_system = { 173eb414681SJohannes Weiner .pcpu = &system_group_pcpu, 174eb414681SJohannes Weiner }; 175eb414681SJohannes Weiner 176bcc78db6SSuren Baghdasaryan static void psi_avgs_work(struct work_struct *work); 177eb414681SJohannes Weiner 1788f91efd8SZhaoyang Huang static void poll_timer_fn(struct timer_list *t); 1798f91efd8SZhaoyang Huang 180eb414681SJohannes Weiner static void group_init(struct psi_group *group) 181eb414681SJohannes Weiner { 182eb414681SJohannes Weiner int cpu; 183eb414681SJohannes Weiner 184eb414681SJohannes Weiner for_each_possible_cpu(cpu) 185eb414681SJohannes Weiner seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); 1863dfbe25cSJohannes Weiner group->avg_last_update = sched_clock(); 1873dfbe25cSJohannes Weiner group->avg_next_update = group->avg_last_update + psi_period; 188bcc78db6SSuren Baghdasaryan INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work); 189bcc78db6SSuren Baghdasaryan mutex_init(&group->avgs_lock); 1900e94682bSSuren Baghdasaryan /* Init trigger-related members */ 1910e94682bSSuren Baghdasaryan mutex_init(&group->trigger_lock); 1920e94682bSSuren Baghdasaryan INIT_LIST_HEAD(&group->triggers); 1930e94682bSSuren Baghdasaryan group->poll_min_period = U32_MAX; 1940e94682bSSuren Baghdasaryan group->polling_next_update = ULLONG_MAX; 1958f91efd8SZhaoyang Huang init_waitqueue_head(&group->poll_wait); 1968f91efd8SZhaoyang Huang timer_setup(&group->poll_timer, poll_timer_fn, 0); 197461daba0SSuren Baghdasaryan rcu_assign_pointer(group->poll_task, NULL); 198eb414681SJohannes Weiner } 199eb414681SJohannes Weiner 200eb414681SJohannes Weiner void __init psi_init(void) 201eb414681SJohannes Weiner { 202e0c27447SJohannes Weiner if (!psi_enable) { 203e0c27447SJohannes Weiner static_branch_enable(&psi_disabled); 204e2ad8ab0SChengming Zhou static_branch_disable(&psi_cgroups_enabled); 205eb414681SJohannes Weiner return; 206e0c27447SJohannes Weiner } 207eb414681SJohannes Weiner 2083958e2d0SSuren Baghdasaryan if (!cgroup_psi_enabled()) 2093958e2d0SSuren Baghdasaryan static_branch_disable(&psi_cgroups_enabled); 2103958e2d0SSuren Baghdasaryan 211eb414681SJohannes Weiner psi_period = jiffies_to_nsecs(PSI_FREQ); 212eb414681SJohannes Weiner group_init(&psi_system); 213eb414681SJohannes Weiner } 214eb414681SJohannes Weiner 215eb414681SJohannes Weiner static bool test_state(unsigned int *tasks, enum psi_states state) 216eb414681SJohannes Weiner { 217eb414681SJohannes Weiner switch (state) { 218eb414681SJohannes Weiner case PSI_IO_SOME: 219fddc8babSJohannes Weiner return unlikely(tasks[NR_IOWAIT]); 220eb414681SJohannes Weiner case PSI_IO_FULL: 221fddc8babSJohannes Weiner return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]); 222eb414681SJohannes Weiner case PSI_MEM_SOME: 223fddc8babSJohannes Weiner return unlikely(tasks[NR_MEMSTALL]); 224eb414681SJohannes Weiner case PSI_MEM_FULL: 225cb0e52b7SBrian Chen return unlikely(tasks[NR_MEMSTALL] && 226cb0e52b7SBrian Chen tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]); 227eb414681SJohannes Weiner case PSI_CPU_SOME: 228fddc8babSJohannes Weiner return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); 229e7fcd762SChengming Zhou case PSI_CPU_FULL: 230fddc8babSJohannes Weiner return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); 231eb414681SJohannes Weiner case PSI_NONIDLE: 232eb414681SJohannes Weiner return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || 233eb414681SJohannes Weiner tasks[NR_RUNNING]; 234eb414681SJohannes Weiner default: 235eb414681SJohannes Weiner return false; 236eb414681SJohannes Weiner } 237eb414681SJohannes Weiner } 238eb414681SJohannes Weiner 2390e94682bSSuren Baghdasaryan static void get_recent_times(struct psi_group *group, int cpu, 2400e94682bSSuren Baghdasaryan enum psi_aggregators aggregator, u32 *times, 241333f3017SSuren Baghdasaryan u32 *pchanged_states) 242eb414681SJohannes Weiner { 243eb414681SJohannes Weiner struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu); 244eb414681SJohannes Weiner u64 now, state_start; 24533b2d630SSuren Baghdasaryan enum psi_states s; 246eb414681SJohannes Weiner unsigned int seq; 24733b2d630SSuren Baghdasaryan u32 state_mask; 248eb414681SJohannes Weiner 249333f3017SSuren Baghdasaryan *pchanged_states = 0; 250333f3017SSuren Baghdasaryan 251eb414681SJohannes Weiner /* Snapshot a coherent view of the CPU state */ 252eb414681SJohannes Weiner do { 253eb414681SJohannes Weiner seq = read_seqcount_begin(&groupc->seq); 254eb414681SJohannes Weiner now = cpu_clock(cpu); 255eb414681SJohannes Weiner memcpy(times, groupc->times, sizeof(groupc->times)); 25633b2d630SSuren Baghdasaryan state_mask = groupc->state_mask; 257eb414681SJohannes Weiner state_start = groupc->state_start; 258eb414681SJohannes Weiner } while (read_seqcount_retry(&groupc->seq, seq)); 259eb414681SJohannes Weiner 260eb414681SJohannes Weiner /* Calculate state time deltas against the previous snapshot */ 261eb414681SJohannes Weiner for (s = 0; s < NR_PSI_STATES; s++) { 262eb414681SJohannes Weiner u32 delta; 263eb414681SJohannes Weiner /* 264eb414681SJohannes Weiner * In addition to already concluded states, we also 265eb414681SJohannes Weiner * incorporate currently active states on the CPU, 266eb414681SJohannes Weiner * since states may last for many sampling periods. 267eb414681SJohannes Weiner * 268eb414681SJohannes Weiner * This way we keep our delta sampling buckets small 269eb414681SJohannes Weiner * (u32) and our reported pressure close to what's 270eb414681SJohannes Weiner * actually happening. 271eb414681SJohannes Weiner */ 27233b2d630SSuren Baghdasaryan if (state_mask & (1 << s)) 273eb414681SJohannes Weiner times[s] += now - state_start; 274eb414681SJohannes Weiner 2750e94682bSSuren Baghdasaryan delta = times[s] - groupc->times_prev[aggregator][s]; 2760e94682bSSuren Baghdasaryan groupc->times_prev[aggregator][s] = times[s]; 277eb414681SJohannes Weiner 278eb414681SJohannes Weiner times[s] = delta; 279333f3017SSuren Baghdasaryan if (delta) 280333f3017SSuren Baghdasaryan *pchanged_states |= (1 << s); 281eb414681SJohannes Weiner } 282eb414681SJohannes Weiner } 283eb414681SJohannes Weiner 284eb414681SJohannes Weiner static void calc_avgs(unsigned long avg[3], int missed_periods, 285eb414681SJohannes Weiner u64 time, u64 period) 286eb414681SJohannes Weiner { 287eb414681SJohannes Weiner unsigned long pct; 288eb414681SJohannes Weiner 289eb414681SJohannes Weiner /* Fill in zeroes for periods of no activity */ 290eb414681SJohannes Weiner if (missed_periods) { 291eb414681SJohannes Weiner avg[0] = calc_load_n(avg[0], EXP_10s, 0, missed_periods); 292eb414681SJohannes Weiner avg[1] = calc_load_n(avg[1], EXP_60s, 0, missed_periods); 293eb414681SJohannes Weiner avg[2] = calc_load_n(avg[2], EXP_300s, 0, missed_periods); 294eb414681SJohannes Weiner } 295eb414681SJohannes Weiner 296eb414681SJohannes Weiner /* Sample the most recent active period */ 297eb414681SJohannes Weiner pct = div_u64(time * 100, period); 298eb414681SJohannes Weiner pct *= FIXED_1; 299eb414681SJohannes Weiner avg[0] = calc_load(avg[0], EXP_10s, pct); 300eb414681SJohannes Weiner avg[1] = calc_load(avg[1], EXP_60s, pct); 301eb414681SJohannes Weiner avg[2] = calc_load(avg[2], EXP_300s, pct); 302eb414681SJohannes Weiner } 303eb414681SJohannes Weiner 3040e94682bSSuren Baghdasaryan static void collect_percpu_times(struct psi_group *group, 3050e94682bSSuren Baghdasaryan enum psi_aggregators aggregator, 3060e94682bSSuren Baghdasaryan u32 *pchanged_states) 307eb414681SJohannes Weiner { 308eb414681SJohannes Weiner u64 deltas[NR_PSI_STATES - 1] = { 0, }; 309eb414681SJohannes Weiner unsigned long nonidle_total = 0; 310333f3017SSuren Baghdasaryan u32 changed_states = 0; 311eb414681SJohannes Weiner int cpu; 312eb414681SJohannes Weiner int s; 313eb414681SJohannes Weiner 314eb414681SJohannes Weiner /* 315eb414681SJohannes Weiner * Collect the per-cpu time buckets and average them into a 316eb414681SJohannes Weiner * single time sample that is normalized to wallclock time. 317eb414681SJohannes Weiner * 318eb414681SJohannes Weiner * For averaging, each CPU is weighted by its non-idle time in 319eb414681SJohannes Weiner * the sampling period. This eliminates artifacts from uneven 320eb414681SJohannes Weiner * loading, or even entirely idle CPUs. 321eb414681SJohannes Weiner */ 322eb414681SJohannes Weiner for_each_possible_cpu(cpu) { 323eb414681SJohannes Weiner u32 times[NR_PSI_STATES]; 324eb414681SJohannes Weiner u32 nonidle; 325333f3017SSuren Baghdasaryan u32 cpu_changed_states; 326eb414681SJohannes Weiner 3270e94682bSSuren Baghdasaryan get_recent_times(group, cpu, aggregator, times, 328333f3017SSuren Baghdasaryan &cpu_changed_states); 329333f3017SSuren Baghdasaryan changed_states |= cpu_changed_states; 330eb414681SJohannes Weiner 331eb414681SJohannes Weiner nonidle = nsecs_to_jiffies(times[PSI_NONIDLE]); 332eb414681SJohannes Weiner nonidle_total += nonidle; 333eb414681SJohannes Weiner 334eb414681SJohannes Weiner for (s = 0; s < PSI_NONIDLE; s++) 335eb414681SJohannes Weiner deltas[s] += (u64)times[s] * nonidle; 336eb414681SJohannes Weiner } 337eb414681SJohannes Weiner 338eb414681SJohannes Weiner /* 339eb414681SJohannes Weiner * Integrate the sample into the running statistics that are 340eb414681SJohannes Weiner * reported to userspace: the cumulative stall times and the 341eb414681SJohannes Weiner * decaying averages. 342eb414681SJohannes Weiner * 343eb414681SJohannes Weiner * Pressure percentages are sampled at PSI_FREQ. We might be 344eb414681SJohannes Weiner * called more often when the user polls more frequently than 345eb414681SJohannes Weiner * that; we might be called less often when there is no task 346eb414681SJohannes Weiner * activity, thus no data, and clock ticks are sporadic. The 347eb414681SJohannes Weiner * below handles both. 348eb414681SJohannes Weiner */ 349eb414681SJohannes Weiner 350eb414681SJohannes Weiner /* total= */ 351eb414681SJohannes Weiner for (s = 0; s < NR_PSI_STATES - 1; s++) 3520e94682bSSuren Baghdasaryan group->total[aggregator][s] += 3530e94682bSSuren Baghdasaryan div_u64(deltas[s], max(nonidle_total, 1UL)); 354eb414681SJohannes Weiner 355333f3017SSuren Baghdasaryan if (pchanged_states) 356333f3017SSuren Baghdasaryan *pchanged_states = changed_states; 3577fc70a39SSuren Baghdasaryan } 3587fc70a39SSuren Baghdasaryan 3597fc70a39SSuren Baghdasaryan static u64 update_averages(struct psi_group *group, u64 now) 3607fc70a39SSuren Baghdasaryan { 3617fc70a39SSuren Baghdasaryan unsigned long missed_periods = 0; 3627fc70a39SSuren Baghdasaryan u64 expires, period; 3637fc70a39SSuren Baghdasaryan u64 avg_next_update; 3647fc70a39SSuren Baghdasaryan int s; 3657fc70a39SSuren Baghdasaryan 366eb414681SJohannes Weiner /* avgX= */ 367bcc78db6SSuren Baghdasaryan expires = group->avg_next_update; 3684e37504dSJohannes Weiner if (now - expires >= psi_period) 369eb414681SJohannes Weiner missed_periods = div_u64(now - expires, psi_period); 370eb414681SJohannes Weiner 371eb414681SJohannes Weiner /* 372eb414681SJohannes Weiner * The periodic clock tick can get delayed for various 373eb414681SJohannes Weiner * reasons, especially on loaded systems. To avoid clock 374eb414681SJohannes Weiner * drift, we schedule the clock in fixed psi_period intervals. 375eb414681SJohannes Weiner * But the deltas we sample out of the per-cpu buckets above 376eb414681SJohannes Weiner * are based on the actual time elapsing between clock ticks. 377eb414681SJohannes Weiner */ 3787fc70a39SSuren Baghdasaryan avg_next_update = expires + ((1 + missed_periods) * psi_period); 379bcc78db6SSuren Baghdasaryan period = now - (group->avg_last_update + (missed_periods * psi_period)); 380bcc78db6SSuren Baghdasaryan group->avg_last_update = now; 381eb414681SJohannes Weiner 382eb414681SJohannes Weiner for (s = 0; s < NR_PSI_STATES - 1; s++) { 383eb414681SJohannes Weiner u32 sample; 384eb414681SJohannes Weiner 3850e94682bSSuren Baghdasaryan sample = group->total[PSI_AVGS][s] - group->avg_total[s]; 386eb414681SJohannes Weiner /* 387eb414681SJohannes Weiner * Due to the lockless sampling of the time buckets, 388eb414681SJohannes Weiner * recorded time deltas can slip into the next period, 389eb414681SJohannes Weiner * which under full pressure can result in samples in 390eb414681SJohannes Weiner * excess of the period length. 391eb414681SJohannes Weiner * 392eb414681SJohannes Weiner * We don't want to report non-sensical pressures in 393eb414681SJohannes Weiner * excess of 100%, nor do we want to drop such events 394eb414681SJohannes Weiner * on the floor. Instead we punt any overage into the 395eb414681SJohannes Weiner * future until pressure subsides. By doing this we 396eb414681SJohannes Weiner * don't underreport the occurring pressure curve, we 397eb414681SJohannes Weiner * just report it delayed by one period length. 398eb414681SJohannes Weiner * 399eb414681SJohannes Weiner * The error isn't cumulative. As soon as another 400eb414681SJohannes Weiner * delta slips from a period P to P+1, by definition 401eb414681SJohannes Weiner * it frees up its time T in P. 402eb414681SJohannes Weiner */ 403eb414681SJohannes Weiner if (sample > period) 404eb414681SJohannes Weiner sample = period; 405bcc78db6SSuren Baghdasaryan group->avg_total[s] += sample; 406eb414681SJohannes Weiner calc_avgs(group->avg[s], missed_periods, sample, period); 407eb414681SJohannes Weiner } 4087fc70a39SSuren Baghdasaryan 4097fc70a39SSuren Baghdasaryan return avg_next_update; 410eb414681SJohannes Weiner } 411eb414681SJohannes Weiner 412bcc78db6SSuren Baghdasaryan static void psi_avgs_work(struct work_struct *work) 413eb414681SJohannes Weiner { 414eb414681SJohannes Weiner struct delayed_work *dwork; 415eb414681SJohannes Weiner struct psi_group *group; 416333f3017SSuren Baghdasaryan u32 changed_states; 417eb414681SJohannes Weiner bool nonidle; 4187fc70a39SSuren Baghdasaryan u64 now; 419eb414681SJohannes Weiner 420eb414681SJohannes Weiner dwork = to_delayed_work(work); 421bcc78db6SSuren Baghdasaryan group = container_of(dwork, struct psi_group, avgs_work); 422eb414681SJohannes Weiner 4237fc70a39SSuren Baghdasaryan mutex_lock(&group->avgs_lock); 4247fc70a39SSuren Baghdasaryan 4257fc70a39SSuren Baghdasaryan now = sched_clock(); 4267fc70a39SSuren Baghdasaryan 4270e94682bSSuren Baghdasaryan collect_percpu_times(group, PSI_AVGS, &changed_states); 428333f3017SSuren Baghdasaryan nonidle = changed_states & (1 << PSI_NONIDLE); 429eb414681SJohannes Weiner /* 430eb414681SJohannes Weiner * If there is task activity, periodically fold the per-cpu 431eb414681SJohannes Weiner * times and feed samples into the running averages. If things 432eb414681SJohannes Weiner * are idle and there is no data to process, stop the clock. 433eb414681SJohannes Weiner * Once restarted, we'll catch up the running averages in one 434eb414681SJohannes Weiner * go - see calc_avgs() and missed_periods. 435eb414681SJohannes Weiner */ 4367fc70a39SSuren Baghdasaryan if (now >= group->avg_next_update) 4377fc70a39SSuren Baghdasaryan group->avg_next_update = update_averages(group, now); 438eb414681SJohannes Weiner 439eb414681SJohannes Weiner if (nonidle) { 4407fc70a39SSuren Baghdasaryan schedule_delayed_work(dwork, nsecs_to_jiffies( 4417fc70a39SSuren Baghdasaryan group->avg_next_update - now) + 1); 442eb414681SJohannes Weiner } 4437fc70a39SSuren Baghdasaryan 4447fc70a39SSuren Baghdasaryan mutex_unlock(&group->avgs_lock); 445eb414681SJohannes Weiner } 446eb414681SJohannes Weiner 4473b03706fSIngo Molnar /* Trigger tracking window manipulations */ 4480e94682bSSuren Baghdasaryan static void window_reset(struct psi_window *win, u64 now, u64 value, 4490e94682bSSuren Baghdasaryan u64 prev_growth) 4500e94682bSSuren Baghdasaryan { 4510e94682bSSuren Baghdasaryan win->start_time = now; 4520e94682bSSuren Baghdasaryan win->start_value = value; 4530e94682bSSuren Baghdasaryan win->prev_growth = prev_growth; 4540e94682bSSuren Baghdasaryan } 4550e94682bSSuren Baghdasaryan 4560e94682bSSuren Baghdasaryan /* 4570e94682bSSuren Baghdasaryan * PSI growth tracking window update and growth calculation routine. 4580e94682bSSuren Baghdasaryan * 4590e94682bSSuren Baghdasaryan * This approximates a sliding tracking window by interpolating 4600e94682bSSuren Baghdasaryan * partially elapsed windows using historical growth data from the 4610e94682bSSuren Baghdasaryan * previous intervals. This minimizes memory requirements (by not storing 4620e94682bSSuren Baghdasaryan * all the intermediate values in the previous window) and simplifies 4630e94682bSSuren Baghdasaryan * the calculations. It works well because PSI signal changes only in 4640e94682bSSuren Baghdasaryan * positive direction and over relatively small window sizes the growth 4650e94682bSSuren Baghdasaryan * is close to linear. 4660e94682bSSuren Baghdasaryan */ 4670e94682bSSuren Baghdasaryan static u64 window_update(struct psi_window *win, u64 now, u64 value) 4680e94682bSSuren Baghdasaryan { 4690e94682bSSuren Baghdasaryan u64 elapsed; 4700e94682bSSuren Baghdasaryan u64 growth; 4710e94682bSSuren Baghdasaryan 4720e94682bSSuren Baghdasaryan elapsed = now - win->start_time; 4730e94682bSSuren Baghdasaryan growth = value - win->start_value; 4740e94682bSSuren Baghdasaryan /* 4750e94682bSSuren Baghdasaryan * After each tracking window passes win->start_value and 4760e94682bSSuren Baghdasaryan * win->start_time get reset and win->prev_growth stores 4770e94682bSSuren Baghdasaryan * the average per-window growth of the previous window. 4780e94682bSSuren Baghdasaryan * win->prev_growth is then used to interpolate additional 4790e94682bSSuren Baghdasaryan * growth from the previous window assuming it was linear. 4800e94682bSSuren Baghdasaryan */ 4810e94682bSSuren Baghdasaryan if (elapsed > win->size) 4820e94682bSSuren Baghdasaryan window_reset(win, now, value, growth); 4830e94682bSSuren Baghdasaryan else { 4840e94682bSSuren Baghdasaryan u32 remaining; 4850e94682bSSuren Baghdasaryan 4860e94682bSSuren Baghdasaryan remaining = win->size - elapsed; 487c3466952SJohannes Weiner growth += div64_u64(win->prev_growth * remaining, win->size); 4880e94682bSSuren Baghdasaryan } 4890e94682bSSuren Baghdasaryan 4900e94682bSSuren Baghdasaryan return growth; 4910e94682bSSuren Baghdasaryan } 4920e94682bSSuren Baghdasaryan 4930e94682bSSuren Baghdasaryan static void init_triggers(struct psi_group *group, u64 now) 4940e94682bSSuren Baghdasaryan { 4950e94682bSSuren Baghdasaryan struct psi_trigger *t; 4960e94682bSSuren Baghdasaryan 4970e94682bSSuren Baghdasaryan list_for_each_entry(t, &group->triggers, node) 4980e94682bSSuren Baghdasaryan window_reset(&t->win, now, 4990e94682bSSuren Baghdasaryan group->total[PSI_POLL][t->state], 0); 5000e94682bSSuren Baghdasaryan memcpy(group->polling_total, group->total[PSI_POLL], 5010e94682bSSuren Baghdasaryan sizeof(group->polling_total)); 5020e94682bSSuren Baghdasaryan group->polling_next_update = now + group->poll_min_period; 5030e94682bSSuren Baghdasaryan } 5040e94682bSSuren Baghdasaryan 5050e94682bSSuren Baghdasaryan static u64 update_triggers(struct psi_group *group, u64 now) 5060e94682bSSuren Baghdasaryan { 5070e94682bSSuren Baghdasaryan struct psi_trigger *t; 508e6df4eadSZhaoyang Huang bool update_total = false; 5090e94682bSSuren Baghdasaryan u64 *total = group->total[PSI_POLL]; 5100e94682bSSuren Baghdasaryan 5110e94682bSSuren Baghdasaryan /* 5120e94682bSSuren Baghdasaryan * On subsequent updates, calculate growth deltas and let 5130e94682bSSuren Baghdasaryan * watchers know when their specified thresholds are exceeded. 5140e94682bSSuren Baghdasaryan */ 5150e94682bSSuren Baghdasaryan list_for_each_entry(t, &group->triggers, node) { 5160e94682bSSuren Baghdasaryan u64 growth; 517e6df4eadSZhaoyang Huang bool new_stall; 5180e94682bSSuren Baghdasaryan 519e6df4eadSZhaoyang Huang new_stall = group->polling_total[t->state] != total[t->state]; 520e6df4eadSZhaoyang Huang 521e6df4eadSZhaoyang Huang /* Check for stall activity or a previous threshold breach */ 522e6df4eadSZhaoyang Huang if (!new_stall && !t->pending_event) 5230e94682bSSuren Baghdasaryan continue; 524e6df4eadSZhaoyang Huang /* 525e6df4eadSZhaoyang Huang * Check for new stall activity, as well as deferred 526e6df4eadSZhaoyang Huang * events that occurred in the last window after the 527e6df4eadSZhaoyang Huang * trigger had already fired (we want to ratelimit 528e6df4eadSZhaoyang Huang * events without dropping any). 529e6df4eadSZhaoyang Huang */ 530e6df4eadSZhaoyang Huang if (new_stall) { 5310e94682bSSuren Baghdasaryan /* 5320e94682bSSuren Baghdasaryan * Multiple triggers might be looking at the same state, 5330e94682bSSuren Baghdasaryan * remember to update group->polling_total[] once we've 5340e94682bSSuren Baghdasaryan * been through all of them. Also remember to extend the 5350e94682bSSuren Baghdasaryan * polling time if we see new stall activity. 5360e94682bSSuren Baghdasaryan */ 537e6df4eadSZhaoyang Huang update_total = true; 5380e94682bSSuren Baghdasaryan 5390e94682bSSuren Baghdasaryan /* Calculate growth since last update */ 5400e94682bSSuren Baghdasaryan growth = window_update(&t->win, now, total[t->state]); 5410e94682bSSuren Baghdasaryan if (growth < t->threshold) 5420e94682bSSuren Baghdasaryan continue; 5430e94682bSSuren Baghdasaryan 544e6df4eadSZhaoyang Huang t->pending_event = true; 545e6df4eadSZhaoyang Huang } 5460e94682bSSuren Baghdasaryan /* Limit event signaling to once per window */ 5470e94682bSSuren Baghdasaryan if (now < t->last_event_time + t->win.size) 5480e94682bSSuren Baghdasaryan continue; 5490e94682bSSuren Baghdasaryan 5500e94682bSSuren Baghdasaryan /* Generate an event */ 5510e94682bSSuren Baghdasaryan if (cmpxchg(&t->event, 0, 1) == 0) 5520e94682bSSuren Baghdasaryan wake_up_interruptible(&t->event_wait); 5530e94682bSSuren Baghdasaryan t->last_event_time = now; 554e6df4eadSZhaoyang Huang /* Reset threshold breach flag once event got generated */ 555e6df4eadSZhaoyang Huang t->pending_event = false; 5560e94682bSSuren Baghdasaryan } 5570e94682bSSuren Baghdasaryan 558e6df4eadSZhaoyang Huang if (update_total) 5590e94682bSSuren Baghdasaryan memcpy(group->polling_total, total, 5600e94682bSSuren Baghdasaryan sizeof(group->polling_total)); 5610e94682bSSuren Baghdasaryan 5620e94682bSSuren Baghdasaryan return now + group->poll_min_period; 5630e94682bSSuren Baghdasaryan } 5640e94682bSSuren Baghdasaryan 565461daba0SSuren Baghdasaryan /* Schedule polling if it's not already scheduled. */ 5660e94682bSSuren Baghdasaryan static void psi_schedule_poll_work(struct psi_group *group, unsigned long delay) 5670e94682bSSuren Baghdasaryan { 568461daba0SSuren Baghdasaryan struct task_struct *task; 5690e94682bSSuren Baghdasaryan 570461daba0SSuren Baghdasaryan /* 571461daba0SSuren Baghdasaryan * Do not reschedule if already scheduled. 572461daba0SSuren Baghdasaryan * Possible race with a timer scheduled after this check but before 573461daba0SSuren Baghdasaryan * mod_timer below can be tolerated because group->polling_next_update 574461daba0SSuren Baghdasaryan * will keep updates on schedule. 575461daba0SSuren Baghdasaryan */ 576461daba0SSuren Baghdasaryan if (timer_pending(&group->poll_timer)) 5770e94682bSSuren Baghdasaryan return; 5780e94682bSSuren Baghdasaryan 5790e94682bSSuren Baghdasaryan rcu_read_lock(); 5800e94682bSSuren Baghdasaryan 581461daba0SSuren Baghdasaryan task = rcu_dereference(group->poll_task); 5820e94682bSSuren Baghdasaryan /* 5830e94682bSSuren Baghdasaryan * kworker might be NULL in case psi_trigger_destroy races with 5840e94682bSSuren Baghdasaryan * psi_task_change (hotpath) which can't use locks 5850e94682bSSuren Baghdasaryan */ 586461daba0SSuren Baghdasaryan if (likely(task)) 587461daba0SSuren Baghdasaryan mod_timer(&group->poll_timer, jiffies + delay); 5880e94682bSSuren Baghdasaryan 5890e94682bSSuren Baghdasaryan rcu_read_unlock(); 5900e94682bSSuren Baghdasaryan } 5910e94682bSSuren Baghdasaryan 592461daba0SSuren Baghdasaryan static void psi_poll_work(struct psi_group *group) 5930e94682bSSuren Baghdasaryan { 5940e94682bSSuren Baghdasaryan u32 changed_states; 5950e94682bSSuren Baghdasaryan u64 now; 5960e94682bSSuren Baghdasaryan 5970e94682bSSuren Baghdasaryan mutex_lock(&group->trigger_lock); 5980e94682bSSuren Baghdasaryan 5990e94682bSSuren Baghdasaryan now = sched_clock(); 6000e94682bSSuren Baghdasaryan 6010e94682bSSuren Baghdasaryan collect_percpu_times(group, PSI_POLL, &changed_states); 6020e94682bSSuren Baghdasaryan 6030e94682bSSuren Baghdasaryan if (changed_states & group->poll_states) { 6040e94682bSSuren Baghdasaryan /* Initialize trigger windows when entering polling mode */ 6050e94682bSSuren Baghdasaryan if (now > group->polling_until) 6060e94682bSSuren Baghdasaryan init_triggers(group, now); 6070e94682bSSuren Baghdasaryan 6080e94682bSSuren Baghdasaryan /* 6090e94682bSSuren Baghdasaryan * Keep the monitor active for at least the duration of the 6100e94682bSSuren Baghdasaryan * minimum tracking window as long as monitor states are 6110e94682bSSuren Baghdasaryan * changing. 6120e94682bSSuren Baghdasaryan */ 6130e94682bSSuren Baghdasaryan group->polling_until = now + 6140e94682bSSuren Baghdasaryan group->poll_min_period * UPDATES_PER_WINDOW; 6150e94682bSSuren Baghdasaryan } 6160e94682bSSuren Baghdasaryan 6170e94682bSSuren Baghdasaryan if (now > group->polling_until) { 6180e94682bSSuren Baghdasaryan group->polling_next_update = ULLONG_MAX; 6190e94682bSSuren Baghdasaryan goto out; 6200e94682bSSuren Baghdasaryan } 6210e94682bSSuren Baghdasaryan 6220e94682bSSuren Baghdasaryan if (now >= group->polling_next_update) 6230e94682bSSuren Baghdasaryan group->polling_next_update = update_triggers(group, now); 6240e94682bSSuren Baghdasaryan 6250e94682bSSuren Baghdasaryan psi_schedule_poll_work(group, 6260e94682bSSuren Baghdasaryan nsecs_to_jiffies(group->polling_next_update - now) + 1); 6270e94682bSSuren Baghdasaryan 6280e94682bSSuren Baghdasaryan out: 6290e94682bSSuren Baghdasaryan mutex_unlock(&group->trigger_lock); 6300e94682bSSuren Baghdasaryan } 6310e94682bSSuren Baghdasaryan 632461daba0SSuren Baghdasaryan static int psi_poll_worker(void *data) 633461daba0SSuren Baghdasaryan { 634461daba0SSuren Baghdasaryan struct psi_group *group = (struct psi_group *)data; 635461daba0SSuren Baghdasaryan 6362cca5426SPeter Zijlstra sched_set_fifo_low(current); 637461daba0SSuren Baghdasaryan 638461daba0SSuren Baghdasaryan while (true) { 639461daba0SSuren Baghdasaryan wait_event_interruptible(group->poll_wait, 640461daba0SSuren Baghdasaryan atomic_cmpxchg(&group->poll_wakeup, 1, 0) || 641461daba0SSuren Baghdasaryan kthread_should_stop()); 642461daba0SSuren Baghdasaryan if (kthread_should_stop()) 643461daba0SSuren Baghdasaryan break; 644461daba0SSuren Baghdasaryan 645461daba0SSuren Baghdasaryan psi_poll_work(group); 646461daba0SSuren Baghdasaryan } 647461daba0SSuren Baghdasaryan return 0; 648461daba0SSuren Baghdasaryan } 649461daba0SSuren Baghdasaryan 650461daba0SSuren Baghdasaryan static void poll_timer_fn(struct timer_list *t) 651461daba0SSuren Baghdasaryan { 652461daba0SSuren Baghdasaryan struct psi_group *group = from_timer(group, t, poll_timer); 653461daba0SSuren Baghdasaryan 654461daba0SSuren Baghdasaryan atomic_set(&group->poll_wakeup, 1); 655461daba0SSuren Baghdasaryan wake_up_interruptible(&group->poll_wait); 656461daba0SSuren Baghdasaryan } 657461daba0SSuren Baghdasaryan 658df774306SShakeel Butt static void record_times(struct psi_group_cpu *groupc, u64 now) 659eb414681SJohannes Weiner { 660eb414681SJohannes Weiner u32 delta; 661eb414681SJohannes Weiner 662eb414681SJohannes Weiner delta = now - groupc->state_start; 663eb414681SJohannes Weiner groupc->state_start = now; 664eb414681SJohannes Weiner 66533b2d630SSuren Baghdasaryan if (groupc->state_mask & (1 << PSI_IO_SOME)) { 666eb414681SJohannes Weiner groupc->times[PSI_IO_SOME] += delta; 66733b2d630SSuren Baghdasaryan if (groupc->state_mask & (1 << PSI_IO_FULL)) 668eb414681SJohannes Weiner groupc->times[PSI_IO_FULL] += delta; 669eb414681SJohannes Weiner } 670eb414681SJohannes Weiner 67133b2d630SSuren Baghdasaryan if (groupc->state_mask & (1 << PSI_MEM_SOME)) { 672eb414681SJohannes Weiner groupc->times[PSI_MEM_SOME] += delta; 67333b2d630SSuren Baghdasaryan if (groupc->state_mask & (1 << PSI_MEM_FULL)) 674eb414681SJohannes Weiner groupc->times[PSI_MEM_FULL] += delta; 675eb414681SJohannes Weiner } 676eb414681SJohannes Weiner 677e7fcd762SChengming Zhou if (groupc->state_mask & (1 << PSI_CPU_SOME)) { 678eb414681SJohannes Weiner groupc->times[PSI_CPU_SOME] += delta; 679e7fcd762SChengming Zhou if (groupc->state_mask & (1 << PSI_CPU_FULL)) 680e7fcd762SChengming Zhou groupc->times[PSI_CPU_FULL] += delta; 681e7fcd762SChengming Zhou } 682eb414681SJohannes Weiner 68333b2d630SSuren Baghdasaryan if (groupc->state_mask & (1 << PSI_NONIDLE)) 684eb414681SJohannes Weiner groupc->times[PSI_NONIDLE] += delta; 685eb414681SJohannes Weiner } 686eb414681SJohannes Weiner 68736b238d5SJohannes Weiner static void psi_group_change(struct psi_group *group, int cpu, 688df774306SShakeel Butt unsigned int clear, unsigned int set, u64 now, 68936b238d5SJohannes Weiner bool wake_clock) 690eb414681SJohannes Weiner { 691eb414681SJohannes Weiner struct psi_group_cpu *groupc; 69236b238d5SJohannes Weiner u32 state_mask = 0; 693eb414681SJohannes Weiner unsigned int t, m; 69433b2d630SSuren Baghdasaryan enum psi_states s; 695eb414681SJohannes Weiner 696eb414681SJohannes Weiner groupc = per_cpu_ptr(group->pcpu, cpu); 697eb414681SJohannes Weiner 698eb414681SJohannes Weiner /* 699eb414681SJohannes Weiner * First we assess the aggregate resource states this CPU's 700eb414681SJohannes Weiner * tasks have been in since the last change, and account any 701eb414681SJohannes Weiner * SOME and FULL time these may have resulted in. 702eb414681SJohannes Weiner * 703eb414681SJohannes Weiner * Then we update the task counts according to the state 704eb414681SJohannes Weiner * change requested through the @clear and @set bits. 705eb414681SJohannes Weiner */ 706eb414681SJohannes Weiner write_seqcount_begin(&groupc->seq); 707eb414681SJohannes Weiner 708df774306SShakeel Butt record_times(groupc, now); 709eb414681SJohannes Weiner 710eb414681SJohannes Weiner for (t = 0, m = clear; m; m &= ~(1 << t), t++) { 711eb414681SJohannes Weiner if (!(m & (1 << t))) 712eb414681SJohannes Weiner continue; 7139d10a13dSCharan Teja Reddy if (groupc->tasks[t]) { 7149d10a13dSCharan Teja Reddy groupc->tasks[t]--; 7159d10a13dSCharan Teja Reddy } else if (!psi_bug) { 716cb0e52b7SBrian Chen printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u %u] clear=%x set=%x\n", 717eb414681SJohannes Weiner cpu, t, groupc->tasks[0], 718eb414681SJohannes Weiner groupc->tasks[1], groupc->tasks[2], 719cb0e52b7SBrian Chen groupc->tasks[3], groupc->tasks[4], 720cb0e52b7SBrian Chen clear, set); 721eb414681SJohannes Weiner psi_bug = 1; 722eb414681SJohannes Weiner } 723eb414681SJohannes Weiner } 724eb414681SJohannes Weiner 725eb414681SJohannes Weiner for (t = 0; set; set &= ~(1 << t), t++) 726eb414681SJohannes Weiner if (set & (1 << t)) 727eb414681SJohannes Weiner groupc->tasks[t]++; 728eb414681SJohannes Weiner 72933b2d630SSuren Baghdasaryan /* Calculate state mask representing active states */ 73033b2d630SSuren Baghdasaryan for (s = 0; s < NR_PSI_STATES; s++) { 73133b2d630SSuren Baghdasaryan if (test_state(groupc->tasks, s)) 73233b2d630SSuren Baghdasaryan state_mask |= (1 << s); 73333b2d630SSuren Baghdasaryan } 7347fae6c81SChengming Zhou 7357fae6c81SChengming Zhou /* 7367fae6c81SChengming Zhou * Since we care about lost potential, a memstall is FULL 7377fae6c81SChengming Zhou * when there are no other working tasks, but also when 7387fae6c81SChengming Zhou * the CPU is actively reclaiming and nothing productive 7397fae6c81SChengming Zhou * could run even if it were runnable. So when the current 7407fae6c81SChengming Zhou * task in a cgroup is in_memstall, the corresponding groupc 7417fae6c81SChengming Zhou * on that cpu is in PSI_MEM_FULL state. 7427fae6c81SChengming Zhou */ 743fddc8babSJohannes Weiner if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) 7447fae6c81SChengming Zhou state_mask |= (1 << PSI_MEM_FULL); 7457fae6c81SChengming Zhou 74633b2d630SSuren Baghdasaryan groupc->state_mask = state_mask; 74733b2d630SSuren Baghdasaryan 748eb414681SJohannes Weiner write_seqcount_end(&groupc->seq); 7490e94682bSSuren Baghdasaryan 75036b238d5SJohannes Weiner if (state_mask & group->poll_states) 75136b238d5SJohannes Weiner psi_schedule_poll_work(group, 1); 75236b238d5SJohannes Weiner 75336b238d5SJohannes Weiner if (wake_clock && !delayed_work_pending(&group->avgs_work)) 75436b238d5SJohannes Weiner schedule_delayed_work(&group->avgs_work, PSI_FREQ); 755eb414681SJohannes Weiner } 756eb414681SJohannes Weiner 7572ce7135aSJohannes Weiner static struct psi_group *iterate_groups(struct task_struct *task, void **iter) 7582ce7135aSJohannes Weiner { 7593958e2d0SSuren Baghdasaryan if (*iter == &psi_system) 7603958e2d0SSuren Baghdasaryan return NULL; 7613958e2d0SSuren Baghdasaryan 7622ce7135aSJohannes Weiner #ifdef CONFIG_CGROUPS 7633958e2d0SSuren Baghdasaryan if (static_branch_likely(&psi_cgroups_enabled)) { 7642ce7135aSJohannes Weiner struct cgroup *cgroup = NULL; 7652ce7135aSJohannes Weiner 7662ce7135aSJohannes Weiner if (!*iter) 7672ce7135aSJohannes Weiner cgroup = task->cgroups->dfl_cgrp; 7682ce7135aSJohannes Weiner else 7692ce7135aSJohannes Weiner cgroup = cgroup_parent(*iter); 7702ce7135aSJohannes Weiner 7712ce7135aSJohannes Weiner if (cgroup && cgroup_parent(cgroup)) { 7722ce7135aSJohannes Weiner *iter = cgroup; 7732ce7135aSJohannes Weiner return cgroup_psi(cgroup); 7742ce7135aSJohannes Weiner } 7753958e2d0SSuren Baghdasaryan } 7762ce7135aSJohannes Weiner #endif 7772ce7135aSJohannes Weiner *iter = &psi_system; 7782ce7135aSJohannes Weiner return &psi_system; 7792ce7135aSJohannes Weiner } 7802ce7135aSJohannes Weiner 78136b238d5SJohannes Weiner static void psi_flags_change(struct task_struct *task, int clear, int set) 78236b238d5SJohannes Weiner { 78336b238d5SJohannes Weiner if (((task->psi_flags & set) || 78436b238d5SJohannes Weiner (task->psi_flags & clear) != clear) && 78536b238d5SJohannes Weiner !psi_bug) { 78636b238d5SJohannes Weiner printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n", 78736b238d5SJohannes Weiner task->pid, task->comm, task_cpu(task), 78836b238d5SJohannes Weiner task->psi_flags, clear, set); 78936b238d5SJohannes Weiner psi_bug = 1; 79036b238d5SJohannes Weiner } 79136b238d5SJohannes Weiner 79236b238d5SJohannes Weiner task->psi_flags &= ~clear; 79336b238d5SJohannes Weiner task->psi_flags |= set; 79436b238d5SJohannes Weiner } 79536b238d5SJohannes Weiner 796eb414681SJohannes Weiner void psi_task_change(struct task_struct *task, int clear, int set) 797eb414681SJohannes Weiner { 798eb414681SJohannes Weiner int cpu = task_cpu(task); 7992ce7135aSJohannes Weiner struct psi_group *group; 8002ce7135aSJohannes Weiner void *iter = NULL; 801df774306SShakeel Butt u64 now; 802eb414681SJohannes Weiner 803eb414681SJohannes Weiner if (!task->pid) 804eb414681SJohannes Weiner return; 805eb414681SJohannes Weiner 80636b238d5SJohannes Weiner psi_flags_change(task, clear, set); 807eb414681SJohannes Weiner 808df774306SShakeel Butt now = cpu_clock(cpu); 8091b69ac6bSJohannes Weiner 81036b238d5SJohannes Weiner while ((group = iterate_groups(task, &iter))) 811c530a3c7SChengming Zhou psi_group_change(group, cpu, clear, set, now, true); 81236b238d5SJohannes Weiner } 8130e94682bSSuren Baghdasaryan 81436b238d5SJohannes Weiner void psi_task_switch(struct task_struct *prev, struct task_struct *next, 81536b238d5SJohannes Weiner bool sleep) 81636b238d5SJohannes Weiner { 81736b238d5SJohannes Weiner struct psi_group *group, *common = NULL; 81836b238d5SJohannes Weiner int cpu = task_cpu(prev); 81936b238d5SJohannes Weiner void *iter; 820df774306SShakeel Butt u64 now = cpu_clock(cpu); 8210e94682bSSuren Baghdasaryan 82236b238d5SJohannes Weiner if (next->pid) { 82336b238d5SJohannes Weiner psi_flags_change(next, 0, TSK_ONCPU); 82436b238d5SJohannes Weiner /* 825*65176f59SChengming Zhou * Set TSK_ONCPU on @next's cgroups. If @next shares any 826*65176f59SChengming Zhou * ancestors with @prev, those will already have @prev's 827*65176f59SChengming Zhou * TSK_ONCPU bit set, and we can stop the iteration there. 82836b238d5SJohannes Weiner */ 82936b238d5SJohannes Weiner iter = NULL; 83036b238d5SJohannes Weiner while ((group = iterate_groups(next, &iter))) { 831*65176f59SChengming Zhou if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { 83236b238d5SJohannes Weiner common = group; 83336b238d5SJohannes Weiner break; 83436b238d5SJohannes Weiner } 83536b238d5SJohannes Weiner 836df774306SShakeel Butt psi_group_change(group, cpu, 0, TSK_ONCPU, now, true); 83736b238d5SJohannes Weiner } 83836b238d5SJohannes Weiner } 83936b238d5SJohannes Weiner 84036b238d5SJohannes Weiner if (prev->pid) { 8414117cebfSChengming Zhou int clear = TSK_ONCPU, set = 0; 842c530a3c7SChengming Zhou bool wake_clock = true; 8434117cebfSChengming Zhou 8444117cebfSChengming Zhou /* 845cb0e52b7SBrian Chen * When we're going to sleep, psi_dequeue() lets us 846cb0e52b7SBrian Chen * handle TSK_RUNNING, TSK_MEMSTALL_RUNNING and 847cb0e52b7SBrian Chen * TSK_IOWAIT here, where we can combine it with 848cb0e52b7SBrian Chen * TSK_ONCPU and save walking common ancestors twice. 8494117cebfSChengming Zhou */ 8504117cebfSChengming Zhou if (sleep) { 8514117cebfSChengming Zhou clear |= TSK_RUNNING; 852cb0e52b7SBrian Chen if (prev->in_memstall) 853cb0e52b7SBrian Chen clear |= TSK_MEMSTALL_RUNNING; 8544117cebfSChengming Zhou if (prev->in_iowait) 8554117cebfSChengming Zhou set |= TSK_IOWAIT; 856c530a3c7SChengming Zhou 857c530a3c7SChengming Zhou /* 858c530a3c7SChengming Zhou * Periodic aggregation shuts off if there is a period of no 859c530a3c7SChengming Zhou * task changes, so we wake it back up if necessary. However, 860c530a3c7SChengming Zhou * don't do this if the task change is the aggregation worker 861c530a3c7SChengming Zhou * itself going to sleep, or we'll ping-pong forever. 862c530a3c7SChengming Zhou */ 863c530a3c7SChengming Zhou if (unlikely((prev->flags & PF_WQ_WORKER) && 864c530a3c7SChengming Zhou wq_worker_last_func(prev) == psi_avgs_work)) 865c530a3c7SChengming Zhou wake_clock = false; 8664117cebfSChengming Zhou } 8674117cebfSChengming Zhou 8684117cebfSChengming Zhou psi_flags_change(prev, clear, set); 86936b238d5SJohannes Weiner 87036b238d5SJohannes Weiner iter = NULL; 87136b238d5SJohannes Weiner while ((group = iterate_groups(prev, &iter)) && group != common) 872c530a3c7SChengming Zhou psi_group_change(group, cpu, clear, set, now, wake_clock); 8734117cebfSChengming Zhou 8744117cebfSChengming Zhou /* 875*65176f59SChengming Zhou * TSK_ONCPU is handled up to the common ancestor. If there are 876*65176f59SChengming Zhou * any other differences between the two tasks (e.g. prev goes 877*65176f59SChengming Zhou * to sleep, or only one task is memstall), finish propagating 878*65176f59SChengming Zhou * those differences all the way up to the root. 8794117cebfSChengming Zhou */ 880*65176f59SChengming Zhou if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { 8814117cebfSChengming Zhou clear &= ~TSK_ONCPU; 8824117cebfSChengming Zhou for (; group; group = iterate_groups(prev, &iter)) 883c530a3c7SChengming Zhou psi_group_change(group, cpu, clear, set, now, wake_clock); 8844117cebfSChengming Zhou } 8851b69ac6bSJohannes Weiner } 886eb414681SJohannes Weiner } 887eb414681SJohannes Weiner 888eb414681SJohannes Weiner /** 889eb414681SJohannes Weiner * psi_memstall_enter - mark the beginning of a memory stall section 890eb414681SJohannes Weiner * @flags: flags to handle nested sections 891eb414681SJohannes Weiner * 892eb414681SJohannes Weiner * Marks the calling task as being stalled due to a lack of memory, 893eb414681SJohannes Weiner * such as waiting for a refault or performing reclaim. 894eb414681SJohannes Weiner */ 895eb414681SJohannes Weiner void psi_memstall_enter(unsigned long *flags) 896eb414681SJohannes Weiner { 897eb414681SJohannes Weiner struct rq_flags rf; 898eb414681SJohannes Weiner struct rq *rq; 899eb414681SJohannes Weiner 900e0c27447SJohannes Weiner if (static_branch_likely(&psi_disabled)) 901eb414681SJohannes Weiner return; 902eb414681SJohannes Weiner 9031066d1b6SYafang Shao *flags = current->in_memstall; 904eb414681SJohannes Weiner if (*flags) 905eb414681SJohannes Weiner return; 906eb414681SJohannes Weiner /* 9071066d1b6SYafang Shao * in_memstall setting & accounting needs to be atomic wrt 908eb414681SJohannes Weiner * changes to the task's scheduling state, otherwise we can 909eb414681SJohannes Weiner * race with CPU migration. 910eb414681SJohannes Weiner */ 911eb414681SJohannes Weiner rq = this_rq_lock_irq(&rf); 912eb414681SJohannes Weiner 9131066d1b6SYafang Shao current->in_memstall = 1; 914cb0e52b7SBrian Chen psi_task_change(current, 0, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); 915eb414681SJohannes Weiner 916eb414681SJohannes Weiner rq_unlock_irq(rq, &rf); 917eb414681SJohannes Weiner } 918eb414681SJohannes Weiner 919eb414681SJohannes Weiner /** 920eb414681SJohannes Weiner * psi_memstall_leave - mark the end of an memory stall section 921eb414681SJohannes Weiner * @flags: flags to handle nested memdelay sections 922eb414681SJohannes Weiner * 923eb414681SJohannes Weiner * Marks the calling task as no longer stalled due to lack of memory. 924eb414681SJohannes Weiner */ 925eb414681SJohannes Weiner void psi_memstall_leave(unsigned long *flags) 926eb414681SJohannes Weiner { 927eb414681SJohannes Weiner struct rq_flags rf; 928eb414681SJohannes Weiner struct rq *rq; 929eb414681SJohannes Weiner 930e0c27447SJohannes Weiner if (static_branch_likely(&psi_disabled)) 931eb414681SJohannes Weiner return; 932eb414681SJohannes Weiner 933eb414681SJohannes Weiner if (*flags) 934eb414681SJohannes Weiner return; 935eb414681SJohannes Weiner /* 9361066d1b6SYafang Shao * in_memstall clearing & accounting needs to be atomic wrt 937eb414681SJohannes Weiner * changes to the task's scheduling state, otherwise we could 938eb414681SJohannes Weiner * race with CPU migration. 939eb414681SJohannes Weiner */ 940eb414681SJohannes Weiner rq = this_rq_lock_irq(&rf); 941eb414681SJohannes Weiner 9421066d1b6SYafang Shao current->in_memstall = 0; 943cb0e52b7SBrian Chen psi_task_change(current, TSK_MEMSTALL | TSK_MEMSTALL_RUNNING, 0); 944eb414681SJohannes Weiner 945eb414681SJohannes Weiner rq_unlock_irq(rq, &rf); 946eb414681SJohannes Weiner } 947eb414681SJohannes Weiner 9482ce7135aSJohannes Weiner #ifdef CONFIG_CGROUPS 9492ce7135aSJohannes Weiner int psi_cgroup_alloc(struct cgroup *cgroup) 9502ce7135aSJohannes Weiner { 951e2ad8ab0SChengming Zhou if (!static_branch_likely(&psi_cgroups_enabled)) 9522ce7135aSJohannes Weiner return 0; 9532ce7135aSJohannes Weiner 9542b97cf76SHao Jia cgroup->psi = kzalloc(sizeof(struct psi_group), GFP_KERNEL); 9555f69a657SChen Wandun if (!cgroup->psi) 9562ce7135aSJohannes Weiner return -ENOMEM; 9575f69a657SChen Wandun 9585f69a657SChen Wandun cgroup->psi->pcpu = alloc_percpu(struct psi_group_cpu); 9595f69a657SChen Wandun if (!cgroup->psi->pcpu) { 9605f69a657SChen Wandun kfree(cgroup->psi); 9615f69a657SChen Wandun return -ENOMEM; 9625f69a657SChen Wandun } 9635f69a657SChen Wandun group_init(cgroup->psi); 9642ce7135aSJohannes Weiner return 0; 9652ce7135aSJohannes Weiner } 9662ce7135aSJohannes Weiner 9672ce7135aSJohannes Weiner void psi_cgroup_free(struct cgroup *cgroup) 9682ce7135aSJohannes Weiner { 969e2ad8ab0SChengming Zhou if (!static_branch_likely(&psi_cgroups_enabled)) 9702ce7135aSJohannes Weiner return; 9712ce7135aSJohannes Weiner 9725f69a657SChen Wandun cancel_delayed_work_sync(&cgroup->psi->avgs_work); 9735f69a657SChen Wandun free_percpu(cgroup->psi->pcpu); 9740e94682bSSuren Baghdasaryan /* All triggers must be removed by now */ 9755f69a657SChen Wandun WARN_ONCE(cgroup->psi->poll_states, "psi: trigger leak\n"); 9765f69a657SChen Wandun kfree(cgroup->psi); 9772ce7135aSJohannes Weiner } 9782ce7135aSJohannes Weiner 9792ce7135aSJohannes Weiner /** 9802ce7135aSJohannes Weiner * cgroup_move_task - move task to a different cgroup 9812ce7135aSJohannes Weiner * @task: the task 9822ce7135aSJohannes Weiner * @to: the target css_set 9832ce7135aSJohannes Weiner * 9842ce7135aSJohannes Weiner * Move task to a new cgroup and safely migrate its associated stall 9852ce7135aSJohannes Weiner * state between the different groups. 9862ce7135aSJohannes Weiner * 9872ce7135aSJohannes Weiner * This function acquires the task's rq lock to lock out concurrent 9882ce7135aSJohannes Weiner * changes to the task's scheduling state and - in case the task is 9892ce7135aSJohannes Weiner * running - concurrent changes to its stall state. 9902ce7135aSJohannes Weiner */ 9912ce7135aSJohannes Weiner void cgroup_move_task(struct task_struct *task, struct css_set *to) 9922ce7135aSJohannes Weiner { 993d583d360SJohannes Weiner unsigned int task_flags; 9942ce7135aSJohannes Weiner struct rq_flags rf; 9952ce7135aSJohannes Weiner struct rq *rq; 9962ce7135aSJohannes Weiner 997e2ad8ab0SChengming Zhou if (!static_branch_likely(&psi_cgroups_enabled)) { 9988fcb2312SOlof Johansson /* 9998fcb2312SOlof Johansson * Lame to do this here, but the scheduler cannot be locked 10008fcb2312SOlof Johansson * from the outside, so we move cgroups from inside sched/. 10018fcb2312SOlof Johansson */ 10028fcb2312SOlof Johansson rcu_assign_pointer(task->cgroups, to); 10038fcb2312SOlof Johansson return; 10048fcb2312SOlof Johansson } 10058fcb2312SOlof Johansson 10062ce7135aSJohannes Weiner rq = task_rq_lock(task, &rf); 10072ce7135aSJohannes Weiner 1008d583d360SJohannes Weiner /* 1009d583d360SJohannes Weiner * We may race with schedule() dropping the rq lock between 1010d583d360SJohannes Weiner * deactivating prev and switching to next. Because the psi 1011d583d360SJohannes Weiner * updates from the deactivation are deferred to the switch 1012d583d360SJohannes Weiner * callback to save cgroup tree updates, the task's scheduling 1013d583d360SJohannes Weiner * state here is not coherent with its psi state: 1014d583d360SJohannes Weiner * 1015d583d360SJohannes Weiner * schedule() cgroup_move_task() 1016d583d360SJohannes Weiner * rq_lock() 1017d583d360SJohannes Weiner * deactivate_task() 1018d583d360SJohannes Weiner * p->on_rq = 0 1019d583d360SJohannes Weiner * psi_dequeue() // defers TSK_RUNNING & TSK_IOWAIT updates 1020d583d360SJohannes Weiner * pick_next_task() 1021d583d360SJohannes Weiner * rq_unlock() 1022d583d360SJohannes Weiner * rq_lock() 1023d583d360SJohannes Weiner * psi_task_change() // old cgroup 1024d583d360SJohannes Weiner * task->cgroups = to 1025d583d360SJohannes Weiner * psi_task_change() // new cgroup 1026d583d360SJohannes Weiner * rq_unlock() 1027d583d360SJohannes Weiner * rq_lock() 1028d583d360SJohannes Weiner * psi_sched_switch() // does deferred updates in new cgroup 1029d583d360SJohannes Weiner * 1030d583d360SJohannes Weiner * Don't rely on the scheduling state. Use psi_flags instead. 1031d583d360SJohannes Weiner */ 1032d583d360SJohannes Weiner task_flags = task->psi_flags; 10332ce7135aSJohannes Weiner 10342ce7135aSJohannes Weiner if (task_flags) 10352ce7135aSJohannes Weiner psi_task_change(task, task_flags, 0); 10362ce7135aSJohannes Weiner 10378fcb2312SOlof Johansson /* See comment above */ 10382ce7135aSJohannes Weiner rcu_assign_pointer(task->cgroups, to); 10392ce7135aSJohannes Weiner 10402ce7135aSJohannes Weiner if (task_flags) 10412ce7135aSJohannes Weiner psi_task_change(task, 0, task_flags); 10422ce7135aSJohannes Weiner 10432ce7135aSJohannes Weiner task_rq_unlock(rq, task, &rf); 10442ce7135aSJohannes Weiner } 10452ce7135aSJohannes Weiner #endif /* CONFIG_CGROUPS */ 10462ce7135aSJohannes Weiner 10472ce7135aSJohannes Weiner int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) 1048eb414681SJohannes Weiner { 1049eb414681SJohannes Weiner int full; 10507fc70a39SSuren Baghdasaryan u64 now; 1051eb414681SJohannes Weiner 1052e0c27447SJohannes Weiner if (static_branch_likely(&psi_disabled)) 1053eb414681SJohannes Weiner return -EOPNOTSUPP; 1054eb414681SJohannes Weiner 10557fc70a39SSuren Baghdasaryan /* Update averages before reporting them */ 10567fc70a39SSuren Baghdasaryan mutex_lock(&group->avgs_lock); 10577fc70a39SSuren Baghdasaryan now = sched_clock(); 10580e94682bSSuren Baghdasaryan collect_percpu_times(group, PSI_AVGS, NULL); 10597fc70a39SSuren Baghdasaryan if (now >= group->avg_next_update) 10607fc70a39SSuren Baghdasaryan group->avg_next_update = update_averages(group, now); 10617fc70a39SSuren Baghdasaryan mutex_unlock(&group->avgs_lock); 1062eb414681SJohannes Weiner 1063e7fcd762SChengming Zhou for (full = 0; full < 2; full++) { 1064890d550dSChengming Zhou unsigned long avg[3] = { 0, }; 1065890d550dSChengming Zhou u64 total = 0; 1066eb414681SJohannes Weiner int w; 1067eb414681SJohannes Weiner 1068890d550dSChengming Zhou /* CPU FULL is undefined at the system level */ 1069890d550dSChengming Zhou if (!(group == &psi_system && res == PSI_CPU && full)) { 1070eb414681SJohannes Weiner for (w = 0; w < 3; w++) 1071eb414681SJohannes Weiner avg[w] = group->avg[res * 2 + full][w]; 10720e94682bSSuren Baghdasaryan total = div_u64(group->total[PSI_AVGS][res * 2 + full], 10730e94682bSSuren Baghdasaryan NSEC_PER_USEC); 1074890d550dSChengming Zhou } 1075eb414681SJohannes Weiner 1076eb414681SJohannes Weiner seq_printf(m, "%s avg10=%lu.%02lu avg60=%lu.%02lu avg300=%lu.%02lu total=%llu\n", 1077eb414681SJohannes Weiner full ? "full" : "some", 1078eb414681SJohannes Weiner LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), 1079eb414681SJohannes Weiner LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), 1080eb414681SJohannes Weiner LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), 1081eb414681SJohannes Weiner total); 1082eb414681SJohannes Weiner } 1083eb414681SJohannes Weiner 1084eb414681SJohannes Weiner return 0; 1085eb414681SJohannes Weiner } 1086eb414681SJohannes Weiner 10870e94682bSSuren Baghdasaryan struct psi_trigger *psi_trigger_create(struct psi_group *group, 108876b079efSHao Jia char *buf, enum psi_res res) 10890e94682bSSuren Baghdasaryan { 10900e94682bSSuren Baghdasaryan struct psi_trigger *t; 10910e94682bSSuren Baghdasaryan enum psi_states state; 10920e94682bSSuren Baghdasaryan u32 threshold_us; 10930e94682bSSuren Baghdasaryan u32 window_us; 10940e94682bSSuren Baghdasaryan 10950e94682bSSuren Baghdasaryan if (static_branch_likely(&psi_disabled)) 10960e94682bSSuren Baghdasaryan return ERR_PTR(-EOPNOTSUPP); 10970e94682bSSuren Baghdasaryan 10980e94682bSSuren Baghdasaryan if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2) 10990e94682bSSuren Baghdasaryan state = PSI_IO_SOME + res * 2; 11000e94682bSSuren Baghdasaryan else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2) 11010e94682bSSuren Baghdasaryan state = PSI_IO_FULL + res * 2; 11020e94682bSSuren Baghdasaryan else 11030e94682bSSuren Baghdasaryan return ERR_PTR(-EINVAL); 11040e94682bSSuren Baghdasaryan 11050e94682bSSuren Baghdasaryan if (state >= PSI_NONIDLE) 11060e94682bSSuren Baghdasaryan return ERR_PTR(-EINVAL); 11070e94682bSSuren Baghdasaryan 11080e94682bSSuren Baghdasaryan if (window_us < WINDOW_MIN_US || 11090e94682bSSuren Baghdasaryan window_us > WINDOW_MAX_US) 11100e94682bSSuren Baghdasaryan return ERR_PTR(-EINVAL); 11110e94682bSSuren Baghdasaryan 11120e94682bSSuren Baghdasaryan /* Check threshold */ 11130e94682bSSuren Baghdasaryan if (threshold_us == 0 || threshold_us > window_us) 11140e94682bSSuren Baghdasaryan return ERR_PTR(-EINVAL); 11150e94682bSSuren Baghdasaryan 11160e94682bSSuren Baghdasaryan t = kmalloc(sizeof(*t), GFP_KERNEL); 11170e94682bSSuren Baghdasaryan if (!t) 11180e94682bSSuren Baghdasaryan return ERR_PTR(-ENOMEM); 11190e94682bSSuren Baghdasaryan 11200e94682bSSuren Baghdasaryan t->group = group; 11210e94682bSSuren Baghdasaryan t->state = state; 11220e94682bSSuren Baghdasaryan t->threshold = threshold_us * NSEC_PER_USEC; 11230e94682bSSuren Baghdasaryan t->win.size = window_us * NSEC_PER_USEC; 1124915a087eSHailong Liu window_reset(&t->win, sched_clock(), 1125915a087eSHailong Liu group->total[PSI_POLL][t->state], 0); 11260e94682bSSuren Baghdasaryan 11270e94682bSSuren Baghdasaryan t->event = 0; 11280e94682bSSuren Baghdasaryan t->last_event_time = 0; 11290e94682bSSuren Baghdasaryan init_waitqueue_head(&t->event_wait); 1130e6df4eadSZhaoyang Huang t->pending_event = false; 11310e94682bSSuren Baghdasaryan 11320e94682bSSuren Baghdasaryan mutex_lock(&group->trigger_lock); 11330e94682bSSuren Baghdasaryan 1134461daba0SSuren Baghdasaryan if (!rcu_access_pointer(group->poll_task)) { 1135461daba0SSuren Baghdasaryan struct task_struct *task; 11360e94682bSSuren Baghdasaryan 1137461daba0SSuren Baghdasaryan task = kthread_create(psi_poll_worker, group, "psimon"); 1138461daba0SSuren Baghdasaryan if (IS_ERR(task)) { 11390e94682bSSuren Baghdasaryan kfree(t); 11400e94682bSSuren Baghdasaryan mutex_unlock(&group->trigger_lock); 1141461daba0SSuren Baghdasaryan return ERR_CAST(task); 11420e94682bSSuren Baghdasaryan } 1143461daba0SSuren Baghdasaryan atomic_set(&group->poll_wakeup, 0); 1144461daba0SSuren Baghdasaryan wake_up_process(task); 1145461daba0SSuren Baghdasaryan rcu_assign_pointer(group->poll_task, task); 11460e94682bSSuren Baghdasaryan } 11470e94682bSSuren Baghdasaryan 11480e94682bSSuren Baghdasaryan list_add(&t->node, &group->triggers); 11490e94682bSSuren Baghdasaryan group->poll_min_period = min(group->poll_min_period, 11500e94682bSSuren Baghdasaryan div_u64(t->win.size, UPDATES_PER_WINDOW)); 11510e94682bSSuren Baghdasaryan group->nr_triggers[t->state]++; 11520e94682bSSuren Baghdasaryan group->poll_states |= (1 << t->state); 11530e94682bSSuren Baghdasaryan 11540e94682bSSuren Baghdasaryan mutex_unlock(&group->trigger_lock); 11550e94682bSSuren Baghdasaryan 11560e94682bSSuren Baghdasaryan return t; 11570e94682bSSuren Baghdasaryan } 11580e94682bSSuren Baghdasaryan 1159a06247c6SSuren Baghdasaryan void psi_trigger_destroy(struct psi_trigger *t) 11600e94682bSSuren Baghdasaryan { 1161a06247c6SSuren Baghdasaryan struct psi_group *group; 1162461daba0SSuren Baghdasaryan struct task_struct *task_to_destroy = NULL; 11630e94682bSSuren Baghdasaryan 1164a06247c6SSuren Baghdasaryan /* 1165a06247c6SSuren Baghdasaryan * We do not check psi_disabled since it might have been disabled after 1166a06247c6SSuren Baghdasaryan * the trigger got created. 1167a06247c6SSuren Baghdasaryan */ 1168a06247c6SSuren Baghdasaryan if (!t) 11690e94682bSSuren Baghdasaryan return; 11700e94682bSSuren Baghdasaryan 1171a06247c6SSuren Baghdasaryan group = t->group; 11720e94682bSSuren Baghdasaryan /* 11730e94682bSSuren Baghdasaryan * Wakeup waiters to stop polling. Can happen if cgroup is deleted 11740e94682bSSuren Baghdasaryan * from under a polling process. 11750e94682bSSuren Baghdasaryan */ 11760e94682bSSuren Baghdasaryan wake_up_interruptible(&t->event_wait); 11770e94682bSSuren Baghdasaryan 11780e94682bSSuren Baghdasaryan mutex_lock(&group->trigger_lock); 11790e94682bSSuren Baghdasaryan 11800e94682bSSuren Baghdasaryan if (!list_empty(&t->node)) { 11810e94682bSSuren Baghdasaryan struct psi_trigger *tmp; 11820e94682bSSuren Baghdasaryan u64 period = ULLONG_MAX; 11830e94682bSSuren Baghdasaryan 11840e94682bSSuren Baghdasaryan list_del(&t->node); 11850e94682bSSuren Baghdasaryan group->nr_triggers[t->state]--; 11860e94682bSSuren Baghdasaryan if (!group->nr_triggers[t->state]) 11870e94682bSSuren Baghdasaryan group->poll_states &= ~(1 << t->state); 11880e94682bSSuren Baghdasaryan /* reset min update period for the remaining triggers */ 11890e94682bSSuren Baghdasaryan list_for_each_entry(tmp, &group->triggers, node) 11900e94682bSSuren Baghdasaryan period = min(period, div_u64(tmp->win.size, 11910e94682bSSuren Baghdasaryan UPDATES_PER_WINDOW)); 11920e94682bSSuren Baghdasaryan group->poll_min_period = period; 1193461daba0SSuren Baghdasaryan /* Destroy poll_task when the last trigger is destroyed */ 11940e94682bSSuren Baghdasaryan if (group->poll_states == 0) { 11950e94682bSSuren Baghdasaryan group->polling_until = 0; 1196461daba0SSuren Baghdasaryan task_to_destroy = rcu_dereference_protected( 1197461daba0SSuren Baghdasaryan group->poll_task, 11980e94682bSSuren Baghdasaryan lockdep_is_held(&group->trigger_lock)); 1199461daba0SSuren Baghdasaryan rcu_assign_pointer(group->poll_task, NULL); 12008f91efd8SZhaoyang Huang del_timer(&group->poll_timer); 12010e94682bSSuren Baghdasaryan } 12020e94682bSSuren Baghdasaryan } 12030e94682bSSuren Baghdasaryan 12040e94682bSSuren Baghdasaryan mutex_unlock(&group->trigger_lock); 12050e94682bSSuren Baghdasaryan 12060e94682bSSuren Baghdasaryan /* 1207a06247c6SSuren Baghdasaryan * Wait for psi_schedule_poll_work RCU to complete its read-side 1208a06247c6SSuren Baghdasaryan * critical section before destroying the trigger and optionally the 1209a06247c6SSuren Baghdasaryan * poll_task. 12100e94682bSSuren Baghdasaryan */ 12110e94682bSSuren Baghdasaryan synchronize_rcu(); 12120e94682bSSuren Baghdasaryan /* 12138f91efd8SZhaoyang Huang * Stop kthread 'psimon' after releasing trigger_lock to prevent a 12140e94682bSSuren Baghdasaryan * deadlock while waiting for psi_poll_work to acquire trigger_lock 12150e94682bSSuren Baghdasaryan */ 1216461daba0SSuren Baghdasaryan if (task_to_destroy) { 12177b2b55daSJason Xing /* 12187b2b55daSJason Xing * After the RCU grace period has expired, the worker 1219461daba0SSuren Baghdasaryan * can no longer be found through group->poll_task. 12207b2b55daSJason Xing */ 1221461daba0SSuren Baghdasaryan kthread_stop(task_to_destroy); 12220e94682bSSuren Baghdasaryan } 12230e94682bSSuren Baghdasaryan kfree(t); 12240e94682bSSuren Baghdasaryan } 12250e94682bSSuren Baghdasaryan 12260e94682bSSuren Baghdasaryan __poll_t psi_trigger_poll(void **trigger_ptr, 12270e94682bSSuren Baghdasaryan struct file *file, poll_table *wait) 12280e94682bSSuren Baghdasaryan { 12290e94682bSSuren Baghdasaryan __poll_t ret = DEFAULT_POLLMASK; 12300e94682bSSuren Baghdasaryan struct psi_trigger *t; 12310e94682bSSuren Baghdasaryan 12320e94682bSSuren Baghdasaryan if (static_branch_likely(&psi_disabled)) 12330e94682bSSuren Baghdasaryan return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; 12340e94682bSSuren Baghdasaryan 1235a06247c6SSuren Baghdasaryan t = smp_load_acquire(trigger_ptr); 1236a06247c6SSuren Baghdasaryan if (!t) 12370e94682bSSuren Baghdasaryan return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; 12380e94682bSSuren Baghdasaryan 12390e94682bSSuren Baghdasaryan poll_wait(file, &t->event_wait, wait); 12400e94682bSSuren Baghdasaryan 12410e94682bSSuren Baghdasaryan if (cmpxchg(&t->event, 1, 0) == 1) 12420e94682bSSuren Baghdasaryan ret |= EPOLLPRI; 12430e94682bSSuren Baghdasaryan 12440e94682bSSuren Baghdasaryan return ret; 12450e94682bSSuren Baghdasaryan } 12460e94682bSSuren Baghdasaryan 12475102bb1cSSuren Baghdasaryan #ifdef CONFIG_PROC_FS 12485102bb1cSSuren Baghdasaryan static int psi_io_show(struct seq_file *m, void *v) 12495102bb1cSSuren Baghdasaryan { 12505102bb1cSSuren Baghdasaryan return psi_show(m, &psi_system, PSI_IO); 12515102bb1cSSuren Baghdasaryan } 12525102bb1cSSuren Baghdasaryan 12535102bb1cSSuren Baghdasaryan static int psi_memory_show(struct seq_file *m, void *v) 12545102bb1cSSuren Baghdasaryan { 12555102bb1cSSuren Baghdasaryan return psi_show(m, &psi_system, PSI_MEM); 12565102bb1cSSuren Baghdasaryan } 12575102bb1cSSuren Baghdasaryan 12585102bb1cSSuren Baghdasaryan static int psi_cpu_show(struct seq_file *m, void *v) 12595102bb1cSSuren Baghdasaryan { 12605102bb1cSSuren Baghdasaryan return psi_show(m, &psi_system, PSI_CPU); 12615102bb1cSSuren Baghdasaryan } 12625102bb1cSSuren Baghdasaryan 12635102bb1cSSuren Baghdasaryan static int psi_open(struct file *file, int (*psi_show)(struct seq_file *, void *)) 12645102bb1cSSuren Baghdasaryan { 12655102bb1cSSuren Baghdasaryan if (file->f_mode & FMODE_WRITE && !capable(CAP_SYS_RESOURCE)) 12665102bb1cSSuren Baghdasaryan return -EPERM; 12675102bb1cSSuren Baghdasaryan 12685102bb1cSSuren Baghdasaryan return single_open(file, psi_show, NULL); 12695102bb1cSSuren Baghdasaryan } 12705102bb1cSSuren Baghdasaryan 12715102bb1cSSuren Baghdasaryan static int psi_io_open(struct inode *inode, struct file *file) 12725102bb1cSSuren Baghdasaryan { 12735102bb1cSSuren Baghdasaryan return psi_open(file, psi_io_show); 12745102bb1cSSuren Baghdasaryan } 12755102bb1cSSuren Baghdasaryan 12765102bb1cSSuren Baghdasaryan static int psi_memory_open(struct inode *inode, struct file *file) 12775102bb1cSSuren Baghdasaryan { 12785102bb1cSSuren Baghdasaryan return psi_open(file, psi_memory_show); 12795102bb1cSSuren Baghdasaryan } 12805102bb1cSSuren Baghdasaryan 12815102bb1cSSuren Baghdasaryan static int psi_cpu_open(struct inode *inode, struct file *file) 12825102bb1cSSuren Baghdasaryan { 12835102bb1cSSuren Baghdasaryan return psi_open(file, psi_cpu_show); 12845102bb1cSSuren Baghdasaryan } 12855102bb1cSSuren Baghdasaryan 12860e94682bSSuren Baghdasaryan static ssize_t psi_write(struct file *file, const char __user *user_buf, 12870e94682bSSuren Baghdasaryan size_t nbytes, enum psi_res res) 12880e94682bSSuren Baghdasaryan { 12890e94682bSSuren Baghdasaryan char buf[32]; 12900e94682bSSuren Baghdasaryan size_t buf_size; 12910e94682bSSuren Baghdasaryan struct seq_file *seq; 12920e94682bSSuren Baghdasaryan struct psi_trigger *new; 12930e94682bSSuren Baghdasaryan 12940e94682bSSuren Baghdasaryan if (static_branch_likely(&psi_disabled)) 12950e94682bSSuren Baghdasaryan return -EOPNOTSUPP; 12960e94682bSSuren Baghdasaryan 12976fcca0faSSuren Baghdasaryan if (!nbytes) 12986fcca0faSSuren Baghdasaryan return -EINVAL; 12996fcca0faSSuren Baghdasaryan 13004adcdceaSMiles Chen buf_size = min(nbytes, sizeof(buf)); 13010e94682bSSuren Baghdasaryan if (copy_from_user(buf, user_buf, buf_size)) 13020e94682bSSuren Baghdasaryan return -EFAULT; 13030e94682bSSuren Baghdasaryan 13040e94682bSSuren Baghdasaryan buf[buf_size - 1] = '\0'; 13050e94682bSSuren Baghdasaryan 13060e94682bSSuren Baghdasaryan seq = file->private_data; 1307a06247c6SSuren Baghdasaryan 13080e94682bSSuren Baghdasaryan /* Take seq->lock to protect seq->private from concurrent writes */ 13090e94682bSSuren Baghdasaryan mutex_lock(&seq->lock); 1310a06247c6SSuren Baghdasaryan 1311a06247c6SSuren Baghdasaryan /* Allow only one trigger per file descriptor */ 1312a06247c6SSuren Baghdasaryan if (seq->private) { 1313a06247c6SSuren Baghdasaryan mutex_unlock(&seq->lock); 1314a06247c6SSuren Baghdasaryan return -EBUSY; 1315a06247c6SSuren Baghdasaryan } 1316a06247c6SSuren Baghdasaryan 131776b079efSHao Jia new = psi_trigger_create(&psi_system, buf, res); 1318a06247c6SSuren Baghdasaryan if (IS_ERR(new)) { 1319a06247c6SSuren Baghdasaryan mutex_unlock(&seq->lock); 1320a06247c6SSuren Baghdasaryan return PTR_ERR(new); 1321a06247c6SSuren Baghdasaryan } 1322a06247c6SSuren Baghdasaryan 1323a06247c6SSuren Baghdasaryan smp_store_release(&seq->private, new); 13240e94682bSSuren Baghdasaryan mutex_unlock(&seq->lock); 13250e94682bSSuren Baghdasaryan 13260e94682bSSuren Baghdasaryan return nbytes; 13270e94682bSSuren Baghdasaryan } 13280e94682bSSuren Baghdasaryan 13290e94682bSSuren Baghdasaryan static ssize_t psi_io_write(struct file *file, const char __user *user_buf, 13300e94682bSSuren Baghdasaryan size_t nbytes, loff_t *ppos) 13310e94682bSSuren Baghdasaryan { 13320e94682bSSuren Baghdasaryan return psi_write(file, user_buf, nbytes, PSI_IO); 13330e94682bSSuren Baghdasaryan } 13340e94682bSSuren Baghdasaryan 13350e94682bSSuren Baghdasaryan static ssize_t psi_memory_write(struct file *file, const char __user *user_buf, 13360e94682bSSuren Baghdasaryan size_t nbytes, loff_t *ppos) 13370e94682bSSuren Baghdasaryan { 13380e94682bSSuren Baghdasaryan return psi_write(file, user_buf, nbytes, PSI_MEM); 13390e94682bSSuren Baghdasaryan } 13400e94682bSSuren Baghdasaryan 13410e94682bSSuren Baghdasaryan static ssize_t psi_cpu_write(struct file *file, const char __user *user_buf, 13420e94682bSSuren Baghdasaryan size_t nbytes, loff_t *ppos) 13430e94682bSSuren Baghdasaryan { 13440e94682bSSuren Baghdasaryan return psi_write(file, user_buf, nbytes, PSI_CPU); 13450e94682bSSuren Baghdasaryan } 13460e94682bSSuren Baghdasaryan 13470e94682bSSuren Baghdasaryan static __poll_t psi_fop_poll(struct file *file, poll_table *wait) 13480e94682bSSuren Baghdasaryan { 13490e94682bSSuren Baghdasaryan struct seq_file *seq = file->private_data; 13500e94682bSSuren Baghdasaryan 13510e94682bSSuren Baghdasaryan return psi_trigger_poll(&seq->private, file, wait); 13520e94682bSSuren Baghdasaryan } 13530e94682bSSuren Baghdasaryan 13540e94682bSSuren Baghdasaryan static int psi_fop_release(struct inode *inode, struct file *file) 13550e94682bSSuren Baghdasaryan { 13560e94682bSSuren Baghdasaryan struct seq_file *seq = file->private_data; 13570e94682bSSuren Baghdasaryan 1358a06247c6SSuren Baghdasaryan psi_trigger_destroy(seq->private); 13590e94682bSSuren Baghdasaryan return single_release(inode, file); 13600e94682bSSuren Baghdasaryan } 13610e94682bSSuren Baghdasaryan 136297a32539SAlexey Dobriyan static const struct proc_ops psi_io_proc_ops = { 136397a32539SAlexey Dobriyan .proc_open = psi_io_open, 136497a32539SAlexey Dobriyan .proc_read = seq_read, 136597a32539SAlexey Dobriyan .proc_lseek = seq_lseek, 136697a32539SAlexey Dobriyan .proc_write = psi_io_write, 136797a32539SAlexey Dobriyan .proc_poll = psi_fop_poll, 136897a32539SAlexey Dobriyan .proc_release = psi_fop_release, 1369eb414681SJohannes Weiner }; 1370eb414681SJohannes Weiner 137197a32539SAlexey Dobriyan static const struct proc_ops psi_memory_proc_ops = { 137297a32539SAlexey Dobriyan .proc_open = psi_memory_open, 137397a32539SAlexey Dobriyan .proc_read = seq_read, 137497a32539SAlexey Dobriyan .proc_lseek = seq_lseek, 137597a32539SAlexey Dobriyan .proc_write = psi_memory_write, 137697a32539SAlexey Dobriyan .proc_poll = psi_fop_poll, 137797a32539SAlexey Dobriyan .proc_release = psi_fop_release, 1378eb414681SJohannes Weiner }; 1379eb414681SJohannes Weiner 138097a32539SAlexey Dobriyan static const struct proc_ops psi_cpu_proc_ops = { 138197a32539SAlexey Dobriyan .proc_open = psi_cpu_open, 138297a32539SAlexey Dobriyan .proc_read = seq_read, 138397a32539SAlexey Dobriyan .proc_lseek = seq_lseek, 138497a32539SAlexey Dobriyan .proc_write = psi_cpu_write, 138597a32539SAlexey Dobriyan .proc_poll = psi_fop_poll, 138697a32539SAlexey Dobriyan .proc_release = psi_fop_release, 1387eb414681SJohannes Weiner }; 1388eb414681SJohannes Weiner 1389eb414681SJohannes Weiner static int __init psi_proc_init(void) 1390eb414681SJohannes Weiner { 13913d817689SWang Long if (psi_enable) { 1392eb414681SJohannes Weiner proc_mkdir("pressure", NULL); 13936db12ee0SJosh Hunt proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); 13946db12ee0SJosh Hunt proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); 13956db12ee0SJosh Hunt proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); 13963d817689SWang Long } 1397eb414681SJohannes Weiner return 0; 1398eb414681SJohannes Weiner } 1399eb414681SJohannes Weiner module_init(psi_proc_init); 14005102bb1cSSuren Baghdasaryan 14015102bb1cSSuren Baghdasaryan #endif /* CONFIG_PROC_FS */ 1402