#
448a2500 |
| 18-Jun-2024 |
John Stultz <jstultz@google.com> |
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8 upstream.
It was reported that in moving to 6.1, a larger then 10% regression
sched: Move psi_account_irqtime() out of update_rq_clock_task() hotpath
commit ddae0ca2a8fe12d0e24ab10ba759c3fbd755ada8 upstream.
It was reported that in moving to 6.1, a larger then 10% regression was seen in the performance of clock_gettime(CLOCK_THREAD_CPUTIME_ID,...).
Using a simple reproducer, I found: 5.10: 100000000 calls in 24345994193 ns => 243.460 ns per call 100000000 calls in 24288172050 ns => 242.882 ns per call 100000000 calls in 24289135225 ns => 242.891 ns per call
6.1: 100000000 calls in 28248646742 ns => 282.486 ns per call 100000000 calls in 28227055067 ns => 282.271 ns per call 100000000 calls in 28177471287 ns => 281.775 ns per call
The cause of this was finally narrowed down to the addition of psi_account_irqtime() in update_rq_clock_task(), in commit 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure").
In my initial attempt to resolve this, I leaned towards moving all accounting work out of the clock_gettime() call path, but it wasn't very pretty, so it will have to wait for a later deeper rework. Instead, Peter shared this approach:
Rework psi_account_irqtime() to use its own psi_irq_time base for accounting, and move it out of the hotpath, calling it instead from sched_tick() and __schedule().
In testing this, we found the importance of ensuring psi_account_irqtime() is run under the rq_lock, which Johannes Weiner helpfully explained, so also add some lockdep annotations to make that requirement clear.
With this change the performance is back in-line with 5.10: 6.1+fix: 100000000 calls in 24297324597 ns => 242.973 ns per call 100000000 calls in 24318869234 ns => 243.189 ns per call 100000000 calls in 24291564588 ns => 242.916 ns per call
Reported-by: Jimmy Shiu <jimmyshiu@google.com> Originally-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Qais Yousef <qyousef@layalina.io> Link: https://lore.kernel.org/r/20240618215909.4099720-1-jstultz@google.com Fixes: 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure") [jstultz: Fixed up minor collisions w/ 6.6-stable] Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
52b33d87 |
| 26-Sep-2022 |
Chengming Zhou <zhouchengming@bytedance.com> |
sched/psi: Use task->psi_flags to clear in CPU migration
The commit d583d360a620 ("psi: Fix psi state corruption when schedule() races with cgroup move") fixed a race problem by making cgroup_move_t
sched/psi: Use task->psi_flags to clear in CPU migration
The commit d583d360a620 ("psi: Fix psi state corruption when schedule() races with cgroup move") fixed a race problem by making cgroup_move_task() use task->psi_flags instead of looking at the scheduler state.
We can extend task->psi_flags usage to CPU migration, which should be a minor optimization for performance and code simplicity.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220926081931.45420-1-zhouchengming@bytedance.com
show more ...
|
#
52b1364b |
| 25-Aug-2022 |
Chengming Zhou <zhouchengming@bytedance.com> |
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on so
sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure
Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload.
When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status.
Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
show more ...
|
#
d79ddb06 |
| 25-Aug-2022 |
Chengming Zhou <zhouchengming@bytedance.com> |
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats trac
sched/psi: Move private helpers to sched/stats.h
This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h
psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20220825164111.29534-5-zhouchengming@bytedance.com
show more ...
|
#
4ff8f2ca |
| 22-Feb-2022 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
Remove all headers, except the ones required to make this header build standalone.
Also include stats.h in sched.h
sched/headers: Reorganize, clean up and optimize kernel/sched/sched.h dependencies
Remove all headers, except the ones required to make this header build standalone.
Also include stats.h in sched.h explicitly - dependencies already require this.
Summary of the build speedup gained through the last ~15 scheduler build & header dependency patches:
Cumulative scheduler (kernel/sched/) build time speedup on a Linux distribution's config, which enables all scheduler features, compared to the vanilla kernel:
_____________________________________________________________________________ | | Vanilla kernel (v5.13-rc7): |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 126,975,564,374 instructions # 1.45 insn per cycle ( +- 0.00% ) | 87,637,847,671 cycles # 3.959 GHz ( +- 0.30% ) | 22,136.96 msec cpu-clock # 7.499 CPUs utilized ( +- 0.29% ) | | 2.9520 +- 0.0169 seconds time elapsed ( +- 0.57% ) |_____________________________________________________________________________ | | Patched kernel: |_____________________________________________________________________________ | | Performance counter stats for 'make -j96 kernel/sched/' (3 runs): | | 50,420,496,914 instructions # 1.47 insn per cycle ( +- 0.00% ) | 34,234,322,038 cycles # 3.946 GHz ( +- 0.31% ) | 8,675.81 msec cpu-clock # 3.053 CPUs utilized ( +- 0.45% ) | | 2.8420 +- 0.0181 seconds time elapsed ( +- 0.64% ) |_____________________________________________________________________________
Summary:
- CPU time used to build the scheduler dropped by -60.9%, a reduction from 22.1 clock-seconds to 8.7 clock-seconds.
- Wall-clock time to build the scheduler dropped by -3.9%, a reduction from 2.95 seconds to 2.84 seconds.
Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
show more ...
|
#
b9e9c6ca |
| 13-Feb-2022 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header.
Two of them rely on being included in
sched/headers: Standardize kernel/sched/sched.h header dependencies
kernel/sched/sched.h is a weird mix of ad-hoc headers included in the middle of the header.
Two of them rely on being included in the middle of kernel/sched/sched.h, due to definitions they require:
- "stat.h" needs the rq definitions. - "autogroup.h" needs the task_group definition.
Move the inclusion of these two files out of kernel/sched/sched.h, and include them in all files that require them.
Move of the rest of the header dependencies to the top of the kernel/sched/sched.h file.
Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
show more ...
|
#
d90a2f16 |
| 20-Nov-2021 |
Ingo Molnar <mingo@kernel.org> |
sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h
Protect against multiple inclusion.
Also include "sched.h" in "stat.h", as it relies on it.
Signed-off-by: Ingo
sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h
Protect against multiple inclusion.
Also include "sched.h" in "stat.h", as it relies on it.
Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Peter Zijlstra <peterz@infradead.org>
show more ...
|
#
cb0e52b7 |
| 10-Nov-2021 |
Brian Chen <brianchen118@gmail.com> |
psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim
We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay o
psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim
We've noticed cases where tasks in a cgroup are stalled on memory but there is little memory FULL pressure since tasks stay on the runqueue in reclaim.
A simple example involves a single threaded program that keeps leaking and touching large amounts of memory. It runs in a cgroup with swap enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though there is significant CPU pressure and memory SOME, there is barely any memory FULL since the task enters reclaim and stays on the runqueue. However, this memory-bound task is effectively stalled on memory and we expect memory FULL to match memory SOME in this scenario.
The code is confused about memstall && running, thinking there is a stalled task and a productive task when there's only one task: a reclaimer that's counted as both. To fix this, we redefine the condition for PSI_MEM_FULL to check that all running tasks are in an active memstall instead of checking that there are no running tasks.
case PSI_MEM_FULL: - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]); + return unlikely(tasks[NR_MEMSTALL] && + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);
This will capture reclaimers. It will also capture tasks that called psi_memstall_enter() and are about to sleep, but this should be negligible noise.
Signed-off-by: Brian Chen <brianchen118@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
show more ...
|
#
60f2415e |
| 05-Sep-2021 |
Yafang Shao <laoar.shao@gmail.com> |
sched: Make schedstats helpers independent of fair sched class
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq
sched: Make schedstats helpers independent of fair sched class
The original prototype of the schedstats helpers are
update_stats_wait_*(struct cfs_rq *cfs_rq, struct sched_entity *se)
The cfs_rq in these helpers is used to get the rq_clock, and the se is used to get the struct sched_statistics and the struct task_struct. In order to make these helpers available by all sched classes, we can pass the rq, sched_statistics and task_struct directly.
Then the new helpers are
update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats)
which are independent of fair sched class.
To avoid vmlinux growing too large or introducing ovehead when !schedstat_enabled(), some new helpers after schedstat_enabled() are also introduced, Suggested by Mel. These helpers are in sched/stats.c,
__update_stats_wait_*(struct rq *rq, struct task_struct *p, struct sched_statistics *stats)
The size of vmlinux as follows, Before After Size of vmlinux 826308552 826304640 The size is a litte smaller as some functions are not inlined again after the change.
I also compared the sched performance with 'perf bench sched pipe', suggested by Mel. The result as followsi (in usecs/op), Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5
[These data is a little difference with the prev version, that is because my old test machine is destroyed so I have to use a new different test machine.] Almost no difference.
No functional change.
[lkp@intel.com: reported build failure in prev version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-4-laoar.shao@gmail.com
show more ...
|
#
ceeadb83 |
| 05-Sep-2021 |
Yafang Shao <laoar.shao@gmail.com> |
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The str
sched: Make struct sched_statistics independent of fair sched class
If we want to use the schedstats facility to trace other sched classes, we should make it independent of fair sched class. The struct sched_statistics is the schedular statistics of a task_struct or a task_group. So we can move it into struct task_struct and struct task_group to achieve the goal.
After the patch, schestats are orgnized as follows,
struct task_struct { ... struct sched_entity se; struct sched_rt_entity rt; struct sched_dl_entity dl; ... struct sched_statistics stats; ... };
Regarding the task group, schedstats is only supported for fair group sched, and a new struct sched_entity_stats is introduced, suggested by Peter -
struct sched_entity_stats { struct sched_entity se; struct sched_statistics stats; } __no_randomize_layout;
Then with the se in a task_group, we can easily get the stats.
The sched_statistics members may be frequently modified when schedstats is enabled, in order to avoid impacting on random data which may in the same cacheline with them, the struct sched_statistics is defined as cacheline aligned.
As this patch changes the core struct of scheduler, so I verified the performance it may impact on the scheduler with 'perf bench sched pipe', suggested by Mel. Below is the result, in which all the values are in usecs/op. Before After kernel.sched_schedstats=0 5.2~5.4 5.2~5.4 kernel.sched_schedstats=1 5.3~5.5 5.3~5.5 [These data is a little difference with the earlier version, that is because my old test machine is destroyed so I have to use a new different test machine.]
Almost no impact on the sched performance.
No functional change.
[lkp@intel.com: reported build failure in earlier version]
Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lore.kernel.org/r/20210905143547.4668-3-laoar.shao@gmail.com
show more ...
|
#
b03fbd4f |
| 11-Jun-2021 |
Peter Zijlstra <peterz@infradead.org> |
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidl
sched: Introduce task_is_running()
Replace a bunch of 'p->state == TASK_RUNNING' with a new helper: task_is_running(p).
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org
show more ...
|
#
90a0ff4e |
| 12-May-2021 |
Peter Zijlstra <peterz@infradead.org> |
sched,stats: Further simplify sched_info
There's no point doing delta==0 updates.
Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
|
#
c5895d3f |
| 04-May-2021 |
Peter Zijlstra <peterz@infradead.org> |
sched: Simplify sched_info_on()
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm.
If SCHEDSTATS=Y (but disabled by default) sche
sched: Simplify sched_info_on()
The situation around sched_info is somewhat complicated, it is used by sched_stats and delayacct and, indirectly, kvm.
If SCHEDSTATS=Y (but disabled by default) sched_info_on() is unconditionally true -- this is the case for all distro kernel configs I checked.
If for some reason SCHEDSTATS=N, but TASK_DELAY_ACCT=Y, then sched_info_on() can return false when delayacct is disabled, presumably because there would be no other users left; except kvm is.
Instead of complicating matters further by accurately accounting sched_stat and kvm state, simply unconditionally enable when SCHED_INFO=Y, matching the common distro case.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210505111525.121458839@infradead.org
show more ...
|
#
4e29fb70 |
| 04-May-2021 |
Peter Zijlstra <peterz@infradead.org> |
sched: Rename sched_info_{queued,dequeued}
For consistency, rename {queued,dequeued} to {enqueue,dequeue}.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <ri
sched: Rename sched_info_{queued,dequeued}
For consistency, rename {queued,dequeued} to {enqueue,dequeue}.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Balbir Singh <bsingharora@gmail.com> Link: https://lkml.kernel.org/r/20210505111525.061402904@infradead.org
show more ...
|
#
4117cebf |
| 02-Mar-2021 |
Chengming Zhou <zhouchengming@bytedance.com> |
psi: Optimize task switch inside shared cgroups
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch
psi: Optimize task switch inside shared cgroups
The commit 36b238d57172 ("psi: Optimize switching tasks inside shared cgroups") only update cgroups whose state actually changes during a task switch only in task preempt case, not in task sleep case.
We actually don't need to clear and set TSK_ONCPU state for common cgroups of next and prev task in sleep case, that can save many psi_group_change especially when most activity comes from one leaf cgroup.
sleep before: psi_dequeue() while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU) psi_task_switch() while ((group = iterate_groups(next))) # all ancestors psi_group_change(next, .set=TSK_ONCPU)
sleep after: psi_dequeue() nop psi_task_switch() while ((group = iterate_groups(next))) # until (prev & next) psi_group_change(next, .set=TSK_ONCPU) while ((group = iterate_groups(prev))) # all ancestors psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
When a voluntary sleep switches to another task, we remove one call of psi_group_change() for every common cgroup ancestor of the two tasks.
Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
show more ...
|
#
7fae6c81 |
| 02-Mar-2021 |
Chengming Zhou <zhouchengming@bytedance.com> |
psi: Use ONCPU state tracking machinery to detect reclaim
Move the reclaim detection from the timer tick to the task state tracking machinery using the recently added ONCPU state. And we also add ta
psi: Use ONCPU state tracking machinery to detect reclaim
Move the reclaim detection from the timer tick to the task state tracking machinery using the recently added ONCPU state. And we also add task psi_flags changes checking in the psi_task_switch() optimization to update the parents properly.
In terms of performance and cost, this ONCPU task state tracking is not cheaper than previous timer tick in aggregate. But the code is simpler and shorter this way, so it's a maintainability win. And Johannes did some testing with perf bench, the performace and cost changes would be acceptable for real workloads.
Thanks to Johannes Weiner for pointing out the psi_task_switch() optimization things and the clearer changelog.
Co-developed-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
show more ...
|
#
1066d1b6 |
| 16-Mar-2020 |
Yafang Shao <laoar.shao@gmail.com> |
psi: Move PF_MEMSTALL out of task->flags
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're s
psi: Move PF_MEMSTALL out of task->flags
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com
show more ...
|
#
36b238d5 |
| 16-Mar-2020 |
Johannes Weiner <hannes@cmpxchg.org> |
psi: Optimize switching tasks inside shared cgroups
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that,
psi: Optimize switching tasks inside shared cgroups
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup.
This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all.
We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task.
The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change().
Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org
show more ...
|
#
b05e75d6 |
| 16-Mar-2020 |
Johannes Weiner <hannes@cmpxchg.org> |
psi: Fix cpu.pressure for cpu.max and competing cgroups
For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limit
psi: Fix cpu.pressure for cpu.max and competing cgroups
For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure.
Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from
NR_RUNNING > 1
to
NR_RUNNING > ON_CPU
which will capture the effects of cpu.max as well as competition from outside the cgroup.
After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org
show more ...
|
#
65d74e91 |
| 04-Jul-2019 |
Yi Wang <wang.yi59@zte.com.cn> |
sched/stats: Fix unlikely() use of sched_info_on()
sched_info_on() is called with unlikely hint, however, the test is to be a constant(1) on which compiler will do nothing when make defconfig, so re
sched/stats: Fix unlikely() use of sched_info_on()
sched_info_on() is called with unlikely hint, however, the test is to be a constant(1) on which compiler will do nothing when make defconfig, so remove the hint.
Also, fix a lack of {}.
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: up2wing@gmail.com Cc: wang.liang82@zte.com.cn Cc: xue.zhihong@zte.com.cn Link: https://lkml.kernel.org/r/1562301307-43002-1-git-send-email-wang.yi59@zte.com.cn Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
e0c27447 |
| 30-Nov-2018 |
Johannes Weiner <hannes@cmpxchg.org> |
psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like us
psi: make disabling/enabling easier for vendor kernels
Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others.
With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier:
1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time.
To avoid double negatives, rename psi_disabled= to psi=.
2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled.
In terms of numbers before and after this patch, Mel says:
: The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so;
Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
eb414681 |
| 26-Oct-2018 |
Johannes Weiner <hannes@cmpxchg.org> |
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how
psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous.
In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system.
A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays:
cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions
These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs.
To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage).
[hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Daniel Drake <drake@endlessm.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
#
97fb7a0a |
| 03-Mar-2018 |
Ingo Molnar <mingo@kernel.org> |
sched: Clean up and harmonize the coding style of the scheduler code base
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all
sched: Clean up and harmonize the coding style of the scheduler code base
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all these details:
- fix speling in comments,
- use curly braces for multi-line statements,
- remove unnecessary parentheses from integer literals,
- capitalize consistently,
- remove stray newlines,
- add comments where necessary,
- remove invalid/unnecessary comments,
- align structure definitions and other data types vertically,
- add missing newlines for increased readability,
- fix vertical tabulation where it's misaligned,
- harmonize preprocessor conditional block labeling and vertical alignment,
- remove line-breaks where they uglify the code,
- add newline after local variable definitions,
No change in functionality:
md5: 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm
Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
2ed41a55 |
| 23-Jan-2018 |
Peter Zijlstra <peterz@infradead.org> |
sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no point in then issuing another static_branch for every individual update in them.
Signed-
sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no point in then issuing another static_branch for every individual update in them.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|
#
b85c8b71 |
| 16-Jan-2018 |
Peter Zijlstra <peterz@infradead.org> |
sched/core: Optimize ttwu_stat()
The whole of ttwu_stat() is guarded by a single schedstat_enabled(), there is absolutely no point in then issuing another static_branch for every single schedstat_in
sched/core: Optimize ttwu_stat()
The whole of ttwu_stat() is guarded by a single schedstat_enabled(), there is absolutely no point in then issuing another static_branch for every single schedstat_inc() in there.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
show more ...
|