Revision tags: v6.6.30, v6.6.29, v6.6.28 |
|
#
7fce9f0f |
| 15-Apr-2024 |
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core sche
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:
- spin_unlock(), - switch_to().
So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.30, v6.6.29, v6.6.28 |
|
#
7fce9f0f |
| 15-Apr-2024 |
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core sche
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:
- spin_unlock(), - switch_to().
So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.30, v6.6.29, v6.6.28 |
|
#
7fce9f0f |
| 15-Apr-2024 |
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core sche
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:
- spin_unlock(), - switch_to().
So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.30, v6.6.29, v6.6.28 |
|
#
7fce9f0f |
| 15-Apr-2024 |
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core sche
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:
- spin_unlock(), - switch_to().
So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.30, v6.6.29, v6.6.28 |
|
#
7fce9f0f |
| 15-Apr-2024 |
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> |
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core sche
sched: Add missing memory barrier in switch_mm_cid
commit fe90f3967bdb3e13f133e5f44025e15f943a99c5 upstream.
Many architectures' switch_mm() (e.g. arm64) do not have an smp_mb() which the core scheduler code has depended upon since commit:
commit 223baf9d17f25 ("sched: Fix performance regression introduced by mm_cid")
If switch_mm() doesn't call smp_mb(), sched_mm_cid_remote_clear() can unset the actively used cid when it fails to observe active task after it sets lazy_put.
There *is* a memory barrier between storing to rq->curr and _return to userspace_ (as required by membarrier), but the rseq mm_cid has stricter requirements: the barrier needs to be issued between store to rq->curr and switch_mm_cid(), which happens earlier than:
- spin_unlock(), - switch_to().
So it's fine when the architecture switch_mm() happens to have that barrier already, but less so when the architecture only provides the full barrier in switch_to() or spin_unlock().
It is a bug in the rseq switch_mm_cid() implementation. All architectures that don't have memory barriers in switch_mm(), but rather have the full barrier either in finish_lock_switch() or switch_to() have them too late for the needs of switch_mm_cid().
Introduce a new smp_mb__after_switch_mm(), defined as smp_mb() in the generic barrier.h header, and use it in switch_mm_cid() for scheduler transitions where switch_mm() is expected to provide a memory barrier.
Architectures can override smp_mb__after_switch_mm() if their switch_mm() implementation provides an implicit memory barrier. Override it with a no-op on x86 which implicitly provide this memory barrier by writing to CR3.
Fixes: 223baf9d17f2 ("sched: Fix performance regression introduced by mm_cid") Reported-by: levi.yun <yeoreum.yun@arm.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> # for arm64 Acked-by: Dave Hansen <dave.hansen@linux.intel.com> # for x86 Cc: <stable@vger.kernel.org> # 6.4.x Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240415152114.59122-2-mathieu.desnoyers@efficios.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43 |
|
#
4eb054f9 |
| 01-Aug-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched: Simplify wake_up_if_idle()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat
sched: Simplify wake_up_if_idle()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211812.032678917@infradead.org
show more ...
|
#
5bb76f1d |
| 01-Aug-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched: Simplify: migrate_swap_stop()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@red
sched: Simplify: migrate_swap_stop()
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Link: https://lore.kernel.org/r/20230801211811.964370836@infradead.org
show more ...
|
Revision tags: v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
88c56cfe |
| 12-Jul-2023 |
Phil Auld <pauld@redhat.com> |
sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together. Tasks can easily run well past their quotas before a remote tick does account
sched/fair: Block nohz tick_stop when cfs bandwidth in use
CFS bandwidth limits and NOHZ full don't play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Currently, when presented with these conflicting requirements the scheduler is favoring nohz_full and letting the tick be stopped. However, nohz tick stopping is already best-effort, there are a number of conditions that can prevent it, whereas cfs runtime bandwidth is expected to be enforced.
Make the scheduler favor bandwidth over stopping the tick by setting TICK_DEP_BIT_SCHED when the only running task is a cfs task with runtime limit enabled. We use cfs_b->hierarchical_quota to determine if the task requires the tick.
Add check in pick_next_task_fair() as well since that is where we have a handle on the task that is actually going to be running.
Add check in sched_can_stop_tick() to cover some edge cases such as nr_running going from 2->1 and the 1 remains the running task.
Reviewed-By: Ben Segall <bsegall@google.com> Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
show more ...
|
#
c98c1827 |
| 14-Jul-2023 |
Phil Auld <pauld@redhat.com> |
sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task groups due to the previous fix simply taking the min. It should reflect a limit
sched, cgroup: Restore meaning to hierarchical_quota
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task groups due to the previous fix simply taking the min. It should reflect a limit imposed at that level or by an ancestor. Even though cgroupv2 does not require child quota to be less than or equal to that of its ancestors the task group will still be constrained by such a quota so this should be shown here. Cgroupv1 continues to set this correctly.
In both cases, add initialization when a new task group is created based on the current parent's value (or RUNTIME_INF in the case of root_task_group). Otherwise, the field is wrong until a quota is changed after creation and __cfs_schedulable() is called.
Fixes: c53593e5cb69 ("sched, cgroup: Don't reject lower cpu.max on ancestors") Signed-off-by: Phil Auld <pauld@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
show more ...
|
#
22dc02f8 |
| 24-Jul-2023 |
Peter Zijlstra <peterz@infradead.org> |
Revert "sched/fair: Move unused stub functions to header"
Revert commit 7aa55f2a5902 ("sched/fair: Move unused stub functions to header"), for while it has the right Changelog, the actual patch cont
Revert "sched/fair: Move unused stub functions to header"
Revert commit 7aa55f2a5902 ("sched/fair: Move unused stub functions to header"), for while it has the right Changelog, the actual patch content a revert of the previous 4 patches:
f7df852ad6db ("sched: Make task_vruntime_update() prototype visible") c0bdfd72fbfb ("sched/fair: Hide unused init_cfs_bandwidth() stub") 378be384e01f ("sched: Add schedule_user() declaration") d55ebae3f312 ("sched: Hide unused sched_update_scaling()")
So in effect this is a revert of a revert and re-applies those patches.
Fixes: 7aa55f2a5902 ("sched/fair: Move unused stub functions to header") Reported-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
show more ...
|
Revision tags: v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32 |
|
#
d07f09a1 |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and ENQUEUE_MIGRATED.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Sign
sched/fair: Propagate enqueue flags into place_entity()
This allows place_entity() to consider ENQUEUE_WAKEUP and ENQUEUE_MIGRATED.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.274010996@infradead.org
show more ...
|
#
e4ec3318 |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the name reflects this.
Signed-off-by: Peter Zijlstra (Int
sched/debug: Rename sysctl_sched_min_granularity to sysctl_sched_base_slice
EEVDF uses this tunable as the base request/slice -- make sure the name reflects this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
show more ...
|
#
5e963f2b |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra
sched/fair: Commit to EEVDF
EEVDF is a better defined scheduling policy, as a result it has less heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
show more ...
|
#
147f3efa |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes so
sched/fair: Implement an EEVDF-like scheduling policy
Where CFS is currently a WFQ based scheduler with only a single knob, the weight. The addition of a second, latency oriented parameter, makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the tree -- those entities that are owed service. Except because this is a virtual time scheduler, the deadlines are in virtual time as well, which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the selection is no longer 'leftmost', over-scheduling is less of a problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
show more ...
|
#
af4cf404 |
| 31-May-2023 |
Peter Zijlstra <peterz@infradead.org> |
sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weig
sched/fair: Add cfs_rq::avg_vruntime
In order to move to an eligibility based scheduling policy, we need to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based scheduler the ideal scheduler will be the weighted average of the individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal scheduler -- note that the approximation is in the individual task behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in this case the average will move backwards in time -- something the ideal scheduler would of course never do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
show more ...
|
Revision tags: v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16 |
|
#
ab83f455 |
| 08-Mar-2023 |
Peter Oskolkov <posk@google.com> |
sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases.
sched: add WF_CURRENT_CPU and externise ttwu
Add WF_CURRENT_CPU wake flag that advices the scheduler to move the wakee to the current CPU. This is useful for fast on-CPU context switching use cases.
In addition, make ttwu external rather than static so that the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <posk@google.com> Signed-off-by: Andrei Vagin <avagin@google.com> Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
show more ...
|
#
d24cb0d9 |
| 07-Jul-2023 |
Tim C Chen <tim.c.chen@linux.intel.com> |
sched/topology: Record number of cores in sched group
When balancing sibling domains that have different number of cores, tasks in respective sibling domain should be proportional to the number of c
sched/topology: Record number of cores in sched group
When balancing sibling domains that have different number of cores, tasks in respective sibling domain should be proportional to the number of cores in each domain. In preparation of implementing such a policy, record the number of cores in a scheduling group.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
show more ...
|
#
677ea015 |
| 20-Jun-2023 |
Josh Don <joshdon@google.com> |
sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account th
sched: add throttled time stat for throttled children
We currently export the total throttled time for cgroups that are given a bandwidth limit. This patch extends this accounting to also account the total time that each children cgroup has been throttled.
This is useful to understand the degree to which children have been affected by the throttling control. Children which are not runnable during the entire throttled period, for example, will not show any self-throttling time during this period.
Expose this in a new interface, 'cpu.stat.local', which is similar to how non-hierarchical events are accounted in 'memory.events.local'.
Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com
show more ...
|
#
ebb83d84 |
| 13-Jun-2023 |
Hao Jia <jiahao.os@bytedance.com> |
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
After commit 8ad075c2eb1f ("sched: Async unthrottling for cfs bandwidth"), we may update the rq clock multiple times i
sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()
After commit 8ad075c2eb1f ("sched: Async unthrottling for cfs bandwidth"), we may update the rq clock multiple times in the loop of __cfsb_csd_unthrottle().
A prior (although less common) instance of this problem exists in unthrottle_offline_cfs_rqs().
Cure both by ensuring update_rq_clock() is called before the loop and setting RQCF_ACT_SKIP during the loop, to supress further updates. The alternative would be pulling update_rq_clock() out of unthrottle_cfs_rq(), but that gives an even bigger mess.
Fixes: 8ad075c2eb1f ("sched: Async unthrottling for cfs bandwidth") Reviewed-By: Ben Segall <bsegall@google.com> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Hao Jia <jiahao.os@bytedance.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20230613082012.49615-4-jiahao.os@bytedance.com
show more ...
|
#
6a9d623a |
| 30-May-2023 |
Vineeth Pillai <vineeth@bitbyteword.org> |
sched/deadline: Fix bandwidth reclaim equation in GRUB
According to the GRUB[1] rule, the runtime is depreciated as: "dq = -max{u, (1 - Uinact - Uextra)} dt" (1)
To guarantee that deadline tasks
sched/deadline: Fix bandwidth reclaim equation in GRUB
According to the GRUB[1] rule, the runtime is depreciated as: "dq = -max{u, (1 - Uinact - Uextra)} dt" (1)
To guarantee that deadline tasks doesn't starve lower class tasks, we do not allocate the full bandwidth of the cpu to deadline tasks. Maximum bandwidth usable by deadline tasks is denoted by "Umax". Considering Umax, equation (1) becomes: "dq = -(max{u, (Umax - Uinact - Uextra)} / Umax) dt" (2)
Current implementation has a minor bug in equation (2), which this patch fixes.
The reclamation logic is verified by a sample program which creates multiple deadline threads and observing their utilization. The tests were run on an isolated cpu(isolcpus=3) on a 4 cpu system.
Tests on 6.3.0 ==============
RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95% TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.33 TID[693]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 93.35
RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95% TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69 TID[708]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 16.69
RUN 3: 2 tasks Task 1: runtime=1ms, deadline=period=10ms Task 2: runtime=1ms, deadline=period=100ms TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.67 TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.37 TID[631]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 62.38 TID[632]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 6.23
As seen above, the reclamation doesn't reclaim the maximum allowed bandwidth and as the bandwidth of tasks gets smaller, the reclaimed bandwidth also comes down.
Tests with this patch applied =============================
RUN 1: runtime=7ms, deadline=period=10ms, RT capacity = 95% TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.19 TID[608]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.16
RUN 2: runtime=1ms, deadline=period=100ms, RT capacity = 95% TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.27 TID[616]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.21
RUN 3: 2 tasks Task 1: runtime=1ms, deadline=period=10ms Task 2: runtime=1ms, deadline=period=100ms TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.64 TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.66 TID[620]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 86.45 TID[621]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 8.73
Running tasks on all cpus allowing for migration also showed that the utilization is reclaimed to the maximum. Running 10 tasks on 3 cpus SCHED_FLAG_RECLAIM - top shows: %Cpu0 : 94.6 us, 0.0 sy, 0.0 ni, 5.4 id, 0.0 wa %Cpu1 : 95.2 us, 0.0 sy, 0.0 ni, 4.8 id, 0.0 wa %Cpu2 : 95.8 us, 0.0 sy, 0.0 ni, 4.2 id, 0.0 wa
[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng. (2015). Parallel and sequential reclaiming in multicore real-time global scheduling.
Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230530135526.2385378-1-vineeth@bitbyteword.org
show more ...
|
#
7d0583cf |
| 15-May-2023 |
Dietmar Eggemann <dietmar.eggemann@arm.com> |
sched/fair, cpufreq: Introduce 'runnable boosting'
The responsiveness of the Per Entity Load Tracking (PELT) util_avg in mobile devices is still considered too low for utilization changes during tas
sched/fair, cpufreq: Introduce 'runnable boosting'
The responsiveness of the Per Entity Load Tracking (PELT) util_avg in mobile devices is still considered too low for utilization changes during task ramp-up.
In Android this manifests in the fact that the first frames of a UI activity are very prone to be jankframes (a frame which doesn't meet the required frame rendering time, e.g. 16ms@60Hz) since the CPU frequency is normally low at this point and has to ramp up quickly.
The beginning of an UI activity is also characterized by the occurrence of CPU contention, especially on little CPUs. Current little CPUs can have an original CPU capacity of only ~ 150 which means that the actual CPU capacity at lower frequency can even be much smaller.
Schedutil maps CPU util_avg into CPU frequency request via:
util = effective_cpu_util(..., cpu_util_cfs(cpu), ...) -> util = map_util_perf(util) -> freq = map_util_freq(util, ...)
CPU contention for CFS tasks can be detected by 'CPU runnable > CPU utililization' in cpu_util_cfs_boost() -> cpu_util(..., boost = 1). Schedutil uses 'runnable boosting' by calling cpu_util_cfs_boost().
To be in sync with schedutil's CPU frequency selection, Energy Aware Scheduling (EAS) also calls cpu_util(..., boost = 1) during max util detection.
Moreover, 'runnable boosting' is also used in load-balance for busiest CPU selection when the migration type is 'migrate_util', i.e. only at sched domains which don't have the SD_SHARE_PKG_RESOURCES flag set.
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-3-dietmar.eggemann@arm.com
show more ...
|
#
3eb6d6ec |
| 15-May-2023 |
Dietmar Eggemann <dietmar.eggemann@arm.com> |
sched/fair: Refactor CPU utilization functions
There is a lot of code duplication in cpu_util_next() & cpu_util_cfs().
Remove this by allowing cpu_util_next() to be called with p = NULL. Rename cpu
sched/fair: Refactor CPU utilization functions
There is a lot of code duplication in cpu_util_next() & cpu_util_cfs().
Remove this by allowing cpu_util_next() to be called with p = NULL. Rename cpu_util_next() to cpu_util() since the '_next' suffix is no longer necessary to distinct cpu utilization related functions. Implement cpu_util_cfs(cpu) as cpu_util(cpu, p = NULL, -1).
This will allow to code future related cpu util changes only in one place, namely in cpu_util().
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230515115735.296329-2-dietmar.eggemann@arm.com
show more ...
|
#
3f4bf7aa |
| 24-May-2023 |
Miaohe Lin <linmiaohe@huawei.com> |
sched/deadline: remove unused dl_bandwidth
The default deadline bandwidth control structure has been removed since commit eb77cf1c151c ("sched/deadline: Remove unused def_dl_bandwidth") leading to u
sched/deadline: remove unused dl_bandwidth
The default deadline bandwidth control structure has been removed since commit eb77cf1c151c ("sched/deadline: Remove unused def_dl_bandwidth") leading to unused init_dl_bandwidth() and struct dl_bandwidth. Remove them to clean up the code.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/r/20230524102514.407486-1-linmiaohe@huawei.com
show more ...
|
#
7aa55f2a |
| 22-May-2023 |
Arnd Bergmann <arnd@arndb.de> |
sched/fair: Move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED, and empty one that is only referenced when FAIR_GROUP_SCHED is disabled bu
sched/fair: Move unused stub functions to header
These four functions have a normal definition for CONFIG_FAIR_GROUP_SCHED, and empty one that is only referenced when FAIR_GROUP_SCHED is disabled but CGROUP_SCHED is still enabled. If both are turned off, the functions are still defined but the misisng prototype causes a W=1 warning:
kernel/sched/fair.c:12544:6: error: no previous prototype for 'free_fair_sched_group' kernel/sched/fair.c:12546:5: error: no previous prototype for 'alloc_fair_sched_group' kernel/sched/fair.c:12553:6: error: no previous prototype for 'online_fair_sched_group' kernel/sched/fair.c:12555:6: error: no previous prototype for 'unregister_fair_sched_group'
Move the alternatives into the header as static inline functions with the correct combination of #ifdef checks to avoid the warning without adding even more complexity.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-6-arnd@kernel.org
show more ...
|
#
f7df852a |
| 22-May-2023 |
Arnd Bergmann <arnd@arndb.de> |
sched: Make task_vruntime_update() prototype visible
Having the prototype next to the caller but not visible to the callee causes a W=1 warning:
kernel/sched/fair.c:11985:6: error: no previous prot
sched: Make task_vruntime_update() prototype visible
Having the prototype next to the caller but not visible to the callee causes a W=1 warning:
kernel/sched/fair.c:11985:6: error: no previous prototype for 'task_vruntime_update' [-Werror=missing-prototypes]
Move this to a header, as we do for all other function declarations.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20230522195021.3456768-5-arnd@kernel.org
show more ...
|