Searched hist:d03b4817436286eeaa1ecb3606064fca31095fd2 (Results 1 – 4 of 4) sorted by relevance
/openbmc/linux/kernel/sched/ |
H A D | rt.c | diff d03b4817436286eeaa1ecb3606064fca31095fd2 Tue Oct 10 13:57:39 CDT 2023 Peter Zijlstra <peterz@infradead.org> sched: Fix stop_one_cpu_nowait() vs hotplug
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ]
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination:
preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable();
This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
|
H A D | deadline.c | diff d03b4817436286eeaa1ecb3606064fca31095fd2 Tue Oct 10 13:57:39 CDT 2023 Peter Zijlstra <peterz@infradead.org> sched: Fix stop_one_cpu_nowait() vs hotplug
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ]
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination:
preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable();
This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
|
H A D | fair.c | diff d03b4817436286eeaa1ecb3606064fca31095fd2 Tue Oct 10 13:57:39 CDT 2023 Peter Zijlstra <peterz@infradead.org> sched: Fix stop_one_cpu_nowait() vs hotplug
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ]
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination:
preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable();
This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
|
H A D | core.c | diff d03b4817436286eeaa1ecb3606064fca31095fd2 Tue Oct 10 13:57:39 CDT 2023 Peter Zijlstra <peterz@infradead.org> sched: Fix stop_one_cpu_nowait() vs hotplug
[ Upstream commit f0498d2a54e7966ce23cd7c7ff42c64fa0059b07 ]
Kuyo reported sporadic failures on a sched_setaffinity() vs CPU hotplug stress-test -- notably affine_move_task() remains stuck in wait_for_completion(), leading to a hung-task detector warning.
Specifically, it was reported that stop_one_cpu_nowait(.fn = migration_cpu_stop) returns false -- this stopper is responsible for the matching complete().
The race scenario is:
CPU0 CPU1
// doing _cpu_down()
__set_cpus_allowed_ptr() task_rq_lock(); takedown_cpu() stop_machine_cpuslocked(take_cpu_down..)
<PREEMPT: cpu_stopper_thread() MULTI_STOP_PREPARE ... __set_cpus_allowed_ptr_locked() affine_move_task() task_rq_unlock();
<PREEMPT: cpu_stopper_thread()\> ack_state() MULTI_STOP_RUN take_cpu_down() __cpu_disable(); stop_machine_park(); stopper->enabled = false; /> /> stop_one_cpu_nowait(.fn = migration_cpu_stop); if (stopper->enabled) // false!!!
That is, by doing stop_one_cpu_nowait() after dropping rq-lock, the stopper thread gets a chance to preempt and allows the cpu-down for the target CPU to complete.
OTOH, since stop_one_cpu_nowait() / cpu_stop_queue_work() needs to issue a wakeup, it must not be ran under the scheduler locks.
Solve this apparent contradiction by keeping preemption disabled over the unlock + queue_stopper combination:
preempt_disable(); task_rq_unlock(...); if (!stop_pending) stop_one_cpu_nowait(...) preempt_enable();
This respects the lock ordering contraints while still avoiding the above race. That is, if we find the CPU is online under rq-lock, the targeted stop_one_cpu_nowait() must succeed.
Apply this pattern to all similar stop_one_cpu_nowait() invocations.
Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()") Reported-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: "Kuyo Chang (張建文)" <Kuyo.Chang@mediatek.com> Link: https://lkml.kernel.org/r/20231010200442.GA16515@noisy.programming.kicks-ass.net Signed-off-by: Sasha Levin <sashal@kernel.org>
|