#
07edfece |
| 01-Dec-2021 |
Frederic Weisbecker <frederic@kernel.org> |
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up. In that case the following scenario can happen:
workqueue: Fix unbind_workers() VS wq_worker_running() race
At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running() -------------- ------------------- if (!(worker->flags & WORKER_NOT_RUNNING)) //PREEMPTED by unbind_workers worker->flags |= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of 1, triggering the following warning when the worker goes idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK>
Also due to this incorrect "nr_running == 1", further queued work may end up not being served, because no worker is awaken at work insert time. This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.15.6 |
|
#
443378f0 |
| 30-Nov-2021 |
Paul E. McKenney <paulmck@kernel.org> |
workqueue: Upgrade queue_work_on() comment
The current queue_work_on() docbook comment says that the caller must ensure that the specified CPU can't go away, but does not spell out the consequences,
workqueue: Upgrade queue_work_on() comment
The current queue_work_on() docbook comment says that the caller must ensure that the specified CPU can't go away, but does not spell out the consequences, which turn out to be quite mild. Therefore expand this comment to explicitly say that the penalty for failing to nail down the specified CPU is that the workqueue handler might find itself executing on some other CPU.
Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.15.5, v5.15.4, v5.15.3, v5.15.2, v5.15.1 |
|
#
f70da745 |
| 05-Nov-2021 |
Marco Elver <elver@google.com> |
workqueue, kasan: avoid alloc_pages() when recording stack
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Inv
workqueue, kasan: avoid alloc_pages() when recording stack
Shuah Khan reported:
| When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled, | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when | it tries to allocate memory attempting to acquire spinlock in page | allocation code while holding workqueue pool raw_spinlock. | | There are several instances of this problem when block layer tries | to __queue_work(). Call trace from one of these instances is below: | | kblockd_mod_delayed_work_on() | mod_delayed_work_on() | __queue_delayed_work() | __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held) | insert_work() | kasan_record_aux_stack() | kasan_save_stack() | stack_depot_save() | alloc_pages() | __alloc_pages() | get_page_from_freelist() | rm_queue() | rm_queue_pcplist() | local_lock_irqsave(&pagesets.lock, flags); | [ BUG: Invalid wait context triggered ]
The default kasan_record_aux_stack() calls stack_depot_save() with GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...). In general, however, it is not even possible to use either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks (see gfp.h and commmit ab00db216c9c7).
Fix it by instructing stackdepot to not expand stack storage via alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc().
While there is an increased risk of failing to insert the stack trace, this is typically unlikely, especially if the same insertion had already succeeded previously (stack depot hit).
For frequent calls from the same location, it therefore becomes extremely unlikely that kasan_record_aux_stack_noalloc() fails.
Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com Signed-off-by: Marco Elver <elver@google.com> Reported-by: Shuah Khan <skhan@linuxfoundation.org> Tested-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Taras Madan <tarasmadan@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vijayanand Jitta <vjitta@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Walter Wu <walter-zh.wu@mediatek.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
show more ...
|
Revision tags: v5.15, v5.14.14 |
|
#
55df0933 |
| 19-Oct-2021 |
Imran Khan <imran.f.khan@oracle.com> |
workqueue: Introduce show_one_worker_pool and show_one_workqueue.
Currently show_workqueue_state shows the state of all workqueues and of all worker pools. In certain cases we may need to dump state
workqueue: Introduce show_one_worker_pool and show_one_workqueue.
Currently show_workqueue_state shows the state of all workqueues and of all worker pools. In certain cases we may need to dump state of only a specific workqueue or worker pool. For example in destroy_workqueue we only need to show state of the workqueue which is getting destroyed.
So rename show_workqueue_state to show_all_workqueues(to signify it dumps state of all busy workqueues) and divide it into more granular functions (show_one_workqueue and show_one_worker_pool), that would show states of individual workqueues and worker pools and can be used in cases such as the one mentioned above.
Also, as mentioned earlier, make destroy_workqueue dump data pertaining to only the workqueue that is being destroyed and make user(s) of earlier interface(show_workqueue_state), use new interface (show_all_workqueues).
Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
d25302e4 |
| 17-Oct-2021 |
Menglong Dong <imagedong@tencent.com> |
workqueue: make sysfs of unbound kworker cpumask more clever
Some unfriendly component, such as dpdk, write the same mask to unbound kworker cpumask again and again. Every time it write to this inte
workqueue: make sysfs of unbound kworker cpumask more clever
Some unfriendly component, such as dpdk, write the same mask to unbound kworker cpumask again and again. Every time it write to this interface some work is queue to cpu, even though the mask is same with the original mask.
So, fix it by return success and do nothing if the cpumask is equal with the old one.
Signed-off-by: Mengen Sun <mengensun@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
c71ec39b |
| 28-Jul-2022 |
Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> |
workqueue: don't skip lockdep work dependency in cancel_work_sync()
[ Upstream commit c0feea594e058223973db94c1c32a830c9807c86 ]
Like Hillf Danton mentioned
syzbot should have been able to catch
workqueue: don't skip lockdep work dependency in cancel_work_sync()
[ Upstream commit c0feea594e058223973db94c1c32a830c9807c86 ]
Like Hillf Danton mentioned
syzbot should have been able to catch cancel_work_sync() in work context by checking lockdep_map in __flush_work() for both flush and cancel.
in [1], being unable to report an obvious deadlock scenario shown below is broken. From locking dependency perspective, sync version of cancel request should behave as if flush request, for it waits for completion of work if that work has already started execution.
---------- #include <linux/module.h> #include <linux/sched.h> static DEFINE_MUTEX(mutex); static void work_fn(struct work_struct *work) { schedule_timeout_uninterruptible(HZ / 5); mutex_lock(&mutex); mutex_unlock(&mutex); } static DECLARE_WORK(work, work_fn); static int __init test_init(void) { schedule_work(&work); schedule_timeout_uninterruptible(HZ / 10); mutex_lock(&mutex); cancel_work_sync(&work); mutex_unlock(&mutex); return -EINVAL; } module_init(test_init); MODULE_LICENSE("GPL"); ----------
The check this patch restores was added by commit 0976dfc1d0cd80a4 ("workqueue: Catch more locking problems with flush_work()").
Then, lockdep's crossrelease feature was added by commit b09be676e0ff25bd ("locking/lockdep: Implement the 'crossrelease' feature"). As a result, this check was once removed by commit fd1a5b04dfb899f8 ("workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes").
But lockdep's crossrelease feature was removed by commit e966eaeeb623f099 ("locking/lockdep: Remove the cross-release locking checks"). At this point, this check should have been restored.
Then, commit d6e89786bed977f3 ("workqueue: skip lockdep wq dependency in cancel_work_sync()") introduced a boolean flag in order to distinguish flush_work() and cancel_work_sync(), for checking "struct workqueue_struct" dependency when called from cancel_work_sync() was causing false positives.
Then, commit 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for flushing") tried to restore "struct work_struct" dependency check, but by error checked this boolean flag. Like an example shown above indicates, "struct work_struct" dependency needs to be checked for both flush_work() and cancel_work_sync().
Link: https://lkml.kernel.org/r/20220504044800.4966-1-hdanton@sina.com [1] Reported-by: Hillf Danton <hdanton@sina.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Fixes: 87915adc3f0acdf0 ("workqueue: re-add lockdep dependencies for flushing") Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
cf5b6bd2 |
| 01-Dec-2021 |
Frederic Weisbecker <frederic@kernel.org> |
workqueue: Fix unbind_workers() VS wq_worker_running() race
commit 07edfece8bcb0580a1828d939e6f8d91a8603eb2 upstream.
At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up
workqueue: Fix unbind_workers() VS wq_worker_running() race
commit 07edfece8bcb0580a1828d939e6f8d91a8603eb2 upstream.
At CPU-hotplug time, unbind_worker() may preempt a worker while it is waking up. In that case the following scenario can happen:
unbind_workers() wq_worker_running() -------------- ------------------- if (!(worker->flags & WORKER_NOT_RUNNING)) //PREEMPTED by unbind_workers worker->flags |= WORKER_UNBOUND; [...] atomic_set(&pool->nr_running, 0); //resume to worker atomic_inc(&worker->pool->nr_running);
After unbind_worker() resets pool->nr_running, the value is expected to remain 0 until the pool ever gets rebound in case cpu_up() is called on the target CPU in the future. But here the race leaves pool->nr_running with a value of 1, triggering the following warning when the worker goes idle:
WARNING: CPU: 3 PID: 34 at kernel/workqueue.c:1823 worker_enter_idle+0x95/0xc0 Modules linked in: CPU: 3 PID: 34 Comm: kworker/3:0 Not tainted 5.16.0-rc1+ #34 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Workqueue: 0x0 (rcu_par_gp) RIP: 0010:worker_enter_idle+0x95/0xc0 Code: 04 85 f8 ff ff ff 39 c1 7f 09 48 8b 43 50 48 85 c0 74 1b 83 e2 04 75 99 8b 43 34 39 43 30 75 91 8b 83 00 03 00 00 85 c0 74 87 <0f> 0b 5b c3 48 8b 35 70 f1 37 01 48 8d 7b 48 48 81 c6 e0 93 0 RSP: 0000:ffff9b7680277ed0 EFLAGS: 00010086 RAX: 00000000ffffffff RBX: ffff93465eae9c00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9346418a0000 RDI: ffff934641057140 RBP: ffff934641057170 R08: 0000000000000001 R09: ffff9346418a0080 R10: ffff9b768027fdf0 R11: 0000000000002400 R12: ffff93465eae9c20 R13: ffff93465eae9c20 R14: ffff93465eae9c70 R15: ffff934641057140 FS: 0000000000000000(0000) GS:ffff93465eac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000001cc0c000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> worker_thread+0x89/0x3d0 ? process_one_work+0x400/0x400 kthread+0x162/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK>
Also due to this incorrect "nr_running == 1", further queued work may end up not being served, because no worker is awaken at work insert time. This raises rcutorture writer stalls for example.
Fix this with disabling preemption in the right place in wq_worker_running().
It's worth noting that if the worker migrates and runs concurrently with unbind_workers(), it is guaranteed to see the WORKER_UNBOUND flag update due to set_cpus_allowed_ptr() acquiring/releasing rq->lock.
Fixes: 6d25be5782e4 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
b09a201b |
| 17-Oct-2021 |
Menglong Dong <imagedong@tencent.com> |
workqueue: make sysfs of unbound kworker cpumask more clever
[ Upstream commit d25302e46592c97d29f70ccb1be558df31a9a360 ]
Some unfriendly component, such as dpdk, write the same mask to unbound kwo
workqueue: make sysfs of unbound kworker cpumask more clever
[ Upstream commit d25302e46592c97d29f70ccb1be558df31a9a360 ]
Some unfriendly component, such as dpdk, write the same mask to unbound kworker cpumask again and again. Every time it write to this interface some work is queue to cpu, even though the mask is same with the original mask.
So, fix it by return success and do nothing if the cpumask is equal with the old one.
Signed-off-by: Mengen Sun <mengensun@tencent.com> Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v5.14.13, v5.14.12, v5.14.11, v5.14.10 |
|
#
57116ce1 |
| 06-Oct-2021 |
Johan Hovold <johan@kernel.org> |
workqueue: fix state-dump console deadlock
Console drivers often queue work while holding locks also taken in their console write paths, something which can lead to deadlocks on SMP when dumping wor
workqueue: fix state-dump console deadlock
Console drivers often queue work while holding locks also taken in their console write paths, something which can lead to deadlocks on SMP when dumping workqueue state (e.g. sysrq-t or on suspend failures).
For serial console drivers this could look like:
CPU0 CPU1 ---- ----
show_workqueue_state(); lock(&pool->lock); <IRQ> lock(&port->lock); schedule_work(); lock(&pool->lock); printk(); lock(console_owner); lock(&port->lock);
where workqueues are, for example, used to push data to the line discipline, process break signals and handle modem-status changes. Line disciplines and serdev drivers can also queue work on write-wakeup notifications, etc.
Reworking every console driver to avoid queuing work while holding locks also taken in their write paths would complicate drivers and is neither desirable or feasible.
Instead use the deferred-printk mechanism to avoid printing while holding pool locks when dumping workqueue state.
Note that there are a few WARN_ON() assertions in the workqueue code which could potentially also trigger a deadlock. Hopefully the ongoing printk rework will provide a general solution for this eventually.
This was originally reported after a lockdep splat when executing sysrq-t with the imx serial driver.
Fixes: 3494fc30846d ("workqueue: dump workqueues on sysrq-t") Cc: stable@vger.kernel.org # 4.0 Reported-by: Fabio Estevam <festevam@denx.de> Tested-by: Fabio Estevam <festevam@denx.de> Signed-off-by: Johan Hovold <johan@kernel.org> Reviewed-by: John Ogness <john.ogness@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.14.9, v5.14.8, v5.14.7, v5.14.6, v5.10.67, v5.10.66, v5.14.5, v5.14.4, v5.10.65, v5.14.3, v5.10.64, v5.14.2, v5.10.63, v5.14.1, v5.10.62, v5.14, v5.10.61, v5.10.60 |
|
#
d812796e |
| 16-Aug-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
workqueue: Assign a color to barrier work items
There was no strong reason to or not to flush barrier work items in flush_workqueue(). And we have to make barrier work items not participate in nr_a
workqueue: Assign a color to barrier work items
There was no strong reason to or not to flush barrier work items in flush_workqueue(). And we have to make barrier work items not participate in nr_active so we had been using WORK_NO_COLOR for them which also makes them can't be flushed by flush_workqueue().
And the users of flush_workqueue() often do not intend to wait barrier work items issued by flush_work(). That made the choice sound perfect.
But barrier work items have reference to internal structure (pool_workqueue) and the worker thread[s] is/are still busy for the workqueue user when the barrrier work items are not done. So it is reasonable to make flush_workqueue() also watch for flush_work() to make it more robust.
And a problem[1] reported by Li Zhe shows that we need such robustness. The warning logs are listed below:
WARNING: CPU: 0 PID: 19336 at kernel/workqueue.c:4430 destroy_workqueue+0x11a/0x2f0 ***** destroy_workqueue: test_workqueue9 has the following busy pwq pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=0/1 refcnt=2 in-flight: 5658:wq_barrier_func Showing busy workqueues and worker pools: *****
It shows that even after drain_workqueue() returns, the barrier work item is still in flight and the pwq (and a worker) is still busy on it.
The problem is caused by flush_workqueue() not watching flush_work():
Thread A Worker /* normal work item with linked */ process_scheduled_works() destroy_workqueue() process_one_work() drain_workqueue() /* run normal work item */ /-- pwq_dec_nr_in_flight() flush_workqueue() <---/ /* the last normal work item is done */ sanity_check process_one_work() /-- raw_spin_unlock_irq(&pool->lock) raw_spin_lock_irq(&pool->lock) <-/ /* maybe preempt */ *WARNING* wq_barrier_func() /* maybe preempt by cond_resched() */
Thread A can get the pool lock after the Worker unlocks the pool lock before running wq_barrier_func(). And if there is any preemption happen around wq_barrier_func(), destroy_workqueue()'s sanity check is more likely to get the lock and catch it. (Note: preemption is not necessary to cause the bug, the unlocking is enough to possibly trigger the WARNING.)
A simple solution might be just executing all linked barrier work items once without releasing pool lock after the head work item's pwq_dec_nr_in_flight(). But this solution has two problems:
1) the head work item might also be barrier work item when the user-queued work item is cancelled. For example: thread 1: thread 2: queue_work(wq, &my_work) flush_work(&my_work) cancel_work_sync(&my_work); /* Neiter my_work nor the barrier work is scheduled. */ destroy_workqueue(wq); /* This is an easier way to catch the WARNING. */
2) there might be too much linked barrier work items and running them all once without releasing pool lock just causes trouble.
The only solution is to make flush_workqueue() aslo watch barrier work items. So we have to assign a color to these barrier work items which is the color of the head (user-queued) work item.
Assigning a color doesn't cause any problem in ative management, because the prvious patch made barrier work items not participate in nr_active via WORK_STRUCT_INACTIVE rather than reliance on the (old) WORK_NO_COLOR.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/ Reported-by: Li Zhe <lizhe.67@bytedance.com> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
018f3a13 |
| 16-Aug-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
workqueue: Mark barrier work with WORK_STRUCT_INACTIVE
Currently, WORK_NO_COLOR has two meanings: Not participate in flushing Not participate in nr_active
And only non-barrier work items are mark
workqueue: Mark barrier work with WORK_STRUCT_INACTIVE
Currently, WORK_NO_COLOR has two meanings: Not participate in flushing Not participate in nr_active
And only non-barrier work items are marked with WORK_STRUCT_INACTIVE when they are in inactive_works list. The barrier work items are not marked INACTIVE even linked in inactive_works list since these tail items are always moved together with the head work item.
These definitions are simple, clean and practical. (Except a small blemish that only the first meaning of WORK_NO_COLOR is documented in include/linux/workqueue.h while both meanings are in workqueue.c)
But dual-purpose WORK_NO_COLOR used for barrier work items has proven to be problematical[1]. Only the second purpose is obligatory. So we plan to make barrier work items participate in flushing but keep them still not participating in nr_active.
So the plan is to mark barrier work items inactive without using WORK_NO_COLOR in this patch so that we can assign a flushing color to them in next patch.
The reasonable way is to add or reuse a bit in work data of the work item. But adding a bit will double the size of pool_workqueue.
Currently, WORK_STRUCT_INACTIVE is only used in try_to_grab_pending() for user-queued work items and try_to_grab_pending() can't work for barrier work items. So we extend WORK_STRUCT_INACTIVE to also mark barrier work items no matter which list they are in because we don't need to determind which list a barrier work item is in.
So the meaning of WORK_STRUCT_INACTIVE becomes just "the work items don't participate in nr_active" (no matter whether it is a barrier work item or a user-queued work item). And WORK_STRUCT_INACTIVE for user-queued work items means they are in inactive_works list.
This patch does it by setting WORK_STRUCT_INACTIVE for barrier work items in insert_wq_barrier() and checking WORK_STRUCT_INACTIVE first in pwq_dec_nr_in_flight(). And the meaning of WORK_NO_COLOR is reduced to only "not participating in flushing".
There is no functionality change intended in this patch. Because WORK_NO_COLOR+WORK_STRUCT_INACTIVE represents the previous WORK_NO_COLOR in meaning and try_to_grab_pending() doesn't use for barrier work items and avoids being confused by this extended WORK_STRUCT_INACTIVE.
A bunch of comment for nr_active & WORK_STRUCT_INACTIVE is also added for documenting how WORK_STRUCT_INACTIVE works in nr_active management.
[1]: https://lore.kernel.org/lkml/20210812083814.32453-1-lizhe.67@bytedance.com/ Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
d21cece0 |
| 16-Aug-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
workqueue: Change the code of calculating work_flags in insert_wq_barrier()
Add a local var @work_flags to calculate work_flags step by step, so that we don't need to squeeze several flags in only t
workqueue: Change the code of calculating work_flags in insert_wq_barrier()
Add a local var @work_flags to calculate work_flags step by step, so that we don't need to squeeze several flags in only the last line of code.
Parepare for next patch to add a bit to barrier work item's flag. Not squshing this to next patch makes it clear that what it will have changed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
c4560c2c |
| 16-Aug-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
workqueue: Change arguement of pwq_dec_nr_in_flight()
Make pwq_dec_nr_in_flight() use work_data rather just work_color.
Prepare for later patch to get WORK_STRUCT_INACTIVE bit from work_data in pwq
workqueue: Change arguement of pwq_dec_nr_in_flight()
Make pwq_dec_nr_in_flight() use work_data rather just work_color.
Prepare for later patch to get WORK_STRUCT_INACTIVE bit from work_data in pwq_dec_nr_in_flight().
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
f97a4a1a |
| 16-Aug-2021 |
Lai Jiangshan <laijs@linux.alibaba.com> |
workqueue: Rename "delayed" (delayed by active management) to "inactive"
There are two kinds of "delayed" work items in workqueue subsystem.
One is for timer-delayed work items which are visible to
workqueue: Rename "delayed" (delayed by active management) to "inactive"
There are two kinds of "delayed" work items in workqueue subsystem.
One is for timer-delayed work items which are visible to workqueue users. The other kind is for work items delayed by active management which can not be directly visible to workqueue users. We mixed the word "delayed" for both kinds and caused somewhat ambiguity.
This patch renames the later one (delayed by active management) to "inactive", because it is used for workqueue active management and most of its related symbols are named with "active" or "activate".
All "delayed" and "DELAYED" are carefully checked and renamed one by one to avoid accidentally changing the name of the other kind for timer-delayed.
No functional change intended.
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
ffd8bea8 |
| 03-Aug-2021 |
Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
workqueue: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() a
workqueue: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged.
Cc: Tejun Heo <tj@kernel.org> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
e441b56f |
| 03-Aug-2021 |
Zhen Lei <thunder.leizhen@huawei.com> |
workqueue: Replace deprecated ida_simple_*() with ida_alloc()/ida_free()
Replace ida_simple_get() with ida_alloc() and ida_simple_remove() with ida_free(), the latter is more concise and intuitive.
workqueue: Replace deprecated ida_simple_*() with ida_alloc()/ida_free()
Replace ida_simple_get() with ida_alloc() and ida_simple_remove() with ida_free(), the latter is more concise and intuitive.
In addition, if ida_alloc() fails, NULL is returned directly. This eliminates unnecessary initialization of two local variables and an 'if' judgment.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
67dc8325 |
| 30-Jul-2021 |
Cai Huoqing <caihuoqing@baidu.com> |
workqueue: Fix typo in comments
Fix typo: *assing ==> assign *alloced ==> allocated *Retun ==> Return *excute ==> execute
v1->v2: *reverse 'iff' *update changelog
Signed-off-by: Cai Huoqing <c
workqueue: Fix typo in comments
Fix typo: *assing ==> assign *alloced ==> allocated *Retun ==> Return *excute ==> execute
v1->v2: *reverse 'iff' *update changelog
Signed-off-by: Cai Huoqing <caihuoqing@baidu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.10.53 |
|
#
f728c4a9 |
| 21-Jul-2021 |
Zhen Lei <thunder.leizhen@huawei.com> |
workqueue: Fix possible memory leaks in wq_numa_init()
In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the previously allocated memories are not released. Doing this before allocating
workqueue: Fix possible memory leaks in wq_numa_init()
In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the previously allocated memories are not released. Doing this before allocating memory eliminates memory leaks.
tj: Note that the condition only occurs when the arch code is pretty broken and the WARN_ON might as well be BUG_ON().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.10.52, v5.10.51, v5.10.50 |
|
#
b42b0bdd |
| 14-Jul-2021 |
Yang Yingliang <yangyingliang@huawei.com> |
workqueue: fix UAF in pwq_unbound_release_workfn()
I got a UAF report when doing fuzz test:
[ 152.880091][ T8030] ================================================================== [ 152.881240][
workqueue: fix UAF in pwq_unbound_release_workfn()
I got a UAF report when doing fuzz test:
[ 152.880091][ T8030] ================================================================== [ 152.881240][ T8030] BUG: KASAN: use-after-free in pwq_unbound_release_workfn+0x50/0x190 [ 152.882442][ T8030] Read of size 4 at addr ffff88810d31bd00 by task kworker/3:2/8030 [ 152.883578][ T8030] [ 152.883932][ T8030] CPU: 3 PID: 8030 Comm: kworker/3:2 Not tainted 5.13.0+ #249 [ 152.885014][ T8030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 [ 152.886442][ T8030] Workqueue: events pwq_unbound_release_workfn [ 152.887358][ T8030] Call Trace: [ 152.887837][ T8030] dump_stack_lvl+0x75/0x9b [ 152.888525][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.889371][ T8030] print_address_description.constprop.10+0x48/0x70 [ 152.890326][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891163][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.891999][ T8030] kasan_report.cold.15+0x82/0xdb [ 152.892740][ T8030] ? pwq_unbound_release_workfn+0x50/0x190 [ 152.893594][ T8030] __asan_load4+0x69/0x90 [ 152.894243][ T8030] pwq_unbound_release_workfn+0x50/0x190 [ 152.895057][ T8030] process_one_work+0x47b/0x890 [ 152.895778][ T8030] worker_thread+0x5c/0x790 [ 152.896439][ T8030] ? process_one_work+0x890/0x890 [ 152.897163][ T8030] kthread+0x223/0x250 [ 152.897747][ T8030] ? set_kthread_struct+0xb0/0xb0 [ 152.898471][ T8030] ret_from_fork+0x1f/0x30 [ 152.899114][ T8030] [ 152.899446][ T8030] Allocated by task 8884: [ 152.900084][ T8030] kasan_save_stack+0x21/0x50 [ 152.900769][ T8030] __kasan_kmalloc+0x88/0xb0 [ 152.901416][ T8030] __kmalloc+0x29c/0x460 [ 152.902014][ T8030] alloc_workqueue+0x111/0x8e0 [ 152.902690][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.903459][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.904198][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.904929][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.905599][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.906247][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.906916][ T8030] do_syscall_64+0x34/0xb0 [ 152.907535][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.908365][ T8030] [ 152.908688][ T8030] Freed by task 8884: [ 152.909243][ T8030] kasan_save_stack+0x21/0x50 [ 152.909893][ T8030] kasan_set_track+0x20/0x30 [ 152.910541][ T8030] kasan_set_free_info+0x24/0x40 [ 152.911265][ T8030] __kasan_slab_free+0xf7/0x140 [ 152.911964][ T8030] kfree+0x9e/0x3d0 [ 152.912501][ T8030] alloc_workqueue+0x7d7/0x8e0 [ 152.913182][ T8030] __btrfs_alloc_workqueue+0x11e/0x2a0 [ 152.913949][ T8030] btrfs_alloc_workqueue+0x6d/0x1d0 [ 152.914703][ T8030] scrub_workers_get+0x1e8/0x490 [ 152.915402][ T8030] btrfs_scrub_dev+0x1b9/0x9c0 [ 152.916077][ T8030] btrfs_ioctl+0x122c/0x4e50 [ 152.916729][ T8030] __x64_sys_ioctl+0x137/0x190 [ 152.917414][ T8030] do_syscall_64+0x34/0xb0 [ 152.918034][ T8030] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 152.918872][ T8030] [ 152.919203][ T8030] The buggy address belongs to the object at ffff88810d31bc00 [ 152.919203][ T8030] which belongs to the cache kmalloc-512 of size 512 [ 152.921155][ T8030] The buggy address is located 256 bytes inside of [ 152.921155][ T8030] 512-byte region [ffff88810d31bc00, ffff88810d31be00) [ 152.922993][ T8030] The buggy address belongs to the page: [ 152.923800][ T8030] page:ffffea000434c600 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10d318 [ 152.925249][ T8030] head:ffffea000434c600 order:2 compound_mapcount:0 compound_pincount:0 [ 152.926399][ T8030] flags: 0x57ff00000010200(slab|head|node=1|zone=2|lastcpupid=0x7ff) [ 152.927515][ T8030] raw: 057ff00000010200 dead000000000100 dead000000000122 ffff888009c42c80 [ 152.928716][ T8030] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 [ 152.929890][ T8030] page dumped because: kasan: bad access detected [ 152.930759][ T8030] [ 152.931076][ T8030] Memory state around the buggy address: [ 152.931851][ T8030] ffff88810d31bc00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.932967][ T8030] ffff88810d31bc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.934068][ T8030] >ffff88810d31bd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.935189][ T8030] ^ [ 152.935763][ T8030] ffff88810d31bd80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 152.936847][ T8030] ffff88810d31be00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 152.937940][ T8030] ==================================================================
If apply_wqattrs_prepare() fails in alloc_workqueue(), it will call put_pwq() which invoke a work queue to call pwq_unbound_release_workfn() and use the 'wq'. The 'wq' allocated in alloc_workqueue() will be freed in error path when apply_wqattrs_prepare() fails. So it will lead a UAF.
CPU0 CPU1 alloc_workqueue() alloc_and_link_pwqs() apply_wqattrs_prepare() fails apply_wqattrs_cleanup() schedule_work(&pwq->unbound_release_work) kfree(wq) worker_thread() pwq_unbound_release_workfn() <- trigger uaf here
If apply_wqattrs_prepare() fails, the new pwq are not linked, it doesn't hold any reference to the 'wq', 'wq' is invalid to access in the worker, so add check pwq if linked to fix this.
Fixes: 2d5f0764b526 ("workqueue: split apply_workqueue_attrs() into 3 stages") Cc: stable@vger.kernel.org # v4.2+ Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Tested-by: Pavel Skripkin <paskripkin@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.10.49, v5.13, v5.10.46, v5.10.43, v5.10.42, v5.10.41, v5.10.40, v5.10.39 |
|
#
940d71c6 |
| 20-May-2021 |
Sergey Senozhatsky <senozhatsky@chromium.org> |
wq: handle VM suspension in stall detection
If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then once this VCPU resumes it will see the new jiffies value, while it may take a while befor
wq: handle VM suspension in stall detection
If VCPU is suspended (VM suspend) in wq_watchdog_timer_fn() then once this VCPU resumes it will see the new jiffies value, while it may take a while before IRQ detects PVCLOCK_GUEST_STOPPED on this VCPU and updates all the watchdogs via pvclock_touch_watchdogs(). There is a small chance of misreported WQ stalls in the meantime, because new jiffies is time_after() old 'ts + thresh'.
wq_watchdog_timer_fn() { for_each_pool(pool, pi) { if (time_after(jiffies, ts + thresh)) { pr_emerg("BUG: workqueue lockup - pool"); } } }
Save jiffies at the beginning of this function and use that value for stall detection. If VM gets suspended then we continue using "old" jiffies value and old WQ touch timestamps. If IRQ at some point restarts the stall detection cycle (pvclock_touch_watchdogs()) then old jiffies will always be before new 'ts + thresh'.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.4.119, v5.10.36, v5.10.35, v5.10.34, v5.4.116, v5.10.33, v5.12, v5.10.32, v5.10.31, v5.10.30 |
|
#
98173112 |
| 08-Apr-2021 |
Sami Tolvanen <samitolvanen@google.com> |
workqueue: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to __queue_delayed_work from a module points to a jump table entry defined in the module instead of the one
workqueue: use WARN_ON_FUNCTION_MISMATCH
With CONFIG_CFI_CLANG, a callback function passed to __queue_delayed_work from a module points to a jump table entry defined in the module instead of the one used in the core kernel, which breaks function address equality in this check:
WARN_ON_ONCE(timer->function != delayed_work_timer_fn);
Use WARN_ON_FUNCTION_MISMATCH() instead to disable the warning when CFI and modules are both enabled.
Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-6-samitolvanen@google.com
show more ...
|
Revision tags: v5.10.27, v5.10.26 |
|
#
89e28ce6 |
| 24-Mar-2021 |
Wang Qing <wangqing@vivo.com> |
workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog() 84;0;0c84;0;0c There are two workqueue-specific watchdog timestamps:
+ @wq_watchdog_touched_cpu (per-CPU) updated
workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog() 84;0;0c84;0;0c There are two workqueue-specific watchdog timestamps:
+ @wq_watchdog_touched_cpu (per-CPU) updated by touch_softlockup_watchdog()
+ @wq_watchdog_touched (global) updated by touch_all_softlockup_watchdogs()
watchdog_timer_fn() checks only the global @wq_watchdog_touched for unbound workqueues. As a result, unbound workqueues are not aware of touch_softlockup_watchdog(). The watchdog might report a stall even when the unbound workqueues are blocked by a known slow code.
Solution: touch_softlockup_watchdog() must touch also the global @wq_watchdog_touched timestamp.
The global timestamp can no longer be used for bound workqueues because it is now updated from all CPUs. Instead, bound workqueues have to check only @wq_watchdog_touched_cpu and these timestamps have to be updated for all CPUs in touch_all_softlockup_watchdogs().
Beware: The change might cause the opposite problem. An unbound workqueue might get blocked on CPU A because of a real softlockup. The workqueue watchdog would miss it when the timestamp got touched on CPU B.
It is acceptable because softlockups are detected by softlockup watchdog. The workqueue watchdog is there to detect stalls where a work never finishes, for example, because of dependencies of works queued into the same workqueue.
V3: - Modify the commit message clearly according to Petr's suggestion.
Signed-off-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.10.25, v5.10.24, v5.10.23, v5.10.22, v5.10.21, v5.10.20, v5.10.19, v5.4.101, v5.10.18 |
|
#
0687c66b |
| 17-Feb-2021 |
Zqiang <qiang.zhang@windriver.com> |
workqueue: Move the position of debug_work_activate() in __queue_work()
The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, inse
workqueue: Move the position of debug_work_activate() in __queue_work()
The debug_work_activate() is called on the premise that the work can be inserted, because if wq be in WQ_DRAINING status, insert work may be failed.
Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability") Signed-off-by: Zqiang <qiang.zhang@windriver.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
Revision tags: v5.10.17, v5.11, v5.10.16, v5.10.15, v5.10.14 |
|
#
e9ad2eb3 |
| 23-Jan-2021 |
Stephen Zhang <stephenzhangzsd@gmail.com> |
workqueue: Use %s instead of function name
It is better to replace the function name with %s, in case the function name changes.
Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com> Signed-off-
workqueue: Use %s instead of function name
It is better to replace the function name with %s, in case the function name changes.
Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
show more ...
|
#
640f17c8 |
| 15-Jan-2021 |
Peter Zijlstra <peterz@infradead.org> |
workqueue: Restrict affinity change to rescuer
create_worker() will already set the right affinity using kthread_bind_mask(), this means only the rescuer will need to change it's affinity.
Howveer,
workqueue: Restrict affinity change to rescuer
create_worker() will already set the right affinity using kthread_bind_mask(), this means only the rescuer will need to change it's affinity.
Howveer, while in cpu-hot-unplug a regular task is not allowed to run on online&&!active as it would be pushed away quite agressively. We need KTHREAD_IS_PER_CPU to survive in that environment.
Therefore set the affinity after getting that magic flag.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
show more ...
|