History log of /openbmc/linux/lib/sbitmap.c (Results 76 – 100 of 109)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v5.4.50, v5.7.7, v5.4.49, v5.7.6, v5.7.5, v5.4.48, v5.7.4, v5.7.3, v5.4.47, v5.4.46, v5.7.2, v5.4.45, v5.7.1, v5.4.44, v5.7, v5.4.43, v5.4.42, v5.4.41, v5.4.40, v5.4.39, v5.4.38, v5.4.37, v5.4.36, v5.4.35, v5.4.34, v5.4.33, v5.4.32, v5.4.31, v5.4.30, v5.4.29, v5.6, v5.4.28, v5.4.27, v5.4.26, v5.4.25, v5.4.24, v5.4.23, v5.4.22, v5.4.21, v5.4.20, v5.4.19, v5.4.18, v5.4.17, v5.4.16, v5.5, v5.4.15, v5.4.14, v5.4.13, v5.4.12, v5.4.11, v5.4.10, v5.4.9, v5.4.8, v5.4.7, v5.4.6, v5.4.5, v5.4.4
# df034c93 17-Dec-2019 David Jeffery <djeffery@redhat.com>

sbitmap: only queue kyber's wait callback if not already active

Under heavy loads where the kyber I/O scheduler hits the token limits for
its scheduling domains, kyber can become stuck.

sbitmap: only queue kyber's wait callback if not already active

Under heavy loads where the kyber I/O scheduler hits the token limits for
its scheduling domains, kyber can become stuck. When active requests
complete, kyber may not be woken up leaving the I/O requests in kyber
stuck.

This stuck state is due to a race condition with kyber and the sbitmap
functions it uses to run a callback when enough requests have completed.
The running of a sbt_wait callback can race with the attempt to insert the
sbt_wait. Since sbitmap_del_wait_queue removes the sbt_wait from the list
first then sets the sbq field to NULL, kyber can see the item as not on a
list but the call to sbitmap_add_wait_queue will see sbq as non-NULL. This
results in the sbt_wait being inserted onto the wait list but ws_active
doesn't get incremented. So the sbitmap queue does not know there is a
waiter on a wait list.

Since sbitmap doesn't think there is a waiter, kyber may never be
informed that there are domain tokens available and the I/O never advances.
With the sbt_wait on a wait list, kyber believes it has an active waiter
so cannot insert a new waiter when reaching the domain's full state.

This race can be fixed by only adding the sbt_wait to the queue if the
sbq field is NULL. If sbq is not NULL, there is already an action active
which will trigger the re-running of kyber. Let it run and add the
sbt_wait to the wait list if still needing to wait.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reported-by: John Pittman <jpittman@redhat.com>
Tested-by: John Pittman <jpittman@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v5.4.3, v5.3.15, v5.4.2, v5.4.1, v5.3.14, v5.4, v5.3.13, v5.3.12
# 708edafa 13-Nov-2019 John Garry <john.garry@huawei.com>

sbitmap: Delete sbitmap_any_bit_clear()

Since the only caller of this function has been deleted, delete this one
also.

Signed-off-by: John Garry <john.garry@huawei.com>
Sign

sbitmap: Delete sbitmap_any_bit_clear()

Since the only caller of this function has been deleted, delete this one
also.

Signed-off-by: John Garry <john.garry@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12, v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13, v5.1.12, v5.1.11, v5.1.10, v5.1.9, v5.1.8, v5.1.7, v5.1.6, v5.1.5
# 41723288 23-May-2019 Pavel Begunkov <asml.silence@gmail.com>

sbitmap: Replace cmpxchg with xchg

cmpxchg() with an immediate value could be replaced with less expensive
xchg(). The same true if new value don't _depend_ on the old one.

In t

sbitmap: Replace cmpxchg with xchg

cmpxchg() with an immediate value could be replaced with less expensive
xchg(). The same true if new value don't _depend_ on the old one.

In the second block, atomic_cmpxchg() return value isn't checked, so
after atomic_cmpxchg() -> atomic_xchg() conversion it could be replaced
with atomic_set(). Comparison with atomic_read() in the second chunk was
left as an optimisation (if that was the initial intention).

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# 0fc479b1 29-May-2019 Thomas Gleixner <tglx@linutronix.de>

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 328

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it u

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 328

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license v2 as published
by the free software foundation this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more details you
should have received a copy of the gnu general public license along
with this program if not see https www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 2 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.923873561@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v5.1.4
# a0934fd2 20-May-2019 Andrea Parri <andrea.parri@amarulasolutions.com>

sbitmap: fix improper use of smp_mb__before_atomic()

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic_set() primitive.

sbitmap: fix improper use of smp_mb__before_atomic()

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic_set() primitive.

Replace the barrier with an smp_mb().

Fixes: 6c0ca7ae292ad ("sbitmap: fix wakeup hang after sbq resize")
Cc: stable@vger.kernel.org
Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v5.1.3, v5.1.2, v5.1.1, v5.0.14, v5.1, v5.0.13, v5.0.12, v5.0.11, v5.0.10, v5.0.9, v5.0.8, v5.0.7, v5.0.6, v5.0.5, v5.0.4
# e6d1fa58 21-Mar-2019 Ming Lei <ming.lei@redhat.com>

sbitmap: order READ/WRITE freed instance and setting clear bit

Inside sbitmap_queue_clear(), once the clear bit is set, it will be
visiable to allocation path immediately. Meantime READ/

sbitmap: order READ/WRITE freed instance and setting clear bit

Inside sbitmap_queue_clear(), once the clear bit is set, it will be
visiable to allocation path immediately. Meantime READ/WRITE on old
associated instance(such as request in case of blk-mq) may be
out-of-order with the setting clear bit, so race with re-allocation
may be triggered.

Adds one memory barrier for ordering READ/WRITE of the freed associated
instance with setting clear bit for avoiding race with re-allocation.

The following kernel oops triggerd by block/006 on aarch64 may be fixed:

[ 142.330954] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000330
[ 142.338794] Mem abort info:
[ 142.341554] ESR = 0x96000005
[ 142.344632] Exception class = DABT (current EL), IL = 32 bits
[ 142.350500] SET = 0, FnV = 0
[ 142.353544] EA = 0, S1PTW = 0
[ 142.356678] Data abort info:
[ 142.359528] ISV = 0, ISS = 0x00000005
[ 142.363343] CM = 0, WnR = 0
[ 142.366305] user pgtable: 64k pages, 48-bit VAs, pgdp = 000000002a3c51c0
[ 142.372983] [0000000000000330] pgd=0000000000000000, pud=0000000000000000
[ 142.379777] Internal error: Oops: 96000005 [#1] SMP
[ 142.384613] Modules linked in: null_blk ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp vfat fat rpcrdma sunrpc rdma_ucm ib_iser rdma_cm iw_cm libiscsi ib_umad scsi_transport_iscsi ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core sbsa_gwdt crct10dif_ce ghash_ce ipmi_ssif sha2_ce ipmi_devintf sha256_arm64 sg sha1_ce ipmi_msghandler ip_tables xfs libcrc32c mlx5_core sdhci_acpi mlxfw ahci_platform at803x sdhci libahci_platform qcom_emac mmc_core hdma hdma_mgmt i2c_dev [last unloaded: null_blk]
[ 142.429753] CPU: 7 PID: 1983 Comm: fio Not tainted 5.0.0.cki #2
[ 142.449458] pstate: 00400005 (nzcv daif +PAN -UAO)
[ 142.454239] pc : __blk_mq_free_request+0x4c/0xa8
[ 142.458830] lr : blk_mq_free_request+0xec/0x118
[ 142.463344] sp : ffff00003360f6a0
[ 142.466646] x29: ffff00003360f6a0 x28: ffff000010e70000
[ 142.471941] x27: ffff801729a50048 x26: 0000000000010000
[ 142.477232] x25: ffff00003360f954 x24: ffff7bdfff021440
[ 142.482529] x23: 0000000000000000 x22: 00000000ffffffff
[ 142.487830] x21: ffff801729810000 x20: 0000000000000000
[ 142.493123] x19: ffff801729a50000 x18: 0000000000000000
[ 142.498413] x17: 0000000000000000 x16: 0000000000000001
[ 142.503709] x15: 00000000000000ff x14: ffff7fe000000000
[ 142.509003] x13: ffff8017dcde09a0 x12: 0000000000000000
[ 142.514308] x11: 0000000000000001 x10: 0000000000000008
[ 142.519597] x9 : ffff8017dcde09a0 x8 : 0000000000002000
[ 142.524889] x7 : ffff8017dcde0a00 x6 : 000000015388f9be
[ 142.530187] x5 : 0000000000000001 x4 : 0000000000000000
[ 142.535478] x3 : 0000000000000000 x2 : 0000000000000000
[ 142.540777] x1 : 0000000000000001 x0 : ffff00001041b194
[ 142.546071] Process fio (pid: 1983, stack limit = 0x000000006460a0ea)
[ 142.552500] Call trace:
[ 142.554926] __blk_mq_free_request+0x4c/0xa8
[ 142.559181] blk_mq_free_request+0xec/0x118
[ 142.563352] blk_mq_end_request+0xfc/0x120
[ 142.567444] end_cmd+0x3c/0xa8 [null_blk]
[ 142.571434] null_complete_rq+0x20/0x30 [null_blk]
[ 142.576194] blk_mq_complete_request+0x108/0x148
[ 142.580797] null_handle_cmd+0x1d4/0x718 [null_blk]
[ 142.585662] null_queue_rq+0x60/0xa8 [null_blk]
[ 142.590171] blk_mq_try_issue_directly+0x148/0x280
[ 142.594949] blk_mq_try_issue_list_directly+0x9c/0x108
[ 142.600064] blk_mq_sched_insert_requests+0xb0/0xd0
[ 142.604926] blk_mq_flush_plug_list+0x16c/0x2a0
[ 142.609441] blk_flush_plug_list+0xec/0x118
[ 142.613608] blk_finish_plug+0x3c/0x4c
[ 142.617348] blkdev_direct_IO+0x3b4/0x428
[ 142.621336] generic_file_read_iter+0x84/0x180
[ 142.625761] blkdev_read_iter+0x50/0x78
[ 142.629579] aio_read.isra.6+0xf8/0x190
[ 142.633409] __io_submit_one.isra.8+0x148/0x738
[ 142.637912] io_submit_one.isra.9+0x88/0xb8
[ 142.642078] __arm64_sys_io_submit+0xe0/0x238
[ 142.646428] el0_svc_handler+0xa0/0x128
[ 142.650238] el0_svc+0x8/0xc
[ 142.653104] Code: b9402a63 f9000a7f 3100047f 540000a0 (f9419a81)
[ 142.659202] ---[ end trace 467586bc175eb09d ]---

Fixes: ea86ea2cdced20057da ("sbitmap: ammortize cost of clearing bits")
Reported-and-bisected_and_tested-by: Yi Zhang <yi.zhang@redhat.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v5.0.3, v4.19.29, v5.0.2, v4.19.28, v5.0.1, v4.19.27, v5.0, v4.19.26, v4.19.25, v4.19.24, v4.19.23, v4.19.22, v4.19.21, v4.19.20, v4.19.19, v4.19.18, v4.19.17, v4.19.16
# fe76fc6a 14-Jan-2019 Ming Lei <ming.lei@redhat.com>

sbitmap: Protect swap_lock from hardirq

Because we may call blk_mq_get_driver_tag() directly from
blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
come and the ab

sbitmap: Protect swap_lock from hardirq

Because we may call blk_mq_get_driver_tag() directly from
blk_mq_dispatch_rq_list() without holding any lock, then HARDIRQ may
come and the above DEADLOCK is triggered.

Commit ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq") tries to
fix this issue by using 'spin_lock_bh', which isn't enough because we
complete request from hardirq context direclty in case of multiqueue.

Cc: Clark Williams <williams@redhat.com>
Fixes: ab53dcfb3e7b ("sbitmap: Protect swap_lock from hardirq")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


# 37198768 14-Jan-2019 Steven Rostedt (VMware) <rostedt@goodmis.org>

sbitmap: Protect swap_lock from softirqs

The swap_lock used by sbitmap has a chain with locks taken from softirq,
but the swap_lock is not protected from being preempted by softirqs.

sbitmap: Protect swap_lock from softirqs

The swap_lock used by sbitmap has a chain with locks taken from softirq,
but the swap_lock is not protected from being preempted by softirqs.

A chain exists of:

sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock

Where the sbq->ws[i].wait lock can be taken from softirq context, which
means all locks below it in the chain must also be protected from
softirqs.

Reported-by: Clark Williams <williams@redhat.com>
Fixes: 58ab5e32e6fd ("sbitmap: silence bogus lockdep IRQ warning")
Fixes: ea86ea2cdced ("sbitmap: amortize cost of clearing bits")
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

show more ...


Revision tags: v4.19.15, v4.19.14, v4.19.13, v4.19.12
# 9f6b7ef6 20-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: add helpers for add/del wait queue handling

After commit 5d2ee7122c73, users of sbitmap that need wait queue
handling must use the provided helpers. But we only added
prepar

sbitmap: add helpers for add/del wait queue handling

After commit 5d2ee7122c73, users of sbitmap that need wait queue
handling must use the provided helpers. But we only added
prepare_to_wait()/finish_wait() style helpers, add the equivalent
add_wait_queue/list_del wrappers as we..

This is needed to ensure kyber plays by the sbitmap waitqueue
rules.

Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.19.11, v4.19.10, v4.19.9
# b2dbff1b 11-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: flush deferred clears for resize and shallow gets

We're missing a deferred clear off the shallow get, which can cause
a hang. Additionally, when we resize the sbitmap, we should

sbitmap: flush deferred clears for resize and shallow gets

We're missing a deferred clear off the shallow get, which can cause
a hang. Additionally, when we resize the sbitmap, we should also
flush deferred clears for good measure.

Ensure we have full coverage on batch clears, even for paths where
we would not be doing deferred clear. This makes it less error
prone for future additions.

Reported-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# 58ab5e32 09-Dec-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: silence bogus lockdep IRQ warning

Ming reports that lockdep spews the following trace. What this
essentially says is that the sbitmap swap_lock was used inconsistently
in IR

sbitmap: silence bogus lockdep IRQ warning

Ming reports that lockdep spews the following trace. What this
essentially says is that the sbitmap swap_lock was used inconsistently
in IRQ enabled and disabled context, and that is usually indicative of a
bug that will cause a deadlock.

For this case, it's a false positive. The swap_lock is used from process
context only, when we swap the bits in the word and cleared mask. We
also end up doing that when we are getting a driver tag, from the
blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
IRQs disabled. However, this isn't from an actual IRQ, it's still
process context.

In lieu of a better way to fix this, simply always disable interrupts
when grabbing the swap_lock if lockdep is enabled.

[ 100.967642] ================start test sanity/001================
[ 101.238280] null: module loaded
[ 106.093735]
[ 106.094012] =====================================================
[ 106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[ 106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
[ 106.096551] -----------------------------------------------------
[ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 106.098231] 000000004c43fa71
(&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
[ 106.099431]
[ 106.099431] and this task is already holding:
[ 106.100229] 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.101630] which would create a new lock dependency:
[ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){....} ->
(&(&sb->map[i].swap_lock)->rlock){+.+.}
[ 106.103553]
[ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 106.104580] (&sbq->ws[i].wait){..-.}
[ 106.104582]
[ 106.104582] ... which became SOFTIRQ-irq-safe at:
[ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.106284] __wake_up_common_lock+0x119/0x1b9
[ 106.106825] sbitmap_queue_wake_up+0x33f/0x383
[ 106.107456] sbitmap_queue_clear+0x4c/0x9a
[ 106.108046] __blk_mq_free_request+0x188/0x1d3
[ 106.108581] blk_mq_free_request+0x23b/0x26b
[ 106.109102] scsi_end_request+0x345/0x5d7
[ 106.109587] scsi_io_completion+0x4b5/0x8f0
[ 106.110099] scsi_finish_command+0x412/0x456
[ 106.110615] scsi_softirq_done+0x23f/0x29b
[ 106.111115] blk_done_softirq+0x2a7/0x2e6
[ 106.111608] __do_softirq+0x360/0x6ad
[ 106.112062] run_ksoftirqd+0x2f/0x5b
[ 106.112499] smpboot_thread_fn+0x3a5/0x3db
[ 106.113000] kthread+0x1d4/0x1e4
[ 106.113457] ret_from_fork+0x3a/0x50
[ 106.113969]
[ 106.113969] to a SOFTIRQ-irq-unsafe lock:
[ 106.114672] (&(&sb->map[i].swap_lock)->rlock){+.+.}
[ 106.114674]
[ 106.114674] ... which became SOFTIRQ-irq-unsafe at:
[ 106.116000] ...
[ 106.116003] _raw_spin_lock+0x33/0x64
[ 106.116676] sbitmap_get+0xd5/0x22c
[ 106.117134] __sbitmap_queue_get+0xe8/0x177
[ 106.117731] __blk_mq_get_tag+0x1e6/0x22d
[ 106.118286] blk_mq_get_tag+0x1db/0x6e4
[ 106.118756] blk_mq_get_driver_tag+0x161/0x258
[ 106.119383] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.120043] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.120607] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.121234] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.121781] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.122366] blk_mq_run_hw_queue+0x151/0x187
[ 106.122887] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.123492] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.124042] blk_flush_plug_list+0x392/0x3d7
[ 106.124557] blk_finish_plug+0x37/0x4f
[ 106.125019] read_pages+0x3ef/0x430
[ 106.125446] __do_page_cache_readahead+0x18e/0x2fc
[ 106.126027] force_page_cache_readahead+0x121/0x133
[ 106.126621] page_cache_sync_readahead+0x35f/0x3bb
[ 106.127229] generic_file_buffered_read+0x410/0x1860
[ 106.127932] __vfs_read+0x319/0x38f
[ 106.128415] vfs_read+0xd2/0x19a
[ 106.128817] ksys_read+0xb9/0x135
[ 106.129225] do_syscall_64+0x140/0x385
[ 106.129684] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.130292]
[ 106.130292] other info that might help us debug this:
[ 106.130292]
[ 106.131226] Chain exists of:
[ 106.131226] &sbq->ws[i].wait -->
&(&hctx->dispatch_wait_lock)->rlock -->
&(&sb->map[i].swap_lock)->rlock
[ 106.131226]
[ 106.132865] Possible interrupt unsafe locking scenario:
[ 106.132865]
[ 106.133659] CPU0 CPU1
[ 106.134194] ---- ----
[ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock);
[ 106.135318] local_irq_disable();
[ 106.136014] lock(&sbq->ws[i].wait);
[ 106.136747]
lock(&(&hctx->dispatch_wait_lock)->rlock);
[ 106.137742] <Interrupt>
[ 106.138110] lock(&sbq->ws[i].wait);
[ 106.138625]
[ 106.138625] *** DEADLOCK ***
[ 106.138625]
[ 106.139430] 3 locks held by fio/1043:
[ 106.139947] #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
hctx_lock+0x29/0xe8
[ 106.140813] #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
blk_mq_dispatch_rq_list+0x4ad/0xd7c
[ 106.141877] #2: 000000007eec8b2f
(&(&hctx->dispatch_wait_lock)->rlock){....}, at:
blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.143267]
[ 106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
holding lock:
[ 106.144351] -> (&sbq->ws[i].wait){..-.} ops: 82 {
[ 106.144926] IN-SOFTIRQ-W at:
[ 106.145314] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.146042] __wake_up_common_lock+0x119/0x1b9
[ 106.146785] sbitmap_queue_wake_up+0x33f/0x383
[ 106.147567] sbitmap_queue_clear+0x4c/0x9a
[ 106.148379] __blk_mq_free_request+0x188/0x1d3
[ 106.149148] blk_mq_free_request+0x23b/0x26b
[ 106.149864] scsi_end_request+0x345/0x5d7
[ 106.150546] scsi_io_completion+0x4b5/0x8f0
[ 106.151367] scsi_finish_command+0x412/0x456
[ 106.152157] scsi_softirq_done+0x23f/0x29b
[ 106.152855] blk_done_softirq+0x2a7/0x2e6
[ 106.153537] __do_softirq+0x360/0x6ad
[ 106.154280] run_ksoftirqd+0x2f/0x5b
[ 106.155020] smpboot_thread_fn+0x3a5/0x3db
[ 106.155828] kthread+0x1d4/0x1e4
[ 106.156526] ret_from_fork+0x3a/0x50
[ 106.157267] INITIAL USE at:
[ 106.157713] _raw_spin_lock_irqsave+0x4b/0x82
[ 106.158542] prepare_to_wait_exclusive+0xa8/0x215
[ 106.159421] blk_mq_get_tag+0x34f/0x6e4
[ 106.160186] blk_mq_get_request+0x48e/0xaef
[ 106.160997] blk_mq_make_request+0x27e/0xbd2
[ 106.161828] generic_make_request+0x4d1/0x873
[ 106.162661] submit_bio+0x20c/0x253
[ 106.163379] mpage_bio_submit+0x44/0x4b
[ 106.164142] mpage_readpages+0x3c2/0x407
[ 106.164919] read_pages+0x13a/0x430
[ 106.165633] __do_page_cache_readahead+0x18e/0x2fc
[ 106.166530] force_page_cache_readahead+0x121/0x133
[ 106.167439] page_cache_sync_readahead+0x35f/0x3bb
[ 106.168337] generic_file_buffered_read+0x410/0x1860
[ 106.169255] __vfs_read+0x319/0x38f
[ 106.169977] vfs_read+0xd2/0x19a
[ 106.170662] ksys_read+0xb9/0x135
[ 106.171356] do_syscall_64+0x140/0x385
[ 106.172120] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.173051] }
[ 106.173308] ... key at: [<ffffffff85094600>] __key.26481+0x0/0x40
[ 106.174219] ... acquired at:
[ 106.174646] _raw_spin_lock+0x33/0x64
[ 106.175183] blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.175843] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.176518] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.177262] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.177900] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.178591] blk_mq_run_hw_queue+0x151/0x187
[ 106.179207] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.179926] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.180571] blk_flush_plug_list+0x392/0x3d7
[ 106.181187] blk_finish_plug+0x37/0x4f
[ 106.181737] __se_sys_io_submit+0x171/0x304
[ 106.182346] do_syscall_64+0x140/0x385
[ 106.182895] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.183607]
[ 106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
[ 106.184691] INITIAL USE at:
[ 106.185119] _raw_spin_lock+0x33/0x64
[ 106.185838] blk_mq_dispatch_rq_list+0x4c1/0xd7c
[ 106.186697] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.187551] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.188481] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.189307] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.190189] blk_mq_run_hw_queue+0x151/0x187
[ 106.190989] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.191902] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.192739] blk_flush_plug_list+0x392/0x3d7
[ 106.193535] blk_finish_plug+0x37/0x4f
[ 106.194269] __se_sys_io_submit+0x171/0x304
[ 106.195059] do_syscall_64+0x140/0x385
[ 106.195794] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.196705] }
[ 106.196950] ... key at: [<ffffffff84880620>] __key.51231+0x0/0x40
[ 106.197853] ... acquired at:
[ 106.198270] lock_acquire+0x280/0x2f3
[ 106.198806] _raw_spin_lock+0x33/0x64
[ 106.199337] sbitmap_get+0xd5/0x22c
[ 106.199850] __sbitmap_queue_get+0xe8/0x177
[ 106.200450] __blk_mq_get_tag+0x1e6/0x22d
[ 106.201035] blk_mq_get_tag+0x1db/0x6e4
[ 106.201589] blk_mq_get_driver_tag+0x161/0x258
[ 106.202237] blk_mq_dispatch_rq_list+0x5b9/0xd7c
[ 106.202902] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.203572] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.204316] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.204956] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.205649] blk_mq_run_hw_queue+0x151/0x187
[ 106.206269] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.206997] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.207644] blk_flush_plug_list+0x392/0x3d7
[ 106.208264] blk_finish_plug+0x37/0x4f
[ 106.208814] __se_sys_io_submit+0x171/0x304
[ 106.209415] do_syscall_64+0x140/0x385
[ 106.209965] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.210684]
[ 106.210904]
[ 106.210904] the dependencies between the lock to be acquired
[ 106.210905] and SOFTIRQ-irq-unsafe lock:
[ 106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
[ 106.213393] HARDIRQ-ON-W at:
[ 106.213840] _raw_spin_lock+0x33/0x64
[ 106.214570] sbitmap_get+0xd5/0x22c
[ 106.215282] __sbitmap_queue_get+0xe8/0x177
[ 106.216086] __blk_mq_get_tag+0x1e6/0x22d
[ 106.216876] blk_mq_get_tag+0x1db/0x6e4
[ 106.217627] blk_mq_get_driver_tag+0x161/0x258
[ 106.218465] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.219326] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.220198] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.221138] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.221975] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.222874] blk_mq_run_hw_queue+0x151/0x187
[ 106.223686] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.224597] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.225444] blk_flush_plug_list+0x392/0x3d7
[ 106.226255] blk_finish_plug+0x37/0x4f
[ 106.227006] read_pages+0x3ef/0x430
[ 106.227717] __do_page_cache_readahead+0x18e/0x2fc
[ 106.228595] force_page_cache_readahead+0x121/0x133
[ 106.229491] page_cache_sync_readahead+0x35f/0x3bb
[ 106.230373] generic_file_buffered_read+0x410/0x1860
[ 106.231277] __vfs_read+0x319/0x38f
[ 106.231986] vfs_read+0xd2/0x19a
[ 106.232666] ksys_read+0xb9/0x135
[ 106.233350] do_syscall_64+0x140/0x385
[ 106.234097] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.235012] SOFTIRQ-ON-W at:
[ 106.235460] _raw_spin_lock+0x33/0x64
[ 106.236195] sbitmap_get+0xd5/0x22c
[ 106.236913] __sbitmap_queue_get+0xe8/0x177
[ 106.237715] __blk_mq_get_tag+0x1e6/0x22d
[ 106.238488] blk_mq_get_tag+0x1db/0x6e4
[ 106.239244] blk_mq_get_driver_tag+0x161/0x258
[ 106.240079] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.240937] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.241806] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.242751] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.243579] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.244469] blk_mq_run_hw_queue+0x151/0x187
[ 106.245277] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.246191] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.247044] blk_flush_plug_list+0x392/0x3d7
[ 106.247859] blk_finish_plug+0x37/0x4f
[ 106.248749] read_pages+0x3ef/0x430
[ 106.249463] __do_page_cache_readahead+0x18e/0x2fc
[ 106.250357] force_page_cache_readahead+0x121/0x133
[ 106.251263] page_cache_sync_readahead+0x35f/0x3bb
[ 106.252157] generic_file_buffered_read+0x410/0x1860
[ 106.253084] __vfs_read+0x319/0x38f
[ 106.253808] vfs_read+0xd2/0x19a
[ 106.254488] ksys_read+0xb9/0x135
[ 106.255186] do_syscall_64+0x140/0x385
[ 106.255943] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.256867] INITIAL USE at:
[ 106.257300] _raw_spin_lock+0x33/0x64
[ 106.258033] sbitmap_get+0xd5/0x22c
[ 106.258747] __sbitmap_queue_get+0xe8/0x177
[ 106.259542] __blk_mq_get_tag+0x1e6/0x22d
[ 106.260320] blk_mq_get_tag+0x1db/0x6e4
[ 106.261072] blk_mq_get_driver_tag+0x161/0x258
[ 106.261902] blk_mq_dispatch_rq_list+0x28e/0xd7c
[ 106.262762] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.263626] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.264571] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.265409] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.266302] blk_mq_run_hw_queue+0x151/0x187
[ 106.267111] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.268028] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.268878] blk_flush_plug_list+0x392/0x3d7
[ 106.269694] blk_finish_plug+0x37/0x4f
[ 106.270432] read_pages+0x3ef/0x430
[ 106.271139] __do_page_cache_readahead+0x18e/0x2fc
[ 106.272040] force_page_cache_readahead+0x121/0x133
[ 106.272932] page_cache_sync_readahead+0x35f/0x3bb
[ 106.273811] generic_file_buffered_read+0x410/0x1860
[ 106.274709] __vfs_read+0x319/0x38f
[ 106.275407] vfs_read+0xd2/0x19a
[ 106.276074] ksys_read+0xb9/0x135
[ 106.276764] do_syscall_64+0x140/0x385
[ 106.277500] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 106.278417] }
[ 106.278676] ... key at: [<ffffffff85094640>] __key.26212+0x0/0x40
[ 106.279586] ... acquired at:
[ 106.280026] lock_acquire+0x280/0x2f3
[ 106.280559] _raw_spin_lock+0x33/0x64
[ 106.281101] sbitmap_get+0xd5/0x22c
[ 106.281610] __sbitmap_queue_get+0xe8/0x177
[ 106.282221] __blk_mq_get_tag+0x1e6/0x22d
[ 106.282809] blk_mq_get_tag+0x1db/0x6e4
[ 106.283368] blk_mq_get_driver_tag+0x161/0x258
[ 106.284018] blk_mq_dispatch_rq_list+0x5b9/0xd7c
[ 106.284685] blk_mq_do_dispatch_sched+0x23a/0x287
[ 106.285371] blk_mq_sched_dispatch_requests+0x379/0x3fc
[ 106.286135] __blk_mq_run_hw_queue+0x137/0x17e
[ 106.286806] __blk_mq_delay_run_hw_queue+0x80/0x25f
[ 106.287515] blk_mq_run_hw_queue+0x151/0x187
[ 106.288149] blk_mq_sched_insert_requests+0x13f/0x175
[ 106.289041] blk_mq_flush_plug_list+0x7d6/0x81b
[ 106.289912] blk_flush_plug_list+0x392/0x3d7
[ 106.290590] blk_finish_plug+0x37/0x4f
[ 106.291238] __se_sys_io_submit+0x171/0x304
[ 106.291864] do_syscall_64+0x140/0x385
[ 106.292534] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.19.8, v4.19.7, v4.19.6
# 5d2ee712 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: optimize wakeup check

Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost ad

sbitmap: optimize wakeup check

Even if we have no waiters on any of the sbitmap_queue wait states, we
still have to loop every entry to check. We do this for every IO, so
the cost adds up.

Shift a bit of the cost to the slow path, when we actually have waiters.
Wrap prepare_to_wait_exclusive() and finish_wait(), so we can maintain
an internal count of how many are currently active. Then we can simply
check this count in sbq_wake_ptr() and not have to loop if we don't
have any sleepers.

Convert the two users of sbitmap with waiting, blk-mq-tag and iSCSI.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# ea86ea2c 30-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: ammortize cost of clearing bits

sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits

sbitmap: ammortize cost of clearing bits

sbitmap maintains a set of words that we use to set and clear bits, with
each bit representing a tag for blk-mq. Even though we spread the bits
out and maintain a hint cache, one particular bit allocated will end up
being cleared in the exact same spot.

This introduces batched clearing of bits. Instead of clearing a given
bit, the same bit is set in a cleared/free mask instead. If we fail
allocating a bit from a given word, then we check the free mask, and
batch move those cleared bits at that time. This trades 64 atomic bitops
for 2 cmpxchg().

In a threaded poll test case, half the overhead of getting and clearing
tags is removed with this change. On another poll test case with a
single thread, performance is unchanged.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# 27fae429 29-Nov-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: don't loop for find_next_zero_bit() for !round_robin

If we aren't forced to do round robin tag allocation, just use the
allocation hint to find the index for the tag word, don't

sbitmap: don't loop for find_next_zero_bit() for !round_robin

If we aren't forced to do round robin tag allocation, just use the
allocation hint to find the index for the tag word, don't use it for the
offset inside the word. This avoids a potential extra round trip in the
bit looping, and since we're fetching this cacheline, we may as well
check the whole word from the start.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.19.5, v4.19.4, v4.18.20, v4.19.3, v4.18.19, v4.19.2, v4.18.18, v4.18.17, v4.19.1, v4.19, v4.18.16, v4.18.15, v4.18.14, v4.18.13, v4.18.12, v4.18.11, v4.18.10, v4.18.9, v4.18.7, v4.18.6, v4.18.5, v4.17.18, v4.18.4, v4.18.3, v4.17.17, v4.18.2, v4.17.16, v4.17.15, v4.18.1, v4.18, v4.17.14, v4.17.13, v4.17.12, v4.17.11, v4.17.10, v4.17.9, v4.17.8, v4.17.7, v4.17.6, v4.17.5, v4.17.4, v4.17.3, v4.17.2
# 590b5b7d 12-Jun-2018 Kees Cook <keescook@chromium.org>

treewide: kzalloc_node() -> kcalloc_node()

The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:

kzalloc_node(a * b, gf

treewide: kzalloc_node() -> kcalloc_node()

The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This
patch replaces cases of:

kzalloc_node(a * b, gfp, node)

with:
kcalloc_node(a * b, gfp, node)

as well as handling cases of:

kzalloc_node(a * b * c, gfp, node)

with:

kzalloc_node(array3_size(a, b, c), gfp, node)

as it's slightly less ugly than:

kcalloc_node(array_size(a, b), c, gfp, node)

This does, however, attempt to ignore constant size factors like:

kzalloc_node(4 * 1024, gfp, node)

though any constants defined via macros get caught up in the conversion.

Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.

The Coccinelle script used for this was:

// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@

(
kzalloc_node(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kzalloc_node(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)

// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@

(
kzalloc_node(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kzalloc_node(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)

// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@

(
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)

// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@

- kzalloc_node
+ kcalloc_node
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)

// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@

(
kzalloc_node(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kzalloc_node(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)

// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@

(
kzalloc_node(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kzalloc_node(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)

// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@

(
kzalloc_node(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kzalloc_node(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)

// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@

(
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kzalloc_node(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)

// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@

(
kzalloc_node(sizeof(THING) * C2, ...)
|
kzalloc_node(sizeof(TYPE) * C2, ...)
|
kzalloc_node(C1 * C2 * C3, ...)
|
kzalloc_node(C1 * C2, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * E2
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kzalloc_node
+ kcalloc_node
(
- E1 * E2
+ E1, E2
, ...)
)

Signed-off-by: Kees Cook <keescook@chromium.org>

show more ...


Revision tags: v4.17.1, v4.17
# e6fc4649 24-May-2018 Ming Lei <ming.lei@redhat.com>

blk-mq: avoid starving tag allocation after allocating process migrates

When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previo

blk-mq: avoid starving tag allocation after allocating process migrates

When the allocation process is scheduled back and the mapped hw queue is
changed, fake one extra wake up on previous queue for compensating wake
up miss, so other allocations on the previous queue won't be starved.

This patch fixes one request allocation hang issue, which can be
triggered easily in case of very low nr_request.

The race is as follows:

1) 2 hw queues, nr_requests are 2, and wake_batch is one

2) there are 3 waiters on hw queue 0

3) two in-flight requests in hw queue 0 are completed, and only two
waiters of 3 are waken up because of wake_batch, but both the two
waiters can be scheduled to another CPU and cause to switch to hw
queue 1

4) then the 3rd waiter will wait for ever, since no in-flight request
is in hw queue 0 any more.

5) this patch fixes it by the fake wakeup when waiter is scheduled to
another hw queue

Cc: <stable@vger.kernel.org>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Modified commit message to make it clearer, and make it apply on
top of the 4.18 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# c854ab57 14-May-2018 Jens Axboe <axboe@kernel.dk>

sbitmap: fix race in wait batch accounting

If we have multiple callers of sbq_wake_up(), we can end up in a
situation where the wait_cnt will continually go more and more
negative. C

sbitmap: fix race in wait batch accounting

If we have multiple callers of sbq_wake_up(), we can end up in a
situation where the wait_cnt will continually go more and more
negative. Consider the case where our wake batch is 1, hence
wait_cnt will start out as 1.

wait_cnt == 1

CPU0 CPU1
atomic_dec_return(), cnt == 0
atomic_dec_return(), cnt == -1
cmpxchg(-1, 0) (succeeds)
[wait_cnt now 0]
cmpxchg(0, 1) (fails)

This ends up with wait_cnt being 0, we'll wakeup immediately
next time. Going through the same loop as above again, and
we'll have wait_cnt -1.

For the case where we have a larger wake batch, the only
difference is that the starting point will be higher. We'll
still end up with continually smaller batch wakeups, which
defeats the purpose of the rolling wakeups.

Always reset the wait_cnt to the batch value. Then it doesn't
matter who wins the race. But ensure that whomever does win
the race is the one that increments the ws index and wakes up
our batch count, loser gets to call __sbq_wake_up() again to
account his wakeups towards the next active wait state index.

Fixes: 6c0ca7ae292a ("sbitmap: fix wakeup hang after sbq resize")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# 61445b56 09-May-2018 Omar Sandoval <osandov@fb.com>

sbitmap: warn if using smaller shallow depth than was setup

Make sure the user passed the right value to
sbitmap_queue_min_shallow_depth().

Acked-by: Paolo Valente <paolo.valent

sbitmap: warn if using smaller shallow depth than was setup

Make sure the user passed the right value to
sbitmap_queue_min_shallow_depth().

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


# a3275539 09-May-2018 Omar Sandoval <osandov@fb.com>

sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

The sbitmap queue wake batch is calculated such that once allocations
start blocking, all of the bits which are already

sbitmap: fix missed wakeups caused by sbitmap_queue_get_shallow()

The sbitmap queue wake batch is calculated such that once allocations
start blocking, all of the bits which are already allocated must be
enough to fulfill the batch counters of all of the waitqueues. However,
the shallow allocation depth can break this invariant, since we block
before our full depth is being utilized. Add
sbitmap_queue_min_shallow_depth(), which saves the minimum shallow depth
the sbq will use, and update sbq_calc_wake_batch() to take it into
account.

Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.16
# 4ace53f1 27-Feb-2018 Omar Sandoval <osandov@fb.com>

sbitmap: use test_and_set_bit_lock()/clear_bit_unlock()

sbitmap_queue_get()/sbitmap_queue_clear() are used for
allocating/freeing a resource, so they should provide acquire/release
b

sbitmap: use test_and_set_bit_lock()/clear_bit_unlock()

sbitmap_queue_get()/sbitmap_queue_clear() are used for
allocating/freeing a resource, so they should provide acquire/release
barrier semantics, respectively. sbitmap_get() currently contains a full
barrier, which is unnecessary, so use test_and_set_bit_lock() instead of
test_and_set_bit() (these are equivalent on x86_64). sbitmap_clear_bit()
does not imply any barriers, which is incorrect, as accesses of the
resource (e.g., request) could potentially get reordered to after the
clear_bit(). Introduce sbitmap_clear_bit_unlock() and use it for
sbitmap_queue_clear() (this only adds a compiler barrier on x86_64). The
other existing user of sbitmap_clear_bit() (the blk-mq software queue
pending map) is serialized through a spinlock and does not need this.

Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.15, v4.13.16
# 4e5dff41 14-Nov-2017 Jens Axboe <axboe@kernel.dk>

blk-mq: improve heavily contended tag case

Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1

blk-mq: improve heavily contended tag case

Even with a number of waitqueues, we can get into a situation where we
are heavily contended on the waitqueue lock. I got a report on spc1
where we're spending seconds doing this. Arguably the use case is nasty,
I reproduce it with one device and 1000 threads banging on the device.
But that doesn't mean we shouldn't be handling it better.

What ends up happening is that a thread will fail to get a tag, add
itself to the waitqueue, and subsequently get woken up when a tag is
freed - only to find itself going back to sleep on the waitqueue.

Instead of waking all threads, use an exclusive wait and wake up our
sbitmap batch count instead. This seems to work well for me (massive
improvement for this use case), and it survives basic testing. But I
haven't fully verified it yet.

An additional improvement is running the queue and checking for a new
tag BEFORE needing to add ourselves to the waitqueue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

show more ...


Revision tags: v4.14, v4.13.5, v4.13, v4.12, v4.10.17, v4.10.16, v4.10.15, v4.10.14, v4.10.13, v4.10.12, v4.10.11
# c05e6673 14-Apr-2017 Omar Sandoval <osandov@fb.com>

sbitmap: add sbitmap_get_shallow() operation

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside so

sbitmap: add sbitmap_get_shallow() operation

This operation supports the use case of limiting the number of bits that
can be allocated for a given operation. Rather than setting aside some
bits at the end of the bitmap, we can set aside bits in each word of the
bitmap. This means we can keep the allocation hints spread out and
support sbitmap_resize() nicely at the cost of lower granularity for the
allowed depth.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

show more ...


Revision tags: v4.10.10, v4.10.9, v4.10.8, v4.10.7, v4.10.6, v4.10.5, v4.10.4, v4.10.3, v4.10.2, v4.10.1, v4.10
# af8601ad 03-Feb-2017 Ingo Molnar <mingo@kernel.org>

kasan, sched/headers: Uninline kasan_enable/disable_current()

<linux/kasan.h> is a low level header that is included early
in affected kernel headers. But it includes <linux/sched.h>

kasan, sched/headers: Uninline kasan_enable/disable_current()

<linux/kasan.h> is a low level header that is included early
in affected kernel headers. But it includes <linux/sched.h>
which complicates the cleanup of sched.h dependencies.

But kasan.h has almost no need for sched.h: its only use of
scheduler functionality is in two inline functions which are
not used very frequently - so uninline kasan_enable_current()
and kasan_disable_current().

Also add a <linux/sched.h> dependency to a .c file that depended
on kasan.h including it.

This paves the way to remove the <linux/sched.h> include from kasan.h.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>

show more ...


# 24af1ccf 25-Jan-2017 Omar Sandoval <osandov@fb.com>

sbitmap: add helpers for dumping to a seq_file

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke <hare@suse.co

sbitmap: add helpers for dumping to a seq_file

This is useful debugging information that will be used in the blk-mq
debugfs directory.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>

Changed 'weight' to 'busy'.

Signed-off-by: Jens Axboe <axboe@fb.com>

show more ...


# 6c0ca7ae 18-Jan-2017 Omar Sandoval <osandov@fb.com>

sbitmap: fix wakeup hang after sbq resize

When we resize a struct sbitmap_queue, we update the wakeup batch size,
but we don't update the wait count in the struct sbq_wait_states. If we

sbitmap: fix wakeup hang after sbq resize

When we resize a struct sbitmap_queue, we update the wakeup batch size,
but we don't update the wait count in the struct sbq_wait_states. If we
resized down from a size which could use a bigger batch size, these
counts could be too large and cause us to miss necessary wakeups. To fix
this, update the wait counts when we resize (ensuring some careful
memory ordering so that it's safe w.r.t. concurrent clears).

This also fixes a theoretical issue where two threads could end up
bumping the wait count up by the batch size, which could also
potentially lead to hangs.

Reported-by: Martin Raiber <martin@urbackup.org>
Fixes: e3a2b3f931f5 ("blk-mq: allow changing of queue depth through sysfs")
Fixes: 2971c35f3588 ("blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

show more ...


12345