Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6 |
|
#
8884a56d |
| 27-Oct-2023 |
Keith Busch <kbusch@kernel.org> |
nvme: ensure reset state check ordering
[ Upstream commit e6e7f7ac03e40795346f1b2994a05f507ad8d345 ]
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optim
nvme: ensure reset state check ordering
[ Upstream commit e6e7f7ac03e40795346f1b2994a05f507ad8d345 ]
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optimizing to a stale state. Normally it isn't a problem to observe the wrong state as it is merely advisory to take a quicker path during initialization and error recovery, but seeing an old state can report unexpected ENETRESET errors when a reset request was in fact successful.
Reported-by: Minh Hoang <mh2022@meta.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39 |
|
#
99dc2640 |
| 11-Jul-2023 |
Ming Lei <ming.lei@redhat.com> |
nvme-tcp: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_tcp_configure_io_queues(), and there is at least two benefits:
1) fix unbalanced freeze and unfreeze, since re-conne
nvme-tcp: fix potential unbalanced freeze & unfreeze
Move start_freeze into nvme_tcp_configure_io_queues(), and there is at least two benefits:
1) fix unbalanced freeze and unfreeze, since re-connection work may fail or be broken by removal
2) IO during error recovery can be failfast quickly because nvme fabrics unquiesces queues after teardown.
One side-effect is that !mpath request may timeout during connecting because of queue topo change, but that looks not one big deal:
1) same problem exists with current code base
2) compared with !mpath, mpath use case is dominant
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") Cc: stable@vger.kernel.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
show more ...
|
Revision tags: v6.1.38, v6.1.37 |
|
#
c97d3fb9 |
| 29-Jun-2023 |
David Howells <dhowells@redhat.com> |
nvme-tcp: Fix comma-related oops
Fix a comma that should be a semicolon. The comma is at the end of an if-body and thus makes the statement after (a bvec_set_page()) conditional too, resulting in a
nvme-tcp: Fix comma-related oops
Fix a comma that should be a semicolon. The comma is at the end of an if-body and thus makes the statement after (a bvec_set_page()) conditional too, resulting in an oops because we didn't fill out the bio_vec[]:
BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page ... Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] RIP: 0010:skb_splice_from_iter+0xf1/0x370 ... Call Trace: tcp_sendmsg_locked+0x3a6/0xdd0 tcp_sendmsg+0x31/0x50 inet_sendmsg+0x47/0x80 sock_sendmsg+0x99/0xb0 nvme_tcp_try_send_data+0x149/0x490 [nvme_tcp] nvme_tcp_try_send+0x1b7/0x300 [nvme_tcp] nvme_tcp_io_work+0x40/0xc0 [nvme_tcp] process_one_work+0x21c/0x430 worker_thread+0x54/0x3e0 kthread+0xf8/0x130
Fixes: 7769887817c3 ("nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Reported-by: Aurelien Aptel <aaptel@nvidia.com> Link: https://lore.kernel.org/r/253mt0il43o.fsf@mtr-vdi-124.i-did-not-set--mail-host-address--so-tickle-me/ Signed-off-by: David Howells <dhowells@redhat.com> cc: Sagi Grimberg <sagi@grimberg.me> cc: Willem de Bruijn <willemb@google.com> cc: Keith Busch <kbusch@kernel.org> cc: Jens Axboe <axboe@fb.com> cc: Christoph Hellwig <hch@lst.de> cc: Chaitanya Kulkarni <kch@nvidia.com> cc: "David S. Miller" <davem@davemloft.net> cc: Eric Dumazet <edumazet@google.com> cc: Jakub Kicinski <kuba@kernel.org> cc: Paolo Abeni <pabeni@redhat.com> cc: Jens Axboe <axboe@kernel.dk> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-nvme@lists.infradead.org cc: netdev@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
Revision tags: v6.1.36, v6.4 |
|
#
77698878 |
| 23-Jun-2023 |
David Howells <dhowells@redhat.com> |
nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using a sendmsg with MSG_SPLICE_PAGES instead of sendpage.
Signed-off-by: David Howells <dhow
nvme-tcp: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage
When transmitting data, call down into TCP using a sendmsg with MSG_SPLICE_PAGES instead of sendpage.
Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Willem de Bruijn <willemb@google.com> cc: Keith Busch <kbusch@kernel.org> cc: Jens Axboe <axboe@fb.com> cc: Christoph Hellwig <hch@lst.de> cc: Chaitanya Kulkarni <kch@nvidia.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-nvme@lists.infradead.org Link: https://lore.kernel.org/r/20230623225513.2732256-8-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27 |
|
#
a249d306 |
| 26-Apr-2023 |
Keith Busch <kbusch@kernel.org> |
nvme-fabrics: add queue setup helpers
tcp and rdma transports have lots of duplicate code setting up the different queue mappings. Add common helpers.
Cc: Chaitanya Kulkarni <kch@nvidia.com> Review
nvme-fabrics: add queue setup helpers
tcp and rdma transports have lots of duplicate code setting up the different queue mappings. Add common helpers.
Cc: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
show more ...
|
Revision tags: v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21 |
|
#
aeacfcef |
| 21-Mar-2023 |
Chris Leech <cleech@redhat.com> |
nvme-tcp: fence TCP socket on receive error
Ensure that no further socket reads occur after a receive processing error, either from io_work being re-scheduled or nvme_tcp_poll.
Failing to do so can
nvme-tcp: fence TCP socket on receive error
Ensure that no further socket reads occur after a receive processing error, either from io_work being re-scheduled or nvme_tcp_poll.
Failing to do so can result in unrecognised PDU payloads or TCP stream garbage being processed as a C2H data PDU, and potentially start copying the payload to an invalid destination after looking up a request using a bogus command id.
Signed-off-by: Chris Leech <cleech@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
88eaba80 |
| 20-Mar-2023 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: fix a possible UAF when failing to allocate an io queue
When we allocate a nvme-tcp queue, we set the data_ready callback before we actually need to use it. This creates the potential that
nvme-tcp: fix a possible UAF when failing to allocate an io queue
When we allocate a nvme-tcp queue, we set the data_ready callback before we actually need to use it. This creates the potential that if a stray controller sends us data on the socket before we connect, we can trigger the io_work and start consuming the socket.
In this case reported: we failed to allocate one of the io queues, and as we start releasing the queues that we already allocated, we get a UAF [1] from the io_work which is running before it should really.
Fix this by setting the socket ops callbacks only before we start the queue, so that we can't accidentally schedule the io_work in the initialization phase before the queue started. While we are at it, rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.
[1]: [16802.107284] nvme nvme4: starting error recovery [16802.109166] nvme nvme4: Reconnecting in 10 seconds... [16812.173535] nvme nvme4: failed to connect socket: -111 [16812.173745] nvme nvme4: Failed reconnect attempt 1 [16812.173747] nvme nvme4: Reconnecting in 10 seconds... [16822.413555] nvme nvme4: failed to connect socket: -111 [16822.413762] nvme nvme4: Failed reconnect attempt 2 [16822.413765] nvme nvme4: Reconnecting in 10 seconds... [16832.661274] nvme nvme4: creating 32 I/O queues. [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088 [16833.920068] nvme nvme4: Failed reconnect attempt 3 [16833.920094] #PF: supervisor write access in kernel mode [16833.920261] nvme nvme4: Reconnecting in 10 seconds... [16833.920368] #PF: error_code(0x0002) - not-present page [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp] [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30 ... [16833.923138] Call Trace: [16833.923271] <TASK> [16833.923402] lock_sock_nested+0x1e/0x50 [16833.923545] nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp] [16833.923685] nvme_tcp_io_work+0x68/0xa0 [nvme_tcp] [16833.923824] process_one_work+0x1e8/0x390 [16833.923969] worker_thread+0x53/0x3d0 [16833.924104] ? process_one_work+0x390/0x390 [16833.924240] kthread+0x124/0x150 [16833.924376] ? set_kthread_struct+0x50/0x50 [16833.924518] ret_from_fork+0x1f/0x30 [16833.924655] </TASK>
Reported-by: Yanjun Zhang <zhangyanjun@cestc.cn> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Yanjun Zhang <zhangyanjun@cestc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
Revision tags: v6.1.20, v6.1.19 |
|
#
7e87965d |
| 13-Mar-2023 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: add nvme-tcp pdu size build protection
Make sure that we don't somehow mess up the wire structures in the spec.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulk
nvme-tcp: add nvme-tcp pdu size build protection
Make sure that we don't somehow mess up the wire structures in the spec.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kkch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
a3406352 |
| 13-Mar-2023 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: fix opcode reporting in the timeout handler
For non in-capsule writes we reuse the request pdu space for a h2cdata pdu in order to avoid over allocating space (either preallocate or dynami
nvme-tcp: fix opcode reporting in the timeout handler
For non in-capsule writes we reuse the request pdu space for a h2cdata pdu in order to avoid over allocating space (either preallocate or dynamically upon receving an r2t pdu). However if the request times out the core expects to find the opcode in the start of the request, which we override.
In order to prevent that, without sacrificing additional 24 bytes per request, we just use the tail of the command pdu space instead (last 24 bytes from the 72 bytes command pdu). That should make the command opcode always available, and we get away from allocating more space.
If in the future we would need the last 24 bytes of the nvme command available we would need to allocate a dedicated space for it in the request, but until then we can avoid doing so.
Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kkch@nvidia.com> Tested-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
Revision tags: v6.1.18, v6.1.17, v6.1.16, v6.1.15 |
|
#
76d54bf2 |
| 26-Feb-2023 |
Akinobu Mita <akinobu.mita@gmail.com> |
nvme-tcp: don't access released socket during error recovery
While the error recovery work is temporarily failing reconnect attempts, running the 'nvme list' command causes a kernel NULL pointer der
nvme-tcp: don't access released socket during error recovery
While the error recovery work is temporarily failing reconnect attempts, running the 'nvme list' command causes a kernel NULL pointer dereference by calling getsockname() with a released socket.
During error recovery work, the nvme tcp socket is released and a new one created, so it is not safe to access the socket without proper check.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Fixes: 02c57a82c008 ("nvme-tcp: print actual source IP address through sysfs "address" attr") Reviewed-by: Martin Belanger <martin.belanger@dell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
Revision tags: v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13 |
|
#
99607843 |
| 12-Dec-2022 |
Amit Engel <Amit.Engel@dell.com> |
nvme-tcp: add additional info for nvme_tcp_timeout log
This provides additional details about the rq/cmd that is timed out
example log if CONFIG_NVME_VERBOSE_ERRORS is configured: "nvme nvme0: queu
nvme-tcp: add additional info for nvme_tcp_timeout log
This provides additional details about the rq/cmd that is timed out
example log if CONFIG_NVME_VERBOSE_ERRORS is configured: "nvme nvme0: queue 2 timeout cid 0xd058 type 4 opc Write (0x1)"
example log if CONFIG_NVME_VERBOSE_ERRORS is not configured: "nvme nvme0: queue 2 timeout cid 0xd058 type 4 opc I/O Cmd (0x1)"
Signed-off-by: Amit Engel <Amit.Engel@dell.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
40e0b090 |
| 19-Jan-2023 |
Peilin Ye <peilin.ye@bytedance.com> |
net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example:
<...> iperf-609 [002] ..... 70.660425: s
net/sock: Introduce trace_sk_data_ready()
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example:
<...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...>
Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
show more ...
|
#
98123866 |
| 16-Dec-2022 |
Benjamin Coddington <bcodding@redhat.com> |
Treewide: Stop corrupting socket's task_frag
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when
Treewide: Stop corrupting socket's task_frag
Since moving to memalloc_nofs_save/restore, SUNRPC has stopped setting the GFP_NOIO flag on sk_allocation which the networking system uses to decide when it is safe to use current->task_frag. The results of this are unexpected corruption in task_frag when SUNRPC is involved in memory reclaim.
The corruption can be seen in crashes, but the root cause is often difficult to ascertain as a crashing machine's stack trace will have no evidence of being near NFS or SUNRPC code. I believe this problem to be much more pervasive than reports to the community may indicate.
Fix this by having kernel users of sockets that may corrupt task_frag due to reclaim set sk_use_task_frag = false. Preemptively correcting this situation for users that still set sk_allocation allows them to convert to memalloc_nofs_save/restore without the same unexpected corruptions that are sure to follow, unlikely to show up in testing, and difficult to bisect.
CC: Philipp Reisner <philipp.reisner@linbit.com> CC: Lars Ellenberg <lars.ellenberg@linbit.com> CC: "Christoph Böhmwalder" <christoph.boehmwalder@linbit.com> CC: Jens Axboe <axboe@kernel.dk> CC: Josef Bacik <josef@toxicpanda.com> CC: Keith Busch <kbusch@kernel.org> CC: Christoph Hellwig <hch@lst.de> CC: Sagi Grimberg <sagi@grimberg.me> CC: Lee Duncan <lduncan@suse.com> CC: Chris Leech <cleech@redhat.com> CC: Mike Christie <michael.christie@oracle.com> CC: "James E.J. Bottomley" <jejb@linux.ibm.com> CC: "Martin K. Petersen" <martin.petersen@oracle.com> CC: Valentina Manea <valentina.manea.m@gmail.com> CC: Shuah Khan <shuah@kernel.org> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: David Howells <dhowells@redhat.com> CC: Marc Dionne <marc.dionne@auristor.com> CC: Steve French <sfrench@samba.org> CC: Christine Caulfield <ccaulfie@redhat.com> CC: David Teigland <teigland@redhat.com> CC: Mark Fasheh <mark@fasheh.com> CC: Joel Becker <jlbec@evilplan.org> CC: Joseph Qi <joseph.qi@linux.alibaba.com> CC: Eric Van Hensbergen <ericvh@gmail.com> CC: Latchesar Ionkov <lucho@ionkov.net> CC: Dominique Martinet <asmadeus@codewreck.org> CC: Ilya Dryomov <idryomov@gmail.com> CC: Xiubo Li <xiubli@redhat.com> CC: Chuck Lever <chuck.lever@oracle.com> CC: Jeff Layton <jlayton@kernel.org> CC: Trond Myklebust <trond.myklebust@hammerspace.com> CC: Anna Schumaker <anna@kernel.org> CC: Steffen Klassert <steffen.klassert@secunet.com> CC: Herbert Xu <herbert@gondor.apana.org.au>
Suggested-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
show more ...
|
Revision tags: v6.1, v6.0.12, v6.0.11 |
|
#
db45e1a5 |
| 30-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
nvme: consolidate setting the tagset flags
All nvme transports should be using the same flags for their tagsets, with the exception for the blocking flag that should only be set for transports that
nvme: consolidate setting the tagset flags
All nvme transports should be using the same flags for their tagsets, with the exception for the blocking flag that should only be set for transports that can block in ->queue_rq.
Add a NVME_F_BLOCKING flag to nvme_ctrl_ops to control the blocking behavior and lift setting the flags into nvme_alloc_{admin,io}_tag_set.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
show more ...
|
#
dcef7727 |
| 30-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
Don't look at ctrl->ops as only RDMA and TCP actually support multiple maps.
Fixes: 6dfba1c09c10 ("nvme-fc: use the tagset alloc/free helpers"
nvme: pass nr_maps explicitly to nvme_alloc_io_tag_set
Don't look at ctrl->ops as only RDMA and TCP actually support multiple maps.
Fixes: 6dfba1c09c10 ("nvme-fc: use the tagset alloc/free helpers") Fixes: ceee1953f923 ("nvme-loop: use the tagset alloc/free helpers") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
show more ...
|
Revision tags: v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78 |
|
#
285b6e9b |
| 08-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
Many of the callers decide which one to use based on a bool argument and there is at least some code to be shared, so merge these two. Also mov
nvme: merge nvme_shutdown_ctrl into nvme_disable_ctrl
Many of the callers decide which one to use based on a bool argument and there is at least some code to be shared, so merge these two. Also move a comment specific to a single callsite to that callsite.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hector Martin <marcan@marcan.st>
show more ...
|
Revision tags: v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72 |
|
#
6887fc64 |
| 03-Oct-2022 |
Sagi Grimberg <sagi@grimberg.me> |
nvme: introduce nvme_start_request
In preparation for nvme-multipath IO stats accounting, we want the accounting to happen in a centralized place. The request completion is already centralized, but
nvme: introduce nvme_start_request
In preparation for nvme-multipath IO stats accounting, we want the accounting to happen in a centralized place. The request completion is already centralized, but we need a common helper to request I/O start.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de>
show more ...
|
Revision tags: v6.0, v5.15.71, v5.15.70, v5.15.69 |
|
#
de4eda9d |
| 15-Sep-2022 |
Al Viro <viro@zeniv.linux.org.uk> |
use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with wri
use less confusing names for iov_iter direction initializers
READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
show more ...
|
#
9f27bd70 |
| 15-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
nvme: rename the queue quiescing helpers
Naming the nvme helpers that wrap the block quiesce functionality _start/_stop is rather confusing. Switch to using the quiesce naming used by the block lay
nvme: rename the queue quiescing helpers
Naming the nvme helpers that wrap the block quiesce functionality _start/_stop is rather confusing. Switch to using the quiesce naming used by the block layer instead.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
show more ...
|
#
1f1a4f89 |
| 13-Nov-2022 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is
nvme-tcp: stop auth work after tearing down queues in error recovery
when starting error recovery there might be a authentication work running, and it involves I/O commands. Given the controller is tearing down there is no chance for the I/O to complete other than timing out which may unnecessarily take a full io timeout.
So first tear down the queues, fail/cancel all inflight I/O (including potentially authentication) and only then stop authentication. This ensures that failover is not stalled due to blocked authentication I/O.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
94cc781f |
| 08-Nov-2022 |
Christoph Hellwig <hch@lst.de> |
nvme: move OPAL setup from PCIe to core
Nothing about the TCG Opal support is PCIe transport specific, so move it to the core code. For this nvme_init_ctrl_finish grows a new was_suspended argument
nvme: move OPAL setup from PCIe to core
Nothing about the TCG Opal support is PCIe transport specific, so move it to the core code. For this nvme_init_ctrl_finish grows a new was_suspended argument that allows the transport driver to tell the OPAL code if the controller came out of a suspend cycle.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: James Smart <jsmart2021@gmail.com> Tested-by Gerd Bayer <gbayer@linxu.ibm.com>
show more ...
|
#
83e1226b |
| 23-Oct-2022 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
When destroying a queue, when calling sock_release, the network stack might need to allocate an skb to send a
nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
When destroying a queue, when calling sock_release, the network stack might need to allocate an skb to send a FIN/RST. When that happens during memory pressure, there is a need to reclaim memory, which in turn may ask the nvme-tcp device to write out dirty pages, however this is not possible due to a ctrl teardown that is going on.
Set PF_MEMALLOC to the task that releases the socket to grant access to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp thread as this may also originate from the swap itself and should be more resilient to memory pressure situations.
This fixes the following lockdep complaint: -- ====================================================== WARNING: possible circular locking dependency detected 6.0.0-rc2+ #25 Tainted: G W ------------------------------------------------------ kswapd0/92 is trying to acquire lock: ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
but task is already holding lock: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x11e/0x160 kmem_cache_alloc_node+0x44/0x530 __alloc_skb+0x158/0x230 tcp_send_active_reset+0x7e/0x730 tcp_disconnect+0x1272/0x1ae0 __tcp_close+0x707/0xd90 tcp_close+0x26/0x80 inet_release+0xfa/0x220 sock_release+0x85/0x1a0 nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp] nvme_do_delete_ctrl+0x130/0x13d [nvme_core] nvme_sysfs_delete.cold+0x8/0xd [nvme_core] kernfs_fop_write_iter+0x356/0x530 vfs_write+0x4e8/0xce0 ksys_write+0xfd/0x1d0 do_syscall_64+0x58/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}: __lock_acquire+0x2a0c/0x5690 lock_acquire+0x18e/0x4f0 lock_sock_nested+0x37/0xc0 tcp_sendpage+0x23/0xa0 inet_sendpage+0xad/0x120 kernel_sendpage+0x156/0x440 nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp] nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp] __blk_mq_try_issue_directly+0x452/0x660 blk_mq_plug_issue_direct.constprop.0+0x207/0x700 blk_mq_flush_plug_list+0x6f5/0xc70 __blk_flush_plug+0x264/0x410 blk_finish_plug+0x4b/0xa0 shrink_lruvec+0x1263/0x1ea0 shrink_node+0x736/0x1a80 balance_pgdat+0x740/0x10d0 kswapd+0x5f2/0xaf0 kthread+0x256/0x2f0 ret_from_fork+0x1f/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(sk_lock-AF_INET-NVME); lock(fs_reclaim); lock(sk_lock-AF_INET-NVME);
*** DEADLOCK ***
3 locks held by kswapd0/92: #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0 #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70 #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
5fa9add6 |
| 22-Oct-2022 |
Nam Cao <namcaov@gmail.com> |
nvme-tcp: replace sg_init_marker() with sg_init_table()
In nvme_tcp_ddgst_update(), sg_init_marker() is called with an uninitialized scatterlist. This is probably fine, but gcc complains:
CC [M]
nvme-tcp: replace sg_init_marker() with sg_init_table()
In nvme_tcp_ddgst_update(), sg_init_marker() is called with an uninitialized scatterlist. This is probably fine, but gcc complains:
CC [M] drivers/nvme/host/tcp.o In file included from ./include/linux/dma-mapping.h:10, from ./include/linux/skbuff.h:31, from ./include/net/net_namespace.h:43, from ./include/linux/netdevice.h:38, from ./include/net/sock.h:46, from drivers/nvme/host/tcp.c:12: In function ‘sg_mark_end’, inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2, inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2: ./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized] 234 | sg->page_link |= SG_END; | ~~^~~~~~~~~~~ drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’: drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here 388 | struct scatterlist sg; | ^~ cc1: all warnings being treated as errors
Use sg_init_table() instead, which basically memset the scatterlist to zero first before calling sg_init_marker().
Signed-off-by: Nam Cao <namcaov@gmail.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
c4abd875 |
| 28-Sep-2022 |
Sagi Grimberg <sagi@grimberg.me> |
nvme-tcp: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (speci
nvme-tcp: fix possible hang caused during ctrl deletion
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (specifically also .stop_ctrl which cancels ctrl error recovery work) 2. nvme_remove_namespaces() - which first flushes scan_work to avoid competing ns addition/removal 3. continue to teardown the controller
However, if err_work was scheduled to run in (1), it is designed to cancel any inflight I/O, particularly I/O that is originating from ns scan_work in (2), but because it is cancelled in .stop_ctrl(), we can prevent forward progress of (2) as ns scanning is blocking on I/O (that will never be cancelled).
The race is: 1. transport layer error observed -> err_work is scheduled 2. scan_work executes, discovers ns, generate I/O to it 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work) - err_work never executed 4. nvme_remove_namespaces() -> flush_work(scan_work) --> deadlock, because scan_work is blocked on I/O that was supposed to be cancelled by err_work, but was cancelled before executing (see stack trace [1]).
Fix this by flushing err_work instead of cancelling it, to force it to execute and cancel all inflight I/O.
[1]: -- Call Trace: <TASK> __schedule+0x390/0x910 ? scan_shadow_nodes+0x40/0x40 schedule+0x55/0xe0 io_schedule+0x16/0x40 do_read_cache_page+0x55d/0x850 ? __page_cache_alloc+0x90/0x90 read_cache_page+0x12/0x20 read_part_sector+0x3f/0x110 amiga_partition+0x3d/0x3e0 ? osf_partition+0x33/0x220 ? put_partition+0x90/0x90 bdev_disk_changed+0x1fe/0x4d0 blkdev_get_whole+0x7b/0x90 blkdev_get_by_dev+0xda/0x2d0 device_add_disk+0x356/0x3b0 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core] ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core] nvme_update_ns_ana_state+0x3a/0x40 [nvme_core] nvme_mpath_add_disk+0x120/0x160 [nvme_core] nvme_alloc_ns+0x594/0xa00 [nvme_core] nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core] ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core] nvme_scan_work+0x281/0x410 [nvme_core] process_one_work+0x1be/0x380 worker_thread+0x37/0x3b0 ? process_one_work+0x380/0x380 kthread+0x12d/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x1f/0x30 </TASK> INFO: task nvme:6725 blocked for more than 491 seconds. Not tainted 5.15.65-f0.el7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:nvme state:D stack: 0 pid: 6725 ppid: 1761 flags:0x00004000 Call Trace: <TASK> __schedule+0x390/0x910 ? sched_clock+0x9/0x10 schedule+0x55/0xe0 schedule_timeout+0x24b/0x2e0 ? try_to_wake_up+0x358/0x510 ? finish_task_switch+0x88/0x2c0 wait_for_completion+0xa5/0x110 __flush_work+0x144/0x210 ? worker_attach_to_pool+0xc0/0xc0 flush_work+0x10/0x20 nvme_remove_namespaces+0x41/0xf0 [nvme_core] nvme_do_delete_ctrl+0x47/0x66 [nvme_core] nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core] dev_attr_store+0x14/0x30 sysfs_kf_write+0x38/0x50 kernfs_fop_write_iter+0x146/0x1d0 new_sync_write+0x114/0x1b0 ? intel_pmu_handle_irq+0xe0/0x420 vfs_write+0x18d/0x270 ksys_write+0x61/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb --
Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: Jonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Tested-by: Jonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
show more ...
|
#
de777825 |
| 20-Sep-2022 |
Christoph Hellwig <hch@lst.de> |
nvme-tcp: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead
nvme-tcp: use the tagset alloc/free helpers
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_tcp_ctrl.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
show more ...
|