d3acd2a4 | 13-Feb-2025 |
Caleb Sander Mateos <csander@purestorage.com> |
nvme/ioctl: add missing space in err message
[ Upstream commit 487a3ea7b1b8ba2ca7d2c2bb3c3594dc360d6261 ]
nvme_validate_passthru_nsid() logs an err message whose format string is split over 2 lines
nvme/ioctl: add missing space in err message
[ Upstream commit 487a3ea7b1b8ba2ca7d2c2bb3c3594dc360d6261 ]
nvme_validate_passthru_nsid() logs an err message whose format string is split over 2 lines. There is a missing space between the two pieces, resulting in log lines like "... does not match nsid (1)of namespace". Add the missing space between ")" and "of". Also combine the format string pieces onto a single line to make the err message easier to grep.
Fixes: e7d4b5493a2d ("nvme: factor out a nvme_validate_passthru_nsid helper") Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
42385f9c | 16-Dec-2024 |
Georg Gottleuber <ggo@tuxedocomputers.com> |
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
commit 11cb3529d18514f7d28ad2190533192aedefd761 upstream.
On the TUXEDO InfinityBook Pro Gen9 Intel, a Samsung 990 Evo NVMe leads to a high powe
nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
commit 11cb3529d18514f7d28ad2190533192aedefd761 upstream.
On the TUXEDO InfinityBook Pro Gen9 Intel, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (4 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 1.2 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
9db27ba3 | 16-Dec-2024 |
Georg Gottleuber <ggo@tuxedocomputers.com> |
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
commit dbf2bb1a1319b7c7d8828905378a6696cca6b0f2 upstream.
On the TUXEDO InfinityFlex, a Samsung 990 Evo NVMe leads to a high power consumpti
nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
commit dbf2bb1a1319b7c7d8828905378a6696cca6b0f2 upstream.
On the TUXEDO InfinityFlex, a Samsung 990 Evo NVMe leads to a high power consumption in s2idle sleep (4 watts).
This patch applies 'Force No Simple Suspend' quirk to achieve a sleep with a lower power consumption, typically around 1.4 watts.
Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Cc: stable@vger.kernel.org Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
40a29e84 | 28-Jan-2025 |
Daniel Wagner <wagi@kernel.org> |
nvme-fc: use ctrl state getter
[ Upstream commit c8ed6cb5d37bc09c7e25e49a670e9fd1a3bd1dfa ]
Do not access the state variable directly, instead use proper synchronization so not stale data is read.
nvme-fc: use ctrl state getter
[ Upstream commit c8ed6cb5d37bc09c7e25e49a670e9fd1a3bd1dfa ]
Do not access the state variable directly, instead use proper synchronization so not stale data is read.
Fixes: e6e7f7ac03e4 ("nvme: ensure reset state check ordering") Signed-off-by: Daniel Wagner <wagi@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
79578630 | 09-Jan-2025 |
Daniel Wagner <wagi@kernel.org> |
nvme: handle connectivity loss in nvme_set_queue_count
[ Upstream commit 294b2b7516fd06a8dd82e4a6118f318ec521e706 ]
When the set feature attempts fails with any NVME status code set in nvme_set_que
nvme: handle connectivity loss in nvme_set_queue_count
[ Upstream commit 294b2b7516fd06a8dd82e4a6118f318ec521e706 ]
When the set feature attempts fails with any NVME status code set in nvme_set_queue_count, the function still report success. Though the numbers of queues set to 0. This is done to support controllers in degraded state (the admin queue is still up and running but no IO queues).
Though there is an exception. When nvme_set_features reports an host path error, nvme_set_queue_count should propagate this error as the connectivity is lost, which means also the admin queue is not working anymore.
Fixes: 9a0be7abb62f ("nvme: refactor set_queue_count") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
db996ed1 | 13-Jan-2025 |
Jens Axboe <axboe@kernel.dk> |
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
[ Upstream commit 170e086ad3997f816d1f551f178a03a626a130b7 ]
nvme_init_effects_log() returns failure when kzalloc() is successful,
nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
[ Upstream commit 170e086ad3997f816d1f551f178a03a626a130b7 ]
nvme_init_effects_log() returns failure when kzalloc() is successful, which is obviously wrong and causes failures to boot. Correct the check.
Fixes: d4a95adeabc6 ("nvme: Add error path for xa_store in nvme_init_effects") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
525dc0f6 | 16-Dec-2024 |
Keisuke Nishimura <keisuke.nishimura@inria.fr> |
nvme: Add error path for xa_store in nvme_init_effects
[ Upstream commit d4a95adeabc6b5a39405e49c6d5ed14dd83682c4 ]
The xa_store() may fail due to memory allocation failure because there is no guar
nvme: Add error path for xa_store in nvme_init_effects
[ Upstream commit d4a95adeabc6b5a39405e49c6d5ed14dd83682c4 ]
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index NVME_CSI_NVM is already used. This fix introduces a new function to handle the error path.
Fixes: cc115cbe12d9 ("nvme: always initialize known command effects") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
48ef61d2 | 20-Dec-2024 |
Keisuke Nishimura <keisuke.nishimura@inria.fr> |
nvme: Add error check for xa_store in nvme_get_effects_log
[ Upstream commit ac32057acc7f3d7a238dafaa9b2aa2bc9750080e ]
The xa_store() may fail due to memory allocation failure because there is no
nvme: Add error check for xa_store in nvme_get_effects_log
[ Upstream commit ac32057acc7f3d7a238dafaa9b2aa2bc9750080e ]
The xa_store() may fail due to memory allocation failure because there is no guarantee that the index csi is already used. This fix adds an error check of the return value of xa_store() in nvme_get_effects_log().
Fixes: 1cf7a12e09aa ("nvme: use an xarray to lookup the Commands Supported and Effects log") Signed-off-by: Keisuke Nishimura <keisuke.nishimura@inria.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
34765438 | 12-Nov-2024 |
Robert Beckett <bob.beckett@collabora.com> |
nvme-pci: 512 byte aligned dma pool segment quirk
[ Upstream commit ebefac5647968679f6ef5803e5d35a71997d20fa ]
We initially introduced a quick fix limiting the queue depth to 1 as experimentation s
nvme-pci: 512 byte aligned dma pool segment quirk
[ Upstream commit ebefac5647968679f6ef5803e5d35a71997d20fa ]
We initially introduced a quick fix limiting the queue depth to 1 as experimentation showed that it fixed data corruption on 64GB steamdecks.
Further experimentation revealed corruption only happens when the last PRP data element aligns to the end of the page boundary. The device appears to treat this as a PRP chain to a new list instead of the data element that it actually is. This implementation is in violation of the spec. Encountering this errata with the Linux driver requires the host request a 128k transfer and coincidently be handed the last small pool dma buffer within a page.
The QD1 quirk effectly works around this because the last data PRP always was at a 248 byte offset from the page start, so it never appeared at the end of the page, but comes at the expense of throttling IO and wasting the remainder of the PRP page beyond 256 bytes. Also to note, the MDTS on these devices is small enough that the "large" prp pool can hold enough PRP elements to never reach the end, so that pool is not a problem either.
Introduce a new quirk to ensure the small pool is always aligned such that the last PRP element can't appear a the end of the page. This comes at the expense of wasting 256 bytes per small pool page allocated.
Link: https://lore.kernel.org/linux-nvme/20241113043151.GA20077@lst.de/T/#u Fixes: 83bdfcbdbe5d ("nvme-pci: qdepth 1 quirk") Cc: Paweł Anikiel <panikiel@google.com> Signed-off-by: Robert Beckett <bob.beckett@collabora.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
ddcc7d71 | 05-Nov-2024 |
Nilay Shroff <nilay@linux.ibm.com> |
Revert "nvme: make keep-alive synchronous operation"
[ Upstream commit 84488282166de6b6760ada8030e87aaa08bce3aa ]
This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
It was realized that
Revert "nvme: make keep-alive synchronous operation"
[ Upstream commit 84488282166de6b6760ada8030e87aaa08bce3aa ]
This reverts commit d06923670b5a5f609603d4a9fee4dec02d38de9c.
It was realized that the fix implemented to contain the race condition among the keep alive task and the fabric shutdown code path in the commit d06923670b5ia ("nvme: make keep-alive synchronous operation") is not optimal. The reason being keep-alive runs under the workqueue and making it synchronous would waste a workqueue context. Furthermore, we later found that the above race condition is a regression caused due to the changes implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvme_uninit_ctrl()"). So we decided to revert the commit d06923670b5a ("nvme: make keep-alive synchronous operation") and then fix the regression.
Link: https://lore.kernel.org/all/196f4013-3bbf-43ff-98b4-9cb2a96c20c2@grimberg.me/ Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
801acf74 | 15-Oct-2024 |
Nilay Shroff <nilay@linux.ibm.com> |
nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
[ Upstream commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf ]
We no more need acquiring ctrl->lock before accessing the NVMe contr
nvme: use helper nvme_ctrl_state in nvme_keep_alive_finish function
[ Upstream commit 599d9f3a10eec69ef28a90161763e4bd7c9c02bf ]
We no more need acquiring ctrl->lock before accessing the NVMe controller state and instead we can now use the helper nvme_ctrl_state. So replace the use of ctrl->lock from nvme_keep_alive_finish function with nvme_ctrl_state call.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 84488282166d ("Revert "nvme: make keep-alive synchronous operation"") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
1e20e4ff | 05-Nov-2024 |
Breno Leitao <leitao@debian.org> |
nvme/multipath: Fix RCU list traversal to use SRCU primitive
[ Upstream commit 5dd18f09ce7399df6fffe80d1598add46c395ae9 ]
The code currently uses list_for_each_entry_rcu() while holding an SRCU loc
nvme/multipath: Fix RCU list traversal to use SRCU primitive
[ Upstream commit 5dd18f09ce7399df6fffe80d1598add46c395ae9 ]
The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled:
drivers/nvme/host/multipath.c:168 RCU-list traversed in non-reader section!! drivers/nvme/host/multipath.c:227 RCU-list traversed in non-reader section!! drivers/nvme/host/multipath.c:260 RCU-list traversed in non-reader section!!
While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning.
Signed-off-by: Breno Leitao <leitao@debian.org> Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
f0679539 | 14-Sep-2024 |
Hannes Reinecke <hare@kernel.org> |
nvme-multipath: avoid hang on inaccessible namespaces
[ Upstream commit 3b97f5a05cfc55e7729ff3769f63eef64e2178bb ]
During repetitive namespace remapping operations on the target the namespace might
nvme-multipath: avoid hang on inaccessible namespaces
[ Upstream commit 3b97f5a05cfc55e7729ff3769f63eef64e2178bb ]
During repetitive namespace remapping operations on the target the namespace might have changed between the time the initial scan was performed, and partition scan was invoked by device_add_disk() in nvme_mpath_set_live(). We then end up with a stuck scanning process:
[<0>] folio_wait_bit_common+0x12a/0x310 [<0>] filemap_read_folio+0x97/0xd0 [<0>] do_read_cache_folio+0x108/0x390 [<0>] read_part_sector+0x31/0xa0 [<0>] read_lba+0xc5/0x160 [<0>] efi_partition+0xd9/0x8f0 [<0>] bdev_disk_changed+0x23d/0x6d0 [<0>] blkdev_get_whole+0x78/0xc0 [<0>] bdev_open+0x2c6/0x3b0 [<0>] bdev_file_open_by_dev+0xcb/0x120 [<0>] disk_scan_partitions+0x5d/0x100 [<0>] device_add_disk+0x402/0x420 [<0>] nvme_mpath_set_live+0x4f/0x1f0 [nvme_core] [<0>] nvme_mpath_add_disk+0x107/0x120 [nvme_core] [<0>] nvme_alloc_ns+0xac6/0xe60 [nvme_core] [<0>] nvme_scan_ns+0x2dd/0x3e0 [nvme_core] [<0>] nvme_scan_work+0x1a3/0x490 [nvme_core]
This happens when we have several paths, some of which are inaccessible, and the active paths are removed first. Then nvme_find_path() will requeue I/O in the ns_head (as paths are present), but the requeue list is never triggered as all remaining paths are inactive.
This patch checks for NVME_NSHEAD_DISK_LIVE in nvme_available_path(), and requeue I/O after NVME_NSHEAD_DISK_LIVE has been cleared once the last path has been removed to properly terminate pending I/O.
Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 5dd18f09ce73 ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
85b9f3e6 | 25-Jun-2024 |
Thomas Song <tsong@purestorage.com> |
nvme-multipath: implement "queue-depth" iopolicy
[ Upstream commit f227345f0a70f011647ae7ae12778bf258ff71f2 ]
The round-robin path selector is inefficient in cases where there is a difference in la
nvme-multipath: implement "queue-depth" iopolicy
[ Upstream commit f227345f0a70f011647ae7ae12778bf258ff71f2 ]
The round-robin path selector is inefficient in cases where there is a difference in latency between paths. In the presence of one or more high latency paths the round-robin selector continues to use the high latency path equally. This results in a bias towards the highest latency path and can cause a significant decrease in overall performance as IOs pile on the highest latency path. This problem is acute with NVMe-oF controllers.
The queue-depth path selector sends I/O down the path with the lowest number of requests in its request queue. Paths with lower latency will clear requests more quickly and have less requests queued compared to higher latency paths. The goal of this path selector is to make more use of lower latency paths which will bring down overall IO latency and increase throughput and performance.
Signed-off-by: Thomas Song <tsong@purestorage.com> [emilne: commandeered patch developed by Thomas Song @ Pure Storage] Co-developed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Ewan D. Milne <emilne@redhat.com> Co-developed-by: John Meneghini <jmeneghi@redhat.com> Signed-off-by: John Meneghini <jmeneghi@redhat.com> Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@redhat.com/ Tested-by: Marco Patalano <mpatalan@redhat.com> Tested-by: Jyoti Rani <jrani@purestorage.com> Tested-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 5dd18f09ce73 ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
a7071e2b | 25-Jun-2024 |
John Meneghini <jmeneghi@redhat.com> |
nvme-multipath: prepare for "queue-depth" iopolicy
[ Upstream commit 3d7c2fd2ea704812867f9586270a2516377482a3 ]
This patch prepares for the introduction of a new iopolicy by breaking up the nvme_fi
nvme-multipath: prepare for "queue-depth" iopolicy
[ Upstream commit 3d7c2fd2ea704812867f9586270a2516377482a3 ]
This patch prepares for the introduction of a new iopolicy by breaking up the nvme_find_path() code path into sub-routines.
Signed-off-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Stable-dep-of: 5dd18f09ce73 ("nvme/multipath: Fix RCU list traversal to use SRCU primitive") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
6b42ded8 | 29-Aug-2024 |
Puranjay Mohan <pjy@amazon.com> |
nvme: fix metadata handling in nvme-passthrough
commit 7c2fd76048e95dd267055b5f5e0a48e6e7c81fd9 upstream.
On an NVMe namespace that does not support metadata, it is possible to send an IO command w
nvme: fix metadata handling in nvme-passthrough
commit 7c2fd76048e95dd267055b5f5e0a48e6e7c81fd9 upstream.
On an NVMe namespace that does not support metadata, it is possible to send an IO command with metadata through io-passthru. This allows issues like [1] to trigger in the completion code path. nvme_map_user_request() doesn't check if the namespace supports metadata before sending it forward. It also allows admin commands with metadata to be processed as it ignores metadata when bdev == NULL and may report success.
Reject an IO command with metadata when the NVMe namespace doesn't support it and reject an admin command if it has metadata.
[1] https://lore.kernel.org/all/mb61pcylvnym8.fsf@amazon.com/
Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Puranjay Mohan <pjy@amazon.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [ Minor changes to make it work on 6.6 ] Signed-off-by: Hagar Hemdan <hagarhem@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
51989929 | 13-Nov-2024 |
Christoph Hellwig <hch@lst.de> |
nvme-pci: reverse request order in nvme_queue_rqs
[ Upstream commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f ]
blk_mq_flush_plug_list submits requests in the reverse order that they were submitted,
nvme-pci: reverse request order in nvme_queue_rqs
[ Upstream commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f ]
blk_mq_flush_plug_list submits requests in the reverse order that they were submitted, which leads to a rather suboptimal I/O pattern especially in rotational devices. Fix this by rewriting nvme_queue_rqs so that it always pops the requests from the passed in request list, and then adds them to the head of a local submit list. This actually simplifies the code a bit as it removes the complicated list splicing, at the cost of extra updates of the rq_next pointer. As that should be cache hot anyway it should be an easy price to pay.
Fixes: d62cbcf62f2f ("nvme: add support for mq_ops->queue_rqs()") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
cee3bff5 | 31-Oct-2024 |
Christoph Hellwig <hch@lst.de> |
nvme-pci: fix freeing of the HMB descriptor table
[ Upstream commit 3c2fb1ca8086eb139b2a551358137525ae8e0d7a ]
The HMB descriptor table is sized to the maximum number of descriptors that could be u
nvme-pci: fix freeing of the HMB descriptor table
[ Upstream commit 3c2fb1ca8086eb139b2a551358137525ae8e0d7a ]
The HMB descriptor table is sized to the maximum number of descriptors that could be used for a given device, but __nvme_alloc_host_mem could break out of the loop earlier on memory allocation failure and end up using less descriptors than planned for, which leads to an incorrect size passed to dma_free_coherent.
In practice this was not showing up because the number of descriptors tends to be low and the dma coherent allocator always allocates and frees at least a page.
Fixes: 87ad72a59a38 ("nvme-pci: implement host memory buffer support") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
f7d9a185 | 26-Nov-2024 |
Keith Busch <kbusch@kernel.org> |
nvme: apple: fix device reference counting
[ Upstream commit b9ecbfa45516182cd062fecd286db7907ba84210 ]
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation si
nvme: apple: fix device reference counting
[ Upstream commit b9ecbfa45516182cd062fecd286db7907ba84210 ]
Drivers must call nvme_uninit_ctrl after a successful nvme_init_ctrl. Split the allocation side out to make the error handling boundary easier to navigate. The apple driver had been doing this wrong, leaking the controller device memory on a tagset failure.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org> [ Resolve minor conflicts ] Signed-off-by: Bin Lan <bin.lan.cn@windriver.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
5a526388 | 04-Nov-2024 |
Breno Leitao <leitao@debian.org> |
nvme/host: Fix RCU list traversal to use SRCU primitive
[ Upstream commit 6d1c69945ce63a9fba22a4abf646cf960d878782 ]
The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, tr
nvme/host: Fix RCU list traversal to use SRCU primitive
[ Upstream commit 6d1c69945ce63a9fba22a4abf646cf960d878782 ]
The code currently uses list_for_each_entry_rcu() while holding an SRCU lock, triggering false positive warnings with CONFIG_PROVE_RCU=y enabled:
drivers/nvme/host/core.c:3770 RCU-list traversed in non-reader section!!
While the list is properly protected by SRCU lock, the code uses the wrong list traversal primitive. Replace list_for_each_entry_rcu() with list_for_each_entry_srcu() to correctly indicate SRCU-based protection and eliminate the false warning.
Fixes: be647e2c76b2 ("nvme: use srcu for iterating namespace list") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
1a1bcca5 | 15-Oct-2024 |
Nilay Shroff <nilay@linux.ibm.com> |
nvme: make keep-alive synchronous operation
[ Upstream commit d06923670b5a5f609603d4a9fee4dec02d38de9c ]
The nvme keep-alive operation, which executes at a periodic interval, could potentially snea
nvme: make keep-alive synchronous operation
[ Upstream commit d06923670b5a5f609603d4a9fee4dec02d38de9c ]
The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below:
Call Trace: autoremove_wake_function+0x0/0xbc (unreliable) __blk_mq_sched_dispatch_requests+0x114/0x24c blk_mq_sched_dispatch_requests+0x44/0x84 blk_mq_run_hw_queue+0x140/0x220 nvme_keep_alive_work+0xc8/0x19c [nvme_core] process_one_work+0x200/0x4e0 worker_thread+0x340/0x504 kthread+0x138/0x140 start_kernel_thread+0x14/0x18
While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvme_keep_alive_end_io function is then invoked to handle the end of the keep-alive operation which decrements the admin->q_usage_counter and assuming this is the last/only request in the admin queue then the admin->q_usage_counter becomes zero. If that happens then blk-mq destroy queue operation (blk_mq_destroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash.
This fix helps avoid the observed crash by implementing keep-alive as a synchronous operation so that we decrement admin->q_usage_counter only after keep-alive command finished its execution and returns the command status back up to its caller (blk_execute_rq()). This would ensure that fabric shutdown code path doesn't destroy the fabric admin queue until keep-alive request finished execution and also keep-alive thread is not running hw/hctx queue dispatch operation.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
4a57f42e | 15-Oct-2024 |
Keith Busch <kbusch@kernel.org> |
nvme-multipath: defer partition scanning
[ Upstream commit 1f021341eef41e77a633186e9be5223de2ce5d48 ]
We need to suppress the partition scan from occuring within the controller's scan_work context.
nvme-multipath: defer partition scanning
[ Upstream commit 1f021341eef41e77a633186e9be5223de2ce5d48 ]
We need to suppress the partition scan from occuring within the controller's scan_work context. If a path error occurs here, the IO will wait until a path becomes available or all paths are torn down, but that action also occurs within scan_work, so it would deadlock. Defer the partion scan to a different context that does not block scan_work.
Reported-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
e04e6480 | 07-Oct-2024 |
Greg Joyce <gjoyce@linux.ibm.com> |
nvme: disable CC.CRIME (NVME_CC_CRIME)
[ Upstream commit 0ce96a6708f34280a536263ee5c67e20c433dcce ]
Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media is ready and able to handle comma
nvme: disable CC.CRIME (NVME_CC_CRIME)
[ Upstream commit 0ce96a6708f34280a536263ee5c67e20c433dcce ]
Disable NVME_CC_CRIME so that CSTS.RDY indicates that the media is ready and able to handle commands without returning NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY.
Signed-off-by: Greg Joyce <gjoyce@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
975cb1d2 | 01-Oct-2024 |
Hannes Reinecke <hare@suse.de> |
nvme: tcp: avoid race between queue_lock lock and destroy
[ Upstream commit 782373ba27660ba7d330208cf5509ece6feb4545 ]
Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during error reco
nvme: tcp: avoid race between queue_lock lock and destroy
[ Upstream commit 782373ba27660ba7d330208cf5509ece6feb4545 ]
Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during error recovery") added a mutex_lock() call for the queue->queue_lock in nvme_tcp_get_address(). However, the mutex_lock() races with mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
DEBUG_LOCKS_WARN_ON(lock->magic != lock) WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587 __mutex_lock+0xcf0/0x1220 Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs] CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:__mutex_lock+0xcf0/0x1220 Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4 ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 RSP: 0018:ffff88811305f760 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001 RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341 R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000 R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058 FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __warn.cold+0x5b/0x1af ? __mutex_lock+0xcf0/0x1220 ? report_bug+0x1ec/0x390 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x13/0x40 ? asm_exc_invalid_op+0x16/0x20 ? __mutex_lock+0xcf0/0x1220 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx___mutex_lock+0x10/0x10 ? __lock_acquire+0xd6a/0x59e0 ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp] ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp] nvme_sysfs_show_address+0x81/0xc0 [nvme_core] dev_attr_show+0x42/0x80 ? __asan_memset+0x1f/0x40 sysfs_kf_seq_show+0x1f0/0x370 seq_read_iter+0x2cb/0x1130 ? rw_verify_area+0x3b1/0x590 ? __mutex_lock+0x433/0x1220 vfs_read+0x6a6/0xa20 ? lockdep_hardirqs_on+0x78/0x100 ? __pfx_vfs_read+0x10/0x10 ksys_read+0xf7/0x1d0 ? __pfx_ksys_read+0x10/0x10 ? __x64_sys_openat+0x105/0x1d0 do_syscall_64+0x93/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx_ksys_read+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f9713f55cfa Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011 RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148 R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0 </TASK>
The WARN is observed when the blktests test case nvme/014 is repeated with tcp transport. It is rare, and 200 times repeat is required to recreate in some test environments.
To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking queue->queue_lock. The flag is cleared long time before the lock gets destroyed.
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
4ed32cc0 | 15-Oct-2024 |
Maurizio Lombardi <mlombard@redhat.com> |
nvme-pci: fix race condition between reset and nvme_dev_disable()
[ Upstream commit 26bc0a81f64ce00fc4342c38eeb2eddaad084dd2 ]
nvme_dev_disable() modifies the dev->online_queues field, therefore nv
nvme-pci: fix race condition between reset and nvme_dev_disable()
[ Upstream commit 26bc0a81f64ce00fc4342c38eeb2eddaad084dd2 ]
nvme_dev_disable() modifies the dev->online_queues field, therefore nvme_pci_update_nr_queues() should avoid racing against it, otherwise we could end up passing invalid values to blk_mq_update_nr_hw_queues().
WARNING: CPU: 39 PID: 61303 at drivers/pci/msi/api.c:347 pci_irq_get_affinity+0x187/0x210 Workqueue: nvme-reset-wq nvme_reset_work [nvme] RIP: 0010:pci_irq_get_affinity+0x187/0x210 Call Trace: <TASK> ? blk_mq_pci_map_queues+0x87/0x3c0 ? pci_irq_get_affinity+0x187/0x210 blk_mq_pci_map_queues+0x87/0x3c0 nvme_pci_map_queues+0x189/0x460 [nvme] blk_mq_update_nr_hw_queues+0x2a/0x40 nvme_reset_work+0x1be/0x2a0 [nvme]
Fix the bug by locking the shutdown_lock mutex before using dev->online_queues. Give up if nvme_dev_disable() is running or if it has been executed already.
Fixes: 949928c1c731 ("NVMe: Fix possible queue use after freed") Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|