- 15 11月, 2022 4 次提交
-
-
由 Christoph Hellwig 提交于
While the specification allows devices to either deallocate data or to actually write zeroes on any Write Zeroes command, many SSDs only do the sensible thing and deallocate data when the DEAC bit is specific. Set it when it is supported and the caller doesn't explicitly opt out of deallocation. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
由 Kanchan Joshi 提交于
Allow all identify-namespace variants (CNS 00h, 05h and 08h) without requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is needed to form IO commands for passthrough interface. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Reviewed-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Kanchan Joshi 提交于
Currently both io and admin commands are kept under a coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely. $ ls -l /dev/ng* crw-rw-rw- 1 root root 242, 0 Sep 9 19:20 /dev/ng0n1 crw------- 1 root root 242, 1 Sep 9 19:20 /dev/ng0n2 In the example above, ng0n1 appears as if it may allow unprivileged read/write operation but it does not and behaves same as ng0n2. This patch implements a shift from CAP_SYS_ADMIN to more fine-granular control for io-commands. If CAP_SYS_ADMIN is present, nothing else is checked as before. Otherwise, following rules are in place - any admin-cmd is not allowed - vendor-specific and fabric commmand are not allowed - io-commands that can write are allowed if matching FMODE_WRITE permission is present - io-commands that read are allowed Add a helper nvme_cmd_allowed that implements above policy. Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed for any decision making. Since file open mode is counted for any approval/denial, change at various places to keep file-mode information handy. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Reviewed-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Christophe JAILLET 提交于
sizeof( struct nvmefc_ls_rcv_op ) = 64 sizeof( union nvmefc_ls_requests ) = 1024 sizeof( union nvmefc_ls_responses ) = 128 So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when kzalloc() is called. Because of the way memory allocations are performed, 2048 bytes are allocated. So about 800 bytes are wasted for each request. Switch to 3 distinct memory allocations, in order to: - save these 800 bytes - avoid zeroing this extra memory - make sure that memory is properly aligned in case of DMA access ("fc_dma_map_single(lsop->rspbuf)" just a few lines below) Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: NJames Smart <jsmart2021@gmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 02 11月, 2022 11 次提交
-
-
由 Chao Leng 提交于
All controller namespaces share the same tagset, so we can use this interface which does the optimal operation for parallel quiesce based on the tagset type(e.g. blocking tagsets and non-blocking tagsets). nvme connect_q should not be quiesced when quiesce tagset, so set the QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q. Currently we use NVME_NS_STOPPED to ensure pairing quiescing and unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED. In addition, we never really quiesce a single namespace. It is a better choice to move the flag from ns to ctrl. Signed-off-by: NChao Leng <lengchao@huawei.com> [hch: rebased on top of prep patches] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just pass the tagset, and move the non-mq check into the only caller that needs it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
apple_nvme_reset_work schedules apple_nvme_remove, to be called, which will call apple_nvme_disable and unquiesce the I/O queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme_remove_dead_ctrl schedules nvme_remove to be called, which will call nvme_dev_disable and unquiesce the I/O queues. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme_kill_queues does two things: 1) mark the gendisk of all namespaces dead 2) unquiesce all I/O queues These used to be be intertwined due to block layer issues, but aren't any more. So move the unquiscing of the I/O queues into the callers, and rename the rest of the function to the now more descriptive nvme_mark_namespaces_dead. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
None of the callers of nvme_kill_queues needs it to unquiesce the admin queues, as all of them already do it themselves: 1) nvme_reset_work explicit call nvme_start_admin_queue toward the beginning of the function. The extra call to nvme_start_admin_queue in nvme_reset_work this won't do anything as NVME_CTRL_ADMIN_Q_STOPPED will already be cleared. 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at the very beginning of the function if the PCIe device was not present, which is the precondition for the call to nvme_kill_queues. nvme_dev_disable already calls nvme_start_admin_queue toward the end of the function when the shutdown flag is set to true, so the admin queue is already enabled at this point. 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver, which will end up in nvme_remove, which calls nvme_dev_disable with the shutdown flag. This case will call nvme_start_admin_queue a bit later than before. 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl above. 5) nvme_remove_namespaces only calls nvme_kill_queues when the controller is in the DEAD state. That can only happen in the PCIe driver, and only from nvme_remove. See item 2) above for the conditions there. So it is safe to just remove the call to nvme_start_admin_queue in nvme_kill_queues without replacement. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
At the point where namespaces are marked dead, the controller is in a non-live state and we won't get pass the identify commands. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The NVME_NS_DEAD check only made sense when we revalidated namespaces in nvme_passthrough_end for commands that affected the namespace inventory. These days NVME_NS_DEAD is only set during reset or when tearing down namespaces, and we always remove all namespaces right after that. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The call to nvme_remove_invalid_namespaces made sense when nvme_passthru_end revalidated all namespaces and had to remove those that didn't exist any more. Since we don't revalidate from nvme_passthru_end now, this call is entirely spurious. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The code to create, update or delete a tagset and namespaces in nvme_reset_work is a bit convoluted. Refactor it with a two high-level conditionals for first probe vs reset and I/O queues vs no I/O queues to make the code flow more clear. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de [axboe: fix whitespace issue] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
nvme and xen-blkfront are already doing this to stop buffered writes from creating dirty pages that can't be written out later. Move it to the common code. This also removes the comment about the ordering from nvme, as bd_mutex not only is gone entirely, but also hasn't been used for locking updates to the disk size long before that, and thus the ordering requirement documented there doesn't apply any more. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 10月, 2022 3 次提交
-
-
由 Christoph Hellwig 提交于
Now that blk_mq_destroy_queue does not release the queue reference, there is no need for a second admin queue reference to be held by the apple_nvme structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NSven Peter <sven@svenpeter.dev> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Now that blk_mq_destroy_queue does not release the queue reference, there is no need for a second admin queue reference to be held by the nvme_dev. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The fact that blk_mq_destroy_queue also drops a queue reference leads to various places having to grab an extra reference. Move the call to blk_put_queue into the callers to allow removing the extra references. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de [axboe: fix fabrics_q vs admin_q conflict in nvme core.c] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 19 10月, 2022 5 次提交
-
-
由 Serge Semin 提交于
Recent commit 52fde2c0 ("nvme: set dma alignment to dword") has caused a regression on our platform. It turned out that the nvme_get_log() method invocation caused the nvme_hwmon_data structure instance corruption. In particular the nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with garbage. After some research we discovered that the problem happened even before the actual NVME DMA execution, but during the buffer mapping. Since our platform is DMA-noncoherent, the mapping implied the cache-line invalidations or write-backs depending on the DMA-direction parameter. In case of the NVME SMART log getting the DMA was performed from-device-to-memory, thus the cache-invalidation was activated during the buffer mapping. Since the log-buffer isn't cache-line aligned, the cache-invalidation caused the neighbour data to be discarded. The neighbouring data turned to be the data surrounding the buffer in the framework of the nvme_hwmon_data structure. In order to fix that we need to make sure that the whole log-buffer is defined within the cache-line-aligned memory region so the cache-invalidation procedure wouldn't involve the adjacent data. One of the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the rest of the NVME core driver prefer that method it has been chosen to fix this problem too. Note after a deeper researches we found out that the denoted commit wasn't a root cause of the problem. It just revealed the invalidity by activating the DMA-based NVME SMART log getting performed in the framework of the NVME hwmon driver. The problem was here since the initial commit of the driver. [1] Documentation/core-api/dma-api-howto.rst Fixes: 400b6a7b ("nvme: Add hardware monitoring support") Signed-off-by: NSerge Semin <Sergey.Semin@baikalelectronics.ru> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Christoph Hellwig 提交于
An NVMe controller works perfectly fine even when the hwmon initialization fails. Stop returning errors that do not come from a controller reset from nvme_hwmon_init to handle this case consistently. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NGuenter Roeck <linux@roeck-us.net> Reviewed-by: NSerge Semin <fancer.lancer@gmail.com>
-
由 Russell King (Oracle) 提交于
NVMe uses PRPs for data transfers and has no specific limit for a single DMA segement. Limiting the size will cause problems because the block layer assumes PRP-ish devices using a virt boundary mask don't have a segment limit. And while this is true, we also really need to tell the DMA mapping layer about it, otherwise dma-debug will trip over it. Fixes: 5bd2927a ("nvme-apple: Add initial Apple SoC NVMe driver") Suggested-by: NSven Peter <sven@svenpeter.dev> Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk> [hch: rewrote the commit message based on the PCIe commit] Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NEric Curtin <ecurtin@redhat.com> Reviewed-by: NSven Peter <sven@svenpeter.dev>
-
由 Xander Li 提交于
Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to process. The firmware version is locked by these SSDs, we can not expect firmware improvement, so disable Write_Zeroes cmd. Signed-off-by: NXander Li <xander_li@kingston.com.tw> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Dan Carpenter 提交于
There is typo here so it releases the wrong variable. "ctrl->admin_q" was intended instead of "ctrl->fabrics_q". Fixes: fe60e8c5 ("nvme: add common helpers to allocate and free tagsets") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 12 10月, 2022 5 次提交
-
-
由 Sagi Grimberg 提交于
When we revalidate paths as part of ns size change (as of commit e7d65803), it is possible that during the path revalidation, the only paths that is IO capable (i.e. optimized/non-optimized) are the ones that ns resize was not yet informed to the host, which will cause inflight requests to be requeued (as we have available paths but none are IO capable). These requests on the requeue list are waiting for someone to resubmit them at some point. The IO capable paths will eventually notify the ns resize change to the host, but there is nothing that will kick the requeue list to resubmit the queued requests. Fix this by always kicking the requeue list, and if no IO capable path exists, these requests will be queued again. A typical log that indicates that IOs are requeued: -- nvme nvme1: creating 4 I/O queues. nvme nvme1: new ctrl: "testnqn1" nvme nvme2: creating 4 I/O queues. nvme nvme2: mapped 4/0/0 default/read/poll queues. nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009 nvme nvme1: rescanning namespaces. nvme1n1: detected capacity change from 2097152 to 4194304 block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O block nvme1n1: no usable path - requeuing I/O nvme nvme2: rescanning namespaces. -- Reported-by: NYogev Cohen <yogev@lightbitslabs.com> Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan") Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> # v5.15+ Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Xi Ruoyao 提交于
ZHITAI TiPro5000 SSDs has the same APST sleep problem as its cousin, TiPro7000. The quirk for TiPro7000 has been added in commit 6b961bce ("nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs"), use the same quirk for TiPro5000. The ASPT data from "nvme id-ctrl /dev/nvme1": vid : 0x1e49 ssvid : 0x1e49 sn : ZTA21T0KA2227304LM mn : ZHITAI TiPlus5000 1TB fr : ZTA09139 [...] ps 0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0025W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Reported-and-tested-by: NChang Feng <flukehn@gmail.com> Signed-off-by: NXi Ruoyao <xry111@xry111.site> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Abhijit 提交于
Add a quirk to fix Lexar NM760 SSD drives reporting duplicate nsids. Signed-off-by: NAbhijit <abhijit@abhijittomar.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (specifically also .stop_ctrl which cancels ctrl error recovery work) 2. nvme_remove_namespaces() - which first flushes scan_work to avoid competing ns addition/removal 3. continue to teardown the controller However, if err_work was scheduled to run in (1), it is designed to cancel any inflight I/O, particularly I/O that is originating from ns scan_work in (2), but because it is cancelled in .stop_ctrl(), we can prevent forward progress of (2) as ns scanning is blocking on I/O (that will never be cancelled). The race is: 1. transport layer error observed -> err_work is scheduled 2. scan_work executes, discovers ns, generate I/O to it 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work) - err_work never executed 4. nvme_remove_namespaces() -> flush_work(scan_work) --> deadlock, because scan_work is blocked on I/O that was supposed to be cancelled by err_work, but was cancelled before executing (see stack trace [1]). Fix this by flushing err_work instead of cancelling it, to force it to execute and cancel all inflight I/O. [1]: -- Call Trace: <TASK> __schedule+0x390/0x910 ? scan_shadow_nodes+0x40/0x40 schedule+0x55/0xe0 io_schedule+0x16/0x40 do_read_cache_page+0x55d/0x850 ? __page_cache_alloc+0x90/0x90 read_cache_page+0x12/0x20 read_part_sector+0x3f/0x110 amiga_partition+0x3d/0x3e0 ? osf_partition+0x33/0x220 ? put_partition+0x90/0x90 bdev_disk_changed+0x1fe/0x4d0 blkdev_get_whole+0x7b/0x90 blkdev_get_by_dev+0xda/0x2d0 device_add_disk+0x356/0x3b0 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core] ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core] nvme_update_ns_ana_state+0x3a/0x40 [nvme_core] nvme_mpath_add_disk+0x120/0x160 [nvme_core] nvme_alloc_ns+0x594/0xa00 [nvme_core] nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core] ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core] nvme_scan_work+0x281/0x410 [nvme_core] process_one_work+0x1be/0x380 worker_thread+0x37/0x3b0 ? process_one_work+0x380/0x380 kthread+0x12d/0x150 ? set_kthread_struct+0x50/0x50 ret_from_fork+0x1f/0x30 </TASK> INFO: task nvme:6725 blocked for more than 491 seconds. Not tainted 5.15.65-f0.el7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:nvme state:D stack: 0 pid: 6725 ppid: 1761 flags:0x00004000 Call Trace: <TASK> __schedule+0x390/0x910 ? sched_clock+0x9/0x10 schedule+0x55/0xe0 schedule_timeout+0x24b/0x2e0 ? try_to_wake_up+0x358/0x510 ? finish_task_switch+0x88/0x2c0 wait_for_completion+0xa5/0x110 __flush_work+0x144/0x210 ? worker_attach_to_pool+0xc0/0xc0 flush_work+0x10/0x20 nvme_remove_namespaces+0x41/0xf0 [nvme_core] nvme_do_delete_ctrl+0x47/0x66 [nvme_core] nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core] dev_attr_store+0x14/0x30 sysfs_kf_write+0x38/0x50 kernfs_fop_write_iter+0x146/0x1d0 new_sync_write+0x114/0x1b0 ? intel_pmu_handle_irq+0xe0/0x420 vfs_write+0x18d/0x270 ksys_write+0x61/0xe0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x37/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb -- Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver") Reported-by: NJonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Tested-by: NJonathan Nicklin <jnicklin@blockbridge.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
When we delete a controller, we execute the following: 1. nvme_stop_ctrl() - stop some work elements that may be inflight or scheduled (specifically also .stop_ctrl which cancels ctrl error recovery work) 2. nvme_remove_namespaces() - which first flushes scan_work to avoid competing ns addition/removal 3. continue to teardown the controller However, if err_work was scheduled to run in (1), it is designed to cancel any inflight I/O, particularly I/O that is originating from ns scan_work in (2), but because it is cancelled in .stop_ctrl(), we can prevent forward progress of (2) as ns scanning is blocking on I/O (that will never be cancelled). The race is: 1. transport layer error observed -> err_work is scheduled 2. scan_work executes, discovers ns, generate I/O to it 3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work) - err_work never executed 4. nvme_remove_namespaces() -> flush_work(scan_work) --> deadlock, because scan_work is blocked on I/O that was supposed to be cancelled by err_work, but was cancelled before executing. Fix this by flushing err_work instead of cancelling it, to force it to execute and cancel all inflight I/O. Fixes: b435ecea ("nvme: Add .stop_ctrl to nvme ctrl ops") Fixes: f6c8e432 ("nvme: flush namespace scanning work just before removing namespaces") Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 30 9月, 2022 8 次提交
-
-
由 Kanchan Joshi 提交于
if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag, use the pre-registered buffer for IO (non-vectored variant). Pass the buffer/length to io_uring and get the bvec iterator for the range. Next, pass this bvec to block-layer and obtain a bio/request for subsequent processing. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kanchan Joshi 提交于
This is a prep patch. Modify nvme_submit_user_cmd and nvme_map_user_request to take ubuffer as plain integer argument, and do away with nvme_to_user_ptr conversion in callers. Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com> Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kanchan Joshi 提交于
nvme_alloc_request expects a large number of parameters. Split this out into two functions to reduce number of parameters. First one retains the name nvme_alloc_request, while second one is named nvme_map_user_request. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Kanchan Joshi 提交于
Pass struct request rather than bio. It helps to kill a parameter, and some processing clean-up too. Signed-off-by: NKanchan Joshi <joshi.k@samsung.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Anuj Gupta 提交于
User blk_rq_map_user_io instead of duplicating the same code at different places Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Now that the normal passthrough end_io path doesn't need the request anymore, we can kill the explicit blk_mq_free_request() and just pass back RQ_END_IO_FREE instead. This enables the batched completion from freeing batches of requests at the time. This brings passthrough IO performance at least on par with bdev based O_DIRECT with io_uring. With this and batche allocations, peak performance goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also about 10% faster than previously, going from ~61M to ~67M IOPS. Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NKeith Busch <kbusch@kernel.org> Co-developed-by: NStefan Roesch <shr@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
By splitting up the metadata and non-metadata end_io handling, we can remove any request dependencies on the normal non-metadata IO path. This is in preparation for enabling the normal IO passthrough path to pass the ownership of the request back to the block layer. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NKeith Busch <kbusch@kernel.org> Co-developed-by: NStefan Roesch <shr@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Everything is just converted to returning RQ_END_IO_NONE, and there should be no functional changes with this patch. In preparation for allowing the end_io handler to pass ownership back to the block layer, rather than retain ownership of the request. Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 9月, 2022 4 次提交
-
-
由 Christoph Hellwig 提交于
Unused now. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
-
由 Christoph Hellwig 提交于
Use the common helpers to allocate and free the tagsets. To make this work the generic nvme_ctrl now needs to be stored in the hctx private data instead of the nvme_fc_ctrl. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NJames Smart <jsmart2021@gmail.com>
-
由 Christoph Hellwig 提交于
Point the private data to the generic controller structure in preparation of using the common tagset init/exit code and use the chance the cleanup the init_hctx methods a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NJames Smart <jsmart2021@gmail.com>
-
由 Christoph Hellwig 提交于
Also update the sqsize field when capping the queue size, and remove the check a queue size that is larger than sqsize given that sqsize is only initialized from opts->queue_size. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NJames Smart <jsmart2021@gmail.com>
-