- 22 8月, 2020 12 次提交
-
-
由 Keith Busch 提交于
Zoned block devices reuse the chunk_sectors queue limit to define zone boundaries. If a such a device happens to also report an optimal boundary, do not use that to define the chunk_sectors as that may intermittently interfere with io splitting and zone size queries. Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
All operations are based on the controller, not the host page size. Switch the dma pool to use the controller page size as well to avoid massive overallocations on large page size systems. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 John Garry 提交于
Recently nvme_dev.q_depth was changed from an int to u16 type. This falls over for the queue depth calculation in nvme_pci_enable(), where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the result: root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: nvme nvme_core CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted 5.8.0-next-20200812 #27 Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 - V1.16.01 03/15/2019 Workqueue: nvme-reset-wq nvme_reset_work [nvme] pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--) pc : __sg_alloc_table_from_pages+0xec/0x238 lr : __sg_alloc_table_from_pages+0xc8/0x238 sp : ffff800013ccbad0 x29: ffff800013ccbad0 x28: ffff0a27b3d380a8 x27: 0000000000000000 x26: 0000000000002dc2 x25: 0000000000000dc0 x24: 0000000000000000 x23: 0000000000000000 x22: ffff800013ccbbe8 x21: 0000000000000010 x20: 0000000000000000 x19: 00000000fffff000 x18: ffffffffffffffff x17: 00000000000000c0 x16: fffffe289eaf6380 x15: ffff800011b59948 x14: ffff002bc8fe98f8 x13: ff00000000000000 x12: ffff8000114ca000 x11: 0000000000000000 x10: ffffffffffffffff x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0 x7 : 0000000000000000 x6 : 0000000000000001 x5 : ffff0a27b5f9b680 x4 : 0000000000000000 x3 : ffff0a27b5f9b680 x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000 Call trace: __sg_alloc_table_from_pages+0xec/0x238 sg_alloc_table_from_pages+0x18/0x28 iommu_dma_alloc+0x474/0x678 dma_alloc_attrs+0xd8/0xf0 nvme_alloc_queue+0x114/0x160 [nvme] nvme_reset_work+0xb34/0x14b4 [nvme] process_one_work+0x1e8/0x360 worker_thread+0x44/0x478 kthread+0x150/0x158 ret_from_fork+0x10/0x34 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5) ---[ end trace 89bb2b72d59bf925 ]--- Fix by making onto a u32. Also use u32 for nvme_dev.q_depth, as we assign this value from nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this avoids the same crash as above. Fixes: 61f3b896 ("nvme-pci: use unsigned for io queue depth") Signed-off-by: NJohn Garry <john.garry@huawei.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a dead lock. The new spin_lock() calls recently added produce the following lockdep warning when running the blktest nvme/003: ================================ WARNING: inconsistent lock state -------------------------------- inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes: ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0 {SOFTIRQ-ON-W} state was registered at: lock_acquire+0x164/0x500 _raw_spin_lock+0x28/0x40 nvme_get_effects_log+0x37/0x1c0 nvme_init_identify+0x9e4/0x14f0 nvme_reset_work+0xadd/0x2360 process_one_work+0x66b/0xb70 worker_thread+0x6e/0x6c0 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 irq event stamp: 1449221 hardirqs last enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140 hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60 softirqs last enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595 softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ctrl->lock); <Interrupt> lock(&ctrl->lock); *** DEADLOCK *** no locks held by ksoftirqd/2/22. stack backtrace: CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c #1450 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xc8/0x11a print_usage_bug.cold.63+0x235/0x23e mark_lock+0xa9c/0xcf0 __lock_acquire+0xd9a/0x2b50 lock_acquire+0x164/0x500 _raw_spin_lock_irqsave+0x40/0x60 nvme_keep_alive_end_io+0x50/0xc0 blk_mq_end_request+0x158/0x210 nvme_complete_rq+0x146/0x500 nvme_loop_complete_rq+0x26/0x30 [nvme_loop] blk_done_softirq+0x187/0x1e0 __do_softirq+0x118/0x595 run_ksoftirqd+0x35/0x50 smpboot_thread_fn+0x1d3/0x310 kthread+0x1e7/0x210 ret_from_fork+0x22/0x30 Fixes: be93e87e ("nvme: support for multiple Command Sets Supported and Effects log pages") Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Tested-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Chaitanya Kulkarni 提交于
Instead of calling blk_put_request() which calls blk_mq_free_request(), call blk_mq_free_request() directly for NVMeOF passthru. This is to mainly avoid an extra function call in the completion path nvmet_passthru_req_done(). Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Chaitanya Kulkarni 提交于
In the existing NVMeOF Passthru core command handling on failure of nvme_alloc_request() it errors out with rq value set to NULL. In the error handling path it calls blk_put_request() without checking if rq is set to NULL or not which produces following Oops:- [ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 1457.347838] #PF: supervisor read access in kernel mode [ 1457.348464] #PF: error_code(0x0000) - not-present page [ 1457.349085] PGD 0 P4D 0 [ 1457.349402] Oops: 0000 [#1] SMP NOPTI [ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G OE 5.8.0-rc4nvme-5.9+ #35 [ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214 [ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop] [ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.361322] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.362337] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.363973] Call Trace: [ 1457.364296] nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet] [ 1457.364990] process_one_work+0x24e/0x5a0 [ 1457.365493] ? __schedule+0x353/0x840 [ 1457.365957] worker_thread+0x3c/0x380 [ 1457.366426] ? process_one_work+0x5a0/0x5a0 [ 1457.366948] kthread+0x135/0x150 [ 1457.367362] ? kthread_create_on_node+0x60/0x60 [ 1457.367934] ret_from_fork+0x22/0x30 [ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer [ 1457.368414] ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_] [ 1457.380849] CR2: 0000000000000000 [ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]--- [ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110 [ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8 [ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282 [ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002 [ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000 [ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000 [ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000 [ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8 [ 1457.389798] FS: 0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000 [ 1457.390796] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0 [ 1457.392525] Kernel panic - not syncing: Fatal exception [ 1457.394138] Kernel Offset: disabled [ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]--- We fix this Oops by adding a new goto label out_put_req and reordering the blk_put_request call to avoid calling blk_put_request() with rq value is set to NULL. Here we also update the rest of the code accordingly. Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands") Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Chaitanya Kulkarni 提交于
In the current implementation before submitting the passthru cmd we may come across error e.g. getting ns from passthru controller, allocating a request from passthru controller, etc. For all the failure cases it only uses single goto label fail_out. In the target code, we follow the pattern to have a separate label for each error out the case when setting up multiple things before the actual action. This patch follows the same pattern and renames generic fail_out label to out_put_ns and updates the error out cases in the nvmet_passthru_execute_cmd() where it is needed. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Martin Wilck 提交于
If we find an optimized path, we quit the loop immediately. Thus we can use just one variable for the next path, slighly simplifying the code. Signed-off-by: NMartin Wilck <mwilck@suse.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Martin Wilck 提交于
If there's only one usable, non-optimized path, nvme_round_robin_path() returns NULL, which is wrong. Fix it by falling back to "old", like in the single optimized path case. Also, if the active path isn't changed, there's no need to re-assign the pointer. Fixes: 3f6e3246 ("nvme-multipath: fix logic for non-optimized paths") Signed-off-by: NMartin Wilck <mwilck@suse.com> Signed-off-by: NMartin George <marting@netapp.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tianjia Zhang 提交于
On an error exit path, a negative error code should be returned instead of a positive return value. Fixes: e399441d ("nvme-fabrics: Add host support for FC transport") Cc: James Smart <jsmart2021@gmail.com> Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Logan Gunthorpe 提交于
Any command with a non-SGL flag set (like fuse flags) should be rejected. Fixes: c1fef73f ("nvmet: add passthru code to process commands") Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Sagi Grimberg 提交于
We forgot to free new_model_number Fixes: 013b7ebe ("nvmet: make ctrl model configurable") Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 7月, 2020 28 次提交
-
-
由 Christoph Hellwig 提交于
Add a quirk for a device that does not support the Identify Namespace Identification Descriptor list despite claiming 1.3 compliance. Fixes: ea43d970 ("nvme: fix identify error status silent ignore") Reported-by: NIngo Brunberg <ingo_brunberg@web.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Tested-by: NIngo Brunberg <ingo_brunberg@web.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Chaitanya Kulkarni 提交于
We can call the nvme_change_ctrl_state() directly and have WARN_ON_ONCE(1) call instead of having to use an extra variable which matches the name of the function. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
When creating a loop controller (ctrl) in nvme_loop_create_ctrl() -> nvme_init_ctrl() we set the ctrl state to NVME_CTRL_NEW. Prior to [1] NVME_CTRL_NEW state was allowed in nvmf_check_ready() for fabrics command type connect. Now, this fails in the following code path for fabrics connect command when creating admin queue :- nvme_loop_create_ctrl() nvme_loo_configure_admin_queue() nvmf_connect_admin_queue() __nvme_submit_sync_cmd() blk_execute_rq() nvme_loop_queue_rq() nvmf_check_ready() # echo "transport=loop,nqn=fs" > /dev/nvme-fabrics [ 6047.741327] nvmet: adding nsid 1 to subsystem fs [ 6048.756430] nvme nvme1: Connect command failed, error wo/DNR bit: 880 We need to set the ctrl state to NVME_CTRL_CONNECTING after :- nvme_loop_create_ctrl() nvme_init_ctrl() so that the above mentioned check for nvmf_check_ready() will return true. This patch sets the ctrl state to connecting after we init the ctrl in nvme_loop_create_ctrl() nvme_init_ctrl() . [1] commit aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Fixes: aa63fa6776a7 ("nvme-fabrics: allow to queue requests for live queues") Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Tested-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Hannes Reinecke 提交于
When nvme_round_robin_path() finds a valid namespace we should be using it; falling back to __nvme_find_path() for non-optimized paths will cause the result from nvme_round_robin_path() to be ignored for non-optimized paths. Fixes: 75c10e73 ("nvme-multipath: round-robin I/O policy") Signed-off-by: NMartin Wilck <mwilck@suse.com> Signed-off-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Martin Wilck 提交于
Handle the special case where we have exactly one optimized path, which we should keep using in this case. Fixes: 75c10e73 ("nvme-multipath: round-robin I/O policy") Signed off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
commit fe35ec58 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58 ("block: update hctx map when use multiple maps") Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
commit fe35ec58 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58 ("block: update hctx map when use multiple maps") Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
This patch updates KConfig file for the NVMeOF target where we add new option so that user can selectively enable/disable passthru code. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> [logang@deltatee.com: fixed some of the wording in the help message] Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
When CONFIG_NVME_TARGET_PASSTHRU as 'passthru' directory will be added to each subsystem. The directory is similar to a namespace and has two attributes: device_path and enable. The user must set the path to the nvme controller's char device and write '1' to enable the subsystem to use passthru. Any given subsystem is prevented from enabling both a regular namespace and the passthru device. If one is enabled, enabling the other will produce an error. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
This patch adds helper functions which are used in the NVMeOF configfs when the user is configuring the passthru subsystem. Here we ensure that only one subsys is assigned to each nvme_ctrl by using an xarray on the cntlid. The subsystem's version number is overridden by the passed through controller's version. However, if that version is less than 1.2.1, then we bump the advertised version to that and print a warning in dmesg. Based-on-a-patch-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
Add passthru command handling capability for the NVMeOF target and export passthru APIs which are used to integrate passthru code with nvmet-core. The new file passthru.c handles passthru cmd parsing and execution. In the passthru mode, we create a block layer request from the nvmet request and map the data on to the block layer request. Admin commands and features are on an allow list as there are a number of each that don't make too much sense with passthrough. We use an allow list such that new commands can be considered before being blindly passed through. In both cases, vendor specific commands are always allowed. We also reject reservation IO commands as the underlying device cannot differentiate between multiple hosts behind a fabric. Based-on-a-patch-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
nvme_find_get_ns() and nvme_put_ns() are required by the target passthru code and are exported under the NVME_TARGET_PASSTHRU namespace. Based-on-a-patch-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0). It makes use of filp_open() to open the file and uses the private data to obtain a pointer to the struct nvme_ctrl. If the fops of the file do not match, -EINVAL is returned. The purpose of this function is to support NVMe-OF target passthru and is exported under the NVME_TARGET_PASSTHRU namespace. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
Introduce a new nvme_execute_passthru_rq() helper which calls nvme_passthru_[start|end]() around blk_execute_rq(). This ensures all passthru calls (including nvme_submit_io()) will be wrapped appropriately. nvme_execute_passthru_rq() will also be useful for the nvmet passthru code and is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
Separate the code to obtain command effects from the code to start a passthru request and move the nvme_passthru_start() and nvme_passthru_end() functions up above nvme_submit_user_cmd() in order that they may be used in a new helper a subsequent patch. The new helper function will be necessary for nvmet passthru code to determine if we need to change out of interrupt context to handle the effects. It is exported in the NVME_TARGET_PASSTHRU namespace. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Logan Gunthorpe 提交于
The host driver should decide whether to use SGLs or PRPs and they currently assume the flags are cleared after the call to nvme_setup_cmd(). However, passed-through commands may erroneously set these bits; so clear them for all cases. Signed-off-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 James Smart 提交于
The transport has a del_work_active flag to avoid duplicate scheduling of the del_work item. This is redundant with the checks that schedule_work() makes. Remove the del_work_active flag. Signed-off-by: NJames Smart <jsmart2021@gmail.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 James Smart 提交于
When searching for an association based on an association id, when there is a match, the code takes a reference. However, it is not validating that the reference taking was successful. Check the status of the reference. If unsuccessful, the device is being deleted and should be ignored. Signed-off-by: NJames Smart <jsmart2021@gmail.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 James Smart 提交于
Currently the FC transport is set max_hw_sectors based on the lldds max sgl segment count. However, the block queue max segments is set based on the controller's max_segments count, which the transport does not set. As such, the lldd is receiving sgl lists that are exceeding its max segment count. Set the controller max segment count and derive max_hw_sectors from the max segment count. Signed-off-by: NJames Smart <jsmart2021@gmail.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: NEwan D. Milne <emilne@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
Stay consistent with the rest of the driver Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f ("nvme: add ANA support") Reported-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
We are starting to see some non-trivial states so lets start documenting them. Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
This patch replaces the ctrl->namespaces tracking from linked list to xarray and improves the performance when accessing one namespce :- XArray vs Default:- IOPS and BW (more the better) increase BW (~1.8%):- --------------------------------------------------- XArray :- read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) read: IOPS=162k, BW=631MiB/s (662MB/s)(18.5GiB/30001msec) Default:- read: IOPS=156k, BW=609MiB/s (639MB/s)(17.8GiB/30001msec) read: IOPS=157k, BW=613MiB/s (643MB/s)(17.0GiB/30001msec) read: IOPS=160k, BW=626MiB/s (656MB/s)(18.3GiB/30001msec) Submission latency (less the better) decrease (~8.3%):- ------------------------------------------------------- XArray:- slat (usec): min=7, max=8386, avg=11.19, stdev=5.96 slat (usec): min=7, max=441, avg=11.09, stdev=4.48 slat (usec): min=7, max=1088, avg=11.21, stdev=4.54 Default :- slat (usec): min=8, max=2826.5k, avg=23.96, stdev=3911.50 slat (usec): min=8, max=503, avg=12.52, stdev=5.07 slat (usec): min=8, max=2384, avg=12.50, stdev=5.28 CPU Usage (less the better) decrease (~5.2%):- ---------------------------------------------- XArray:- cpu : usr=1.84%, sys=18.61%, ctx=949471, majf=0, minf=250 cpu : usr=1.83%, sys=18.41%, ctx=950262, majf=0, minf=237 cpu : usr=1.82%, sys=18.82%, ctx=957224, majf=0, minf=234 Default:- cpu : usr=1.70%, sys=19.21%, ctx=858196, majf=0, minf=251 cpu : usr=1.82%, sys=19.98%, ctx=929720, majf=0, minf=227 cpu : usr=1.83%, sys=20.33%, ctx=947208, majf=0, minf=235. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Yamin Friedman 提交于
Has the driver use shared CQs providing ~10%-20% improvement when multiple disks are used. Instead of opening a CQ for each QP per controller, a CQ for each core will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: NYamin Friedman <yaminf@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Yamin Friedman 提交于
Has the driver use shared CQs providing ~10%-20% improvement as seen in the patch introducing shared CQs. Instead of opening a CQ for each QP per controller connected, a CQ for each QP will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: NYamin Friedman <yaminf@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 David E. Box 提交于
This patch implements a solution for a BIOS hack used on some currently shipping Intel systems to change driver power management policy for PCIe NVMe drives. Some newer Intel platforms, like some Comet Lake systems, require that PCIe devices use D3 when doing suspend-to-idle in order to allow the platform to realize maximum power savings. This is particularly needed to support ATX power supply shutdown on desktop systems. In order to ensure this happens for root ports with storage devices, Microsoft apparently created this ACPI _DSD property as a way to influence their driver policy. To my knowledge this property has not been discussed with the NVME specification body. Though the solution is not ideal, it addresses a problem that also affects Linux since the NVMe driver's default policy of using NVMe APST during suspend-to-idle prevents the PCI root port from going to D3 and leads to higher power consumption for these platforms. The power consumption difference may be negligible on laptop systems, but many watts on desktop systems when the ATX power supply is blocked from powering down. The patch creates a new nvme_acpi_storage_d3 function to check for the StorageD3Enable property during probe and enables D3 as a quirk if set. It also provides a 'noacpi' module parameter to allow skipping the quirk if needed. Tested with: - PM961 NVMe SED Samsung 512GB - INTEL SSDPEKKF512G8 Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-introSigned-off-by: NDavid E. Box <david.e.box@linux.intel.com> Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
>From the initial implementation of NVMe SGL kernel support commit a7a7cbe3 ("nvme-pci: add SGL support") with addition of the commit 943e942e ("nvme-pci: limit max IO size and segments to avoid high order allocations") now there is only caller left for nvme_pci_iod_alloc_size() which statically passes true for last parameter that calculates allocation size based on SGL since we need size of biggest command supported for mempool allocation. This patch modifies the helper functions nvme_pci_iod_alloc_size() such that it is now uses maximum of PRP and SGL size for iod allocation size calculation. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
Saving the nvme controller's page size was from a time when the driver tried to use different sized pages, but this value is always set to a constant, and has been this way for some time. Remove the 'page_size' field and replace its usage with the constant value. This also lets the compiler make some micro-optimizations in the io path, and that's always a good thing. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-