- 22 10月, 2020 2 次提交
-
-
由 Chao Leng 提交于
A crash happened due to injecting error test. When a CQE has incorrect command id due do an error injection, the host may find a request which is already freed. Dereferencing req->mr->rkey causes a crash in nvme_rdma_process_nvme_rsp because the mr is already freed. Add a check for the mr to fix it. Signed-off-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chao Leng 提交于
A crash can happened when a connect is rejected. The host establishes the connection after received ConnectReply, and then continues to send the fabrics Connect command. If the controller does not receive the ReadyToUse capsule, host may receive a ConnectReject reply. Call nvme_rdma_destroy_queue_ib after the host received the RDMA_CM_EVENT_REJECTED event. Then when the fabrics Connect command times out, nvme_rdma_timeout calls nvme_rdma_complete_rq to fail the request. A crash happenes due to use after free in nvme_rdma_complete_rq. nvme_rdma_destroy_queue_ib is redundant when handling the RDMA_CM_EVENT_REJECTED event as nvme_rdma_destroy_queue_ib is already called in connection failure handler. Signed-off-by: NChao Leng <lengchao@huawei.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 09 9月, 2020 1 次提交
-
-
由 David Milburn 提交于
Cancel async event work in case async event has been queued up, and nvme_rdma_submit_async_event() runs after event has been freed. Signed-off-by: NDavid Milburn <dmilburn@redhat.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 29 8月, 2020 3 次提交
-
-
由 Sagi Grimberg 提交于
If the controller becomes unresponsive in the middle of a reset, we will hang because we are waiting for the freeze to complete, but that cannot happen since we have commands that are inflight holding the q_usage_counter, and we can't blindly fail requests that times out. So give a timeout and if we cannot wait for queue freeze before unfreezing, fail and have the error handling take care how to proceed (either schedule a reconnect of remove the controller). Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
When a request times out in a LIVE state, we simply trigger error recovery and let the error recovery handle the request cancellation, however when a request times out in a non LIVE state, we make sure to complete it immediately as it might block controller setup or teardown and prevent forward progress. However tearing down the entire set of I/O and admin queues causes freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really an overkill to what we actually need, which is to just fence controller teardown that may be running, stop the queue, and cancel the request if it is not already completed. Now that we have the controller teardown_lock, we can safely serialize request cancellation. This addresses a hang caused by calling extra queue freeze on controller namespaces, causing unfreeze to not complete correctly. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
In the timeout handler we may need to complete a request because the request that timed out may be an I/O that is a part of a serial sequence of controller teardown or initialization. In order to complete the request, we need to fence any other context that may compete with us and complete the request that is timing out. In this case, we could have a potential double completion in case a hard-irq or a different competing context triggered error recovery and is running inflight request cancellation concurrently with the timeout handler. Protect using a ctrl teardown_lock to serialize contexts that may complete a cancelled request due to error recovery or a reset. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 24 8月, 2020 1 次提交
-
-
由 Gustavo A. R. Silva 提交于
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
-
- 22 8月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
nvme_end_request is a bit misnamed, as it wraps around the blk_mq_complete_* API. It's semantics also are non-trivial, so give it a more descriptive name and add a comment explaining the semantics. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 7月, 2020 3 次提交
-
-
由 Sagi Grimberg 提交于
commit fe35ec58 ("block: update hctx map when use multiple maps") exposed an issue where we may hang trying to wait for queue freeze during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple queue maps (which we have now for default/read/poll) is attempting to freeze the queue. However we never started queue freeze when starting the reset, which means that we have inflight pending requests that entered the queue that we will not complete once the queue is quiesced. So start a freeze before we quiesce the queue, and unfreeze the queue after we successfully connected the I/O queues (and make sure to call blk_mq_update_nr_hw_queues only after we are sure that the queue was already frozen). This follows to how the pci driver handles resets. Fixes: fe35ec58 ("block: update hctx map when use multiple maps") Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f ("nvme: add ANA support") Reported-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Yamin Friedman 提交于
Has the driver use shared CQs providing ~10%-20% improvement as seen in the patch introducing shared CQs. Instead of opening a CQ for each QP per controller connected, a CQ for each QP will be provided by the RDMA core driver that will be shared between the QPs on that core reducing interrupt overhead. Signed-off-by: NYamin Friedman <yaminf@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 25 6月, 2020 1 次提交
-
-
由 Max Gurtovoy 提交于
The completion vector index that is given during CQ creation can't exceed the number of support vectors by the underlying RDMA device. This violation currently can accure, for example, in case one will try to connect with N regular read/write queues and M poll queues and the sum of N + M > num_supported_vectors. This will lead to failure in establish a connection to remote target. Instead, in that case, share a completion vector between queues. Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 24 6月, 2020 3 次提交
-
-
由 Christoph Hellwig 提交于
Revert and incorret transformation that caused requests using remote invalidation to never complete. Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper") Reported-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NChristoph Hellwig <hch@lst.de> Tested-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Factor a small sniplet of duplicated code into a new helper in preparation for making this sniplet a little bit less trivial. Reviewed-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 5月, 2020 2 次提交
-
-
由 Max Gurtovoy 提交于
For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end protection information passthrough and validation for NVMe over RDMA transport. Metadata offload support was implemented over the new RDMA signature verbs API and it is enabled for capable controllers. Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Israel Rukshin 提交于
Remove first_sgl pointer from struct nvme_rdma_request and use pointer arithmetic instead. The inline scatterlist, if exists, will be located right after the nvme_rdma_request. This patch is needed as a preparation for adding PI support. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 31 3月, 2020 1 次提交
-
-
由 Israel Rukshin 提交于
Use a semicolon at the end of an assignment expression. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 26 3月, 2020 3 次提交
-
-
由 Israel Rukshin 提交于
The transition to LIVE state should not fail in case of a new controller. Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the resources may leads to NULL dereference at teardown flow (e.g., IO tagset, admin_q, connect_q). Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Israel Rukshin 提交于
Put the ctrl reference count at nvme_uninit_ctrl as opposed to nvme_init_ctrl which takes it. This decrease the reference count at the core layer instead of decreasing it on each transport separately. Also move the call of nvme_uninit_ctrl at PCI driver after calling to nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference count after using the dev. This is safe because those functions use nvme_dev which is freed only later at nvme_pci_free_ctrl. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Israel Rukshin 提交于
In case nvme_sysfs_delete() is called by the user before taking the ctrl reference count, the ctrl may be freed during the creation and cause the bug. Take the reference as soon as the controller is externally visible, which is done by cdev_device_add() in nvme_init_ctrl(). Also take the reference count at the core layer instead of taking it on each transport separately. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 17 3月, 2020 1 次提交
-
-
由 Bart Van Assche 提交于
Move the get_unaligned_be24(), get_unaligned_le24() and put_unaligned_le24() definitions from various drivers into include/linux/unaligned/generic.h. Add a put_unaligned_be24() implementation. Link: https://lore.kernel.org/r/20200313203102.16613-4-bvanassche@acm.org Cc: Keith Busch <kbusch@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jens Axboe <axboe@fb.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # For drivers/usb Reviewed-by: Felipe Balbi <balbi@kernel.org> # For drivers/usb/gadget Signed-off-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-
- 11 3月, 2020 1 次提交
-
-
由 Prabhath Sajeepa 提交于
The timeout of identify cmd, which is invoked as part of admin queue creation, can result in freeing of async event data both in nvme_rdma_timeout handler and error handling path of nvme_rdma_configure_admin queue thus causing NULL pointer reference. Call Trace: ? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma] nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma] nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics] __vfs_write+0x1b/0x40 vfs_write+0xb2/0x1b0 ksys_write+0x61/0xd0 __x64_sys_write+0x1a/0x20 do_syscall_64+0x60/0x1e0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reviewed-by: NRoland Dreier <roland@purestorage.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NPrabhath Sajeepa <psajeepa@purestorage.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 15 2月, 2020 1 次提交
-
-
由 Nigel Kirkland 提交于
Delayed keep alive work is queued on system workqueue and may be cancelled via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq. Check_flush_dependency detects mismatched attributes between the work-queue context used to cancel the keep alive work and system-wq. Specifically system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used to cancel keep alive work have WQ_MEM_RECLAIM flag. Example warning: workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc] is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core] To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq. However this creates a secondary concern where work and a request to cancel that work may be in the same work queue - namely err_work in the rdma and tcp transports, which will want to flush/cancel the keep alive work which will now be on nvme_wq. After reviewing the transports, it looks like err_work can be moved to nvme_reset_wq. In fact that aligns them better with transition into RESETTING and performing related reset work in nvme_reset_wq. Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq. Signed-off-by: NNigel Kirkland <nigel.kirkland@broadcom.com> Signed-off-by: NJames Smart <jsmart2021@gmail.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 11月, 2019 1 次提交
-
-
由 Israel Rukshin 提交于
nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 128*128*4K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't support SG_CHAIN, use only runtime allocation for the SGL. We didn't notice of a performance degradation, since for small IOs we'll use the inline SG and for the bigger IOs the allocation of a bigger SGL from slab is fast enough. Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 05 11月, 2019 3 次提交
-
-
由 Max Gurtovoy 提交于
In case there are controllers that are not associated with any RDMA device (e.g. during unsuccessful reconnection) and the user will unload the module, these controllers will not be freed and will access already freed memory. The same logic appears in other fabric drivers as well. Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset") Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Max Gurtovoy 提交于
nvme_cleanup_cmd should be called for each call to nvme_setup_cmd (symmetrical functions). Move the call for nvme_cleanup_cmd to the common core layer and call it during nvme_complete_rq for the good flow. For error flow, each transport will call nvme_cleanup_cmd independently. Also take care of a special case of path failure, where we call nvme_complete_rq without doing nvme_setup_cmd. Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Israel Rukshin 提交于
This function improves code readability and reduces code duplication. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 10月, 2019 1 次提交
-
-
由 Keith Busch 提交于
A controller in the resetting state has not yet completed its recovery actions. The pci and fc transports were already handling this, so update the remaining transports to not attempt additional recovery in this state. Instead, just restart the request timer. Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 28 9月, 2019 1 次提交
-
-
由 Sagi Grimberg 提交于
If the connect times out, we may have already destroyed the queue in the timeout handler, so test if the queue is still allocated in the connect error handler. Reported-by: NYi Zhang <yi.zhang@redhat.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 26 9月, 2019 1 次提交
-
-
由 Max Gurtovoy 提交于
By default, the NVMe/RDMA driver should support max io_size of 1MiB (or upto the maximum supported size by the HCA). Currently, one will see that /sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024. A non power of 2 value can cause performance degradation due to unnecessary splitting of IO requests and unoptimized allocation units. The number of pages per MR has been fixed here, so there is no longer any need to reduce max_sectors by 1. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 30 8月, 2019 5 次提交
-
-
由 Israel Rukshin 提交于
Remove code duplication. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Israel Rukshin 提交于
For RDMA transports, TOS is an extension of IB QoS to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. One of the TOS usage is bandwidth management which allows setting bandwidth limits for QoS classes, e.g. 80% bandwidth to controllers at QoS class A and 20% to controllers at QoS class B. Note: In addition to the TOS configuration, QOS must be configured on the relevant HCA on the target (send RDMA commands) and initiator to effect the traffic. usage examples: nvme connect --tos=0 --transport=rdma --traddr=10.0.1.1 --nqn=test-nvme Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 05 8月, 2019 1 次提交
-
-
由 Ming Lei 提交于
When aborting in-flight request for recovering controller, we have to make sure that queue's complete function is called on completed request before moving on. Otherwise, for example, the warning of WARN_ON_ONCE(qp->mrs_used > 0) in ib_destroy_qp_user() may be triggered on nvme-rdma. Fix this issue by using blk_mq_tagset_wait_completed_request. Cc: Max Gurtovoy <maxg@mellanox.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Keith Busch <keith.busch@intel.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 8月, 2019 1 次提交
-
-
由 Sagi Grimberg 提交于
When start_queue fails, we need to make sure to drain the queue cq before freeing the rdma resources because we might still race with the completion path. Have start_queue() error path safely stop the queue. -- [30371.808111] nvme nvme1: Failed reconnect attempt 11 [30371.808113] nvme nvme1: Reconnecting in 10 seconds... [...] [30382.069315] nvme nvme1: creating 4 I/O queues. [30382.257058] nvme nvme1: Connect Invalid SQE Parameter, qid 4 [30382.257061] nvme nvme1: failed to connect queue: 4 ret=386 [30382.305001] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 [30382.305022] IP: qedr_poll_cq+0x8a3/0x1170 [qedr] [30382.305028] PGD 0 P4D 0 [30382.305037] Oops: 0000 [#1] SMP PTI [...] [30382.305153] Call Trace: [30382.305166] ? __switch_to_asm+0x34/0x70 [30382.305187] __ib_process_cq+0x56/0xd0 [ib_core] [30382.305201] ib_poll_handler+0x26/0x70 [ib_core] [30382.305213] irq_poll_softirq+0x88/0x110 [30382.305223] ? sort_range+0x20/0x20 [30382.305232] __do_softirq+0xde/0x2c6 [30382.305241] ? sort_range+0x20/0x20 [30382.305249] run_ksoftirqd+0x1c/0x60 [30382.305258] smpboot_thread_fn+0xef/0x160 [30382.305265] kthread+0x113/0x130 [30382.305273] ? kthread_create_worker_on_cpu+0x50/0x50 [30382.305281] ret_from_fork+0x35/0x40 -- Reported-by: NNicolas Morey-Chaisemartin <NMoreyChaisemartin@suse.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 24 6月, 2019 1 次提交
-
-
由 Israel Rukshin 提交于
This is a preparation for adding new signature API to the rw-API. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 21 6月, 2019 1 次提交
-
-
由 Ming Lei 提交于
sg_alloc_table_chained() currently allows the caller to provide one preallocated SGL and returns if the requested number isn't bigger than size of that SGL. This is used to inline an SGL for an IO request. However, scattergather code only allows that size of the 1st preallocated SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory (4KB) is claimed for the SGL for each IO request. If the I/O is small, it would be prudent to allocate a smaller SGL. Introduce an extra parameter to sg_alloc_table_chained() and sg_free_table_chained() for specifying size of the preallocated SGL. Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the same size except for the last one. Change the code to allow both functions to accept a variable size for the 1st preallocated SGL. [mkp: attempted to clarify commit desc] Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Ewan D. Milne <emilne@redhat.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: netdev@vger.kernel.org Cc: linux-nvme@lists.infradead.org Suggested-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
-