提交 · 1e28eed17697bcf343c6743f0028cc3b5dd88bf0 · openeuler / Kernel

10 2月, 2021 1 次提交

nvme-rdma: handle nvme_rdma_post_send failures better · 62eca397

由 Chao Leng 提交于 2月 01, 2021

nvme_rdma_post_send failing is a path related error and should bounce
to another path when using nvme-multipath.  Call nvme_host_path_error
when nvme_rdma_post_send returns -EIO to ensure nvme_complete_rq gets
invoked to fail over to another path if there is one.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

62eca397

02 2月, 2021 2 次提交

nvme-rdma: use cancel tagset helper for tear down · c4189d68

由 Chao Leng 提交于 1月 21, 2021

Use nvme_cancel_tagset and nvme_cancel_admin_tagset to clean code for
tear down process.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c4189d68

nvme-rdma: add clean action for failed reconnection · 958dc1d3

由 Chao Leng 提交于 1月 21, 2021

A crash happens when inject failed reconnection.
If reconnect failed after start io queues, the queues will be unquiesced
and new requests continue to be delivered. Reconnection error handling
process directly free queues without cancel suspend requests. The
suppend request will time out, and then crash due to use the queue
after free.

Add sync queues and cancel suppend requests for reconnection error
handling.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

958dc1d3

25 1月, 2021 1 次提交

block: store a block_device pointer in struct bio · 309dca30

由 Christoph Hellwig 提交于 1月 24, 2021

Replace the gendisk pointer in struct bio with a pointer to the newly
improved struct block device.  From that the gendisk can be trivially
accessed with an extra indirection, but it also allows to directly
look up all information related to partition remapping.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

309dca30

19 1月, 2021 1 次提交

nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout · 7674073b

由 Chao Leng 提交于 1月 14, 2021

A crash happens when inject completing request long time(nearly 30s).
Each name space has a request queue, when inject completing request long
time, multi request queues may have time out requests at the same time,
nvme_rdma_timeout will execute concurrently. Multi requests in different
request queues may be queued in the same rdma queue, multi
nvme_rdma_timeout may call nvme_rdma_stop_queue at the same time.
The first nvme_rdma_timeout will clear NVME_RDMA_Q_LIVE and continue
stopping the rdma queue(drain qp), but the others check NVME_RDMA_Q_LIVE
is already cleared, and then directly complete the requests, complete
request before the qp is fully drained may lead to a use-after-free
condition.

Add a multex lock to serialize nvme_rdma_stop_queue.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Tested-by: NIsrael Rukshin <israelr@nvidia.com>
Reviewed-by: NIsrael Rukshin <israelr@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

7674073b

02 12月, 2020 1 次提交

nvme: use consistent macro name for timeout · dc96f938

由 Chaitanya Kulkarni 提交于 11月 09, 2020

This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

dc96f938

13 11月, 2020 1 次提交

nvme-rdma: Use ibdev_to_node instead of dereferencing ->dma_device · 22dd4c70

由 Christoph Hellwig 提交于 11月 06, 2020

->dma_device is a private implementation detail of the RDMA core.  Use the
ibdev_to_node helper to get the NUMA node for a ib_device instead of
poking into ->dma_device.

Link: https://lore.kernel.org/r/20201106181941.1878556-5-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

22dd4c70

03 11月, 2020 2 次提交

nvme-rdma: avoid repeated request completion · fdf58e02

由 Sagi Grimberg 提交于 10月 22, 2020

The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_rdma_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

fdf58e02

nvme-rdma: avoid race between time out and tear down · 3017013d

由 Chao Leng 提交于 10月 22, 2020

Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3017013d

28 10月, 2020 1 次提交

RDMA: Add rdma_connect_locked() · 071ba4cc

由 Jason Gunthorpe 提交于 10月 26, 2020

There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the
handler triggers a completion and another thread does rdma_connect() or
the handler directly calls rdma_connect().

In all cases rdma_connect() needs to hold the handler_mutex, but when
handler's are invoked this is already held by the core code. This causes
ULPs using the 2nd method to deadlock.

Provide a rdma_connect_locked() and have all ULPs call it from their
handlers.

Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.comReported-and-tested-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Fixes: 2a7cec53 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state")
Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: NJack Wang <jinpu.wang@cloud.ionos.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

071ba4cc

27 10月, 2020 1 次提交

nvme-rdma: handle unexpected nvme completion data length · 25c1ca6e

由 zhenwei pi 提交于 10月 25, 2020

Receiving a zero length message leads to the following warnings because
the CQE is processed twice:

refcount_t: underflow; use-after-free.
WARNING: CPU: 0 PID: 0 at lib/refcount.c:28

RIP: 0010:refcount_warn_saturate+0xd9/0xe0
Call Trace:
 <IRQ>
 nvme_rdma_recv_done+0xf3/0x280 [nvme_rdma]
 __ib_process_cq+0x76/0x150 [ib_core]
 ...

Sanity check the received data length, to avoids this.

Thanks to Chao Leng & Sagi for suggestions.
Signed-off-by: Nzhenwei pi <pizhenwei@bytedance.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25c1ca6e

22 10月, 2020 2 次提交

nvme-rdma: fix crash due to incorrect cqe · a87da50f

由 Chao Leng 提交于 10月 12, 2020

A crash happened due to injecting error test.
When a CQE has incorrect command id due do an error injection, the host
may find a request which is already freed.  Dereferencing req->mr->rkey
causes a crash in nvme_rdma_process_nvme_rsp because the mr is already
freed.

Add a check for the mr to fix it.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a87da50f

nvme-rdma: fix crash when connect rejected · 43efdb8e

由 Chao Leng 提交于 10月 12, 2020

A crash can happened when a connect is rejected.   The host establishes
the connection after received ConnectReply, and then continues to send
the fabrics Connect command.  If the controller does not receive the
ReadyToUse capsule, host may receive a ConnectReject reply.

Call nvme_rdma_destroy_queue_ib after the host received the
RDMA_CM_EVENT_REJECTED event.  Then when the fabrics Connect command
times out, nvme_rdma_timeout calls nvme_rdma_complete_rq to fail the
request.  A crash happenes due to use after free in
nvme_rdma_complete_rq.

nvme_rdma_destroy_queue_ib is redundant when handling the
RDMA_CM_EVENT_REJECTED event as nvme_rdma_destroy_queue_ib is already
called in connection failure handler.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

43efdb8e

09 9月, 2020 1 次提交

nvme-rdma: cancel async events before freeing event struct · 925dd04c

由 David Milburn 提交于 9月 02, 2020

Cancel async event work in case async event has been queued up, and
nvme_rdma_submit_async_event() runs after event has been freed.
Signed-off-by: NDavid Milburn <dmilburn@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

925dd04c

29 8月, 2020 3 次提交

nvme-rdma: fix reset hang if controller died in the middle of a reset · 2362acb6

由 Sagi Grimberg 提交于 7月 30, 2020

If the controller becomes unresponsive in the middle of a reset, we
will hang because we are waiting for the freeze to complete, but that
cannot happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to
proceed (either schedule a reconnect of remove the controller).
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2362acb6

nvme-rdma: fix timeout handler · 0475a8dc

由 Sagi Grimberg 提交于 7月 29, 2020

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

0475a8dc

nvme-rdma: serialize controller teardown sequences · 5110f402

由 Sagi Grimberg 提交于 8月 05, 2020

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

5110f402

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

22 8月, 2020 1 次提交

nvme: rename and document nvme_end_request · 2eb81a33

由 Christoph Hellwig 提交于 8月 18, 2020

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API.  It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2eb81a33

29 7月, 2020 3 次提交

nvme-rdma: fix controller reset hang during traffic · 9f98772b

由 Sagi Grimberg 提交于 7月 27, 2020

commit fe35ec58 ("block: update hctx map when use multiple maps")
exposed an issue where we may hang trying to wait for queue freeze
during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
queue maps (which we have now for default/read/poll) is attempting to
freeze the queue. However we never started queue freeze when starting the
reset, which means that we have inflight pending requests that entered the
queue that we will not complete once the queue is quiesced.

So start a freeze before we quiesce the queue, and unfreeze the queue
after we successfully connected the I/O queues (and make sure to call
blk_mq_update_nr_hw_queues only after we are sure that the queue was
already frozen).

This follows to how the pci driver handles resets.

Fixes: fe35ec58 ("block: update hctx map when use multiple maps")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9f98772b

nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e

由 Sagi Grimberg 提交于 7月 22, 2020

A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
    is an optimized path to it, path nvme1 happens to be
    inaccessible.

2) Before scan_work is complete nvme0 disconnect is initiated
    nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING

3) scan_work(1) attempts to submit IO,
    but nvme_path_is_optimized() observes nvme0 is not LIVE.
    Since nvme1 is a possible path IO is requeued and scan_work hangs.

--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
--

4) Delete also hangs in flush_work(ctrl->scan_work)
    from nvme_remove_namespaces().

Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.

[  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[  605.552087] Call Trace:
[  605.552683]  __schedule+0x2b9/0x6c0
[  605.553507]  schedule+0x42/0xb0
[  605.554201]  io_schedule+0x16/0x40
[  605.555012]  do_read_cache_page+0x438/0x830
[  605.556925]  read_cache_page+0x12/0x20
[  605.557757]  read_dev_sector+0x27/0xc0
[  605.558587]  amiga_partition+0x4d/0x4c5
[  605.561278]  check_partition+0x154/0x244
[  605.562138]  rescan_partitions+0xae/0x280
[  605.563076]  __blkdev_get+0x40f/0x560
[  605.563830]  blkdev_get+0x3d/0x140
[  605.564500]  __device_add_disk+0x388/0x480
[  605.565316]  device_add_disk+0x13/0x20
[  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
[  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
[  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
[  605.573330]  process_one_work+0x1db/0x380
[  605.574144]  worker_thread+0x4d/0x400
[  605.574896]  kthread+0x104/0x140
[  605.577205]  ret_from_fork+0x35/0x40
[  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
[  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.582320] nvme            D    0 14044  14043 0x00000000
[  605.583424] Call Trace:
[  605.583935]  __schedule+0x2b9/0x6c0
[  605.584625]  schedule+0x42/0xb0
[  605.585290]  schedule_timeout+0x203/0x2f0
[  605.588493]  wait_for_completion+0xb1/0x120
[  605.590066]  __flush_work+0x123/0x1d0
[  605.591758]  __cancel_work_timer+0x10e/0x190
[  605.593542]  cancel_work_sync+0x10/0x20
[  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
[  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
[  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[  605.598320]  dev_attr_store+0x17/0x30

Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.

In addition, change the transport drivers to follow the updated state
machine.

Fixes: 0d0b660f ("nvme: add ANA support")
Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ecca390e

nvme-rdma: use new shared CQ mechanism · 287f329e

由 Yamin Friedman 提交于 7月 13, 2020

Has the driver use shared CQs providing ~10%-20% improvement as seen in
the patch introducing shared CQs. Instead of opening a CQ for each QP
per controller connected, a CQ for each QP will be provided by the RDMA
core driver that will be shared between the QPs on that core reducing
interrupt overhead.
Signed-off-by: NYamin Friedman <yaminf@mellanox.com>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

287f329e

25 6月, 2020 1 次提交

nvme-rdma: assign completion vector correctly · 032a9966

由 Max Gurtovoy 提交于 6月 23, 2020

The completion vector index that is given during CQ creation can't
exceed the number of support vectors by the underlying RDMA device. This
violation currently can accure, for example, in case one will try to
connect with N regular read/write queues and M poll queues and the sum
of N + M > num_supported_vectors. This will lead to failure in establish
a connection to remote target. Instead, in that case, share a completion
vector between queues.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

032a9966

24 6月, 2020 3 次提交

nvme-rdma: fix a missing completion with remove invalidation · 7a804c34

由 Christoph Hellwig 提交于 6月 23, 2020

Revert and incorret transformation that caused requests using remote
invalidation to never complete.

Fixes: 421147be863b ("nvme-rdma: factor out a nvme_rdma_end_request helper")
Reported-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a804c34

nvme: use blk_mq_complete_request_remote to avoid an indirect function call · ff029451

由 Christoph Hellwig 提交于 6月 11, 2020

Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ff029451

nvme-rdma: factor out a nvme_rdma_end_request helper · 8446546c

由 Christoph Hellwig 提交于 6月 11, 2020

Factor a small sniplet of duplicated code into a new helper in
preparation for making this sniplet a little bit less trivial.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8446546c

27 5月, 2020 2 次提交

nvme-rdma: add metadata/T10-PI support · 5ec5d3bd

由 Max Gurtovoy 提交于 5月 19, 2020

For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end
protection information passthrough and validation for NVMe over RDMA
transport. Metadata offload support was implemented over the new RDMA
signature verbs API and it is enabled for capable controllers.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5ec5d3bd

nvme-rdma: introduce nvme_rdma_sgl structure · 324d9e78

由 Israel Rukshin 提交于 5月 19, 2020

Remove first_sgl pointer from struct nvme_rdma_request and use pointer
arithmetic instead. The inline scatterlist, if exists, will be located
right after the nvme_rdma_request. This patch is needed as a preparation
for adding PI support.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

324d9e78

31 3月, 2020 1 次提交

nvme-rdma: Replace comma with a semicolon · a62315b8

由 Israel Rukshin 提交于 3月 31, 2020

Use a semicolon at the end of an assignment expression.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a62315b8

26 3月, 2020 3 次提交

nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl · 96135862

由 Israel Rukshin 提交于 3月 24, 2020

The transition to LIVE state should not fail in case of a new controller.
Moving to DELETING state before nvme_tcp_create_ctrl() allocates all the
resources may leads to NULL dereference at teardown flow (e.g., IO tagset,
admin_q, connect_q).
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

96135862

nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl · 726612b6

由 Israel Rukshin 提交于 3月 24, 2020

Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

726612b6

nvme: Fix ctrl use-after-free during sysfs deletion · b780d741

由 Israel Rukshin 提交于 3月 24, 2020

In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

b780d741

17 3月, 2020 1 次提交

scsi: treewide: Consolidate {get,put}_unaligned_[bl]e24() definitions · a7afff31

由 Bart Van Assche 提交于 3月 13, 2020

Move the get_unaligned_be24(), get_unaligned_le24() and
put_unaligned_le24() definitions from various drivers into
include/linux/unaligned/generic.h. Add a put_unaligned_be24()
implementation.

Link: https://lore.kernel.org/r/20200313203102.16613-4-bvanassche@acm.org
Cc: Keith Busch <kbusch@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jens Axboe <axboe@fb.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> # For drivers/usb
Reviewed-by: Felipe Balbi <balbi@kernel.org> # For drivers/usb/gadget
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>

a7afff31

11 3月, 2020 1 次提交

nvme-rdma: Avoid double freeing of async event data · 9134ae2a

由 Prabhath Sajeepa 提交于 3月 09, 2020

The timeout of identify cmd, which is invoked as part of admin queue
creation, can result in freeing of async event data both in
nvme_rdma_timeout handler and error handling path of
nvme_rdma_configure_admin queue thus causing NULL pointer reference.
Call Trace:
 ? nvme_rdma_setup_ctrl+0x223/0x800 [nvme_rdma]
 nvme_rdma_create_ctrl+0x2ba/0x3f7 [nvme_rdma]
 nvmf_dev_write+0xa54/0xcc6 [nvme_fabrics]
 __vfs_write+0x1b/0x40
 vfs_write+0xb2/0x1b0
 ksys_write+0x61/0xd0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x60/0x1e0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reviewed-by: NRoland Dreier <roland@purestorage.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPrabhath Sajeepa <psajeepa@purestorage.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9134ae2a

15 2月, 2020 1 次提交

nvme: prevent warning triggered by nvme_stop_keep_alive · 97b2512a

由 Nigel Kirkland 提交于 2月 10, 2020

Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.

Example warning:

  workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.

After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.

Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.
Signed-off-by: NNigel Kirkland <nigel.kirkland@broadcom.com>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

97b2512a

27 11月, 2019 1 次提交

nvme-rdma: Avoid preallocating big SGL for data · 38e18002

由 Israel Rukshin 提交于 11月 24, 2019

nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.

The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.

We didn't notice of a performance degradation, since for small IOs we'll
use the inline SG and for the bigger IOs the allocation of a bigger SGL
from slab is fast enough.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

38e18002

05 11月, 2019 3 次提交

nvme-rdma: fix a segmentation fault during module unload · 9ad9e8d6

由 Max Gurtovoy 提交于 10月 29, 2019

In case there are controllers that are not associated with any RDMA
device (e.g. during unsuccessful reconnection) and the user will unload
the module, these controllers will not be freed and will access already
freed memory. The same logic appears in other fabric drivers as well.

Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset")
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9ad9e8d6

nvme: move common call to nvme_cleanup_cmd to core layer · 16686f3a

由 Max Gurtovoy 提交于 10月 13, 2019

nvme_cleanup_cmd should be called for each call to nvme_setup_cmd
(symmetrical functions). Move the call for nvme_cleanup_cmd to the common
core layer and call it during nvme_complete_rq for the good flow. For
error flow, each transport will call nvme_cleanup_cmd independently. Also
take care of a special case of path failure, where we call
nvme_complete_rq without doing nvme_setup_cmd.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

16686f3a

nvme: introduce nvme_is_aen_req function · 58a8df67

由 Israel Rukshin 提交于 10月 13, 2019

This function improves code readability and reduces code duplication.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

58a8df67

14 10月, 2019 1 次提交

nvme: Restart request timers in resetting state · 92b98e88

由 Keith Busch 提交于 9月 05, 2019

A controller in the resetting state has not yet completed its recovery
actions. The pci and fc transports were already handling this, so update
the remaining transports to not attempt additional recovery in this
state. Instead, just restart the request timer.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

92b98e88

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功