提交 0525af71 编写于 作者: I Israel Rukshin 提交者: Jens Axboe

nvme-rdma: remove timeout for getting RDMA-CM established event

In case many controllers start error recovery at the same time (i.e.,
when port is down and up), they may never succeed to reconnect again.
This is because the target can't handle all the connect requests at
three seconds (the arbitrary value set today). Even if some of the
connections are established, when a single queue fails to connect,
all the controller's queues are destroyed as well. So, on the
following reconnection attempts the number of connect requests may
remain the same. To fix this, remove the timeout and wait for RDMA-CM
event to abort/complete the connect request. RDMA-CM sends unreachable
event when a timeout of ~90 seconds is expired. This approach is used
at other RDMA-CM users like SRP and iSER at blocking mode. The commit
also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
Signed-off-by: NIsrael Rukshin <israelr@nvidia.com>
Reviewed-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
上级 7012eef5
...@@ -29,7 +29,7 @@ ...@@ -29,7 +29,7 @@
#include "fabrics.h" #include "fabrics.h"
#define NVME_RDMA_CONNECT_TIMEOUT_MS 3000 /* 3 second */ #define NVME_RDMA_CM_TIMEOUT_MS 3000 /* 3 second */
#define NVME_RDMA_MAX_SEGMENTS 256 #define NVME_RDMA_MAX_SEGMENTS 256
...@@ -248,12 +248,9 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue) ...@@ -248,12 +248,9 @@ static int nvme_rdma_wait_for_cm(struct nvme_rdma_queue *queue)
{ {
int ret; int ret;
ret = wait_for_completion_interruptible_timeout(&queue->cm_done, ret = wait_for_completion_interruptible(&queue->cm_done);
msecs_to_jiffies(NVME_RDMA_CONNECT_TIMEOUT_MS) + 1); if (ret)
if (ret < 0)
return ret; return ret;
if (ret == 0)
return -ETIMEDOUT;
WARN_ON_ONCE(queue->cm_error > 0); WARN_ON_ONCE(queue->cm_error > 0);
return queue->cm_error; return queue->cm_error;
} }
...@@ -612,7 +609,7 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl, ...@@ -612,7 +609,7 @@ static int nvme_rdma_alloc_queue(struct nvme_rdma_ctrl *ctrl,
queue->cm_error = -ETIMEDOUT; queue->cm_error = -ETIMEDOUT;
ret = rdma_resolve_addr(queue->cm_id, src_addr, ret = rdma_resolve_addr(queue->cm_id, src_addr,
(struct sockaddr *)&ctrl->addr, (struct sockaddr *)&ctrl->addr,
NVME_RDMA_CONNECT_TIMEOUT_MS); NVME_RDMA_CM_TIMEOUT_MS);
if (ret) { if (ret) {
dev_info(ctrl->ctrl.device, dev_info(ctrl->ctrl.device,
"rdma_resolve_addr failed (%d).\n", ret); "rdma_resolve_addr failed (%d).\n", ret);
...@@ -1895,7 +1892,7 @@ static int nvme_rdma_addr_resolved(struct nvme_rdma_queue *queue) ...@@ -1895,7 +1892,7 @@ static int nvme_rdma_addr_resolved(struct nvme_rdma_queue *queue)
if (ctrl->opts->tos >= 0) if (ctrl->opts->tos >= 0)
rdma_set_service_type(queue->cm_id, ctrl->opts->tos); rdma_set_service_type(queue->cm_id, ctrl->opts->tos);
ret = rdma_resolve_route(queue->cm_id, NVME_RDMA_CONNECT_TIMEOUT_MS); ret = rdma_resolve_route(queue->cm_id, NVME_RDMA_CM_TIMEOUT_MS);
if (ret) { if (ret) {
dev_err(ctrl->device, "rdma_resolve_route failed (%d).\n", dev_err(ctrl->device, "rdma_resolve_route failed (%d).\n",
queue->cm_error); queue->cm_error);
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册