提交 · f04064814c2a15c22ed9c803f9b634ef34f91092 · openeuler / Kernel

16 8月, 2021 4 次提交

nvme-tcp: don't update queue count when failing to set io queues · 664227fd

由 Ruozhu Li 提交于 8月 07, 2021

We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.
Signed-off-by: NRuozhu Li <liruozhu@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

664227fd

nvme-tcp: pair send_mutex init with destroy · d48f92cd

由 Keith Busch 提交于 8月 06, 2021

Each mutex_init() should have a corresponding mutex_destroy().
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d48f92cd

nvme: code command_id with a genctr for use-after-free validation · e7006de6

由 Sagi Grimberg 提交于 6月 16, 2021

We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.
Suggested-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Acked-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e7006de6

nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data · 3b01a9d0

由 Sagi Grimberg 提交于 6月 16, 2021

We already validate it when receiving the c2hdata pdu header
and this is not changing so this is a redundant check.
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3b01a9d0

13 7月, 2021 1 次提交

nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE · 8b43ced6

由 Prabhakar Kushwaha 提交于 7月 13, 2021

dev_get_by_name() finds network device by name but it also increases the
reference count.

If a nvme-tcp queue is present and the network device driver is removed
before nvme_tcp, we will face the following continuous log:

  "kernel:unregister_netdevice: waiting for <eth> to become free. Usage count = 2"

And rmmod further halts. Similar case arises during reboot/shutdown
with nvme-tcp queue present and both never completes.

To fix this, use __dev_get_by_name() which finds network device by
name without increasing any reference counter.

Fixes: 3ede8f72 ("nvme-tcp: allow selecting the network interface for connections")
Signed-off-by: NOmkar Kulkarni <okulkarni@marvell.com>
Signed-off-by: NShai Malin <smalin@marvell.com>
Signed-off-by: NPrabhakar Kushwaha <pkushwaha@marvell.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
[hch: remove the ->ndev member entirely]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8b43ced6

01 7月, 2021 1 次提交

nvme: use blk_execute_rq() for passthrough commands · be42a33b

由 Keith Busch 提交于 6月 10, 2021

The generic blk_execute_rq() knows how to handle polled completions. Use
that instead of implementing an nvme specific handler.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

be42a33b

17 6月, 2021 1 次提交

nvme-tcp: use ctrl sgl check helper · 3b54064f

由 Chaitanya Kulkarni 提交于 6月 09, 2021

Use the helper to check NVMe controller's SGL support.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3b54064f

16 6月, 2021 1 次提交

nvme-tcp: fix error codes in nvme_tcp_setup_ctrl() · 522af60c

由 Dan Carpenter 提交于 6月 05, 2021

These error paths currently return success but they should return
-EOPNOTSUPP.

Fixes: 73ffcefc ("nvme-tcp: check sgl supported by target")
Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

522af60c

03 6月, 2021 1 次提交

nvme-tcp: allow selecting the network interface for connections · 3ede8f72

由 Martin Belanger 提交于 5月 20, 2021

In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.

Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.

This new option is needed in addtion to the existing host-traddr for
the following reasons:

Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 100.0.0.100/24 scope global lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s3
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s8
       valid_lft forever preferred_lft forever

The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.

The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.

$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
       valid_lft 426sec preferred_lft 426sec
    inet 192.168.56.102/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.103/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.104/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever

Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.

In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.

References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122

Tested both IPv4 and IPv6 connections.
Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3ede8f72

19 5月, 2021 2 次提交

nvme-tcp: rerun io_work if req_list is not empty · a0fdd141

由 Keith Busch 提交于 5月 17, 2021

A possible race condition exists where the request to send data is
enqueued from nvme_tcp_handle_r2t()'s will not be observed by
nvme_tcp_send_all() if it happens to be running. The driver relies on
io_work to send the enqueued request when it is runs again, but the
concurrently running nvme_tcp_send_all() may not have released the
send_mutex at that time. If no future commands are enqueued to re-kick
the io_work, the request will timeout in the SEND_H2C state, resulting
in a timeout error like:

  nvme nvme0: queue 1: timeout request 0x3 type 6

Ensure the io_work continues to run as long as the req_list is not empty.

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a0fdd141

nvme-tcp: fix possible use-after-completion · 825619b0

由 Sagi Grimberg 提交于 5月 17, 2021

Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
context") added a second context that may perform a network send.
This means that now RX and TX are not serialized in nvme_tcp_io_work
and can run concurrently.

While there is correct mutual exclusion in the TX path (where
the send_mutex protect the queue socket send activity) RX activity,
and more specifically request completion may run concurrently.

This means we must guarantee that any mutation of the request state
related to its lifetime, bytes sent must not be accessed when a completion
may have possibly arrived back (and processed).

The race may trigger when a request completion arrives, processed
_and_ reused as a fresh new request, exactly in the (relatively short)
window between the last data payload sent and before the request iov_iter
is advanced.

Consider the following race:
1. 16K write request is queued
2. The nvme command and the data is sent to the controller (in-capsule
   or solicited by r2t)
3. After the last payload is sent but before the req.iter is advanced,
   the controller sends back a completion.
4. The completion is processed, the request is completed, and reused
   to transfer a new request (write or read)
5. The new request is queued, and the driver reset the request parameters
   (nvme_tcp_setup_cmd_pdu).
6. Now context in (2) resumes execution and advances the req.iter

==> use-after-completion as this is already a new request.

Fix this by making sure the request is not advanced after the last
data payload send, knowing that a completion may have arrived already.

An alternative solution would have been to delay the request completion
or state change waiting for reference counting on the TX path, but besides
adding atomic operations to the hot-path, it may present challenges in
multi-stage R2T scenarios where a r2t handler needs to be deferred to
an async execution.
Reported-by: NNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
Tested-by: NAnil Mishra <anil.mishra@wdc.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

825619b0

04 5月, 2021 1 次提交

nvme: move the fabrics queue ready check routines to core · a9715744

由 Tao Chiu 提交于 4月 26, 2021

queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
restarts the admin queue, users are able to submit admin commands to a
controller before reset_work() completes. Commands submitted under this
condition may interfere with commands that performs identify, IO queue
setup in reset_work(), and may result in a hang described in the
following patch.

As seen in the fabrics, user commands are prevented from being executed
under inproper controller states. We may reuse this logic to maintain a
clear admin queue during reset_work().
Signed-off-by: NTao Chiu <taochiu@synology.com>
Signed-off-by: NCody Wong <codywong@synology.com>
Reviewed-by: NLeon Chien <leonchien@synology.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a9715744

03 4月, 2021 4 次提交

nvme-tcp: check sgl supported by target · 73ffcefc

由 Max Gurtovoy 提交于 3月 30, 2021

SGLs support is mandatory for NVMe/tcp, make sure that the target is
aligned to the specification.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

73ffcefc

nvme-tcp: block BH in sk state_change sk callback · 8b73b45d

由 Sagi Grimberg 提交于 3月 21, 2021

The TCP stack can run from process context for a long time
so we should disable BH here.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8b73b45d

nvme: use driver pdu command for passthrough · f4b9e6c9

由 Keith Busch 提交于 3月 17, 2021

All nvme transport drivers preallocate an nvme command for each request.
Assume to use that command for nvme_setup_cmd() instead of requiring
drivers pass a pointer to it. All nvme drivers must initialize the
generic nvme_request 'cmd' to point to the transport's preallocated
nvme_command.

The generic nvme_request cmd pointer had previously been used only as a
temporary copy for passthrough commands. Since it now points to the
command that gets dispatched, passthrough commands must directly set it
up prior to executing the request.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f4b9e6c9

nvme: rename nvme_init_identify() · f21c4769

由 Chaitanya Kulkarni 提交于 2月 28, 2021

This is a prep patch so that we can move the identify data structure
related code initialization from nvme_init_identify() into a helper.

Rename the function nvmet_init_identify() to nvmet_init_ctrl_finish().

Next patch will move the nvme_id_ctrl related initialization from newly
renamed function nvme_init_ctrl_finish() into the nvme_init_identify()
helper.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f21c4769

18 3月, 2021 4 次提交

nvme-tcp: fix possible hang when failing to set io queues · 72f57242

由 Sagi Grimberg 提交于 3月 15, 2021

We only setup io queues for nvme controllers, and it makes absolutely no
sense to allow a controller (re)connect without any I/O queues. If we
happen to fail setting the queue count for any reason, we should not
allow this to be a successful reconnect as I/O has no chance in going
through. Instead just fail and schedule another reconnect.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

72f57242

nvme-tcp: fix misuse of __smp_processor_id with preemption enabled · bb833370

由 Sagi Grimberg 提交于 3月 15, 2021

For our pure advisory use-case, we only rely on this call as a hint, so
fix the warning complaints of using the smp_processor_id variants with
preemption enabled.

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Fixes: ada83177 ("nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

bb833370

nvme-tcp: fix a NULL deref when receiving a 0-length r2t PDU · fd0823f4

由 Sagi Grimberg 提交于 3月 15, 2021

When the controller sends us a 0-length r2t PDU we should not attempt to
try to set up a h2cdata PDU but rather conclude that this is a buggy
controller (forward progress is not possible) and simply fail it
immediately.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: NBelanger, Martin <Martin.Belanger@dell.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

fd0823f4

nvme-fabrics: only reserve a single tag · ed01fee2

由 Christoph Hellwig 提交于 3月 03, 2021

Fabrics drivers currently reserve two tags on the admin queue.  But
given that the connect command is only run on a freshly created queue
or after all commands have been force aborted we only need to reserve
a single tag.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>

ed01fee2

11 2月, 2021 1 次提交

nvme-tcp: fix crash triggered with a dataless request submission · e11e5116

由 Sagi Grimberg 提交于 2月 10, 2021

write-zeros has a bio, but does not have any data buffers associated
with it. Hence should not initialize the request iter for it (which
attempts to reference the bi_io_vec (and crash).
--
 run blktests nvme/012 at 2021-02-05 21:53:34
 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP NOPTI
 CPU: 15 PID: 12069 Comm: kworker/15:2H Tainted: G S        I       5.11.0-rc6+ #1
 Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020
 Workqueue: kblockd blk_mq_run_work_fn
 RIP: 0010:nvme_tcp_init_iter+0x7d/0xd0 [nvme_tcp]
 RSP: 0018:ffffbd084447bd18 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffffa0bba9f3ce80 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000002000000
 RBP: ffffa0ba8ac6fec0 R08: 0000000002000000 R09: 0000000000000000
 R10: 0000000002800809 R11: 0000000000000000 R12: 0000000000000000
 R13: ffffa0bba9f3cf90 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffffa0c9ff9c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 00000001c9c6c005 CR4: 00000000007706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  nvme_tcp_queue_rq+0xef/0x330 [nvme_tcp]
  blk_mq_dispatch_rq_list+0x11c/0x7c0
  ? blk_mq_flush_busy_ctxs+0xf6/0x110
  __blk_mq_sched_dispatch_requests+0x12b/0x170
  blk_mq_sched_dispatch_requests+0x30/0x60
  __blk_mq_run_hw_queue+0x2b/0x60
  process_one_work+0x1cb/0x360
  ? process_one_work+0x360/0x360
  worker_thread+0x30/0x370
  ? process_one_work+0x360/0x360
  kthread+0x116/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x1f/0x30
--

Fixes: cb9b870f ("nvme-tcp: fix wrong setting of request iov_iter")
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e11e5116

02 2月, 2021 5 次提交

nvme-tcp: use cancel tagset helper for tear down · 563c8158

由 Chao Leng 提交于 1月 21, 2021

Use nvme_cancel_tagset and nvme_cancel_admin_tagset to clean code for
tear down process.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

563c8158

nvme-tcp: add clean action for failed reconnection · 70a99574

由 Chao Leng 提交于 1月 21, 2021

If reconnect failed after start io queues, the queues will be unquiesced
and new requests continue to be delivered. Reconnection error handling
process directly free queues without cancel suspend requests. The
suppend request will time out, and then crash due to use the queue
after free.

Add sync queues and cancel suppend requests for reconnection error
handling.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

70a99574

nvme-tcp: pass multipage bvec to request iov_iter · 0dc9edaf

由 Sagi Grimberg 提交于 1月 14, 2021

iov_iter uses the right helpers so we should be able
to pass in a multipage bvec. Right now the iov_iter is
initialized with more segments that it needs which doesn't
fail because the iov_iter is capped by byte count, but it
is better to use a full multipage bvec iter.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0dc9edaf

S
nvme-tcp: get rid of unused helper function · 60141aa0
由 Sagi Grimberg 提交于 1月 14, 2021
```
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
```
60141aa0

nvme-tcp: fix wrong setting of request iov_iter · cb9b870f

由 Sagi Grimberg 提交于 1月 14, 2021

We might set the iov_iter direction wrong, which is harmless for this
use-case, but get it right. Also this makes the code slightly cleaner.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cb9b870f

19 1月, 2021 1 次提交

nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout · 9ebbfe49

由 Chao Leng 提交于 1月 14, 2021

Each name space has a request queue, if complete request long time,
multi request queues may have time out requests at the same time,
nvme_tcp_timeout will execute concurrently. Multi requests in different
request queues may be queued in the same tcp queue, multi
nvme_tcp_timeout may call nvme_tcp_stop_queue at the same time.
The first nvme_tcp_stop_queue will clear NVME_TCP_Q_LIVE and continue
stopping the tcp queue(cancel io_work), but the others check
NVME_TCP_Q_LIVE is already cleared, and then directly complete the
requests, complete request before the io work is completely canceled may
lead to a use-after-free condition.
Add a multex lock to serialize nvme_tcp_stop_queue.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9ebbfe49

15 1月, 2021 2 次提交

nvme-tcp: fix possible data corruption with bio merges · ca1ff67d

由 Sagi Grimberg 提交于 1月 13, 2021

When a bio merges, we can get a request that spans multiple
bios, and the overall request payload size is the sum of
all bios. When we calculate how much we need to send
from the existing bio (and bvec), we did not take into
account the iov_iter byte count cap.

Since multipage bvecs support, bvecs can split in the middle
which means that when we account for the last bvec send we
should also take the iov_iter byte count cap as it might be
lower than the last bvec size.
Reported-by: NHao Wang <pkuwangh@gmail.com>
Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Tested-by: NHao Wang <pkuwangh@gmail.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca1ff67d

nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT · ada83177

由 Sagi Grimberg 提交于 1月 13, 2021

We shouldn't call smp_processor_id() in a preemptible
context, but this is advisory at best, so instead
call __smp_processor_id().

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: NOr Gerlitz <gerlitz.or@gmail.com>
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ada83177

06 1月, 2021 1 次提交

nvme-tcp: Fix possible race of io_work and direct send · 5c11f7d9

由 Sagi Grimberg 提交于 12月 21, 2020

We may send a request (with or without its data) from two paths:

  1. From our I/O context nvme_tcp_io_work which is triggered from:
    - queue_rq
    - r2t reception
    - socket data_ready and write_space callbacks
  2. Directly from queue_rq if the send_list is empty (because we want to
     save the context switch associated with scheduling our io_work).

However, given that now we have the send_mutex, we may run into a race
condition where none of these contexts will send the pending payload to
the controller. Both io_work send path and queue_rq send path
opportunistically attempt to acquire the send_mutex however queue_rq only
attempts to send a single request, and if io_work context fails to
acquire the send_mutex it will complete without rescheduling itself.

The race can trigger with the following sequence:

  1. queue_rq sends request (no incapsule data) and blocks
  2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
     to the send_list and schedules io_work
  3. io_work triggers and cannot acquire the send_mutex - because of (1),
     ends without self rescheduling
  4. queue_rq completes the send, and completes

==> no context will send the h2cdata - timeout.

Fix this by having queue_rq sending as much as it can from the send_list
such that if it still has any left, its because the socket buffer is
full and the socket write_space callback will trigger, thus guaranteeing
that a context will be scheduled to send the h2cdata PDU.

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Reported-by: NPotnuri Bharat Teja <bharat@chelsio.com>
Reported-by: NSamuel Jones <sjones@kalrayinc.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NPotnuri Bharat Teja <bharat@chelsio.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5c11f7d9

02 12月, 2020 1 次提交

nvme: use consistent macro name for timeout · dc96f938

由 Chaitanya Kulkarni 提交于 11月 09, 2020

This is purely a clenaup patch, add prefix NVME to the ADMIN_TIMEOUT to
make consistent with NVME_IO_TIMEOUT.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

dc96f938

03 11月, 2020 2 次提交

nvme-tcp: avoid repeated request completion · 0a8a2c85

由 Sagi Grimberg 提交于 10月 22, 2020

The request may be executed asynchronously, and rq->state may be
changed to IDLE. To avoid repeated request completion, only
MQ_RQ_COMPLETE of rq->state is checked in nvme_tcp_complete_timed_out.
It is not safe, so need adding check IDLE for rq->state.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChao Leng <lengchao@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0a8a2c85

nvme-tcp: avoid race between time out and tear down · d6f66210

由 Chao Leng 提交于 10月 22, 2020

Now use teardown_lock to serialize for time out and tear down. This may
cause abnormal: first cancel all request in tear down, then time out may
complete the request again, but the request may already be freed or
restarted.

To avoid race between time out and tear down, in tear down process,
first we quiesce the queue, and then delete the timer and cancel
the time out work for the queue. At the same time we need to delete
teardown_lock.
Signed-off-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d6f66210

03 10月, 2020 1 次提交

nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab

由 Coly Li 提交于 10月 02, 2020

Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
send slab pages. But for pages allocated by __get_free_pages() without
__GFP_COMP, which also have refcount as 0, they are still sent by
kernel_sendpage() to remote end, this is problematic.

The new introduced helper sendpage_ok() checks both PageSlab tag and
page_count counter, and returns true if the checking page is OK to be
sent by kernel_sendpage().

This patch fixes the page checking issue of nvme_tcp_try_send_data()
with sendpage_ok(). If sendpage_ok() returns true, send this page by
kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
Signed-off-by: NColy Li <colyli@suse.de>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Vlastimil Babka <vbabka@suse.com>
Cc: stable@vger.kernel.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d4194ab

09 9月, 2020 1 次提交

nvme-tcp: cancel async events before freeing event struct · ceb1e087

由 David Milburn 提交于 9月 02, 2020

Cancel async event work in case async event has been queued up, and
nvme_tcp_submit_async_event() runs after event has been freed.
Signed-off-by: NDavid Milburn <dmilburn@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ceb1e087

29 8月, 2020 3 次提交

nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f

由 Sagi Grimberg 提交于 7月 30, 2020

If the controller becomes unresponsive in the middle of a reset, we will
hang because we are waiting for the freeze to complete, but that cannot
happen since we have commands that are inflight holding the
q_usage_counter, and we can't blindly fail requests that times out.

So give a timeout and if we cannot wait for queue freeze before
unfreezing, fail and have the error handling take care how to proceed
(either schedule a reconnect of remove the controller).
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

e5c01f4f

nvme-tcp: fix timeout handler · 236187c4

由 Sagi Grimberg 提交于 7月 28, 2020

When a request times out in a LIVE state, we simply trigger error
recovery and let the error recovery handle the request cancellation,
however when a request times out in a non LIVE state, we make sure to
complete it immediately as it might block controller setup or teardown
and prevent forward progress.

However tearing down the entire set of I/O and admin queues causes
freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
an overkill to what we actually need, which is to just fence controller
teardown that may be running, stop the queue, and cancel the request if
it is not already completed.

Now that we have the controller teardown_lock, we can safely serialize
request cancellation. This addresses a hang caused by calling extra
queue freeze on controller namespaces, causing unfreeze to not complete
correctly.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

236187c4

nvme-tcp: serialize controller teardown sequences · d4d61470

由 Sagi Grimberg 提交于 8月 05, 2020

In the timeout handler we may need to complete a request because the
request that timed out may be an I/O that is a part of a serial sequence
of controller teardown or initialization. In order to complete the
request, we need to fence any other context that may compete with us
and complete the request that is timing out.

In this case, we could have a potential double completion in case
a hard-irq or a different competing context triggered error recovery
and is running inflight request cancellation concurrently with the
timeout handler.

Protect using a ctrl teardown_lock to serialize contexts that may
complete a cancelled request due to error recovery or a reset.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

d4d61470

24 8月, 2020 1 次提交

treewide: Use fallthrough pseudo-keyword · df561f66

由 Gustavo A. R. Silva 提交于 8月 23, 2020

Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-throughSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

df561f66

22 8月, 2020 1 次提交

nvme: rename and document nvme_end_request · 2eb81a33

由 Christoph Hellwig 提交于 8月 18, 2020

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API.  It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2eb81a33

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功