1. 14 3月, 2022 2 次提交
  2. 28 2月, 2022 1 次提交
  3. 24 11月, 2021 2 次提交
  4. 27 10月, 2021 2 次提交
  5. 26 10月, 2021 1 次提交
    • S
      nvme-tcp: fix H2CData PDU send accounting (again) · 25e1f67e
      Sagi Grimberg 提交于
      We should not access request members after the last send, even to
      determine if indeed it was the last data payload send. The reason is
      that a completion could have arrived and trigger a new execution of the
      request which overridden these members. This was fixed by commit
      825619b0 ("nvme-tcp: fix possible use-after-completion").
      
      Commit e371af03 broke that assumption again to address cases where
      multiple r2t pdus are sent per request. To fix it, we need to record the
      request data_sent and data_len and after the payload network send we
      reference these counters to determine weather we should advance the
      request iterator.
      
      Fixes: e371af03 ("nvme-tcp: fix incorrect h2cdata pdu offset accounting")
      Reported-by: NKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # 5.10+
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      25e1f67e
  6. 21 10月, 2021 1 次提交
  7. 20 10月, 2021 1 次提交
  8. 19 10月, 2021 1 次提交
  9. 21 9月, 2021 1 次提交
  10. 14 9月, 2021 1 次提交
  11. 06 9月, 2021 1 次提交
    • D
      nvme-tcp: Do not reset transport on data digest errors · 1ba2e507
      Daniel Wagner 提交于
      The spec says
      
        7.4.6.1 Digest Error handling
      
        When a host detects a data digest error in a C2HData PDU, that host
        shall continue processing C2HData PDUs associated with the command and
        when the command processing has completed, if a successful status was
        returned by the controller, the host shall fail the command with a
        non-fatal transport error.
      
      Currently the transport is reseted when a data digest error is
      detected. Instead, when a digest error is detected, mark the final
      status as NVME_SC_DATA_XFER_ERROR and let the upper layer handle
      the error.
      
      In order to keep track of the final result maintain a status field in
      nvme_tcp_request object and use it to overwrite the completion queue
      status (which might be successful even though a digest error has been
      detected) when completing the request.
      Signed-off-by: NDaniel Wagner <dwagner@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      1ba2e507
  12. 16 8月, 2021 4 次提交
  13. 13 7月, 2021 1 次提交
  14. 01 7月, 2021 1 次提交
  15. 17 6月, 2021 1 次提交
  16. 16 6月, 2021 1 次提交
  17. 03 6月, 2021 1 次提交
    • M
      nvme-tcp: allow selecting the network interface for connections · 3ede8f72
      Martin Belanger 提交于
      In our application, we need a way to force TCP connections to go out a
      specific IP interface instead of letting Linux select the interface
      based on the routing tables.
      
      Add the 'host-iface' option to allow specifying the interface to use.
      When the option host-iface is specified, the driver uses the specified
      interface to set the option SO_BINDTODEVICE on the TCP socket before
      connecting.
      
      This new option is needed in addtion to the existing host-traddr for
      the following reasons:
      
      Specifying an IP interface by its associated IP address is less
      intuitive than specifying the actual interface name and, in some cases,
      simply doesn't work. That's because the association between interfaces
      and IP addresses is not predictable. IP addresses can be changed or can
      change by themselves over time (e.g. DHCP). Interface names are
      predictable [1] and will persist over time. Consider the following
      configuration.
      
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 100.0.0.100/24 scope global lo
             valid_lft forever preferred_lft forever
      2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s3
             valid_lft forever preferred_lft forever
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s8
             valid_lft forever preferred_lft forever
      
      The above is a VM that I configured with the same IP address
      (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
      unique interface associated with 100.0.0.100 does not work here. And
      this is why the option host_iface is required. I understand that the
      above config does not represent a standard host system, but I'm using
      this to prove a point: "We can never know how users will configure
      their systems". By te way, The above configuration is perfectly fine
      by Linux.
      
      The current TCP implementation for host_traddr performs a
      bind()-before-connect(). This is a common construct to set the source
      IP address on a TCP socket before connecting. This has no effect on how
      Linux selects the interface for the connection. That's because Linux
      uses the Weak End System model as described in RFC1122 [2]. On the other
      hand, setting the Source IP Address has benefits and should be supported
      by linux-nvme. In fact, setting the Source IP Address is a mandatory
      FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
      Consider the following configuration.
      
      $ ip addr list dev enp0s8
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
             valid_lft 426sec preferred_lft 426sec
          inet 192.168.56.102/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.103/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.104/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
      
      Here we can see that several addresses are associated with interface
      enp0s8. By default, Linux always selects the default IP address,
      192.168.56.101, as the source address when connecting over interface
      enp0s8. Some users, however, want the ability to specify a different
      source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
      host_traddr can be used as-is to perform this function.
      
      In conclusion, I believe that we need 2 options for TCP connections.
      One that can be used to specify an interface (host-iface). And one that
      can be used to set the source address (host-traddr). Users should be
      allowed to use one or the other, or both, or none. Of course, the
      documentation for host_traddr will need some clarification. It should
      state that when used for TCP connection, this option only sets the
      source address. And the documentation for host_iface should say that
      this option is only available for TCP connections.
      
      References:
      [1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
      [2] https://tools.ietf.org/html/rfc1122
      
      Tested both IPv4 and IPv6 connections.
      Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3ede8f72
  18. 19 5月, 2021 2 次提交
    • K
      nvme-tcp: rerun io_work if req_list is not empty · a0fdd141
      Keith Busch 提交于
      A possible race condition exists where the request to send data is
      enqueued from nvme_tcp_handle_r2t()'s will not be observed by
      nvme_tcp_send_all() if it happens to be running. The driver relies on
      io_work to send the enqueued request when it is runs again, but the
      concurrently running nvme_tcp_send_all() may not have released the
      send_mutex at that time. If no future commands are enqueued to re-kick
      the io_work, the request will timeout in the SEND_H2C state, resulting
      in a timeout error like:
      
        nvme nvme0: queue 1: timeout request 0x3 type 6
      
      Ensure the io_work continues to run as long as the req_list is not empty.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a0fdd141
    • S
      nvme-tcp: fix possible use-after-completion · 825619b0
      Sagi Grimberg 提交于
      Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
      context") added a second context that may perform a network send.
      This means that now RX and TX are not serialized in nvme_tcp_io_work
      and can run concurrently.
      
      While there is correct mutual exclusion in the TX path (where
      the send_mutex protect the queue socket send activity) RX activity,
      and more specifically request completion may run concurrently.
      
      This means we must guarantee that any mutation of the request state
      related to its lifetime, bytes sent must not be accessed when a completion
      may have possibly arrived back (and processed).
      
      The race may trigger when a request completion arrives, processed
      _and_ reused as a fresh new request, exactly in the (relatively short)
      window between the last data payload sent and before the request iov_iter
      is advanced.
      
      Consider the following race:
      1. 16K write request is queued
      2. The nvme command and the data is sent to the controller (in-capsule
         or solicited by r2t)
      3. After the last payload is sent but before the req.iter is advanced,
         the controller sends back a completion.
      4. The completion is processed, the request is completed, and reused
         to transfer a new request (write or read)
      5. The new request is queued, and the driver reset the request parameters
         (nvme_tcp_setup_cmd_pdu).
      6. Now context in (2) resumes execution and advances the req.iter
      
      ==> use-after-completion as this is already a new request.
      
      Fix this by making sure the request is not advanced after the last
      data payload send, knowing that a completion may have arrived already.
      
      An alternative solution would have been to delay the request completion
      or state change waiting for reference counting on the TX path, but besides
      adding atomic operations to the hot-path, it may present challenges in
      multi-stage R2T scenarios where a r2t handler needs to be deferred to
      an async execution.
      Reported-by: NNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
      Tested-by: NAnil Mishra <anil.mishra@wdc.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      825619b0
  19. 04 5月, 2021 1 次提交
  20. 03 4月, 2021 4 次提交
  21. 18 3月, 2021 4 次提交
  22. 11 2月, 2021 1 次提交
    • S
      nvme-tcp: fix crash triggered with a dataless request submission · e11e5116
      Sagi Grimberg 提交于
      write-zeros has a bio, but does not have any data buffers associated
      with it. Hence should not initialize the request iter for it (which
      attempts to reference the bi_io_vec (and crash).
      --
       run blktests nvme/012 at 2021-02-05 21:53:34
       BUG: kernel NULL pointer dereference, address: 0000000000000008
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] SMP NOPTI
       CPU: 15 PID: 12069 Comm: kworker/15:2H Tainted: G S        I       5.11.0-rc6+ #1
       Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020
       Workqueue: kblockd blk_mq_run_work_fn
       RIP: 0010:nvme_tcp_init_iter+0x7d/0xd0 [nvme_tcp]
       RSP: 0018:ffffbd084447bd18 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffffa0bba9f3ce80 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000002000000
       RBP: ffffa0ba8ac6fec0 R08: 0000000002000000 R09: 0000000000000000
       R10: 0000000002800809 R11: 0000000000000000 R12: 0000000000000000
       R13: ffffa0bba9f3cf90 R14: 0000000000000000 R15: 0000000000000000
       FS:  0000000000000000(0000) GS:ffffa0c9ff9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 00000001c9c6c005 CR4: 00000000007706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        nvme_tcp_queue_rq+0xef/0x330 [nvme_tcp]
        blk_mq_dispatch_rq_list+0x11c/0x7c0
        ? blk_mq_flush_busy_ctxs+0xf6/0x110
        __blk_mq_sched_dispatch_requests+0x12b/0x170
        blk_mq_sched_dispatch_requests+0x30/0x60
        __blk_mq_run_hw_queue+0x2b/0x60
        process_one_work+0x1cb/0x360
        ? process_one_work+0x360/0x360
        worker_thread+0x30/0x370
        ? process_one_work+0x360/0x360
        kthread+0x116/0x130
        ? kthread_park+0x80/0x80
        ret_from_fork+0x1f/0x30
      --
      
      Fixes: cb9b870f ("nvme-tcp: fix wrong setting of request iov_iter")
      Reported-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Tested-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e11e5116
  23. 02 2月, 2021 5 次提交