1. 16 8月, 2021 4 次提交
  2. 13 7月, 2021 1 次提交
  3. 01 7月, 2021 1 次提交
  4. 17 6月, 2021 1 次提交
  5. 16 6月, 2021 1 次提交
  6. 03 6月, 2021 1 次提交
    • M
      nvme-tcp: allow selecting the network interface for connections · 3ede8f72
      Martin Belanger 提交于
      In our application, we need a way to force TCP connections to go out a
      specific IP interface instead of letting Linux select the interface
      based on the routing tables.
      
      Add the 'host-iface' option to allow specifying the interface to use.
      When the option host-iface is specified, the driver uses the specified
      interface to set the option SO_BINDTODEVICE on the TCP socket before
      connecting.
      
      This new option is needed in addtion to the existing host-traddr for
      the following reasons:
      
      Specifying an IP interface by its associated IP address is less
      intuitive than specifying the actual interface name and, in some cases,
      simply doesn't work. That's because the association between interfaces
      and IP addresses is not predictable. IP addresses can be changed or can
      change by themselves over time (e.g. DHCP). Interface names are
      predictable [1] and will persist over time. Consider the following
      configuration.
      
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 100.0.0.100/24 scope global lo
             valid_lft forever preferred_lft forever
      2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s3
             valid_lft forever preferred_lft forever
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s8
             valid_lft forever preferred_lft forever
      
      The above is a VM that I configured with the same IP address
      (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
      unique interface associated with 100.0.0.100 does not work here. And
      this is why the option host_iface is required. I understand that the
      above config does not represent a standard host system, but I'm using
      this to prove a point: "We can never know how users will configure
      their systems". By te way, The above configuration is perfectly fine
      by Linux.
      
      The current TCP implementation for host_traddr performs a
      bind()-before-connect(). This is a common construct to set the source
      IP address on a TCP socket before connecting. This has no effect on how
      Linux selects the interface for the connection. That's because Linux
      uses the Weak End System model as described in RFC1122 [2]. On the other
      hand, setting the Source IP Address has benefits and should be supported
      by linux-nvme. In fact, setting the Source IP Address is a mandatory
      FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
      Consider the following configuration.
      
      $ ip addr list dev enp0s8
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
             valid_lft 426sec preferred_lft 426sec
          inet 192.168.56.102/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.103/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.104/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
      
      Here we can see that several addresses are associated with interface
      enp0s8. By default, Linux always selects the default IP address,
      192.168.56.101, as the source address when connecting over interface
      enp0s8. Some users, however, want the ability to specify a different
      source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
      host_traddr can be used as-is to perform this function.
      
      In conclusion, I believe that we need 2 options for TCP connections.
      One that can be used to specify an interface (host-iface). And one that
      can be used to set the source address (host-traddr). Users should be
      allowed to use one or the other, or both, or none. Of course, the
      documentation for host_traddr will need some clarification. It should
      state that when used for TCP connection, this option only sets the
      source address. And the documentation for host_iface should say that
      this option is only available for TCP connections.
      
      References:
      [1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
      [2] https://tools.ietf.org/html/rfc1122
      
      Tested both IPv4 and IPv6 connections.
      Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3ede8f72
  7. 19 5月, 2021 2 次提交
    • K
      nvme-tcp: rerun io_work if req_list is not empty · a0fdd141
      Keith Busch 提交于
      A possible race condition exists where the request to send data is
      enqueued from nvme_tcp_handle_r2t()'s will not be observed by
      nvme_tcp_send_all() if it happens to be running. The driver relies on
      io_work to send the enqueued request when it is runs again, but the
      concurrently running nvme_tcp_send_all() may not have released the
      send_mutex at that time. If no future commands are enqueued to re-kick
      the io_work, the request will timeout in the SEND_H2C state, resulting
      in a timeout error like:
      
        nvme nvme0: queue 1: timeout request 0x3 type 6
      
      Ensure the io_work continues to run as long as the req_list is not empty.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a0fdd141
    • S
      nvme-tcp: fix possible use-after-completion · 825619b0
      Sagi Grimberg 提交于
      Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
      context") added a second context that may perform a network send.
      This means that now RX and TX are not serialized in nvme_tcp_io_work
      and can run concurrently.
      
      While there is correct mutual exclusion in the TX path (where
      the send_mutex protect the queue socket send activity) RX activity,
      and more specifically request completion may run concurrently.
      
      This means we must guarantee that any mutation of the request state
      related to its lifetime, bytes sent must not be accessed when a completion
      may have possibly arrived back (and processed).
      
      The race may trigger when a request completion arrives, processed
      _and_ reused as a fresh new request, exactly in the (relatively short)
      window between the last data payload sent and before the request iov_iter
      is advanced.
      
      Consider the following race:
      1. 16K write request is queued
      2. The nvme command and the data is sent to the controller (in-capsule
         or solicited by r2t)
      3. After the last payload is sent but before the req.iter is advanced,
         the controller sends back a completion.
      4. The completion is processed, the request is completed, and reused
         to transfer a new request (write or read)
      5. The new request is queued, and the driver reset the request parameters
         (nvme_tcp_setup_cmd_pdu).
      6. Now context in (2) resumes execution and advances the req.iter
      
      ==> use-after-completion as this is already a new request.
      
      Fix this by making sure the request is not advanced after the last
      data payload send, knowing that a completion may have arrived already.
      
      An alternative solution would have been to delay the request completion
      or state change waiting for reference counting on the TX path, but besides
      adding atomic operations to the hot-path, it may present challenges in
      multi-stage R2T scenarios where a r2t handler needs to be deferred to
      an async execution.
      Reported-by: NNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
      Tested-by: NAnil Mishra <anil.mishra@wdc.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      825619b0
  8. 04 5月, 2021 1 次提交
  9. 03 4月, 2021 4 次提交
  10. 18 3月, 2021 4 次提交
  11. 11 2月, 2021 1 次提交
    • S
      nvme-tcp: fix crash triggered with a dataless request submission · e11e5116
      Sagi Grimberg 提交于
      write-zeros has a bio, but does not have any data buffers associated
      with it. Hence should not initialize the request iter for it (which
      attempts to reference the bi_io_vec (and crash).
      --
       run blktests nvme/012 at 2021-02-05 21:53:34
       BUG: kernel NULL pointer dereference, address: 0000000000000008
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] SMP NOPTI
       CPU: 15 PID: 12069 Comm: kworker/15:2H Tainted: G S        I       5.11.0-rc6+ #1
       Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020
       Workqueue: kblockd blk_mq_run_work_fn
       RIP: 0010:nvme_tcp_init_iter+0x7d/0xd0 [nvme_tcp]
       RSP: 0018:ffffbd084447bd18 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffffa0bba9f3ce80 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000002000000
       RBP: ffffa0ba8ac6fec0 R08: 0000000002000000 R09: 0000000000000000
       R10: 0000000002800809 R11: 0000000000000000 R12: 0000000000000000
       R13: ffffa0bba9f3cf90 R14: 0000000000000000 R15: 0000000000000000
       FS:  0000000000000000(0000) GS:ffffa0c9ff9c0000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 00000001c9c6c005 CR4: 00000000007706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       PKRU: 55555554
       Call Trace:
        nvme_tcp_queue_rq+0xef/0x330 [nvme_tcp]
        blk_mq_dispatch_rq_list+0x11c/0x7c0
        ? blk_mq_flush_busy_ctxs+0xf6/0x110
        __blk_mq_sched_dispatch_requests+0x12b/0x170
        blk_mq_sched_dispatch_requests+0x30/0x60
        __blk_mq_run_hw_queue+0x2b/0x60
        process_one_work+0x1cb/0x360
        ? process_one_work+0x360/0x360
        worker_thread+0x30/0x370
        ? process_one_work+0x360/0x360
        kthread+0x116/0x130
        ? kthread_park+0x80/0x80
        ret_from_fork+0x1f/0x30
      --
      
      Fixes: cb9b870f ("nvme-tcp: fix wrong setting of request iov_iter")
      Reported-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Tested-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e11e5116
  12. 02 2月, 2021 5 次提交
  13. 19 1月, 2021 1 次提交
    • C
      nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout · 9ebbfe49
      Chao Leng 提交于
      Each name space has a request queue, if complete request long time,
      multi request queues may have time out requests at the same time,
      nvme_tcp_timeout will execute concurrently. Multi requests in different
      request queues may be queued in the same tcp queue, multi
      nvme_tcp_timeout may call nvme_tcp_stop_queue at the same time.
      The first nvme_tcp_stop_queue will clear NVME_TCP_Q_LIVE and continue
      stopping the tcp queue(cancel io_work), but the others check
      NVME_TCP_Q_LIVE is already cleared, and then directly complete the
      requests, complete request before the io work is completely canceled may
      lead to a use-after-free condition.
      Add a multex lock to serialize nvme_tcp_stop_queue.
      Signed-off-by: NChao Leng <lengchao@huawei.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      9ebbfe49
  14. 15 1月, 2021 2 次提交
  15. 06 1月, 2021 1 次提交
    • S
      nvme-tcp: Fix possible race of io_work and direct send · 5c11f7d9
      Sagi Grimberg 提交于
      We may send a request (with or without its data) from two paths:
      
        1. From our I/O context nvme_tcp_io_work which is triggered from:
          - queue_rq
          - r2t reception
          - socket data_ready and write_space callbacks
        2. Directly from queue_rq if the send_list is empty (because we want to
           save the context switch associated with scheduling our io_work).
      
      However, given that now we have the send_mutex, we may run into a race
      condition where none of these contexts will send the pending payload to
      the controller. Both io_work send path and queue_rq send path
      opportunistically attempt to acquire the send_mutex however queue_rq only
      attempts to send a single request, and if io_work context fails to
      acquire the send_mutex it will complete without rescheduling itself.
      
      The race can trigger with the following sequence:
      
        1. queue_rq sends request (no incapsule data) and blocks
        2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
           to the send_list and schedules io_work
        3. io_work triggers and cannot acquire the send_mutex - because of (1),
           ends without self rescheduling
        4. queue_rq completes the send, and completes
      
      ==> no context will send the h2cdata - timeout.
      
      Fix this by having queue_rq sending as much as it can from the send_list
      such that if it still has any left, its because the socket buffer is
      full and the socket write_space callback will trigger, thus guaranteeing
      that a context will be scheduled to send the h2cdata PDU.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Reported-by: NPotnuri Bharat Teja <bharat@chelsio.com>
      Reported-by: NSamuel Jones <sjones@kalrayinc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NPotnuri Bharat Teja <bharat@chelsio.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      5c11f7d9
  16. 02 12月, 2020 1 次提交
  17. 03 11月, 2020 2 次提交
  18. 03 10月, 2020 1 次提交
    • C
      nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab
      Coly Li 提交于
      Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
      send slab pages. But for pages allocated by __get_free_pages() without
      __GFP_COMP, which also have refcount as 0, they are still sent by
      kernel_sendpage() to remote end, this is problematic.
      
      The new introduced helper sendpage_ok() checks both PageSlab tag and
      page_count counter, and returns true if the checking page is OK to be
      sent by kernel_sendpage().
      
      This patch fixes the page checking issue of nvme_tcp_try_send_data()
      with sendpage_ok(). If sendpage_ok() returns true, send this page by
      kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4194ab
  19. 09 9月, 2020 1 次提交
  20. 29 8月, 2020 3 次提交
    • S
      nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f
      Sagi Grimberg 提交于
      If the controller becomes unresponsive in the middle of a reset, we will
      hang because we are waiting for the freeze to complete, but that cannot
      happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to proceed
      (either schedule a reconnect of remove the controller).
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      e5c01f4f
    • S
      nvme-tcp: fix timeout handler · 236187c4
      Sagi Grimberg 提交于
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      236187c4
    • S
      nvme-tcp: serialize controller teardown sequences · d4d61470
      Sagi Grimberg 提交于
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      d4d61470
  21. 24 8月, 2020 1 次提交
  22. 22 8月, 2020 1 次提交