1. 02 2月, 2021 2 次提交
  2. 19 1月, 2021 1 次提交
    • C
      nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout · 9ebbfe49
      Chao Leng 提交于
      Each name space has a request queue, if complete request long time,
      multi request queues may have time out requests at the same time,
      nvme_tcp_timeout will execute concurrently. Multi requests in different
      request queues may be queued in the same tcp queue, multi
      nvme_tcp_timeout may call nvme_tcp_stop_queue at the same time.
      The first nvme_tcp_stop_queue will clear NVME_TCP_Q_LIVE and continue
      stopping the tcp queue(cancel io_work), but the others check
      NVME_TCP_Q_LIVE is already cleared, and then directly complete the
      requests, complete request before the io work is completely canceled may
      lead to a use-after-free condition.
      Add a multex lock to serialize nvme_tcp_stop_queue.
      Signed-off-by: NChao Leng <lengchao@huawei.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      9ebbfe49
  3. 15 1月, 2021 2 次提交
  4. 06 1月, 2021 1 次提交
    • S
      nvme-tcp: Fix possible race of io_work and direct send · 5c11f7d9
      Sagi Grimberg 提交于
      We may send a request (with or without its data) from two paths:
      
        1. From our I/O context nvme_tcp_io_work which is triggered from:
          - queue_rq
          - r2t reception
          - socket data_ready and write_space callbacks
        2. Directly from queue_rq if the send_list is empty (because we want to
           save the context switch associated with scheduling our io_work).
      
      However, given that now we have the send_mutex, we may run into a race
      condition where none of these contexts will send the pending payload to
      the controller. Both io_work send path and queue_rq send path
      opportunistically attempt to acquire the send_mutex however queue_rq only
      attempts to send a single request, and if io_work context fails to
      acquire the send_mutex it will complete without rescheduling itself.
      
      The race can trigger with the following sequence:
      
        1. queue_rq sends request (no incapsule data) and blocks
        2. RX path receives r2t - prepares data PDU to send, adds h2cdata PDU
           to the send_list and schedules io_work
        3. io_work triggers and cannot acquire the send_mutex - because of (1),
           ends without self rescheduling
        4. queue_rq completes the send, and completes
      
      ==> no context will send the h2cdata - timeout.
      
      Fix this by having queue_rq sending as much as it can from the send_list
      such that if it still has any left, its because the socket buffer is
      full and the socket write_space callback will trigger, thus guaranteeing
      that a context will be scheduled to send the h2cdata PDU.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Reported-by: NPotnuri Bharat Teja <bharat@chelsio.com>
      Reported-by: NSamuel Jones <sjones@kalrayinc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NPotnuri Bharat Teja <bharat@chelsio.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      5c11f7d9
  5. 02 12月, 2020 1 次提交
  6. 03 11月, 2020 2 次提交
  7. 03 10月, 2020 1 次提交
    • C
      nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab
      Coly Li 提交于
      Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
      send slab pages. But for pages allocated by __get_free_pages() without
      __GFP_COMP, which also have refcount as 0, they are still sent by
      kernel_sendpage() to remote end, this is problematic.
      
      The new introduced helper sendpage_ok() checks both PageSlab tag and
      page_count counter, and returns true if the checking page is OK to be
      sent by kernel_sendpage().
      
      This patch fixes the page checking issue of nvme_tcp_try_send_data()
      with sendpage_ok(). If sendpage_ok() returns true, send this page by
      kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4194ab
  8. 09 9月, 2020 1 次提交
  9. 29 8月, 2020 3 次提交
    • S
      nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f
      Sagi Grimberg 提交于
      If the controller becomes unresponsive in the middle of a reset, we will
      hang because we are waiting for the freeze to complete, but that cannot
      happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to proceed
      (either schedule a reconnect of remove the controller).
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      e5c01f4f
    • S
      nvme-tcp: fix timeout handler · 236187c4
      Sagi Grimberg 提交于
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      236187c4
    • S
      nvme-tcp: serialize controller teardown sequences · d4d61470
      Sagi Grimberg 提交于
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      d4d61470
  10. 24 8月, 2020 1 次提交
  11. 22 8月, 2020 1 次提交
  12. 29 7月, 2020 2 次提交
    • S
      nvme-tcp: fix controller reset hang during traffic · 2875b0ae
      Sagi Grimberg 提交于
      commit fe35ec58 ("block: update hctx map when use multiple maps")
      exposed an issue where we may hang trying to wait for queue freeze
      during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
      queue maps (which we have now for default/read/poll) is attempting to
      freeze the queue. However we never started queue freeze when starting the
      reset, which means that we have inflight pending requests that entered the
      queue that we will not complete once the queue is quiesced.
      
      So start a freeze before we quiesce the queue, and unfreeze the queue
      after we successfully connected the I/O queues (and make sure to call
      blk_mq_update_nr_hw_queues only after we are sure that the queue was
      already frozen).
      
      This follows to how the pci driver handles resets.
      
      Fixes: fe35ec58 ("block: update hctx map when use multiple maps")
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2875b0ae
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e
  13. 26 7月, 2020 1 次提交
  14. 08 7月, 2020 3 次提交
  15. 25 6月, 2020 1 次提交
  16. 24 6月, 2020 1 次提交
  17. 11 6月, 2020 1 次提交
  18. 29 5月, 2020 5 次提交
  19. 27 5月, 2020 1 次提交
  20. 10 5月, 2020 3 次提交
  21. 01 4月, 2020 1 次提交
    • S
      nvme-tcp: fix possible crash in recv error flow · 39d06079
      Sagi Grimberg 提交于
      If the target misbehaves and sends us unexpected payload we
      need to make sure to fail the controller and stop processing
      the input stream. We clear the rd_enabled flag and stop
      the io_work, but we may still requeue it if we still have pending
      sends and then in the next invocation we will process the input
      stream as the check is only in the .data_ready upcall.
      
      To fix this we need to make sure not to self-requeue io_work
      upon a recv flow error.
      
      This fixes the crash:
       nvme nvme2: receive failed:  -22
       BUG: unable to handle page fault for address: ffffbeb5816c3b48
       nvme_ns_head_make_request: 29 callbacks suppressed
       block nvme0n5: no usable path - requeuing I/O
       block nvme0n5: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       #PF: supervisor read access inkernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0
       Oops: 0000 [#1] SMP PTI
       CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu
       Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015
       Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
       RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp]
       RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246
       RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008
       RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40
       RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900
       R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000
       R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798
       FS:  0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        tcp_read_sock+0x8c/0x290
        ? __release_sock+0x9d/0xe0
        ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp]
        nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp]
        ? finish_task_switch+0x163/0x270
        process_one_work+0x1fd/0x3f0
        worker_thread+0x34/0x410
        kthread+0x121/0x140
        ? process_one_work+0x3f0/0x3f0
        ? kthread_park+0xb0/0xb0
        ret_from_fork+0x35/0x40
      Reported-by: NRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      39d06079
  22. 31 3月, 2020 2 次提交
  23. 26 3月, 2020 3 次提交