1. 03 11月, 2020 1 次提交
  2. 03 10月, 2020 1 次提交
    • C
      nvme-tcp: check page by sendpage_ok() before calling kernel_sendpage() · 7d4194ab
      Coly Li 提交于
      Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to
      send slab pages. But for pages allocated by __get_free_pages() without
      __GFP_COMP, which also have refcount as 0, they are still sent by
      kernel_sendpage() to remote end, this is problematic.
      
      The new introduced helper sendpage_ok() checks both PageSlab tag and
      page_count counter, and returns true if the checking page is OK to be
      sent by kernel_sendpage().
      
      This patch fixes the page checking issue of nvme_tcp_try_send_data()
      with sendpage_ok(). If sendpage_ok() returns true, send this page by
      kernel_sendpage(), otherwise use sock_no_sendpage to handle this page.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hannes Reinecke <hare@suse.de>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Vlastimil Babka <vbabka@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d4194ab
  3. 09 9月, 2020 1 次提交
  4. 29 8月, 2020 3 次提交
    • S
      nvme-tcp: fix reset hang if controller died in the middle of a reset · e5c01f4f
      Sagi Grimberg 提交于
      If the controller becomes unresponsive in the middle of a reset, we will
      hang because we are waiting for the freeze to complete, but that cannot
      happen since we have commands that are inflight holding the
      q_usage_counter, and we can't blindly fail requests that times out.
      
      So give a timeout and if we cannot wait for queue freeze before
      unfreezing, fail and have the error handling take care how to proceed
      (either schedule a reconnect of remove the controller).
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      e5c01f4f
    • S
      nvme-tcp: fix timeout handler · 236187c4
      Sagi Grimberg 提交于
      When a request times out in a LIVE state, we simply trigger error
      recovery and let the error recovery handle the request cancellation,
      however when a request times out in a non LIVE state, we make sure to
      complete it immediately as it might block controller setup or teardown
      and prevent forward progress.
      
      However tearing down the entire set of I/O and admin queues causes
      freeze/unfreeze imbalance (q->mq_freeze_depth) because and is really
      an overkill to what we actually need, which is to just fence controller
      teardown that may be running, stop the queue, and cancel the request if
      it is not already completed.
      
      Now that we have the controller teardown_lock, we can safely serialize
      request cancellation. This addresses a hang caused by calling extra
      queue freeze on controller namespaces, causing unfreeze to not complete
      correctly.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      236187c4
    • S
      nvme-tcp: serialize controller teardown sequences · d4d61470
      Sagi Grimberg 提交于
      In the timeout handler we may need to complete a request because the
      request that timed out may be an I/O that is a part of a serial sequence
      of controller teardown or initialization. In order to complete the
      request, we need to fence any other context that may compete with us
      and complete the request that is timing out.
      
      In this case, we could have a potential double completion in case
      a hard-irq or a different competing context triggered error recovery
      and is running inflight request cancellation concurrently with the
      timeout handler.
      
      Protect using a ctrl teardown_lock to serialize contexts that may
      complete a cancelled request due to error recovery or a reset.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      d4d61470
  5. 24 8月, 2020 1 次提交
  6. 22 8月, 2020 1 次提交
  7. 29 7月, 2020 2 次提交
    • S
      nvme-tcp: fix controller reset hang during traffic · 2875b0ae
      Sagi Grimberg 提交于
      commit fe35ec58 ("block: update hctx map when use multiple maps")
      exposed an issue where we may hang trying to wait for queue freeze
      during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
      queue maps (which we have now for default/read/poll) is attempting to
      freeze the queue. However we never started queue freeze when starting the
      reset, which means that we have inflight pending requests that entered the
      queue that we will not complete once the queue is quiesced.
      
      So start a freeze before we quiesce the queue, and unfreeze the queue
      after we successfully connected the I/O queues (and make sure to call
      blk_mq_update_nr_hw_queues only after we are sure that the queue was
      already frozen).
      
      This follows to how the pci driver handles resets.
      
      Fixes: fe35ec58 ("block: update hctx map when use multiple maps")
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2875b0ae
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e
  8. 26 7月, 2020 1 次提交
  9. 08 7月, 2020 3 次提交
  10. 25 6月, 2020 1 次提交
  11. 24 6月, 2020 1 次提交
  12. 11 6月, 2020 1 次提交
  13. 29 5月, 2020 5 次提交
  14. 27 5月, 2020 1 次提交
  15. 10 5月, 2020 3 次提交
  16. 01 4月, 2020 1 次提交
    • S
      nvme-tcp: fix possible crash in recv error flow · 39d06079
      Sagi Grimberg 提交于
      If the target misbehaves and sends us unexpected payload we
      need to make sure to fail the controller and stop processing
      the input stream. We clear the rd_enabled flag and stop
      the io_work, but we may still requeue it if we still have pending
      sends and then in the next invocation we will process the input
      stream as the check is only in the .data_ready upcall.
      
      To fix this we need to make sure not to self-requeue io_work
      upon a recv flow error.
      
      This fixes the crash:
       nvme nvme2: receive failed:  -22
       BUG: unable to handle page fault for address: ffffbeb5816c3b48
       nvme_ns_head_make_request: 29 callbacks suppressed
       block nvme0n5: no usable path - requeuing I/O
       block nvme0n5: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n7: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       block nvme0n3: no usable path - requeuing I/O
       #PF: supervisor read access inkernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 1039157067 P4D 1039157067 PUD 103915a067 PMD 102719f067 PTE 0
       Oops: 0000 [#1] SMP PTI
       CPU: 8 PID: 411 Comm: kworker/8:1H Not tainted 5.3.0-40-generic #32~18.04.1-Ubuntu
       Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0 12/17/2015
       Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
       RIP: 0010:nvme_tcp_recv_skb+0x2ae/0xb50 [nvme_tcp]
       RSP: 0018:ffffbeb5806cfd10 EFLAGS: 00010246
       RAX: ffffbeb5816c3b48 RBX: 00000000000003d0 RCX: 0000000000000008
       RDX: 00000000000003d0 RSI: 0000000000000001 RDI: ffff9a3040684b40
       RBP: ffffbeb5806cfd90 R08: 0000000000000000 R09: ffffffff946e6900
       R10: ffffbeb5806cfce0 R11: 0000000000000001 R12: 0000000000000000
       R13: ffff9a2ff86501c0 R14: 00000000000003d0 R15: ffff9a30b85f2798
       FS:  0000000000000000(0000) GS:ffff9a30bf800000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffbeb5816c3b48 CR3: 000000088400a006 CR4: 00000000003626e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        tcp_read_sock+0x8c/0x290
        ? __release_sock+0x9d/0xe0
        ? nvme_tcp_write_space+0xb0/0xb0 [nvme_tcp]
        nvme_tcp_io_work+0x4b4/0x830 [nvme_tcp]
        ? finish_task_switch+0x163/0x270
        process_one_work+0x1fd/0x3f0
        worker_thread+0x34/0x410
        kthread+0x121/0x140
        ? process_one_work+0x3f0/0x3f0
        ? kthread_park+0xb0/0xb0
        ret_from_fork+0x35/0x40
      Reported-by: NRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      39d06079
  17. 31 3月, 2020 2 次提交
  18. 26 3月, 2020 6 次提交
  19. 05 3月, 2020 1 次提交
  20. 15 2月, 2020 2 次提交
    • N
      nvme: prevent warning triggered by nvme_stop_keep_alive · 97b2512a
      Nigel Kirkland 提交于
      Delayed keep alive work is queued on system workqueue and may be cancelled
      via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.
      
      Check_flush_dependency detects mismatched attributes between the work-queue
      context used to cancel the keep alive work and system-wq. Specifically
      system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
      to cancel keep alive work have WQ_MEM_RECLAIM flag.
      
      Example warning:
      
        workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
      	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]
      
      To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.
      
      However this creates a secondary concern where work and a request to cancel
      that work may be in the same work queue - namely err_work in the rdma and
      tcp transports, which will want to flush/cancel the keep alive work which
      will now be on nvme_wq.
      
      After reviewing the transports, it looks like err_work can be moved to
      nvme_reset_wq. In fact that aligns them better with transition into
      RESETTING and performing related reset work in nvme_reset_wq.
      
      Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.
      Signed-off-by: NNigel Kirkland <nigel.kirkland@broadcom.com>
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97b2512a
    • A
      nvme/tcp: fix bug on double requeue when send fails · 2d570a7c
      Anton Eidelman 提交于
      When nvme_tcp_io_work() fails to send to socket due to
      connection close/reset, error_recovery work is triggered
      from nvme_tcp_state_change() socket callback.
      This cancels all the active requests in the tagset,
      which requeues them.
      
      The failed request, however, was ended and thus requeued
      individually as well unless send returned -EPIPE.
      Another return code to be treated the same way is -ECONNRESET.
      
      Double requeue caused BUG_ON(blk_queued_rq(rq))
      in blk_mq_requeue_request() from either the individual requeue
      of the failed request or the bulk requeue from
      blk_mq_tagset_busy_iter(, nvme_cancel_request, );
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d570a7c
  21. 05 11月, 2019 1 次提交
  22. 29 10月, 2019 1 次提交