1. 24 8月, 2020 1 次提交
  2. 29 7月, 2020 3 次提交
    • S
      nvme-rdma: fix controller reset hang during traffic · 9f98772b
      Sagi Grimberg 提交于
      commit fe35ec58 ("block: update hctx map when use multiple maps")
      exposed an issue where we may hang trying to wait for queue freeze
      during I/O. We call blk_mq_update_nr_hw_queues which in case of multiple
      queue maps (which we have now for default/read/poll) is attempting to
      freeze the queue. However we never started queue freeze when starting the
      reset, which means that we have inflight pending requests that entered the
      queue that we will not complete once the queue is quiesced.
      
      So start a freeze before we quiesce the queue, and unfreeze the queue
      after we successfully connected the I/O queues (and make sure to call
      blk_mq_update_nr_hw_queues only after we are sure that the queue was
      already frozen).
      
      This follows to how the pci driver handles resets.
      
      Fixes: fe35ec58 ("block: update hctx map when use multiple maps")
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      9f98772b
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e
    • Y
      nvme-rdma: use new shared CQ mechanism · 287f329e
      Yamin Friedman 提交于
      Has the driver use shared CQs providing ~10%-20% improvement as seen in
      the patch introducing shared CQs. Instead of opening a CQ for each QP
      per controller connected, a CQ for each QP will be provided by the RDMA
      core driver that will be shared between the QPs on that core reducing
      interrupt overhead.
      Signed-off-by: NYamin Friedman <yaminf@mellanox.com>
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      287f329e
  3. 25 6月, 2020 1 次提交
    • M
      nvme-rdma: assign completion vector correctly · 032a9966
      Max Gurtovoy 提交于
      The completion vector index that is given during CQ creation can't
      exceed the number of support vectors by the underlying RDMA device. This
      violation currently can accure, for example, in case one will try to
      connect with N regular read/write queues and M poll queues and the sum
      of N + M > num_supported_vectors. This will lead to failure in establish
      a connection to remote target. Instead, in that case, share a completion
      vector between queues.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      032a9966
  4. 24 6月, 2020 3 次提交
  5. 27 5月, 2020 2 次提交
  6. 31 3月, 2020 1 次提交
  7. 26 3月, 2020 3 次提交
  8. 17 3月, 2020 1 次提交
  9. 11 3月, 2020 1 次提交
  10. 15 2月, 2020 1 次提交
    • N
      nvme: prevent warning triggered by nvme_stop_keep_alive · 97b2512a
      Nigel Kirkland 提交于
      Delayed keep alive work is queued on system workqueue and may be cancelled
      via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.
      
      Check_flush_dependency detects mismatched attributes between the work-queue
      context used to cancel the keep alive work and system-wq. Specifically
      system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
      to cancel keep alive work have WQ_MEM_RECLAIM flag.
      
      Example warning:
      
        workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
      	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]
      
      To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.
      
      However this creates a secondary concern where work and a request to cancel
      that work may be in the same work queue - namely err_work in the rdma and
      tcp transports, which will want to flush/cancel the keep alive work which
      will now be on nvme_wq.
      
      After reviewing the transports, it looks like err_work can be moved to
      nvme_reset_wq. In fact that aligns them better with transition into
      RESETTING and performing related reset work in nvme_reset_wq.
      
      Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.
      Signed-off-by: NNigel Kirkland <nigel.kirkland@broadcom.com>
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      97b2512a
  11. 27 11月, 2019 1 次提交
    • I
      nvme-rdma: Avoid preallocating big SGL for data · 38e18002
      Israel Rukshin 提交于
      nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      
      The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
      support SG_CHAIN, use only runtime allocation for the SGL.
      
      We didn't notice of a performance degradation, since for small IOs we'll
      use the inline SG and for the bigger IOs the allocation of a bigger SGL
      from slab is fast enough.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      38e18002
  12. 05 11月, 2019 3 次提交
  13. 14 10月, 2019 1 次提交
  14. 28 9月, 2019 1 次提交
  15. 26 9月, 2019 1 次提交
  16. 30 8月, 2019 5 次提交
  17. 05 8月, 2019 1 次提交
  18. 01 8月, 2019 1 次提交
    • S
      nvme-rdma: fix possible use-after-free in connect error flow · d94211b8
      Sagi Grimberg 提交于
      When start_queue fails, we need to make sure to drain the
      queue cq before freeing the rdma resources because we might
      still race with the completion path. Have start_queue() error
      path safely stop the queue.
      
      --
      [30371.808111] nvme nvme1: Failed reconnect attempt 11
      [30371.808113] nvme nvme1: Reconnecting in 10 seconds...
      [...]
      [30382.069315] nvme nvme1: creating 4 I/O queues.
      [30382.257058] nvme nvme1: Connect Invalid SQE Parameter, qid 4
      [30382.257061] nvme nvme1: failed to connect queue: 4 ret=386
      [30382.305001] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      [30382.305022] IP: qedr_poll_cq+0x8a3/0x1170 [qedr]
      [30382.305028] PGD 0 P4D 0
      [30382.305037] Oops: 0000 [#1] SMP PTI
      [...]
      [30382.305153] Call Trace:
      [30382.305166]  ? __switch_to_asm+0x34/0x70
      [30382.305187]  __ib_process_cq+0x56/0xd0 [ib_core]
      [30382.305201]  ib_poll_handler+0x26/0x70 [ib_core]
      [30382.305213]  irq_poll_softirq+0x88/0x110
      [30382.305223]  ? sort_range+0x20/0x20
      [30382.305232]  __do_softirq+0xde/0x2c6
      [30382.305241]  ? sort_range+0x20/0x20
      [30382.305249]  run_ksoftirqd+0x1c/0x60
      [30382.305258]  smpboot_thread_fn+0xef/0x160
      [30382.305265]  kthread+0x113/0x130
      [30382.305273]  ? kthread_create_worker_on_cpu+0x50/0x50
      [30382.305281]  ret_from_fork+0x35/0x40
      --
      Reported-by: NNicolas Morey-Chaisemartin <NMoreyChaisemartin@suse.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      d94211b8
  19. 24 6月, 2019 1 次提交
  20. 21 6月, 2019 1 次提交
    • M
      scsi: lib/sg_pool.c: improve APIs for allocating sg pool · 4635873c
      Ming Lei 提交于
      sg_alloc_table_chained() currently allows the caller to provide one
      preallocated SGL and returns if the requested number isn't bigger than
      size of that SGL. This is used to inline an SGL for an IO request.
      
      However, scattergather code only allows that size of the 1st preallocated
      SGL to be SG_CHUNK_SIZE(128). This means a substantial amount of memory
      (4KB) is claimed for the SGL for each IO request. If the I/O is small, it
      would be prudent to allocate a smaller SGL.
      
      Introduce an extra parameter to sg_alloc_table_chained() and
      sg_free_table_chained() for specifying size of the preallocated SGL.
      
      Both __sg_free_table() and __sg_alloc_table() assume that each SGL has the
      same size except for the last one.  Change the code to allow both functions
      to accept a variable size for the 1st preallocated SGL.
      
      [mkp: attempted to clarify commit desc]
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: netdev@vger.kernel.org
      Cc: linux-nvme@lists.infradead.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      4635873c
  21. 07 6月, 2019 1 次提交
    • M
      nvme-rdma: use dynamic dma mapping per command · 62f99b62
      Max Gurtovoy 提交于
      Commit 87fd1253 ("nvme-rdma: remove redundant reference between
      ib_device and tagset") caused a kernel panic when disconnecting from an
      inaccessible controller (disconnect during re-connection).
      
      --
      nvme nvme0: Removing ctrl: NQN "testnqn1"
      nvme_rdma: nvme_rdma_exit_request: hctx 0 queue_idx 1
      BUG: unable to handle kernel paging request at 0000000080000228
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP PTI
      ...
      Call Trace:
       blk_mq_exit_hctx+0x5c/0xf0
       blk_mq_exit_queue+0xd4/0x100
       blk_cleanup_queue+0x9a/0xc0
       nvme_rdma_destroy_io_queues+0x52/0x60 [nvme_rdma]
       nvme_rdma_shutdown_ctrl+0x3e/0x80 [nvme_rdma]
       nvme_do_delete_ctrl+0x53/0x80 [nvme_core]
       nvme_sysfs_delete+0x45/0x60 [nvme_core]
       kernfs_fop_write+0x105/0x180
       vfs_write+0xad/0x1a0
       ksys_write+0x5a/0xd0
       do_syscall_64+0x55/0x110
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fa215417154
      --
      
      The reason for this crash is accessing an already freed ib_device for
      performing dma_unmap during exit_request commands. The root cause for
      that is that during re-connection all the queues are destroyed and
      re-created (and the ib_device is reference counted by the queues and
      freed as well) but the tagset stays alive and all the DMA mappings (that
      we perform in init_request) kept in the request context. The original
      commit fixed a different bug that was introduced during bonding (aka nic
      teaming) tests that for some scenarios change the underlying ib_device
      and caused memory leakage and possible segmentation fault. This commit
      is a complementary commit that also changes the wrong DMA mappings that
      were saved in the request context and making the request sqe dma
      mappings dynamic with the command lifetime (i.e. mapped in .queue_rq and
      unmapped in .complete). It also fixes the above crash of accessing freed
      ib_device during destruction of the tagset.
      
      Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset")
      Reported-by: NJim Harris <james.r.harris@intel.com>
      Suggested-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NJim Harris <james.r.harris@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      62f99b62
  22. 31 5月, 2019 1 次提交
    • S
      nvme-rdma: fix queue mapping when queue count is limited · 5651cd3c
      Sagi Grimberg 提交于
      When the controller supports less queues than requested, we
      should make sure that queue mapping does the right thing and
      not assume that all queues are available. This fixes a crash
      when the controller supports less queues than requested.
      
      The rules are:
      1. if no write/poll queues are requested, we assign the available queues
         to the default queue map. The default and read queue maps share the
         existing queues.
      2. if write queues are requested:
        - first make sure that read queue map gets the requested
          nr_io_queues count
        - then grant the default queue map the minimum between the requested
          nr_write_queues and the remaining queues. If there are no available
          queues to dedicate to the default queue map, fallback to (1) and
          share all the queues in the existing queue map.
      3. if poll queues are requested:
        - map the remaining queues to the poll queue map.
      
      Also, provide a log indication on how we constructed the different
      queue maps.
      Reported-by: NHarris, James R <james.r.harris@intel.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Tested-by: NJim Harris <james.r.harris@intel.com>
      Cc: <stable@vger.kernel.org> # v5.0+
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      5651cd3c
  23. 13 5月, 2019 1 次提交
    • M
      nvme-rdma: remove redundant reference between ib_device and tagset · 87fd1253
      Max Gurtovoy 提交于
      In the past, before adding f41725bb ("nvme-rdma: Use mr pool") commit,
      we needed a reference on the ib_device as long as the tagset
      was alive, as the MRs in the request structures needed a valid ib_device.
      Now, we allocate/deallocate MR pool per QP and consume on demand.
      
      Also remove nvme_rdma_free_tagset function and use blk_mq_free_tag_set
      instead, as it unneeded anymore.
      
      This commit also fixes a memory leakage and possible segmentation fault.
      When configuring the system with NIC teaming (aka bonding), we use 1
      network interface to create an HA connection to the target side. In case
      one connection breaks down, nvme-rdma driver will get notification from
      rdma-cm layer that underlying address was change and will start error
      recovery process. During this process, we'll reconnect to the target
      via the second interface in the bond without destroying the tagset.
      This will cause a leakage of the initial rdma device (ndev) and miscount
      in the reference count of the new created rdma device (new ndev). In
      the final destruction (or in another error flow), we'll get a warning
      dump from the ib_dealloc_pd that we still have inflight MR's related to
      that pd. This happens becasue of the miscount of the reference tag of
      the rdma device and causing access violation to it's elements (some
      queues are not destroyed yet).
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      87fd1253
  24. 25 4月, 2019 1 次提交
  25. 21 2月, 2019 1 次提交
  26. 20 2月, 2019 1 次提交
  27. 04 2月, 2019 1 次提交