1. 25 6月, 2020 8 次提交
    • S
      nvme: don't protect ns mutation with ns->head->lock · e164471d
      Sagi Grimberg 提交于
      Right now ns->head->lock is protecting namespace mutation
      which is wrong and unneeded. Move it to only protect
      against head mutations. While we're at it, remove unnecessary
      ns->head reference as we already have head pointer.
      
      The problem with this is that the head->lock spans
      mpath disk node I/O that may block under some conditions (if
      for example the controller is disconnecting or the path
      became inaccessible), The locking scheme does not allow any
      other path to enable itself, preventing blocked I/O to complete
      and forward-progress from there.
      
      This is a preparation patch for the fix in a subsequent patch
      where the disk I/O will also be done outside the head->lock.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e164471d
    • A
      nvme-multipath: fix deadlock between ana_work and scan_work · 489dd102
      Anton Eidelman 提交于
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      In order to recover and complete the IO ana_work on the same ctrl
      should be able to update the path state and remove NVME_NS_ANA_PENDING.
      
      The deadlock occurs because scan_work keeps holding ana_lock,
      so ana_work hangs [1].
      
      Fix:
      Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
      of the ANA group desc, and then calls nvme_update_ns_ana_state() without
      holding ana_lock.
      
      [1]:
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? select_task_rq_fair+0x1aa/0x5c0
      kernel:  ? kvm_sched_clock_read+0x11/0x20
      kernel:  ? sched_clock+0x9/0x10
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      489dd102
    • S
      nvme: fix possible deadlock when I/O is blocked · 3b4b1972
      Sagi Grimberg 提交于
      Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      
      When adding a new namespace to the head disk (via nvme_mpath_set_live)
      we will see partition scan which triggers I/O on the mpath device node.
      This process will usually be triggered from the scan_work which holds
      the scan_lock. If I/O blocks (if we got ana change currently have only
      available paths but none are accessible) this can deadlock on the head
      disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
      takes it to check for resize (also triggered from scan_work on a different
      path). See trace [1].
      
      The mpath disk revalidation was originally added to detect online disk
      size change, but this is no longer needed since commit cb224c3a
      ("nvme: Convert to use set_capacity_revalidate_and_notify") which already
      updates resize info without unnecessarily revalidating the disk (the
      mpath disk doesn't even implement .revalidate_disk fop).
      
      [1]:
      --
      kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:9   D    0   494      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  revalidate_disk+0x63/0xa0
      kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
      kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
      kernel:  ? evict+0x14c/0x1b0
      kernel:  revalidate_disk+0x2b/0xa0
      kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      ...
      kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:1   D    0  2630      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? file_fdatawait_range+0x30/0x30
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
      kernel:  efi_partition+0x1e6/0x708
      kernel:  ? vsnprintf+0x39e/0x4e0
      kernel:  ? snprintf+0x49/0x60
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      --
      
      Fixes: fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3b4b1972
    • M
      nvme-rdma: assign completion vector correctly · 032a9966
      Max Gurtovoy 提交于
      The completion vector index that is given during CQ creation can't
      exceed the number of support vectors by the underlying RDMA device. This
      violation currently can accure, for example, in case one will try to
      connect with N regular read/write queues and M poll queues and the sum
      of N + M > num_supported_vectors. This will lead to failure in establish
      a connection to remote target. Instead, in that case, share a completion
      vector between queues.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      032a9966
    • M
      nvme-tcp: initialize tagset numa value to the value of the ctrl · 610c8235
      Max Gurtovoy 提交于
      Both admin's and drive's tagsets should be set according the numa
      node of the controller.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      610c8235
    • M
      nvme-pci: initialize tagset numa value to the value of the ctrl · d4ec47f1
      Max Gurtovoy 提交于
      Both admin's and drive's tagsets should be set according the numa node
      of the controller.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      d4ec47f1
    • M
      nvme-pci: override the value of the controller's numa node · 635333e4
      Max Gurtovoy 提交于
      Set the node value according to the PCI device numa node.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      635333e4
    • M
      nvme: set initial value for controller's numa node · 4fea243e
      Max Gurtovoy 提交于
      Initialize the node to NUMA_NO_NODE value. Transports that are aware of
      numa node affinity can override it (e.g. RDMA transport set the affinity
      according to the RDMA HCA).
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      4fea243e
  2. 14 6月, 2020 1 次提交
    • M
      treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624
      Masahiro Yamada 提交于
      Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
      '---help---'"), the number of '---help---' has been gradually
      decreasing, but there are still more than 2400 instances.
      
      This commit finishes the conversion. While I touched the lines,
      I also fixed the indentation.
      
      There are a variety of indentation styles found.
      
        a) 4 spaces + '---help---'
        b) 7 spaces + '---help---'
        c) 8 spaces + '---help---'
        d) 1 space + 1 tab + '---help---'
        e) 1 tab + '---help---'    (correct indentation)
        f) 1 tab + 1 space + '---help---'
        g) 1 tab + 2 spaces + '---help---'
      
      In order to convert all of them to 1 tab + 'help', I ran the
      following commend:
      
        $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
      Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
      a7f7f624
  3. 11 6月, 2020 4 次提交
  4. 05 6月, 2020 1 次提交
  5. 30 5月, 2020 1 次提交
  6. 29 5月, 2020 5 次提交
  7. 28 5月, 2020 1 次提交
    • D
      nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll() · 9210c075
      Dongli Zhang 提交于
      There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
      when doing live reset while polling the nvme device.
      
            CPU X                        CPU Y
                                     nvme_poll()
      nvme_dev_disable()
      -> nvme_stop_queues()
      -> nvme_suspend_io_queues()
      -> nvme_suspend_queue()
                                     -> spin_lock(&nvmeq->cq_poll_lock);
      -> nvme_reap_pending_cqes()
         -> nvme_process_cq()        -> nvme_process_cq()
      
      In the above scenario, the nvme_process_cq() for the same queue may be
      running on both CPU X and CPU Y concurrently.
      
      It is much more easier to reproduce the issue when CONFIG_PREEMPT is
      enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
      time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
      period.
      
      This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
      nvme_reap_pending_cqes().
      
      Fixes: fa46c6fb ("nvme/pci: move cqe check after device shutdown")
      Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      9210c075
  8. 27 5月, 2020 16 次提交
  9. 13 5月, 2020 1 次提交
  10. 10 5月, 2020 2 次提交