1. 29 8月, 2020 1 次提交
  2. 22 8月, 2020 4 次提交
    • C
      nvme: redirect commands on dying queue · 5eac5f33
      Chao Leng 提交于
      If a command send through nvme-multipath failed on a dying queue, resend it
      on another path.
      Signed-off-by: NChao Leng <lengchao@huawei.com>
      [hch: rebased on top of the completion refactoring]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5eac5f33
    • C
      nvme: refactor command completion · 5ddaabe8
      Christoph Hellwig 提交于
      Lift all the code to decide the dispostition of a completed command
      from nvme_complete_rq and nvme_failover_req into a new helper, which
      returns an emum of the potential actions.  nvme_complete_rq then
      just switches on those and calls the proper helper for the action.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5ddaabe8
    • K
      nvme: skip noiob for zoned devices · c41ad98b
      Keith Busch 提交于
      Zoned block devices reuse the chunk_sectors queue limit to define zone
      boundaries. If a such a device happens to also report an optimal
      boundary, do not use that to define the chunk_sectors as that may
      intermittently interfere with io splitting and zone size queries.
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c41ad98b
    • L
      nvme: Use spin_lock_irq() when taking the ctrl->lock · ecbcdf0c
      Logan Gunthorpe 提交于
      When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
      dead lock. The new spin_lock() calls recently added produce the
      following lockdep warning when running the blktest nvme/003:
      
          ================================
          WARNING: inconsistent lock state
          --------------------------------
          inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
          ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
          ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
          {SOFTIRQ-ON-W} state was registered at:
            lock_acquire+0x164/0x500
            _raw_spin_lock+0x28/0x40
            nvme_get_effects_log+0x37/0x1c0
            nvme_init_identify+0x9e4/0x14f0
            nvme_reset_work+0xadd/0x2360
            process_one_work+0x66b/0xb70
            worker_thread+0x6e/0x6c0
            kthread+0x1e7/0x210
            ret_from_fork+0x22/0x30
          irq event stamp: 1449221
          hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
          hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
          softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
          softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50
      
          other info that might help us debug this:
           Possible unsafe locking scenario:
      
                 CPU0
                 ----
            lock(&ctrl->lock);
            <Interrupt>
              lock(&ctrl->lock);
      
           *** DEADLOCK ***
      
          no locks held by ksoftirqd/2/22.
      
          stack backtrace:
          CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c #1450
          Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
          Call Trace:
           dump_stack+0xc8/0x11a
           print_usage_bug.cold.63+0x235/0x23e
           mark_lock+0xa9c/0xcf0
           __lock_acquire+0xd9a/0x2b50
           lock_acquire+0x164/0x500
           _raw_spin_lock_irqsave+0x40/0x60
           nvme_keep_alive_end_io+0x50/0xc0
           blk_mq_end_request+0x158/0x210
           nvme_complete_rq+0x146/0x500
           nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
           blk_done_softirq+0x187/0x1e0
           __do_softirq+0x118/0x595
           run_ksoftirqd+0x35/0x50
           smpboot_thread_fn+0x1d3/0x310
           kthread+0x1e7/0x210
           ret_from_fork+0x22/0x30
      
      Fixes: be93e87e ("nvme: support for multiple Command Sets Supported and Effects log pages")
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Tested-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ecbcdf0c
  3. 29 7月, 2020 10 次提交
    • C
      nvme: add a Identify Namespace Identification Descriptor list quirk · 5bedd3af
      Christoph Hellwig 提交于
      Add a quirk for a device that does not support the Identify Namespace
      Identification Descriptor list despite claiming 1.3 compliance.
      
      Fixes: ea43d970 ("nvme: fix identify error status silent ignore")
      Reported-by: NIngo Brunberg <ingo_brunberg@web.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NIngo Brunberg <ingo_brunberg@web.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      5bedd3af
    • L
      nvme: export nvme_find_get_ns() and nvme_put_ns() · 24493b8b
      Logan Gunthorpe 提交于
      nvme_find_get_ns() and nvme_put_ns() are required by the target passthru
      code and are exported under the NVME_TARGET_PASSTHRU namespace.
      Based-on-a-patch-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      24493b8b
    • L
      nvme: introduce nvme_ctrl_get_by_path() · f783f444
      Logan Gunthorpe 提交于
      nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it
      gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
      It makes use of filp_open() to open the file and uses the private
      data to obtain a pointer to the struct nvme_ctrl. If the fops of the
      file do not match, -EINVAL is returned.
      
      The purpose of this function is to support NVMe-OF target passthru
      and is exported under the NVME_TARGET_PASSTHRU namespace.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      f783f444
    • L
      nvme: introduce nvme_execute_passthru_rq to call nvme_passthru_[start|end]() · 17365ae6
      Logan Gunthorpe 提交于
      Introduce a new nvme_execute_passthru_rq() helper which calls
      nvme_passthru_[start|end]() around blk_execute_rq(). This ensures
      all passthru calls (including nvme_submit_io()) will be wrapped
      appropriately.
      
      nvme_execute_passthru_rq() will also be useful for the nvmet passthru
      code and is exported in the NVME_TARGET_PASSTHRU namespace.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      17365ae6
    • L
      nvme: create helper function to obtain command effects · df21b6b1
      Logan Gunthorpe 提交于
      Separate the code to obtain command effects from the code
      to start a passthru request and move the nvme_passthru_start() and
      nvme_passthru_end() functions up above nvme_submit_user_cmd() in order
      that they may be used in a new helper a subsequent patch.
      
      The new helper function will be necessary for nvmet passthru
      code to determine if we need to change out of interrupt context
      to handle the effects. It is exported in the NVME_TARGET_PASSTHRU
      namespace.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      df21b6b1
    • L
      nvme: clear any SGL flags in passthru commands · 2bf5d3bb
      Logan Gunthorpe 提交于
      The host driver should decide whether to use SGLs or PRPs and they
      currently assume the flags are cleared after the call to
      nvme_setup_cmd(). However, passed-through commands may erroneously
      set these bits; so clear them for all cases.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2bf5d3bb
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e
    • C
      nvme-core: replace ctrl page size with a macro · 6c3c05b0
      Chaitanya Kulkarni 提交于
      Saving the nvme controller's page size was from a time when the driver
      tried to use different sized pages, but this value is always set to
      a constant, and has been this way for some time. Remove the 'page_size'
      field and replace its usage with the constant value.
      
      This also lets the compiler make some micro-optimizations in the io
      path, and that's always a good thing.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      6c3c05b0
    • B
      nvme: remove redundant validation in nvme_start_ctrl() · 5887450b
      Baolin Wang 提交于
      We've already validated the 'kato' in nvme_start_keep_alive(), thus no
      need to validate it again in nvme_start_ctrl(). Remove it.
      Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      5887450b
    • D
      nvme: remove an unnecessary condition · eca9e827
      Dan Carpenter 提交于
      "v" is an unsigned int so it can't be more than UINT_MAX.  Removing this
      check makes it easier to preserve the error code as well.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      eca9e827
  4. 21 7月, 2020 1 次提交
  5. 16 7月, 2020 1 次提交
  6. 09 7月, 2020 1 次提交
  7. 08 7月, 2020 7 次提交
  8. 02 7月, 2020 1 次提交
  9. 01 7月, 2020 1 次提交
  10. 25 6月, 2020 2 次提交
    • S
      nvme: fix possible deadlock when I/O is blocked · 3b4b1972
      Sagi Grimberg 提交于
      Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      
      When adding a new namespace to the head disk (via nvme_mpath_set_live)
      we will see partition scan which triggers I/O on the mpath device node.
      This process will usually be triggered from the scan_work which holds
      the scan_lock. If I/O blocks (if we got ana change currently have only
      available paths but none are accessible) this can deadlock on the head
      disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
      takes it to check for resize (also triggered from scan_work on a different
      path). See trace [1].
      
      The mpath disk revalidation was originally added to detect online disk
      size change, but this is no longer needed since commit cb224c3a
      ("nvme: Convert to use set_capacity_revalidate_and_notify") which already
      updates resize info without unnecessarily revalidating the disk (the
      mpath disk doesn't even implement .revalidate_disk fop).
      
      [1]:
      --
      kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:9   D    0   494      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  revalidate_disk+0x63/0xa0
      kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
      kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
      kernel:  ? evict+0x14c/0x1b0
      kernel:  revalidate_disk+0x2b/0xa0
      kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      ...
      kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u65:1   D    0  2630      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? file_fdatawait_range+0x30/0x30
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
      kernel:  efi_partition+0x1e6/0x708
      kernel:  ? vsnprintf+0x39e/0x4e0
      kernel:  ? snprintf+0x49/0x60
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  ? blk_mq_free_request+0xd2/0x100
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x1f/0x40
      --
      
      Fixes: fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
      in nvme_validate_ns")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3b4b1972
    • M
      nvme: set initial value for controller's numa node · 4fea243e
      Max Gurtovoy 提交于
      Initialize the node to NUMA_NO_NODE value. Transports that are aware of
      numa node affinity can override it (e.g. RDMA transport set the affinity
      according to the RDMA HCA).
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      4fea243e
  11. 24 6月, 2020 1 次提交
  12. 11 6月, 2020 1 次提交
  13. 30 5月, 2020 1 次提交
  14. 27 5月, 2020 8 次提交