1. 02 7月, 2020 1 次提交
  2. 01 7月, 2020 4 次提交
  3. 25 6月, 2020 4 次提交
    • S
      nvme-multipath: fix bogus request queue reference put · c3124466
      Sagi Grimberg 提交于
      The mpath disk node takes a reference on the request mpath
      request queue when adding live path to the mpath gendisk.
      However if we connected to an inaccessible path device_add_disk
      is not called, so if we disconnect and remove the mpath gendisk
      we endup putting an reference on the request queue that was
      never taken [1].
      
      Fix that to check if we ever added a live path (using
      NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
      reference.
      
      [1]:
      ------------[ cut here ]------------
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
      CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
      RIP: 0010:refcount_warn_saturate+0xa6/0xf0
      RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
      RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
      RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
      FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       disk_release+0xa2/0xc0
       device_release+0x28/0x80
       kobject_put+0xa5/0x1b0
       nvme_put_ns_head+0x26/0x70 [nvme_core]
       nvme_put_ns+0x30/0x60 [nvme_core]
       nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
       nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
       nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
       kernfs_fop_write+0xc1/0x1a0
       vfs_write+0xb6/0x1a0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x52/0x1a0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Tested-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      c3124466
    • A
      nvme-multipath: fix deadlock due to head->lock · d8a22f85
      Anton Eidelman 提交于
      In the following scenario scan_work and ana_work will deadlock:
      
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
      ANY ctrl will not be able to complete nvme_mpath_set_live()
      on the same ns->head, which is required in order to update
      the new accessible path and remove NVME_NS_ANA_PENDING..
      Therefore IO never completes: deadlock [1].
      
      Fix:
      Move device_add_disk out of the head->lock and protect it with an
      atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.
      
      [1]:
      kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:2    D    0   160      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
      kernel:  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_read_ana_log+0x76/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
      kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
      kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      kernel: kworker/u8:4    D    0   439      2 0x80004000
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x256/0x390 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      d8a22f85
    • S
      nvme: don't protect ns mutation with ns->head->lock · e164471d
      Sagi Grimberg 提交于
      Right now ns->head->lock is protecting namespace mutation
      which is wrong and unneeded. Move it to only protect
      against head mutations. While we're at it, remove unnecessary
      ns->head reference as we already have head pointer.
      
      The problem with this is that the head->lock spans
      mpath disk node I/O that may block under some conditions (if
      for example the controller is disconnecting or the path
      became inaccessible), The locking scheme does not allow any
      other path to enable itself, preventing blocked I/O to complete
      and forward-progress from there.
      
      This is a preparation patch for the fix in a subsequent patch
      where the disk I/O will also be done outside the head->lock.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e164471d
    • A
      nvme-multipath: fix deadlock between ana_work and scan_work · 489dd102
      Anton Eidelman 提交于
      When scan_work calls nvme_mpath_add_disk() this holds ana_lock
      and invokes nvme_parse_ana_log(), which may issue IO
      in device_add_disk() and hang waiting for an accessible path.
      While nvme_mpath_set_live() only called when nvme_state_is_live(),
      a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.
      
      In order to recover and complete the IO ana_work on the same ctrl
      should be able to update the path state and remove NVME_NS_ANA_PENDING.
      
      The deadlock occurs because scan_work keeps holding ana_lock,
      so ana_work hangs [1].
      
      Fix:
      Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
      of the ANA group desc, and then calls nvme_update_ns_ana_state() without
      holding ana_lock.
      
      [1]:
      kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      
      kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  schedule_preempt_disabled+0xe/0x10
      kernel:  __mutex_lock.isra.0+0x182/0x4f0
      kernel:  ? __switch_to_asm+0x34/0x70
      kernel:  ? select_task_rq_fair+0x1aa/0x5c0
      kernel:  ? kvm_sched_clock_read+0x11/0x20
      kernel:  ? sched_clock+0x9/0x10
      kernel:  __mutex_lock_slowpath+0x13/0x20
      kernel:  mutex_lock+0x2e/0x40
      kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
      kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x4d/0x400
      kernel:  kthread+0x104/0x140
      kernel:  ? process_one_work+0x380/0x380
      kernel:  ? kthread_park+0x80/0x80
      kernel:  ret_from_fork+0x35/0x40
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      489dd102
  4. 10 5月, 2020 3 次提交
  5. 04 4月, 2020 1 次提交
    • S
      nvme: fix deadlock caused by ANA update wrong locking · 657f1975
      Sagi Grimberg 提交于
      The deadlock combines 4 flows in parallel:
      - ns scanning (triggered from reconnect)
      - request timeout
      - ANA update (triggered from reconnect)
      - I/O coming into the mpath device
      
      (1) ns scanning triggers disk revalidation -> update disk info ->
          freeze queue -> but blocked, due to (2)
      
      (2) timeout handler reference the g_usage_counter - > but blocks in
          the transport .timeout() handler, due to (3)
      
      (3) the transport timeout handler (indirectly) calls nvme_stop_queue() ->
          which takes the (down_read) namespaces_rwsem - > but blocks, due to (4)
      
      (4) ANA update takes the (down_write) namespaces_rwsem -> calls
          nvme_mpath_set_live() -> which synchronize the ns_head srcu
          (see commit 504db087) -> but blocks, due to (5)
      
      (5) I/O came into nvme_mpath_make_request -> took srcu_read_lock ->
          direct_make_request > blk_queue_enter -> but blocked, due to (1)
      
      ==> the request queue is under freeze -> deadlock.
      
      The fix is making ANA update take a read lock as the namespaces list
      is not manipulated, it is just the ns and ns->head that are being
      updated (which is protected with the ns->head lock).
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      657f1975
  6. 28 3月, 2020 1 次提交
    • C
      block: simplify queue allocation · 3d745ea5
      Christoph Hellwig 提交于
      Current make_request based drivers use either blk_alloc_queue_node or
      blk_alloc_queue to allocate a queue, and then set up the make_request_fn
      function pointer and a few parameters using the blk_queue_make_request
      helper.  Simplify this by passing the make_request pointer to
      blk_alloc_queue, and while at it merge the _node variant into the main
      helper by always passing a node_id, and remove the superfluous gfp_mask
      parameter.  A lower-level __blk_alloc_queue is kept for the blk-mq case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d745ea5
  7. 26 3月, 2020 1 次提交
  8. 21 2月, 2020 1 次提交
    • L
      nvme-multipath: Fix memory leak with ana_log_buf · 3b783090
      Logan Gunthorpe 提交于
      kmemleak reports a memory leak with the ana_log_buf allocated by
      nvme_mpath_init():
      
      unreferenced object 0xffff888120e94000 (size 8208):
        comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
          hex dump (first 32 bytes):
            00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
            01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
          backtrace:
            [<00000000e2360188>] kmalloc_order+0x97/0xc0
            [<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100
            [<00000000f50c0406>] __kmalloc+0x24c/0x2d0
            [<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0
            [<000000005802589e>] nvme_init_identify+0x75f/0x1600
            [<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280
            [<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710
            [<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9
            [<000000004199f8d0>] __vfs_write+0x50/0xa0
            [<0000000065466fef>] vfs_write+0xf3/0x280
            [<00000000b0db9a8b>] ksys_write+0xc6/0x160
            [<0000000082156b91>] __x64_sys_write+0x43/0x50
            [<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0
            [<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      nvme_mpath_init() is called by nvme_init_identify() which is called in
      multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
      means nvme_mpath_init() may be called multiple times before
      nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).
      
      When nvme_mpath_init() is called multiple times, it overwrites the
      ana_log_buf pointer with a new allocation, thus leaking the previous
      allocation.
      
      To fix this, free ana_log_buf before allocating a new one.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      3b783090
  9. 05 11月, 2019 3 次提交
    • A
      nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8
      Anton Eidelman 提交于
      nvme_mpath_clear_ctrl_paths() iterates through
      the ctrl->namespaces list while holding ctrl->scan_lock.
      This does not seem to be the correct way of protecting
      from concurrent list modification.
      
      Specifically, nvme_scan_work() sorts ctrl->namespaces
      AFTER unlocking scan_lock.
      
      This may result in the following (rare) crash in ctrl disconnect
      during scan_work:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000050
          Oops: 0000 [#1] SMP PTI
          CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
          RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
          ...
          Call Trace:
           nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
           nvme_remove_namespaces+0x35/0xe0 [nvme_core]
           nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
           nvme_sysfs_delete+0x49/0x60 [nvme_core]
           dev_attr_store+0x17/0x30
           sysfs_kf_write+0x3e/0x50
           kernfs_fop_write+0x11e/0x1a0
           __vfs_write+0x1b/0x40
           vfs_write+0xb9/0x1a0
           ksys_write+0x67/0xe0
           __x64_sys_write+0x1a/0x20
           do_syscall_64+0x5a/0x130
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
          RIP: 0033:0x7f8d02bfb154
      
      Fix:
      After taking scan_lock in nvme_mpath_clear_ctrl_paths()
      down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
      This will not cause deadlocks because taking scan_lock never happens
      while holding the namespaces_rwsem.
      Moreover, scan work downs namespaces_rwsem in the same order.
      
      Alternative: sort ctrl->namespaces in nvme_scan_work()
      while still holding the scan_lock.
      This would leave nvme_mpath_clear_ctrl_paths() without correct protection
      against ctrl->namespaces modification by anyone other than scan_work.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      763303a8
    • P
      nvme: Fix parsing of ANA log page · 64fab729
      Prabhath Sajeepa 提交于
      Check validity of offset into ANA log buffer before accessing
      nvme_ana_group_desc. This check ensures the size of ANA log buffer >=
      offset + sizeof(nvme_ana_group_desc)
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NPrabhath Sajeepa <psajeepa@purestorage.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      64fab729
    • M
      nvme: introduce "Command Aborted By host" status code · 2dc3947b
      Max Gurtovoy 提交于
      Fix the status code of canceled requests initiated by the host according
      to TP4028 (Status Code 0x371):
      "Command Aborted By host: The command was aborted as a result of host
      action (e.g., the host disconnected the Fabric connection)."
      
      Also in a multipath environment, unless otherwise specified, errors of
      this type (path related) should be retried using a different path, if
      one is available.
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2dc3947b
  10. 29 10月, 2019 2 次提交
    • A
      nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf
      Anton Eidelman 提交于
      groups_only mode in nvme_read_ana_log() is no longer used: remove it.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86cccfbf
    • A
      nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042
      Anton Eidelman 提交于
      The following scenario results in an IO hang:
      1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
         NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
      2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
         This fails because ctrl disconnects.
         Therefore nvme_update_ns_ana_state() is not called
         and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
      3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
         nvme_read_ana_log(ctrl, groups_only=true).
         However, nvme_update_ana_state() does not update namespaces
         because nr_nsids = 0 (due to groups_only mode).
      4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.
      
      Result:
      The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
      Consequently ctrl will never be considered a viable path by __nvme_find_path().
      IO will hang if ctrl is the only or the last path to the namespace.
      
      More generally, while ctrl is reconnecting, its ANA state may change.
      And because nvme_mpath_init() requests ANA log in groups_only mode,
      these changes are not propagated to the existing ctrl namespaces.
      This may result in a mal-function or an IO hang.
      
      Solution:
      nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
      This will not harm the new ctrl case (no namespaces present),
      and will make sure the ANA state of namespaces gets updated after reconnect.
      
      Note: Another option would be for nvme_mpath_init() to invoke
      nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af8fd042
  11. 30 8月, 2019 1 次提交
    • A
      nvme-multipath: fix ana log nsid lookup when nsid is not found · e01f91df
      Anton Eidelman 提交于
      ANA log parsing invokes nvme_update_ana_state() per ANA group desc.
      This updates the state of namespaces with nsids in desc->nsids[].
      
      Both ctrl->namespaces list and desc->nsids[] array are sorted by nsid.
      Hence nvme_update_ana_state() performs a single walk over ctrl->namespaces:
      - if current namespace matches the current desc->nsids[n],
        this namespace is updated, and n is incremented.
      - the process stops when it encounters the end of either
        ctrl->namespaces end or desc->nsids[]
      
      In case desc->nsids[n] does not match any of ctrl->namespaces,
      the remaining nsids following desc->nsids[n] will not be updated.
      Such situation was considered abnormal and generated WARN_ON_ONCE.
      
      However ANA log MAY contain nsids not (yet) found in ctrl->namespaces.
      For example, lets consider the following scenario:
      - nvme0 exposes namespaces with nsids = [2, 3] to the host
      - a new namespace nsid = 1 is added dynamically
      - also, a ANA topology change is triggered
      - NS_CHANGED aen is generated and triggers scan_work
      - before scan_work discovers nsid=1 and creates a namespace, a NOTICE_ANA
        aen was issues and ana_work receives ANA log with nsids=[1, 2, 3]
      
      Result: ana_work fails to update ANA state on existing namespaces [2, 3]
      
      Solution:
      Change the way nvme_update_ana_state() namespace list walk
      checks the current namespace against desc->nsids[n] as follows:
      a) ns->head->ns_id < desc->nsids[n]: keep walking ctrl->namespaces.
      b) ns->head->ns_id == desc->nsids[n]: match, update the namespace
      c) ns->head->ns_id >= desc->nsids[n]: skip to desc->nsids[n+1]
      
      This enables correct operation in the scenario described above.
      This also allows ANA log to contain nsids currently invisible
      to the host, i.e. inactive nsids.
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Reviewed-by: NJames Smart <james.smart@broadcom.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      e01f91df
  12. 21 8月, 2019 1 次提交
    • A
      nvme-multipath: fix possible I/O hang when paths are updated · 504db087
      Anton Eidelman 提交于
      nvme_state_set_live() making a path available triggers requeue_work
      in order to resubmit requests that ended up on requeue_list when no
      paths were available.
      
      This requeue_work may race with concurrent nvme_ns_head_make_request()
      that do not observe the live path yet.
      Such concurrent requests may by made by either:
      - New IO submission.
      - Requeue_work triggered by nvme_failover_req() or another ana_work.
      
      A race may cause requeue_work capture the state of requeue_list before
      more requests get onto the list. These requests will stay on the list
      forever unless requeue_work is triggered again.
      
      In order to prevent such race, nvme_state_set_live() should
      synchronize_srcu(&head->srcu) before triggering the requeue_work and
      prevent nvme_ns_head_make_request referencing an old snapshot of the
      path list.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      504db087
  13. 01 8月, 2019 2 次提交
    • S
      nvme: fix controller removal race with scan work · 0157ec8d
      Sagi Grimberg 提交于
      With multipath enabled, nvme_scan_work() can read from the device
      (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics,
      once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
      (see nvmf_check_ready()) and the mpath stack device make_request
      will block if head->list is not empty. However, when the head->list
      consistst of only DELETING/DEAD controllers, we should actually not
      block, but rather fail immediately.
      
      In addition, before we go ahead and remove the namespaces, make sure
      to clear the current path and kick the requeue list so that the
      request will fast fail upon requeuing.
      
      [1]:
      --
        INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:3    D    0   166      2 0x80004000
        Workqueue: nvme-wq nvme_scan_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         io_schedule+0x21/0x70
         do_read_cache_page+0xa57/0x1330
         read_cache_page+0x4a/0x70
         read_dev_sector+0xbf/0x380
         amiga_partition+0xc4/0x1230
         check_partition+0x30f/0x630
         rescan_partitions+0x19a/0x980
         __blkdev_get+0x85a/0x12f0
         blkdev_get+0x2a5/0x790
         __device_add_disk+0xe25/0x1250
         device_add_disk+0x13/0x20
         nvme_mpath_set_live+0x172/0x2b0
         nvme_update_ns_ana_state+0x130/0x180
         nvme_set_ns_ana_state+0x9a/0xb0
         nvme_parse_ana_log+0x1c3/0x4a0
         nvme_mpath_add_disk+0x157/0x290
         nvme_validate_ns+0x1017/0x1bd0
         nvme_scan_work+0x44d/0x6a0
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      
         INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:1    D    0  1034      2 0x80004000
        Workqueue: nvme-delete-wq nvme_delete_ctrl_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         schedule_timeout+0x390/0x830
         wait_for_completion+0x1a7/0x310
         __flush_work+0x241/0x5d0
         flush_work+0x10/0x20
         nvme_remove_namespaces+0x85/0x3d0
         nvme_do_delete_ctrl+0xb4/0x1e0
         nvme_delete_ctrl_work+0x15/0x20
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      --
      Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0157ec8d
    • S
      nvme: fix a possible deadlock when passthru commands sent to a multipath device · b9156dae
      Sagi Grimberg 提交于
      When the user issues a command with side effects, we will end up freezing
      the namespace request queue when updating disk info (and the same for
      the corresponding mpath disk node).
      
      However, we are not freezing the mpath node request queue,
      which means that mpath I/O can still come in and block on blk_queue_enter
      (called from nvme_ns_head_make_request -> direct_make_request).
      
      This is a deadlock, because blk_queue_enter will block until the inner
      namespace request queue is unfroze, but that process is blocked because
      the namespace revalidation is trying to update the mpath disk info
      and freeze its request queue (which will never complete because
      of the I/O that is blocked on blk_queue_enter).
      
      Fix this by freezing all the subsystem nsheads request queues before
      executing the passthru command. Given that these commands are infrequent
      we should not worry about this temporary I/O freeze to keep things sane.
      
      Here is the matching hang traces:
      --
      [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
      [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.487274] systemd-udevd D 0 17994 1 0x00000000
      [ 374.493407] Call Trace:
      [ 374.496145] __schedule+0x2ef/0x620
      [ 374.500047] schedule+0x38/0xa0
      [ 374.503569] blk_queue_enter+0x139/0x220
      [ 374.507959] ? remove_wait_queue+0x60/0x60
      [ 374.512540] direct_make_request+0x60/0x130
      [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
      [ 374.523740] ? generic_make_request_checks+0x307/0x6f0
      [ 374.529484] generic_make_request+0x10d/0x2e0
      [ 374.534356] submit_bio+0x75/0x140
      [ 374.538163] ? guard_bio_eod+0x32/0xe0
      [ 374.542361] submit_bh_wbc+0x171/0x1b0
      [ 374.546553] block_read_full_page+0x1ed/0x330
      [ 374.551426] ? check_disk_change+0x70/0x70
      [ 374.556008] ? scan_shadow_nodes+0x30/0x30
      [ 374.560588] blkdev_readpage+0x18/0x20
      [ 374.564783] do_read_cache_page+0x301/0x860
      [ 374.569463] ? blkdev_writepages+0x10/0x10
      [ 374.574037] ? prep_new_page+0x88/0x130
      [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
      [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
      [ 374.588947] read_cache_page+0x12/0x20
      [ 374.593142] read_dev_sector+0x2d/0xd0
      [ 374.597337] read_lba+0x104/0x1f0
      [ 374.601046] find_valid_gpt+0xfa/0x720
      [ 374.605243] ? string_nocheck+0x58/0x70
      [ 374.609534] ? find_valid_gpt+0x720/0x720
      [ 374.614016] efi_partition+0x89/0x430
      [ 374.618113] ? string+0x48/0x60
      [ 374.621632] ? snprintf+0x49/0x70
      [ 374.625339] ? find_valid_gpt+0x720/0x720
      [ 374.629828] check_partition+0x116/0x210
      [ 374.634214] rescan_partitions+0xb6/0x360
      [ 374.638699] __blkdev_reread_part+0x64/0x70
      [ 374.643377] blkdev_reread_part+0x23/0x40
      [ 374.647860] blkdev_ioctl+0x48c/0x990
      [ 374.651956] block_ioctl+0x41/0x50
      [ 374.655766] do_vfs_ioctl+0xa7/0x600
      [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
      [ 374.664832] ksys_ioctl+0x67/0x90
      [ 374.668539] __x64_sys_ioctl+0x1a/0x20
      [ 374.672732] do_syscall_64+0x5a/0x1c0
      [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
      [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.760170] nvmeadm D 0 49141 36333 0x00004080
      [ 374.766301] Call Trace:
      [ 374.769038] __schedule+0x2ef/0x620
      [ 374.772939] schedule+0x38/0xa0
      [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
      [ 374.781614] ? remove_wait_queue+0x60/0x60
      [ 374.786192] blk_mq_freeze_queue+0x1a/0x20
      [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
      [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
      [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
      [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
      [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
      [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
      [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
      [ 374.833114] do_vfs_ioctl+0xa7/0x600
      [ 374.837117] ? __audit_syscall_entry+0xdd/0x130
      [ 374.842184] ksys_ioctl+0x67/0x90
      [ 374.845891] __x64_sys_ioctl+0x1a/0x20
      [ 374.850082] do_syscall_64+0x5a/0x1c0
      [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      --
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Tested-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      b9156dae
  14. 23 7月, 2019 1 次提交
    • M
      nvme: fix multipath crash when ANA is deactivated · 66b20ac0
      Marta Rybczynska 提交于
      Fix a crash with multipath activated. It happends when ANA log
      page is larger than MDTS and because of that ANA is disabled.
      The driver then tries to access unallocated buffer when connecting
      to a nvme target. The signature is as follows:
      
      [  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
      [  300.435387] nvme nvme0: disabling ANA support.
      [  300.437835] nvme nvme0: creating 4 I/O queues.
      [  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
      [  300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  300.466342] #PF error: [normal kernel read fault]
      [  300.467385] PGD 0 P4D 0
      [  300.467987] Oops: 0000 [#1] SMP PTI
      [  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
      [  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.486626] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.488538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.494991] Call Trace:
      [  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
      [  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
      [  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
      [  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
      [  300.500717]  ? __switch_to_asm+0x35/0x70
      [  300.501663]  process_one_work+0x171/0x380
      [  300.502340]  worker_thread+0x49/0x3f0
      [  300.503079]  kthread+0xf8/0x130
      [  300.503795]  ? max_active_store+0x80/0x80
      [  300.504690]  ? kthread_bind+0x10/0x10
      [  300.505502]  ret_from_fork+0x35/0x40
      [  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio
      [  300.514984] CR2: 0000000000000008
      [  300.515569] ---[ end trace faa2eefad7e7f218 ]---
      [  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.527084] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.528396] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.533264] Kernel panic - not syncing: Fatal exception
      [  300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Condition check refactoring from Christoph Hellwig.
      Signed-off-by: NMarta Rybczynska <marta.rybczynska@kalray.eu>
      Tested-by: NJean-Baptiste Riaux <jbriaux@kalray.eu>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      66b20ac0
  15. 10 7月, 2019 3 次提交
  16. 13 5月, 2019 1 次提交
  17. 01 5月, 2019 2 次提交
  18. 29 3月, 2019 1 次提交
  19. 20 2月, 2019 2 次提交
  20. 24 1月, 2019 1 次提交
  21. 10 1月, 2019 1 次提交
  22. 08 12月, 2018 1 次提交
  23. 05 12月, 2018 1 次提交
  24. 26 11月, 2018 1 次提交
    • J
      block: make blk_poll() take a parameter on whether to spin or not · 0a1b8b87
      Jens Axboe 提交于
      blk_poll() has always kept spinning until it found an IO. This is
      fine for SYNC polling, since we need to find one request we have
      pending, but in preparation for ASYNC polling it can be beneficial
      to just check if we have any entries available or not.
      
      Existing callers are converted to pass in 'spin == true', to retain
      the old behavior.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a1b8b87