1. 23 12月, 2021 1 次提交
  2. 27 10月, 2021 1 次提交
  3. 21 10月, 2021 3 次提交
    • H
      nvme: drop scan_lock and always kick requeue list when removing namespaces · 2b81a5f0
      Hannes Reinecke 提交于
      When reading the partition table on initial scan hits an I/O error the
      I/O will hang with the scan_mutex held:
      
      [<0>] do_read_cache_page+0x49b/0x790
      [<0>] read_part_sector+0x39/0xe0
      [<0>] read_lba+0xf9/0x1d0
      [<0>] efi_partition+0xf1/0x7f0
      [<0>] bdev_disk_changed+0x1ee/0x550
      [<0>] blkdev_get_whole+0x81/0x90
      [<0>] blkdev_get_by_dev+0x128/0x2e0
      [<0>] device_add_disk+0x377/0x3c0
      [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
      [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
      [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
      [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
      [<0>] nvme_scan_work+0x168/0x310 [nvme_core]
      [<0>] process_one_work+0x231/0x420
      
      and trying to delete the controller will deadlock as it tries to grab
      the scan mutex:
      
      [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
      [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
      [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]
      
      As we're now properly ordering the namespace list there is no need to
      hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
      And we always need to kick the requeue list as the path will be marked
      as unusable and I/O will be requeued _without_ a current path.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2b81a5f0
    • L
      nvme-multipath: add error handling support for add_disk() · 11384580
      Luis Chamberlain 提交于
      We never checked for errors on add_disk() as this function
      returned void. Now that this is fixed, use the shiny new
      error handling.
      
      Since we now can tell for sure when a disk was added, move
      setting the bit NVME_NSHEAD_DISK_LIVE only when we did
      add the disk successfully.
      
      Nothing to do here as the cleanup is done elsewhere. We take
      care and use test_and_set_bit() because it is protects against
      two nvme paths simultaneously calling device_add_disk() on the
      same namespace head.
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      11384580
    • H
      nvme: generate uevent once a multipath namespace is operational again · f6f09c15
      Hannes Reinecke 提交于
      When fast_io_fail_tmo is set I/O will be aborted while recovery is
      still ongoing. This causes MD to set the namespace to failed, and
      no futher I/O will be submitted to that namespace.
      
      However, once the recovery succeeds and the namespace becomes
      operational again the NVMe subsystem doesn't send a notification,
      so MD cannot automatically reinstate operation and requires
      manual interaction.
      
      This patch will send a KOBJ_CHANGE uevent per multipathed namespace
      once the underlying controller transitions to LIVE, allowing an automatic
      MD reassembly with these udev rules:
      
      /etc/udev/rules.d/65-md-auto-re-add.rules:
      SUBSYSTEM!="block", GOTO="md_end"
      
      ACTION!="change", GOTO="md_end"
      ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end"
      PROGRAM="/sbin/md_raid_auto_readd.sh $devnode"
      LABEL="md_end"
      
      /sbin/md_raid_auto_readd.sh:
      
      MDADM=/sbin/mdadm
      DEVNAME=$1
      
      export $(${MDADM} --examine --export ${DEVNAME})
      
      if [ -z "${MD_UUID}" ]; then
          exit 1
      fi
      
      UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID})
      MD_DEVNAME=${UUID_LINK##*/}
      export $(${MDADM} --detail --export /dev/${MD_DEVNAME})
      if [ -z "${MD_METADATA}" ] ; then
          exit 1
      fi
      if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then
          echo "${MD_DEVNAME}: array not degraded, nothing to do"
          exit 0
      fi
      MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state)
      if [ ${MD_STATE} != "clean" ] ; then
          echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add"
          exit 1
      fi
      MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE"
      if [ ${!MD_VARNAME} = "spare" ] ; then
          ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME}
      fi
      
      Changes to v2:
      - Add udev rules example to description
      Changes to v1:
      - use disk_uevent() as suggested by hch
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      f6f09c15
  4. 18 10月, 2021 2 次提交
  5. 14 10月, 2021 1 次提交
  6. 14 9月, 2021 1 次提交
  7. 06 9月, 2021 2 次提交
    • H
      nvme-multipath: revalidate paths during rescan · e7d65803
      Hannes Reinecke 提交于
      When triggering a rescan due to a namespace resize we will be
      receiving AENs on every controller, triggering a rescan of all
      attached namespaces. If multipath is active only the current path and
      the ns_head disk will be updated, the other paths will still refer to
      the old size until AENs for the remaining controllers are received.
      
      If I/O comes in before that it might be routed to one of the old
      paths, triggering an I/O failure with 'access beyond end of device'.
      With this patch the old paths are skipped from multipath path
      selection until the controller serving these paths has been rescanned.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      [dwagner: - introduce NVME_NS_READY flag instead of NVME_NS_INVALIDATE
                - use 'revalidate' instead of 'invalidate' which
      	    follows the zoned device code path.
      	  - clear NVME_NS_READY before clearing current_path]
      Signed-off-by: NDaniel Wagner <dwagner@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e7d65803
    • C
      nvme-multipath: set QUEUE_FLAG_NOWAIT · d32d3d0b
      Christoph Hellwig 提交于
      The nvme multipathing code just dispatches bios to one of the blk-mq
      based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT
      to support REQ_NOWAIT bios.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      d32d3d0b
  8. 13 8月, 2021 1 次提交
  9. 21 7月, 2021 1 次提交
  10. 16 6月, 2021 2 次提交
  11. 03 6月, 2021 1 次提交
  12. 01 6月, 2021 2 次提交
  13. 13 5月, 2021 1 次提交
    • H
      nvmet: use new ana_log_size instead the old one · e181811b
      Hou Pu 提交于
      The new ana_log_size should be used instead of the old one.
      Or kernel NULL pointer dereference will happen like below:
      
      [   38.957849][   T69] BUG: kernel NULL pointer dereference, address: 000000000000003c
      [   38.975550][   T69] #PF: supervisor write access in kernel mode
      [   38.975955][   T69] #PF: error_code(0x0002) - not-present page
      [   38.976905][   T69] PGD 0 P4D 0
      [   38.979388][   T69] Oops: 0002 [#1] SMP NOPTI
      [   38.980488][   T69] CPU: 0 PID: 69 Comm: kworker/0:2 Not tainted 5.12.0+ #54
      [   38.981254][   T69] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [   38.982502][   T69] Workqueue: events nvme_loop_execute_work
      [   38.985219][   T69] RIP: 0010:memcpy_orig+0x68/0x10f
      [   38.986203][   T69] Code: 83 c2 20 eb 44 48 01 d6 48 01 d7 48 83 ea 20 0f 1f 00 48 83 ea 20 4c 8b 46 f8 4c 8b 4e f0 4c 8b 56 e8 4c 8b 5e e0 48 8d 76 e0 <4c> 89 47 f8 4c 89 4f f0 4c 89 57 e8 4c 89 5f e0 48 8d 7f e0 73 d2
      [   38.987677][   T69] RSP: 0018:ffffc900001b7d48 EFLAGS: 00000287
      [   38.987996][   T69] RAX: 0000000000000020 RBX: 0000000000000024 RCX: 0000000000000010
      [   38.988327][   T69] RDX: ffffffffffffffe4 RSI: ffff8881084bc004 RDI: 0000000000000044
      [   38.988620][   T69] RBP: 0000000000000024 R08: 0000000100000000 R09: 0000000000000000
      [   38.988991][   T69] R10: 0000000100000000 R11: 0000000000000001 R12: 0000000000000024
      [   38.989289][   T69] R13: ffff8881084bc000 R14: 0000000000000000 R15: 0000000000000024
      [   38.989845][   T69] FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
      [   38.990234][   T69] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   38.990490][   T69] CR2: 000000000000003c CR3: 00000001085b2000 CR4: 00000000000006f0
      [   38.991105][   T69] Call Trace:
      [   38.994157][   T69]  sg_copy_buffer+0xb8/0xf0
      [   38.995357][   T69]  nvmet_copy_to_sgl+0x48/0x6d
      [   38.995565][   T69]  nvmet_execute_get_log_page_ana+0xd4/0x1cb
      [   38.995792][   T69]  nvmet_execute_get_log_page+0xc9/0x146
      [   38.995992][   T69]  nvme_loop_execute_work+0x3e/0x44
      [   38.996181][   T69]  process_one_work+0x1c3/0x3c0
      [   38.996393][   T69]  worker_thread+0x44/0x3d0
      [   38.996600][   T69]  ? cancel_delayed_work+0x90/0x90
      [   38.996804][   T69]  kthread+0xf7/0x130
      [   38.996961][   T69]  ? kthread_create_worker_on_cpu+0x70/0x70
      [   38.997171][   T69]  ret_from_fork+0x22/0x30
      [   38.997705][   T69] Modules linked in:
      [   38.998741][   T69] CR2: 000000000000003c
      [   39.000104][   T69] ---[ end trace e719927b609d0fa0 ]---
      
      Fixes: 5e1f6899 ("nvme-multipath: fix double initialization of ANA state")
      Signed-off-by: NHou Pu <houpu.main@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e181811b
  14. 12 5月, 2021 1 次提交
  15. 04 5月, 2021 1 次提交
  16. 22 4月, 2021 1 次提交
  17. 15 4月, 2021 3 次提交
  18. 06 4月, 2021 1 次提交
  19. 03 4月, 2021 1 次提交
  20. 10 2月, 2021 1 次提交
  21. 29 1月, 2021 1 次提交
  22. 26 1月, 2021 1 次提交
  23. 25 1月, 2021 1 次提交
  24. 05 12月, 2020 1 次提交
  25. 02 12月, 2020 1 次提交
    • V
      nvme-fabrics: reject I/O to offline device · 8c4dfea9
      Victor Gladkov 提交于
      Commands get stuck while Host NVMe-oF controller is in reconnect state.
      The controller enters into reconnect state when it loses connection with
      the target.  It tries to reconnect every 10 seconds (default) until
      a successful reconnect or until the reconnect time-out is reached.
      The default reconnect time out is 10 minutes.
      
      Applications are expecting commands to complete with success or error
      within a certain timeout (30 seconds by default).  The NVMe host is
      enforcing that timeout while it is connected, but during reconnect the
      timeout is not enforced and commands may get stuck for a long period or
      even forever.
      
      To fix this long delay due to the default timeout, introduce new
      "fast_io_fail_tmo" session parameter.  The timeout is measured in seconds
      from the controller reconnect and any command beyond that timeout is
      rejected.  The new parameter value may be passed during 'connect'.
      The default value of -1 means no timeout (similar to current behavior).
      Signed-off-by: NVictor Gladkov <victor.gladkov@kioxia.com>
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChao Leng <lengchao@huawei.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8c4dfea9
  26. 25 9月, 2020 1 次提交
  27. 22 8月, 2020 3 次提交
  28. 29 7月, 2020 3 次提交
    • H
      nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths · fbd6a42d
      Hannes Reinecke 提交于
      When nvme_round_robin_path() finds a valid namespace we should be using it;
      falling back to __nvme_find_path() for non-optimized paths will cause the
      result from nvme_round_robin_path() to be ignored for non-optimized paths.
      
      Fixes: 75c10e73 ("nvme-multipath: round-robin I/O policy")
      Signed-off-by: NMartin Wilck <mwilck@suse.com>
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      fbd6a42d
    • M
      nvme-multipath: fix logic for non-optimized paths · 3f6e3246
      Martin Wilck 提交于
      Handle the special case where we have exactly one optimized path,
      which we should keep using in this case.
      
      Fixes: 75c10e73 ("nvme-multipath: round-robin I/O policy")
      Signed off-by: Martin Wilck <mwilck@suse.com>
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3f6e3246
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e