1. 01 8月, 2019 2 次提交
    • S
      nvme: fix controller removal race with scan work · 0157ec8d
      Sagi Grimberg 提交于
      With multipath enabled, nvme_scan_work() can read from the device
      (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics,
      once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
      (see nvmf_check_ready()) and the mpath stack device make_request
      will block if head->list is not empty. However, when the head->list
      consistst of only DELETING/DEAD controllers, we should actually not
      block, but rather fail immediately.
      
      In addition, before we go ahead and remove the namespaces, make sure
      to clear the current path and kick the requeue list so that the
      request will fast fail upon requeuing.
      
      [1]:
      --
        INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:3    D    0   166      2 0x80004000
        Workqueue: nvme-wq nvme_scan_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         io_schedule+0x21/0x70
         do_read_cache_page+0xa57/0x1330
         read_cache_page+0x4a/0x70
         read_dev_sector+0xbf/0x380
         amiga_partition+0xc4/0x1230
         check_partition+0x30f/0x630
         rescan_partitions+0x19a/0x980
         __blkdev_get+0x85a/0x12f0
         blkdev_get+0x2a5/0x790
         __device_add_disk+0xe25/0x1250
         device_add_disk+0x13/0x20
         nvme_mpath_set_live+0x172/0x2b0
         nvme_update_ns_ana_state+0x130/0x180
         nvme_set_ns_ana_state+0x9a/0xb0
         nvme_parse_ana_log+0x1c3/0x4a0
         nvme_mpath_add_disk+0x157/0x290
         nvme_validate_ns+0x1017/0x1bd0
         nvme_scan_work+0x44d/0x6a0
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      
         INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
              Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:1    D    0  1034      2 0x80004000
        Workqueue: nvme-delete-wq nvme_delete_ctrl_work
        Call Trace:
         __schedule+0x851/0x1400
         schedule+0x99/0x210
         schedule_timeout+0x390/0x830
         wait_for_completion+0x1a7/0x310
         __flush_work+0x241/0x5d0
         flush_work+0x10/0x20
         nvme_remove_namespaces+0x85/0x3d0
         nvme_do_delete_ctrl+0xb4/0x1e0
         nvme_delete_ctrl_work+0x15/0x20
         process_one_work+0x7d7/0x1240
         worker_thread+0x8e/0xff0
         kthread+0x2c3/0x3b0
         ret_from_fork+0x35/0x40
      --
      Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0157ec8d
    • S
      nvme: fix a possible deadlock when passthru commands sent to a multipath device · b9156dae
      Sagi Grimberg 提交于
      When the user issues a command with side effects, we will end up freezing
      the namespace request queue when updating disk info (and the same for
      the corresponding mpath disk node).
      
      However, we are not freezing the mpath node request queue,
      which means that mpath I/O can still come in and block on blk_queue_enter
      (called from nvme_ns_head_make_request -> direct_make_request).
      
      This is a deadlock, because blk_queue_enter will block until the inner
      namespace request queue is unfroze, but that process is blocked because
      the namespace revalidation is trying to update the mpath disk info
      and freeze its request queue (which will never complete because
      of the I/O that is blocked on blk_queue_enter).
      
      Fix this by freezing all the subsystem nsheads request queues before
      executing the passthru command. Given that these commands are infrequent
      we should not worry about this temporary I/O freeze to keep things sane.
      
      Here is the matching hang traces:
      --
      [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
      [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.487274] systemd-udevd D 0 17994 1 0x00000000
      [ 374.493407] Call Trace:
      [ 374.496145] __schedule+0x2ef/0x620
      [ 374.500047] schedule+0x38/0xa0
      [ 374.503569] blk_queue_enter+0x139/0x220
      [ 374.507959] ? remove_wait_queue+0x60/0x60
      [ 374.512540] direct_make_request+0x60/0x130
      [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
      [ 374.523740] ? generic_make_request_checks+0x307/0x6f0
      [ 374.529484] generic_make_request+0x10d/0x2e0
      [ 374.534356] submit_bio+0x75/0x140
      [ 374.538163] ? guard_bio_eod+0x32/0xe0
      [ 374.542361] submit_bh_wbc+0x171/0x1b0
      [ 374.546553] block_read_full_page+0x1ed/0x330
      [ 374.551426] ? check_disk_change+0x70/0x70
      [ 374.556008] ? scan_shadow_nodes+0x30/0x30
      [ 374.560588] blkdev_readpage+0x18/0x20
      [ 374.564783] do_read_cache_page+0x301/0x860
      [ 374.569463] ? blkdev_writepages+0x10/0x10
      [ 374.574037] ? prep_new_page+0x88/0x130
      [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
      [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
      [ 374.588947] read_cache_page+0x12/0x20
      [ 374.593142] read_dev_sector+0x2d/0xd0
      [ 374.597337] read_lba+0x104/0x1f0
      [ 374.601046] find_valid_gpt+0xfa/0x720
      [ 374.605243] ? string_nocheck+0x58/0x70
      [ 374.609534] ? find_valid_gpt+0x720/0x720
      [ 374.614016] efi_partition+0x89/0x430
      [ 374.618113] ? string+0x48/0x60
      [ 374.621632] ? snprintf+0x49/0x70
      [ 374.625339] ? find_valid_gpt+0x720/0x720
      [ 374.629828] check_partition+0x116/0x210
      [ 374.634214] rescan_partitions+0xb6/0x360
      [ 374.638699] __blkdev_reread_part+0x64/0x70
      [ 374.643377] blkdev_reread_part+0x23/0x40
      [ 374.647860] blkdev_ioctl+0x48c/0x990
      [ 374.651956] block_ioctl+0x41/0x50
      [ 374.655766] do_vfs_ioctl+0xa7/0x600
      [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
      [ 374.664832] ksys_ioctl+0x67/0x90
      [ 374.668539] __x64_sys_ioctl+0x1a/0x20
      [ 374.672732] do_syscall_64+0x5a/0x1c0
      [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
      [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
      [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 374.760170] nvmeadm D 0 49141 36333 0x00004080
      [ 374.766301] Call Trace:
      [ 374.769038] __schedule+0x2ef/0x620
      [ 374.772939] schedule+0x38/0xa0
      [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
      [ 374.781614] ? remove_wait_queue+0x60/0x60
      [ 374.786192] blk_mq_freeze_queue+0x1a/0x20
      [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
      [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
      [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
      [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
      [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
      [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
      [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
      [ 374.833114] do_vfs_ioctl+0xa7/0x600
      [ 374.837117] ? __audit_syscall_entry+0xdd/0x130
      [ 374.842184] ksys_ioctl+0x67/0x90
      [ 374.845891] __x64_sys_ioctl+0x1a/0x20
      [ 374.850082] do_syscall_64+0x5a/0x1c0
      [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      --
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Tested-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      b9156dae
  2. 23 7月, 2019 1 次提交
    • M
      nvme: fix multipath crash when ANA is deactivated · 66b20ac0
      Marta Rybczynska 提交于
      Fix a crash with multipath activated. It happends when ANA log
      page is larger than MDTS and because of that ANA is disabled.
      The driver then tries to access unallocated buffer when connecting
      to a nvme target. The signature is as follows:
      
      [  300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192).
      [  300.435387] nvme nvme0: disabling ANA support.
      [  300.437835] nvme nvme0: creating 4 I/O queues.
      [  300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009
      [  300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  300.466342] #PF error: [normal kernel read fault]
      [  300.467385] PGD 0 P4D 0
      [  300.467987] Oops: 0000 [#1] SMP PTI
      [  300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4
      [  300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [  300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core]
      [  300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.486626] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.488538] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.494991] Call Trace:
      [  300.495645]  nvme_mpath_add_disk+0x5c/0xb0 [nvme_core]
      [  300.496880]  nvme_validate_ns+0x2ef/0x550 [nvme_core]
      [  300.498105]  ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core]
      [  300.499539]  nvme_scan_work+0x2b4/0x370 [nvme_core]
      [  300.500717]  ? __switch_to_asm+0x35/0x70
      [  300.501663]  process_one_work+0x171/0x380
      [  300.502340]  worker_thread+0x49/0x3f0
      [  300.503079]  kthread+0xf8/0x130
      [  300.503795]  ? max_active_store+0x80/0x80
      [  300.504690]  ? kthread_bind+0x10/0x10
      [  300.505502]  ret_from_fork+0x35/0x40
      [  300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio
      [  300.514984] CR2: 0000000000000008
      [  300.515569] ---[ end trace faa2eefad7e7f218 ]---
      [  300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core]
      [  300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48
      [  300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296
      [  300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000
      [  300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258
      [  300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044
      [  300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0
      [  300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0
      [  300.527084] FS:  0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000
      [  300.528396] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0
      [  300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  300.533264] Kernel panic - not syncing: Fatal exception
      [  300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Condition check refactoring from Christoph Hellwig.
      Signed-off-by: NMarta Rybczynska <marta.rybczynska@kalray.eu>
      Tested-by: NJean-Baptiste Riaux <jbriaux@kalray.eu>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      66b20ac0
  3. 10 7月, 2019 3 次提交
  4. 13 5月, 2019 1 次提交
  5. 01 5月, 2019 2 次提交
  6. 29 3月, 2019 1 次提交
  7. 20 2月, 2019 2 次提交
  8. 24 1月, 2019 1 次提交
  9. 10 1月, 2019 1 次提交
  10. 08 12月, 2018 1 次提交
  11. 05 12月, 2018 1 次提交
  12. 26 11月, 2018 1 次提交
    • J
      block: make blk_poll() take a parameter on whether to spin or not · 0a1b8b87
      Jens Axboe 提交于
      blk_poll() has always kept spinning until it found an IO. This is
      fine for SYNC polling, since we need to find one request we have
      pending, but in preparation for ASYNC polling it can be beneficial
      to just check if we have any entries available or not.
      
      Existing callers are converted to pass in 'spin == true', to retain
      the old behavior.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a1b8b87
  13. 19 11月, 2018 1 次提交
    • J
      block: have ->poll_fn() return number of entries polled · 85f4d4b6
      Jens Axboe 提交于
      We currently only really support sync poll, ie poll with 1 IO in flight.
      This prepares us for supporting async poll.
      
      Note that the returned value isn't necessarily 100% accurate. If poll
      races with IRQ completion, we assume that the fact that the task is now
      runnable means we found at least one entry. In reality it could be more
      than 1, or not even 1. This is fine, the caller will just need to take
      this into account.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85f4d4b6
  14. 16 11月, 2018 1 次提交
  15. 09 11月, 2018 1 次提交
  16. 17 10月, 2018 1 次提交
  17. 02 10月, 2018 2 次提交
    • C
      nvme: take node locality into account when selecting a path · f3334447
      Christoph Hellwig 提交于
      Make current_path an array with an entry for every possible node, and
      cache the best path on a per-node basis.  Take the node distance into
      account when selecting it.  This is primarily useful for dual-ported PCIe
      devices which are connected to PCIe root ports on different sockets.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      f3334447
    • J
      nvme: call nvme_complete_rq when nvmf_check_ready fails for mpath I/O · 783f4a44
      James Smart 提交于
      When an io is rejected by nvmf_check_ready() due to validation of the
      controller state, the nvmf_fail_nonready_command() will normally return
      BLK_STS_RESOURCE to requeue and retry.  However, if the controller is
      dying or the I/O is marked for NVMe multipath, the I/O is failed so that
      the controller can terminate or so that the io can be issued on a
      different path.  Unfortunately, as this reject point is before the
      transport has accepted the command, blk-mq ends up completing the I/O
      and never calls nvme_complete_rq(), which is where multipath may preserve
      or re-route the I/O. The end result is, the device user ends up seeing an
      EIO error.
      
      Example: single path connectivity, controller is under load, and a reset
      is induced.  An I/O is received:
      
        a) while the reset state has been set but the queues have yet to be
           stopped; or
        b) after queues are started (at end of reset) but before the reconnect
           has completed.
      
      The I/O finishes with an EIO status.
      
      This patch makes the following changes:
      
        - Adds the HOST_PATH_ERROR pathing status from TP4028
        - Modifies the reject point such that it appears to queue successfully,
          but actually completes the io with the new pathing status and calls
          nvme_complete_rq().
        - nvme_complete_rq() recognizes the new status, avoids resetting the
          controller (likely was already done in order to get this new status),
          and calls the multipather to clear the current path that errored.
          This allows the next command (retry or new command) to select a new
          path if there is one.
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      783f4a44
  18. 28 9月, 2018 2 次提交
  19. 26 9月, 2018 1 次提交
  20. 07 8月, 2018 1 次提交
  21. 28 7月, 2018 2 次提交
  22. 11 6月, 2018 1 次提交
  23. 03 5月, 2018 2 次提交
  24. 26 3月, 2018 1 次提交
  25. 09 3月, 2018 1 次提交
  26. 07 3月, 2018 1 次提交
  27. 01 3月, 2018 1 次提交
  28. 28 2月, 2018 1 次提交
    • B
      nvme-multipath: fix sysfs dangerously created links · 9bd82b1a
      Baegjae Sung 提交于
      If multipathing is enabled, each NVMe subsystem creates a head
      namespace (e.g., nvme0n1) and multiple private namespaces
      (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
      private namespaces, links of head namespace are used, so the
      namespace creation order must be followed (e.g., nvme0n1 ->
      nvme0c1n1). If the order is not followed, links of sysfs will be
      incomplete or kernel panic will occur.
      
      The kernel panic was:
        kernel BUG at fs/sysfs/symlink.c:27!
        Call Trace:
          nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
          nvme_validate_ns+0x5c2/0x850 [nvme_core]
          nvme_scan_work+0x1af/0x2d0 [nvme_core]
      
      Correct order
      Context A     Context B
      nvme0n1
      nvme0c0n1     nvme0c1n1
      
      Incorrect order
      Context A     Context B
                    nvme0c1n1
      nvme0n1
      nvme0c0n1
      
      The nvme_mpath_add_disk (for creating head namespace) is called
      just before the nvme_mpath_add_disk_links (for creating private
      namespaces). In nvme_mpath_add_disk, the first context acquires
      the lock of subsystem and creates a head namespace, and other
      contexts do nothing by checking GENHD_FL_UP of a head namespace
      after waiting to acquire the lock. We verified the code with or
      without multipathing using three vendors of dual-port NVMe SSDs.
      Signed-off-by: NBaegjae Sung <baegjae@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9bd82b1a
  29. 11 1月, 2018 2 次提交
  30. 20 11月, 2017 1 次提交