1. 01 3月, 2018 2 次提交
  2. 28 2月, 2018 1 次提交
    • B
      nvme-multipath: fix sysfs dangerously created links · 9bd82b1a
      Baegjae Sung 提交于
      If multipathing is enabled, each NVMe subsystem creates a head
      namespace (e.g., nvme0n1) and multiple private namespaces
      (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
      private namespaces, links of head namespace are used, so the
      namespace creation order must be followed (e.g., nvme0n1 ->
      nvme0c1n1). If the order is not followed, links of sysfs will be
      incomplete or kernel panic will occur.
      
      The kernel panic was:
        kernel BUG at fs/sysfs/symlink.c:27!
        Call Trace:
          nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
          nvme_validate_ns+0x5c2/0x850 [nvme_core]
          nvme_scan_work+0x1af/0x2d0 [nvme_core]
      
      Correct order
      Context A     Context B
      nvme0n1
      nvme0c0n1     nvme0c1n1
      
      Incorrect order
      Context A     Context B
                    nvme0c1n1
      nvme0n1
      nvme0c0n1
      
      The nvme_mpath_add_disk (for creating head namespace) is called
      just before the nvme_mpath_add_disk_links (for creating private
      namespaces). In nvme_mpath_add_disk, the first context acquires
      the lock of subsystem and creates a head namespace, and other
      contexts do nothing by checking GENHD_FL_UP of a head namespace
      after waiting to acquire the lock. We verified the code with or
      without multipathing using three vendors of dual-port NVMe SSDs.
      Signed-off-by: NBaegjae Sung <baegjae@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9bd82b1a
  3. 26 2月, 2018 1 次提交
  4. 22 2月, 2018 3 次提交
  5. 14 2月, 2018 5 次提交
  6. 13 2月, 2018 1 次提交
    • R
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier 提交于
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0a34e466
  7. 11 2月, 2018 2 次提交
    • J
      nvme_fc: cleanup io completion · c3aedd22
      James Smart 提交于
      There was some old cold that dealt with complete_rq being called
      prior to the lldd returning the io completion. This is garbage code.
      The complete_rq routine was being called after eh_timeouts were
      called and it was due to eh_timeouts not being handled properly.
      The timeouts were fixed in prior patches so that in general, a
      timeout will initiate an abort and the reset timer restarted as
      the abort operation will take care of completing things. Given the
      reset timer restarted, the erroneous complete_rq calls were eliminated.
      
      So remove the work that was synchronizing complete_rq with io
      completion.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      c3aedd22
    • J
      nvme_fc: correct abort race condition on resets · 3efd6e8e
      James Smart 提交于
      During reset handling, there is live io completing while the reset
      is taking place. The reset path attempts to abort all outstanding io,
      counting the number of ios that were reset. It then waits for those
      ios to be reclaimed from the lldd before continuing.
      
      The transport's logic on io state and flag setting was poor, allowing
      ios to complete simultaneous to the abort request. The completed ios
      were counted, but as the completion had already occurred, the
      completion never reduced the count. As the count never zeros, the
      reset/delete never completes.
      
      Tighten it up by unconditionally changing the op state to completed
      when the io done handler is called.  The reset/abort path now changes
      the op state to aborted, but the abort only continues if the op
      state was live priviously. If complete, the abort is backed out.
      Thus proper counting of io aborts and their completions is working
      again.
      
      Also removed the TERMIO state on the op as it's redundant with the
      op's aborted state.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      3efd6e8e
  8. 09 2月, 2018 4 次提交
  9. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  10. 26 1月, 2018 5 次提交
  11. 25 1月, 2018 1 次提交
  12. 24 1月, 2018 1 次提交
  13. 18 1月, 2018 6 次提交
  14. 16 1月, 2018 4 次提交
    • S
      nvmet: release a ns reference in nvmet_req_uninit if needed · 423b4487
      Sagi Grimberg 提交于
      nvmet_req_init looked up a namespace and took a reference on it (unless it
      failed prior to that). If the request is uninitialized (in error cases) we
      need to remove that reference in case it was taken, otherwise we leak
      namespace reference when calling nvme_req_uninit.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      423b4487
    • R
      nvme-fabrics: fix memory leak when parsing host ID option · df351ef7
      Roland Dreier 提交于
      We use match_strdup() to get a copy of the option string for host ID string, but
      we just pass it to uuid_parse() and don't store the string pointer, so we need to
      kfree() the string after parsing it.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      df351ef7
    • M
      nvme: fix comment typos in nvme_create_io_queues · 8adb8c14
      Minwoo Im 提交于
      fix comment typos in nvme_create_io_queues() like below.
        _aount_ to _amount_
        _an_    to _can_
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8adb8c14
    • R
      nvme: host delete_work and reset_work on separate workqueues · b227c59b
      Roy Shterman 提交于
      We need to ensure that delete_work will be hosted on a different
      workqueue than all the works we flush or cancel from it.
      Otherwise we may hit a circular dependency warning [1].
      
      Also, given that delete_work flushes reset_work, host reset_work
      on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
      fix the flushing in the individual drivers to flush nvme_delete_wq
      when draining queued deletes.
      
      [1]:
      [  178.491942] =============================================
      [  178.492718] [ INFO: possible recursive locking detected ]
      [  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
      [  178.494382] ---------------------------------------------
      [  178.495160] kworker/5:1/135 is trying to acquire lock:
      [  178.495894]  (
      [  178.496120] "nvme-wq"
      [  178.496471] ){++++.+}
      [  178.496599] , at:
      [  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
      [  178.497670]
                     but task is already holding lock:
      [  178.498499]  (
      [  178.498724] "nvme-wq"
      [  178.499074] ){++++.+}
      [  178.499202] , at:
      [  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.500343]
                     other info that might help us debug this:
      [  178.501269]  Possible unsafe locking scenario:
      
      [  178.502113]        CPU0
      [  178.502472]        ----
      [  178.502829]   lock(
      [  178.503115] "nvme-wq"
      [  178.503467] );
      [  178.503716]   lock(
      [  178.504001] "nvme-wq"
      [  178.504353] );
      [  178.504601]
                      *** DEADLOCK ***
      
      [  178.505441]  May be due to missing lock nesting notation
      
      [  178.506453] 2 locks held by kworker/5:1/135:
      [  178.507068]  #0:
      [  178.507330]  (
      [  178.507598] "nvme-wq"
      [  178.507726] ){++++.+}
      [  178.508079] , at:
      [  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.509004]  #1:
      [  178.509265]  (
      [  178.509532] (&ctrl->delete_work)
      [  178.509795] ){+.+.+.}
      [  178.510145] , at:
      [  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.511070]
                     stack backtrace:
      :
      [  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
      [  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
      [  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
      [  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
      [  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
      [  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
      [  178.518443] Call Trace:
      [  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
      [  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
      [  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
      [  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
      [  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
      [  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
      [  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
      [  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
      [  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
      [  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
      [  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
      [  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
      [  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
      [  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
      [  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
      [  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
      [  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
      [  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
      [  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
      Signed-off-by: NRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b227c59b
  15. 15 1月, 2018 2 次提交
    • S
      nvme-pci: allocate device queues storage space at probe · 147b27e4
      Sagi Grimberg 提交于
      It may cause race by setting 'nvmeq' in nvme_init_request()
      because .init_request is called inside switching io scheduler, which
      may happen when the NVMe device is being resetted and its nvme queues
      are being freed and created. We don't have any sync between the two
      pathes.
      
      This patch changes the nvmeq allocation to occur at probe time so
      there is no way we can dereference it at init_request.
      
      [   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
      [   93.274146] invalid opcode: 0000 [#1] SMP
      [   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
      nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
      intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
      kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
      intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
      ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
      shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
      mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
      fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
      megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
      [   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
      [   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
      [   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
      [   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
      [   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
      [   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
      [   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
      [   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
      [   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
      [   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
      [   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
      [   93.446368] Call Trace:
      [   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
      [   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
      [   93.459214]  blk_mq_init_sched+0x7e/0x140
      [   93.463687]  elevator_switch+0x5a/0x1f0
      [   93.467966]  ? elevator_get.isra.17+0x52/0xc0
      [   93.472826]  elv_iosched_store+0xde/0x150
      [   93.477299]  queue_attr_store+0x4e/0x90
      [   93.481580]  kernfs_fop_write+0xfa/0x180
      [   93.485958]  __vfs_write+0x33/0x170
      [   93.489851]  ? __inode_security_revalidate+0x4c/0x60
      [   93.495390]  ? selinux_file_permission+0xda/0x130
      [   93.500641]  ? _cond_resched+0x15/0x30
      [   93.504815]  vfs_write+0xad/0x1a0
      [   93.508512]  SyS_write+0x52/0xc0
      [   93.512113]  do_syscall_64+0x61/0x1a0
      [   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
      [   93.521351] RIP: 0033:0x7f33ce96aab0
      [   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
      [   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
      [   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
      [   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
      [   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
      [   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
      74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
      0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
      [   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
      [   93.602273] ---[ end trace 810dde3993e5f14e ]---
      Reported-by: NYi Zhang <yi.zhang@redhat.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      147b27e4
    • S
      nvme-pci: serialize pci resets · 79c48ccf
      Sagi Grimberg 提交于
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      79c48ccf
  16. 11 1月, 2018 1 次提交