1. 26 3月, 2018 2 次提交
    • T
      nvme: Add fault injection feature · b9e03857
      Thomas Tai 提交于
      Linux's fault injection framework provides a systematic way to support
      error injection via debugfs in the /sys/kernel/debug directory. This
      patch uses the framework to add error injection to NVMe driver. The
      fault injection source code is stored in a separate file and only linked
      if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected.
      
      Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no
      retry will be injected into the nvme_end_request. Users can change
      the default status code and no retry flag via debufs. Following example
      shows how to enable and inject an error. For more examples, refer to
      Documentation/fault-injection/nvme-fault-injection.txt
      
      How to enable nvme fault injection:
      
      First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
      recompile the kernel. After booting up the kernel, do the
      following.
      
      How to inject an error:
      
      mount /dev/nvme0n1 /mnt
      echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
      echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
      cp a.file /mnt
      
      Expected Result:
      
      cp: cannot stat ‘/mnt/a.file’: Input/output error
      
      Message from dmesg:
      
      FAULT_INJECTION: forcing a failure.
      name fault_inject, interval 1, probability 100, space 0, times 1
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
      Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
      Call Trace:
        <IRQ>
        dump_stack+0x5c/0x7d
        should_fail+0x148/0x170
        nvme_should_fail+0x2f/0x50 [nvme_core]
        nvme_process_cq+0xe7/0x1d0 [nvme]
        nvme_irq+0x1e/0x40 [nvme]
        __handle_irq_event_percpu+0x3a/0x190
        handle_irq_event_percpu+0x30/0x70
        handle_irq_event+0x36/0x60
        handle_fasteoi_irq+0x78/0x120
        handle_irq+0xa7/0x130
        ? tick_irq_enter+0xa8/0xc0
        do_IRQ+0x43/0xc0
        common_interrupt+0xa2/0xa2
        </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10
      RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
      R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
        ? __sched_text_end+0x4/0x4
        default_idle+0x18/0xf0
        do_idle+0x150/0x1d0
        cpu_startup_entry+0x6f/0x80
        start_kernel+0x4c4/0x4e4
        ? set_init_arg+0x55/0x55
        secondary_startup_64+0xa5/0xb0
        print_req_error: I/O error, dev nvme0n1, sector 9240
      EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
      inode #2: comm cp: reading directory lblock 0
      Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
      Reviewed-by: NEric Saint-Etienne <eric.saint.etienne@oracle.com>
      Signed-off-by: NKarl Volz <karl.volz@oracle.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9e03857
    • M
      nvme: use define instead of magic value for identify size · 42595eb7
      Minwoo Im 提交于
      NVME_IDENTIFY_DATA_SIZE was added to linux/nvme.h by following commit.
        commit 0add5e8e ("nvmet: use NVME_IDENTIFY_DATA_SIZE")
      
      Make it use NVME_IDENTIFY_DATA_SIZE define instead of magic value
      0x1000 in case of identify data size.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42595eb7
  2. 09 3月, 2018 1 次提交
  3. 28 2月, 2018 1 次提交
    • B
      nvme-multipath: fix sysfs dangerously created links · 9bd82b1a
      Baegjae Sung 提交于
      If multipathing is enabled, each NVMe subsystem creates a head
      namespace (e.g., nvme0n1) and multiple private namespaces
      (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
      private namespaces, links of head namespace are used, so the
      namespace creation order must be followed (e.g., nvme0n1 ->
      nvme0c1n1). If the order is not followed, links of sysfs will be
      incomplete or kernel panic will occur.
      
      The kernel panic was:
        kernel BUG at fs/sysfs/symlink.c:27!
        Call Trace:
          nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
          nvme_validate_ns+0x5c2/0x850 [nvme_core]
          nvme_scan_work+0x1af/0x2d0 [nvme_core]
      
      Correct order
      Context A     Context B
      nvme0n1
      nvme0c0n1     nvme0c1n1
      
      Incorrect order
      Context A     Context B
                    nvme0c1n1
      nvme0n1
      nvme0c0n1
      
      The nvme_mpath_add_disk (for creating head namespace) is called
      just before the nvme_mpath_add_disk_links (for creating private
      namespaces). In nvme_mpath_add_disk, the first context acquires
      the lock of subsystem and creates a head namespace, and other
      contexts do nothing by checking GENHD_FL_UP of a head namespace
      after waiting to acquire the lock. We verified the code with or
      without multipathing using three vendors of dual-port NVMe SSDs.
      Signed-off-by: NBaegjae Sung <baegjae@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9bd82b1a
  4. 14 2月, 2018 2 次提交
  5. 13 2月, 2018 1 次提交
    • R
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier 提交于
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0a34e466
  6. 09 2月, 2018 4 次提交
  7. 26 1月, 2018 3 次提交
  8. 16 1月, 2018 1 次提交
    • R
      nvme: host delete_work and reset_work on separate workqueues · b227c59b
      Roy Shterman 提交于
      We need to ensure that delete_work will be hosted on a different
      workqueue than all the works we flush or cancel from it.
      Otherwise we may hit a circular dependency warning [1].
      
      Also, given that delete_work flushes reset_work, host reset_work
      on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
      fix the flushing in the individual drivers to flush nvme_delete_wq
      when draining queued deletes.
      
      [1]:
      [  178.491942] =============================================
      [  178.492718] [ INFO: possible recursive locking detected ]
      [  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
      [  178.494382] ---------------------------------------------
      [  178.495160] kworker/5:1/135 is trying to acquire lock:
      [  178.495894]  (
      [  178.496120] "nvme-wq"
      [  178.496471] ){++++.+}
      [  178.496599] , at:
      [  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
      [  178.497670]
                     but task is already holding lock:
      [  178.498499]  (
      [  178.498724] "nvme-wq"
      [  178.499074] ){++++.+}
      [  178.499202] , at:
      [  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.500343]
                     other info that might help us debug this:
      [  178.501269]  Possible unsafe locking scenario:
      
      [  178.502113]        CPU0
      [  178.502472]        ----
      [  178.502829]   lock(
      [  178.503115] "nvme-wq"
      [  178.503467] );
      [  178.503716]   lock(
      [  178.504001] "nvme-wq"
      [  178.504353] );
      [  178.504601]
                      *** DEADLOCK ***
      
      [  178.505441]  May be due to missing lock nesting notation
      
      [  178.506453] 2 locks held by kworker/5:1/135:
      [  178.507068]  #0:
      [  178.507330]  (
      [  178.507598] "nvme-wq"
      [  178.507726] ){++++.+}
      [  178.508079] , at:
      [  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.509004]  #1:
      [  178.509265]  (
      [  178.509532] (&ctrl->delete_work)
      [  178.509795] ){+.+.+.}
      [  178.510145] , at:
      [  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.511070]
                     stack backtrace:
      :
      [  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
      [  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
      [  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
      [  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
      [  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
      [  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
      [  178.518443] Call Trace:
      [  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
      [  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
      [  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
      [  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
      [  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
      [  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
      [  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
      [  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
      [  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
      [  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
      [  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
      [  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
      [  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
      [  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
      [  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
      [  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
      [  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
      [  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
      [  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
      Signed-off-by: NRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b227c59b
  9. 15 1月, 2018 1 次提交
  10. 11 1月, 2018 2 次提交
  11. 09 1月, 2018 1 次提交
  12. 08 1月, 2018 4 次提交
  13. 29 12月, 2017 2 次提交
  14. 15 12月, 2017 4 次提交
    • K
      nvme: setup streams after initializing namespace head · 654b4a4a
      Keith Busch 提交于
      Fixes a NULL pointer dereference.
      Reported-by: NArnav Dawn <a.dawn@samsung.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      654b4a4a
    • K
      nvme: check hw sectors before setting chunk sectors · 249159c5
      Keith Busch 提交于
      Some devices with IDs matching the "stripe" quirk don't actually have
      this quirk, and don't have an MDTS value. When MDTS is not set, the
      driver sets the max sectors to UINT_MAX, which is not a power of 2,
      hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting
      chunk sectors for such devices.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      249159c5
    • M
      nvme: call blk_integrity_unregister after queue is cleaned up · bd9f5d65
      Ming Lei 提交于
      During IO complete path, bio_integrity_advance() is often called, and
      blk_get_integrity() is called in this function. But in
      blk_integrity_unregister, the buffer pointed by queue->integrity
      is cleared, and blk_integrity->profile becomes NULL, then blk_get_integrity
      returns NULL, and causes kernel oops[1] finally.
      
      This patch fixes this issue by calling blk_integrity_unregister() after
      blk_cleanup_queue().
      
      [1] kernel oops log
      [  122.068007] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
      [  122.076760] IP: bio_integrity_advance+0x3d/0xf0
      [  122.081815] PGD 0 P4D 0
      [  122.084641] Oops: 0000 [#1] SMP
      [  122.088142] Modules linked in: sunrpc ipmi_ssif intel_rapl vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mei_me ipmi_si crct10dif_pclmul crc32_pclmul sg mei ghash_clmulni_intel mxm_wmi ipmi_devintf iTCO_wdt intel_cstate intel_uncore pcspkr intel_rapl_perf iTCO_vendor_support dcdbas ipmi_msghandler lpc_ich acpi_power_meter shpchp wmi dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel ahci nvme tg3 libahci nvme_core i2c_core libata ptp megaraid_sas pps_core dm_mirror dm_region_hash dm_log dm_mod
      [  122.149577] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-11.el7a.x86_64 #1
      [  122.157635] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
      [  122.166179] task: ffff8802ff1e8000 task.stack: ffffc90000130000
      [  122.172785] RIP: 0010:bio_integrity_advance+0x3d/0xf0
      [  122.178419] RSP: 0018:ffff88047fc03d70 EFLAGS: 00010006
      [  122.184248] RAX: ffff880473b08000 RBX: ffff880458c71a80 RCX: ffff880473b08248
      [  122.192209] RDX: 0000000000000000 RSI: 000000000000003c RDI: ffffc900038d7ba0
      [  122.200171] RBP: ffff88047fc03d78 R08: 0000000000000001 R09: ffffffffa01a78b5
      [  122.208132] R10: ffff88047fc1eda0 R11: ffff880458c71ad0 R12: 0000000000007800
      [  122.216094] R13: 0000000000000000 R14: 0000000000007800 R15: ffff880473a39b40
      [  122.224056] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  122.233083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  122.239494] CR2: 000000000000000a CR3: 0000000001c09002 CR4: 00000000001606e0
      [  122.247455] Call Trace:
      [  122.250183]  <IRQ>
      [  122.252429]  bio_advance+0x28/0xf0
      [  122.256217]  blk_update_request+0xa1/0x310
      [  122.260778]  blk_mq_end_request+0x1e/0x70
      [  122.265256]  nvme_complete_rq+0x1c/0xd0 [nvme_core]
      [  122.270699]  nvme_pci_complete_rq+0x85/0x130 [nvme]
      [  122.276140]  __blk_mq_complete_request+0x8d/0x140
      [  122.281387]  blk_mq_complete_request+0x16/0x20
      [  122.286345]  nvme_process_cq+0xdd/0x1c0 [nvme]
      [  122.291301]  nvme_irq+0x23/0x50 [nvme]
      [  122.295485]  __handle_irq_event_percpu+0x3c/0x190
      [  122.300725]  handle_irq_event_percpu+0x32/0x80
      [  122.305683]  handle_irq_event+0x3b/0x60
      [  122.309964]  handle_edge_irq+0x8f/0x190
      [  122.314247]  handle_irq+0xab/0x120
      [  122.318043]  do_IRQ+0x48/0xd0
      [  122.321355]  common_interrupt+0x9d/0x9d
      [  122.325625]  </IRQ>
      [  122.327967] RIP: 0010:cpuidle_enter_state+0xe9/0x280
      [  122.333504] RSP: 0018:ffffc90000133e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff35
      [  122.341952] RAX: ffff88047fc1b900 RBX: ffff88047fc24400 RCX: 000000000000001f
      [  122.349913] RDX: 0000000000000000 RSI: fffffcf2e6007295 RDI: 0000000000000000
      [  122.357874] RBP: ffffc90000133ea0 R08: 000000000000062e R09: 0000000000000253
      [  122.365836] R10: 0000000000000225 R11: 0000000000000018 R12: 0000000000000002
      [  122.373797] R13: 0000000000000001 R14: ffff88047fc24400 R15: 0000001c6bd1d263
      [  122.381762]  ? cpuidle_enter_state+0xc5/0x280
      [  122.386623]  cpuidle_enter+0x17/0x20
      [  122.390611]  call_cpuidle+0x23/0x40
      [  122.394501]  do_idle+0x17e/0x1f0
      [  122.398101]  cpu_startup_entry+0x73/0x80
      [  122.402478]  start_secondary+0x178/0x1c0
      [  122.406854]  secondary_startup_64+0xa5/0xa5
      [  122.411520] Code: 48 8b 5f 68 48 8b 47 08 31 d2 4c 8b 5b 48 48 8b 80 d0 03 00 00 48 83 b8 48 02 00 00 00 48 8d 88 48 02 00 00 48 0f 45 d1 c1 ee 09 <0f> b6 4a 0a 0f b6 52 09 89 f0 48 01 73 08 83 e9 09 d3 e8 0f af
      [  122.432604] RIP: bio_integrity_advance+0x3d/0xf0 RSP: ffff88047fc03d70
      [  122.439888] CR2: 000000000000000a
      Reported-by: NZhang Yi <yizhan@redhat.com>
      Tested-by: NZhang Yi <yizhan@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      bd9f5d65
    • D
      nvme: set discard_alignment to zero · b224f613
      David Disseldorp 提交于
      Similar to 7c084289 ("rbd: set discard_alignment to zero"), NVMe
      devices are currently incorrectly initialised with the block queue
      discard_alignment set to the NVMe stream alignment.
      
      As per Documentation/ABI/testing/sysfs-block:
        The discard_alignment parameter indicates how many bytes the beginning
        of the device is offset from the internal allocation unit's natural
        alignment.
      
      Correcting the discard_alignment parameter to zero has no effect on how
      discard requests are propagated through the block layer - @alignment in
      __blkdev_issue_discard() remains zero. However, it does fix other
      consumers, such as LIO's Block Limits VPD response.
      Signed-off-by: NDavid Disseldorp <ddiss@suse.de>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b224f613
  15. 20 11月, 2017 2 次提交
  16. 12 11月, 2017 1 次提交
  17. 11 11月, 2017 8 次提交