1. 26 3月, 2018 3 次提交
    • T
      nvme: Add fault injection feature · b9e03857
      Thomas Tai 提交于
      Linux's fault injection framework provides a systematic way to support
      error injection via debugfs in the /sys/kernel/debug directory. This
      patch uses the framework to add error injection to NVMe driver. The
      fault injection source code is stored in a separate file and only linked
      if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected.
      
      Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no
      retry will be injected into the nvme_end_request. Users can change
      the default status code and no retry flag via debufs. Following example
      shows how to enable and inject an error. For more examples, refer to
      Documentation/fault-injection/nvme-fault-injection.txt
      
      How to enable nvme fault injection:
      
      First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
      recompile the kernel. After booting up the kernel, do the
      following.
      
      How to inject an error:
      
      mount /dev/nvme0n1 /mnt
      echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
      echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
      cp a.file /mnt
      
      Expected Result:
      
      cp: cannot stat ‘/mnt/a.file’: Input/output error
      
      Message from dmesg:
      
      FAULT_INJECTION: forcing a failure.
      name fault_inject, interval 1, probability 100, space 0, times 1
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
      Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
      Call Trace:
        <IRQ>
        dump_stack+0x5c/0x7d
        should_fail+0x148/0x170
        nvme_should_fail+0x2f/0x50 [nvme_core]
        nvme_process_cq+0xe7/0x1d0 [nvme]
        nvme_irq+0x1e/0x40 [nvme]
        __handle_irq_event_percpu+0x3a/0x190
        handle_irq_event_percpu+0x30/0x70
        handle_irq_event+0x36/0x60
        handle_fasteoi_irq+0x78/0x120
        handle_irq+0xa7/0x130
        ? tick_irq_enter+0xa8/0xc0
        do_IRQ+0x43/0xc0
        common_interrupt+0xa2/0xa2
        </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10
      RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
      R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
        ? __sched_text_end+0x4/0x4
        default_idle+0x18/0xf0
        do_idle+0x150/0x1d0
        cpu_startup_entry+0x6f/0x80
        start_kernel+0x4c4/0x4e4
        ? set_init_arg+0x55/0x55
        secondary_startup_64+0xa5/0xb0
        print_req_error: I/O error, dev nvme0n1, sector 9240
      EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
      inode #2: comm cp: reading directory lblock 0
      Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
      Reviewed-by: NEric Saint-Etienne <eric.saint.etienne@oracle.com>
      Signed-off-by: NKarl Volz <karl.volz@oracle.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9e03857
    • M
      nvme: use define instead of magic value for identify size · 42595eb7
      Minwoo Im 提交于
      NVME_IDENTIFY_DATA_SIZE was added to linux/nvme.h by following commit.
        commit 0add5e8e ("nvmet: use NVME_IDENTIFY_DATA_SIZE")
      
      Make it use NVME_IDENTIFY_DATA_SIZE define instead of magic value
      0x1000 in case of identify data size.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42595eb7
    • S
      nvmet: don't return "any" ip address in discovery log page · 4c652685
      Sagi Grimberg 提交于
      Its perfectly valid to assign a nvmet port to listen on "any"
      IP address (traddr 0.0.0.0 for ipv4 address family) for IP based
      transport ports. However, we must not return this address in
      discovery log entries. Instead we need to return the address
      where the request was accepted on (req->port address).
      
      Since this is nvme transport specific, introduce an optional
      .disc_traddr interface that is designed to check that a
      port in question is bound to "any" IP address and if so, set
      the traddr from the port where the request came from.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c652685
  2. 09 3月, 2018 1 次提交
  3. 01 3月, 2018 2 次提交
  4. 28 2月, 2018 1 次提交
    • B
      nvme-multipath: fix sysfs dangerously created links · 9bd82b1a
      Baegjae Sung 提交于
      If multipathing is enabled, each NVMe subsystem creates a head
      namespace (e.g., nvme0n1) and multiple private namespaces
      (e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
      private namespaces, links of head namespace are used, so the
      namespace creation order must be followed (e.g., nvme0n1 ->
      nvme0c1n1). If the order is not followed, links of sysfs will be
      incomplete or kernel panic will occur.
      
      The kernel panic was:
        kernel BUG at fs/sysfs/symlink.c:27!
        Call Trace:
          nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
          nvme_validate_ns+0x5c2/0x850 [nvme_core]
          nvme_scan_work+0x1af/0x2d0 [nvme_core]
      
      Correct order
      Context A     Context B
      nvme0n1
      nvme0c0n1     nvme0c1n1
      
      Incorrect order
      Context A     Context B
                    nvme0c1n1
      nvme0n1
      nvme0c0n1
      
      The nvme_mpath_add_disk (for creating head namespace) is called
      just before the nvme_mpath_add_disk_links (for creating private
      namespaces). In nvme_mpath_add_disk, the first context acquires
      the lock of subsystem and creates a head namespace, and other
      contexts do nothing by checking GENHD_FL_UP of a head namespace
      after waiting to acquire the lock. We verified the code with or
      without multipathing using three vendors of dual-port NVMe SSDs.
      Signed-off-by: NBaegjae Sung <baegjae@gmail.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9bd82b1a
  5. 26 2月, 2018 1 次提交
  6. 22 2月, 2018 3 次提交
  7. 14 2月, 2018 5 次提交
  8. 13 2月, 2018 1 次提交
    • R
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier 提交于
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0a34e466
  9. 11 2月, 2018 2 次提交
    • J
      nvme_fc: cleanup io completion · c3aedd22
      James Smart 提交于
      There was some old cold that dealt with complete_rq being called
      prior to the lldd returning the io completion. This is garbage code.
      The complete_rq routine was being called after eh_timeouts were
      called and it was due to eh_timeouts not being handled properly.
      The timeouts were fixed in prior patches so that in general, a
      timeout will initiate an abort and the reset timer restarted as
      the abort operation will take care of completing things. Given the
      reset timer restarted, the erroneous complete_rq calls were eliminated.
      
      So remove the work that was synchronizing complete_rq with io
      completion.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      c3aedd22
    • J
      nvme_fc: correct abort race condition on resets · 3efd6e8e
      James Smart 提交于
      During reset handling, there is live io completing while the reset
      is taking place. The reset path attempts to abort all outstanding io,
      counting the number of ios that were reset. It then waits for those
      ios to be reclaimed from the lldd before continuing.
      
      The transport's logic on io state and flag setting was poor, allowing
      ios to complete simultaneous to the abort request. The completed ios
      were counted, but as the completion had already occurred, the
      completion never reduced the count. As the count never zeros, the
      reset/delete never completes.
      
      Tighten it up by unconditionally changing the op state to completed
      when the io done handler is called.  The reset/abort path now changes
      the op state to aborted, but the abort only continues if the op
      state was live priviously. If complete, the abort is backed out.
      Thus proper counting of io aborts and their completions is working
      again.
      
      Also removed the TERMIO state on the op as it's redundant with the
      op's aborted state.
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      3efd6e8e
  10. 09 2月, 2018 4 次提交
  11. 31 1月, 2018 1 次提交
    • M
      blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a
      Ming Lei 提交于
      This status is returned from driver to block layer if device related
      resource is unavailable, but driver can guarantee that IO dispatch
      will be triggered in future when the resource is available.
      
      Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
      returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
      a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
      3 ms because both scsi-mq and nvmefc are using that magic value.
      
      If a driver can make sure there is in-flight IO, it is safe to return
      BLK_STS_DEV_RESOURCE because:
      
      1) If all in-flight IOs complete before examining SCHED_RESTART in
      blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
      is run immediately in this case by blk_mq_dispatch_rq_list();
      
      2) if there is any in-flight IO after/when examining SCHED_RESTART
      in blk_mq_dispatch_rq_list():
      - if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
      - otherwise, this request will be dispatched after any in-flight IO is
        completed via blk_mq_sched_restart()
      
      3) if SCHED_RESTART is set concurently in context because of
      BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
      cases and make sure IO hang can be avoided.
      
      One invariant is that queue will be rerun if SCHED_RESTART is set.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86ff7c2a
  12. 26 1月, 2018 5 次提交
  13. 25 1月, 2018 1 次提交
  14. 24 1月, 2018 1 次提交
  15. 18 1月, 2018 6 次提交
  16. 16 1月, 2018 3 次提交