1. 22 6月, 2018 1 次提交
    • J
      nvme-pci: limit max IO size and segments to avoid high order allocations · 943e942e
      Jens Axboe 提交于
      nvme requires an sg table allocation for each request. If the request
      is large, then the allocation can become quite large. For instance,
      with our default software settings of 1280KB IO size, we'll need
      10248 bytes of sg table. That turns into a 2nd order allocation,
      which we can't always guarantee. If we fail the allocation, blk-mq
      will retry it later. But there's no guarantee that we'll EVER be
      able to allocate that much contigious memory.
      
      Limit the IO size such that we never need more than a single page
      of memory. That's a lot faster and more reliable. Then back that
      allocation with a mempool, so that we know we'll always be able
      to succeed the allocation at some point.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      943e942e
  2. 14 6月, 2018 1 次提交
  3. 09 6月, 2018 1 次提交
  4. 01 6月, 2018 3 次提交
  5. 23 5月, 2018 1 次提交
    • J
      nvme: fix lockdep warning in nvme_mpath_clear_current_path · 978628ec
      Johannes Thumshirn 提交于
      When running blktest's nvme/005 with a lockdep enabled kernel the test
      case fails due to the following lockdep splat in dmesg:
      
       =============================
       WARNING: suspicious RCU usage
       4.17.0-rc5 #881 Not tainted
       -----------------------------
       drivers/nvme/host/nvme.h:457 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       3 locks held by kworker/u32:5/1102:
        #0:         (ptrval) ((wq_completion)"nvme-wq"){+.+.}, at: process_one_work+0x152/0x5c0
        #1:         (ptrval) ((work_completion)(&ctrl->scan_work)){+.+.}, at: process_one_work+0x152/0x5c0
        #2:         (ptrval) (&subsys->lock#2){+.+.}, at: nvme_ns_remove+0x43/0x1c0 [nvme_core]
      
      The only caller of nvme_mpath_clear_current_path() is nvme_ns_remove()
      which holds the subsys lock so it's likely a false positive, but when
      using rcu_access_pointer(), we're telling rcu and lockdep that we're
      only after the pointer falue.
      
      Fixes: 32acab31 ("nvme: implement multipath access to nvme subsystems")
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Suggested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      978628ec
  6. 19 5月, 2018 1 次提交
  7. 12 5月, 2018 1 次提交
    • J
      nvme: add quirk to force medium priority for SQ creation · 9abd68ef
      Jens Axboe 提交于
      Some P3100 drives have a bug where they think WRRU (weighted round robin)
      is always enabled, even though the host doesn't set it. Since they think
      it's enabled, they also look at the submission queue creation priority. We
      used to set that to MEDIUM by default, but that was removed in commit
      81c1cd98. This causes various issues on that drive. Add a quirk to
      still set MEDIUM priority for that controller.
      
      Fixes: 81c1cd98 ("nvme/pci: Don't set reserved SQ create flags")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      9abd68ef
  8. 03 5月, 2018 1 次提交
  9. 12 4月, 2018 3 次提交
    • J
      nvme: expand nvmf_check_if_ready checks · bb06ec31
      James Smart 提交于
      The nvmf_check_if_ready() checks that were added are very simplistic.
      As such, the routine allows a lot of cases to fail ios during windows
      of reset or re-connection. In cases where there are not multi-path
      options present, the error goes back to the callee - the filesystem
      or application. Not good.
      
      The common routine was rewritten and calling syntax slightly expanded
      so that per-transport is_ready routines don't need to be present.
      The transports now call the routine directly. The routine is now a
      fabrics routine rather than an inline function.
      
      The routine now looks at controller state to decide the action to
      take. Some states mandate io failure. Others define the condition where
      a command can be accepted.  When the decision is unclear, a generic
      queue-or-reject check is made to look for failfast or multipath ios and
      only fails the io if it is so marked. Otherwise, the io will be queued
      and wait for the controller state to resolve.
      
      Admin commands issued via ioctl share a live admin queue with commands
      from the transport for controller init. The ioctls could be intermixed
      with the initialization commands. It's possible for the ioctl cmd to
      be issued prior to the controller being enabled. To block this, the
      ioctl admin commands need to be distinguished from admin commands used
      for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to
      reflect this division and set it on ioctls requests.  As the
      nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(),
      ensure that commands allocated by the ioctl path (actually anything
      in core.c) preps the nvme_req(req) before starting the io. This will
      preserve the USERCMD flag during execution and/or retry.
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.e>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bb06ec31
    • J
      nvme: unexport nvme_start_keep_alive · 00b683db
      Johannes Thumshirn 提交于
      nvme_start_keep_alive() isn't used outside core.c so unexport it and
      make it static.
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      00b683db
    • M
      nvme: enforce 64bit offset for nvme_get_log_ext fn · 7ec6074f
      Matias Bjørling 提交于
      Compiling on 32 bits system produces a warning for the shift width
      when shifting 32 bit integer with 64bit integer.
      
      Make sure that offset always is 64bit, and use macros for retrieving
      lower and upper bits of the offset.
      Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ec6074f
  10. 30 3月, 2018 1 次提交
  11. 26 3月, 2018 4 次提交
    • M
      nvme: make nvme_get_log_ext non-static · d558fb51
      Matias Bjørling 提交于
      Enable the lightnvm integration to use the nvme_get_log_ext()
      function.
      Signed-off-by: NMatias Bjørling <mb@lightnvm.io>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d558fb51
    • N
      nvme: Add .stop_ctrl to nvme ctrl ops · b435ecea
      Nitzan Carmi 提交于
      For consistancy reasons, any fabric-specific works
      (e.g error recovery/reconnect) should be canceled in
      nvme_stop_ctrl, as for all other NVMe pending works
      (e.g. scan, keep alive).
      
      The patch aims to simplify the logic of the code, as
      we now only rely on a vague demand from any fabric
      to flush its private workqueues at the beginning of
      .delete_ctrl op.
      Signed-off-by: NNitzan Carmi <nitzanc@mellanox.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b435ecea
    • J
      nvme: change namespaces_mutext to namespaces_rwsem · 765cc031
      Jianchao Wang 提交于
      namespaces_mutext is used to synchronize the operations on ctrl
      namespaces list. Most of the time, it is a read operation.
      
      On the other hand, there are many interfaces in nvme core that
      need this lock, such as nvme_wait_freeze, and even more interfaces
      will be added. If we use mutex here, circular dependency could be
      introduced easily. For example:
      context A                  context B
      nvme_xxx                   nvme_xxx
      hold namespaces_mutext     require namespaces_mutext
      sync context B
      
      So it is better to change it from mutex to rwsem.
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      765cc031
    • T
      nvme: Add fault injection feature · b9e03857
      Thomas Tai 提交于
      Linux's fault injection framework provides a systematic way to support
      error injection via debugfs in the /sys/kernel/debug directory. This
      patch uses the framework to add error injection to NVMe driver. The
      fault injection source code is stored in a separate file and only linked
      if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected.
      
      Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no
      retry will be injected into the nvme_end_request. Users can change
      the default status code and no retry flag via debufs. Following example
      shows how to enable and inject an error. For more examples, refer to
      Documentation/fault-injection/nvme-fault-injection.txt
      
      How to enable nvme fault injection:
      
      First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
      recompile the kernel. After booting up the kernel, do the
      following.
      
      How to inject an error:
      
      mount /dev/nvme0n1 /mnt
      echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
      echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
      cp a.file /mnt
      
      Expected Result:
      
      cp: cannot stat ‘/mnt/a.file’: Input/output error
      
      Message from dmesg:
      
      FAULT_INJECTION: forcing a failure.
      name fault_inject, interval 1, probability 100, space 0, times 1
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
      Hardware name: innotek GmbH VirtualBox/VirtualBox,
      BIOS VirtualBox 12/01/2006
      Call Trace:
        <IRQ>
        dump_stack+0x5c/0x7d
        should_fail+0x148/0x170
        nvme_should_fail+0x2f/0x50 [nvme_core]
        nvme_process_cq+0xe7/0x1d0 [nvme]
        nvme_irq+0x1e/0x40 [nvme]
        __handle_irq_event_percpu+0x3a/0x190
        handle_irq_event_percpu+0x30/0x70
        handle_irq_event+0x36/0x60
        handle_fasteoi_irq+0x78/0x120
        handle_irq+0xa7/0x130
        ? tick_irq_enter+0xa8/0xc0
        do_IRQ+0x43/0xc0
        common_interrupt+0xa2/0xa2
        </IRQ>
      RIP: 0010:native_safe_halt+0x2/0x10
      RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
      R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
        ? __sched_text_end+0x4/0x4
        default_idle+0x18/0xf0
        do_idle+0x150/0x1d0
        cpu_startup_entry+0x6f/0x80
        start_kernel+0x4c4/0x4e4
        ? set_init_arg+0x55/0x55
        secondary_startup_64+0xa5/0xb0
        print_req_error: I/O error, dev nvme0n1, sector 9240
      EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
      inode #2: comm cp: reading directory lblock 0
      Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
      Reviewed-by: NEric Saint-Etienne <eric.saint.etienne@oracle.com>
      Signed-off-by: NKarl Volz <karl.volz@oracle.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b9e03857
  12. 07 3月, 2018 1 次提交
  13. 13 2月, 2018 1 次提交
    • R
      nvme: Don't use a stack buffer for keep-alive command · 0a34e466
      Roland Dreier 提交于
      In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
      the stack into blk_execute_rq_nowait().  However, the block layer doesn't
      guarantee that the request is fully queued before blk_execute_rq_nowait()
      returns.  If not, and the request is queued after nvme_keep_alive() returns,
      then we'll end up using stack memory that might have been overwritten to
      form the NVMe command we pass to hardware.
      
      Fix this by keeping a special command struct in the nvme_ctrl struct right
      next to the delayed work struct used for keep-alives.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      0a34e466
  14. 09 2月, 2018 1 次提交
  15. 16 1月, 2018 1 次提交
    • R
      nvme: host delete_work and reset_work on separate workqueues · b227c59b
      Roy Shterman 提交于
      We need to ensure that delete_work will be hosted on a different
      workqueue than all the works we flush or cancel from it.
      Otherwise we may hit a circular dependency warning [1].
      
      Also, given that delete_work flushes reset_work, host reset_work
      on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
      fix the flushing in the individual drivers to flush nvme_delete_wq
      when draining queued deletes.
      
      [1]:
      [  178.491942] =============================================
      [  178.492718] [ INFO: possible recursive locking detected ]
      [  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
      [  178.494382] ---------------------------------------------
      [  178.495160] kworker/5:1/135 is trying to acquire lock:
      [  178.495894]  (
      [  178.496120] "nvme-wq"
      [  178.496471] ){++++.+}
      [  178.496599] , at:
      [  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
      [  178.497670]
                     but task is already holding lock:
      [  178.498499]  (
      [  178.498724] "nvme-wq"
      [  178.499074] ){++++.+}
      [  178.499202] , at:
      [  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.500343]
                     other info that might help us debug this:
      [  178.501269]  Possible unsafe locking scenario:
      
      [  178.502113]        CPU0
      [  178.502472]        ----
      [  178.502829]   lock(
      [  178.503115] "nvme-wq"
      [  178.503467] );
      [  178.503716]   lock(
      [  178.504001] "nvme-wq"
      [  178.504353] );
      [  178.504601]
                      *** DEADLOCK ***
      
      [  178.505441]  May be due to missing lock nesting notation
      
      [  178.506453] 2 locks held by kworker/5:1/135:
      [  178.507068]  #0:
      [  178.507330]  (
      [  178.507598] "nvme-wq"
      [  178.507726] ){++++.+}
      [  178.508079] , at:
      [  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.509004]  #1:
      [  178.509265]  (
      [  178.509532] (&ctrl->delete_work)
      [  178.509795] ){+.+.+.}
      [  178.510145] , at:
      [  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
      [  178.511070]
                     stack backtrace:
      :
      [  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
      [  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
      [  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
      [  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
      [  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
      [  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
      [  178.518443] Call Trace:
      [  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
      [  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
      [  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
      [  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
      [  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
      [  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
      [  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
      [  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
      [  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
      [  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
      [  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
      [  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
      [  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
      [  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
      [  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
      [  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
      [  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
      [  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
      [  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
      [  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
      [  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
      [  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
      Signed-off-by: NRoy Shterman <roys@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b227c59b
  16. 15 1月, 2018 1 次提交
  17. 11 1月, 2018 1 次提交
  18. 08 1月, 2018 1 次提交
  19. 29 12月, 2017 1 次提交
  20. 23 11月, 2017 1 次提交
  21. 11 11月, 2017 12 次提交
  22. 01 11月, 2017 1 次提交