提交 · 9a915a5be7dc320743034a17394e08eb438baf33 · openanolis / cloud-kernel

26 3月, 2018 3 次提交

nvme-pci: quiesce IO queues prior to disabling device HMB accesses · 9a915a5b

由 Jianchao Wang 提交于 2月 12, 2018

Quiesce IO queues prior to disabling device HMB accesses. A controller
using HMB may relay on it to efficiently complete IO commands.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9a915a5b

nvme: Add fault injection feature · b9e03857

由 Thomas Tai 提交于 2月 08, 2018

Linux's fault injection framework provides a systematic way to support
error injection via debugfs in the /sys/kernel/debug directory. This
patch uses the framework to add error injection to NVMe driver. The
fault injection source code is stored in a separate file and only linked
if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected.

Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no
retry will be injected into the nvme_end_request. Users can change
the default status code and no retry flag via debufs. Following example
shows how to enable and inject an error. For more examples, refer to
Documentation/fault-injection/nvme-fault-injection.txt

How to enable nvme fault injection:

First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
recompile the kernel. After booting up the kernel, do the
following.

How to inject an error:

mount /dev/nvme0n1 /mnt
echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
cp a.file /mnt

Expected Result:

cp: cannot stat ‘/mnt/a.file’: Input/output error

Message from dmesg:

FAULT_INJECTION: forcing a failure.
name fault_inject, interval 1, probability 100, space 0, times 1
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox,
BIOS VirtualBox 12/01/2006
Call Trace:
  <IRQ>
  dump_stack+0x5c/0x7d
  should_fail+0x148/0x170
  nvme_should_fail+0x2f/0x50 [nvme_core]
  nvme_process_cq+0xe7/0x1d0 [nvme]
  nvme_irq+0x1e/0x40 [nvme]
  __handle_irq_event_percpu+0x3a/0x190
  handle_irq_event_percpu+0x30/0x70
  handle_irq_event+0x36/0x60
  handle_fasteoi_irq+0x78/0x120
  handle_irq+0xa7/0x130
  ? tick_irq_enter+0xa8/0xc0
  do_IRQ+0x43/0xc0
  common_interrupt+0xa2/0xa2
  </IRQ>
RIP: 0010:native_safe_halt+0x2/0x10
RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
  ? __sched_text_end+0x4/0x4
  default_idle+0x18/0xf0
  do_idle+0x150/0x1d0
  cpu_startup_entry+0x6f/0x80
  start_kernel+0x4c4/0x4e4
  ? set_init_arg+0x55/0x55
  secondary_startup_64+0xa5/0xb0
  print_req_error: I/O error, dev nvme0n1, sector 9240
EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
inode #2: comm cp: reading directory lblock 0
Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
Reviewed-by: NEric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: NKarl Volz <karl.volz@oracle.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b9e03857

nvme: use define instead of magic value for identify size · 42595eb7

由 Minwoo Im 提交于 2月 08, 2018

NVME_IDENTIFY_DATA_SIZE was added to linux/nvme.h by following commit.
  commit 0add5e8e ("nvmet: use NVME_IDENTIFY_DATA_SIZE")

Make it use NVME_IDENTIFY_DATA_SIZE define instead of magic value
0x1000 in case of identify data size.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

42595eb7

09 3月, 2018 1 次提交

block: Use blk_queue_flag_*() in drivers instead of queue_flag_*() · 8b904b5b

由 Bart Van Assche 提交于 3月 07, 2018

This patch has been generated as follows:

for verb in set_unlocked clear_unlocked set clear; do
  replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
    $(git grep -lw queue_flag_${verb} drivers block/bsg*)
done

Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b904b5b

01 3月, 2018 1 次提交

block: Add 'lock' as third argument to blk_alloc_queue_node() · 5ee0524b

由 Bart Van Assche 提交于 2月 28, 2018

This patch does not change any functionality.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5ee0524b

28 2月, 2018 1 次提交

nvme-multipath: fix sysfs dangerously created links · 9bd82b1a

由 Baegjae Sung 提交于 2月 28, 2018

If multipathing is enabled, each NVMe subsystem creates a head
namespace (e.g., nvme0n1) and multiple private namespaces
(e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
private namespaces, links of head namespace are used, so the
namespace creation order must be followed (e.g., nvme0n1 ->
nvme0c1n1). If the order is not followed, links of sysfs will be
incomplete or kernel panic will occur.

The kernel panic was:
  kernel BUG at fs/sysfs/symlink.c:27!
  Call Trace:
    nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
    nvme_validate_ns+0x5c2/0x850 [nvme_core]
    nvme_scan_work+0x1af/0x2d0 [nvme_core]

Correct order
Context A     Context B
nvme0n1
nvme0c0n1     nvme0c1n1

Incorrect order
Context A     Context B
              nvme0c1n1
nvme0n1
nvme0c0n1

The nvme_mpath_add_disk (for creating head namespace) is called
just before the nvme_mpath_add_disk_links (for creating private
namespaces). In nvme_mpath_add_disk, the first context acquires
the lock of subsystem and creates a head namespace, and other
contexts do nothing by checking GENHD_FL_UP of a head namespace
after waiting to acquire the lock. We verified the code with or
without multipathing using three vendors of dual-port NVMe SSDs.
Signed-off-by: NBaegjae Sung <baegjae@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>

9bd82b1a

26 2月, 2018 1 次提交

nvme-pci: Fix nvme queue cleanup if IRQ setup fails · f25a2dfc

由 Jianchao Wang 提交于 2月 15, 2018

This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
[changelog updates, removed misc whitespace changes]
Signed-off-by: NKeith Busch <keith.busch@intel.com>

f25a2dfc

22 2月, 2018 2 次提交

nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes · 0d309923

由 Christoph Hellwig 提交于 2月 22, 2018

blk_rq_bytes does the wrong thing for special payloads like discards and
might cause the driver to not set up a SGL.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>

0d309923

nvme-fabrics: don't check for non-NULL module in nvmf_register_transport · 5a1e5953

由 Christoph Hellwig 提交于 2月 22, 2018

THIS_MODULE evaluates to NULL when used from code built into the kernel,
thus breaking built-in transport modules.  Remove the bogus check.

Fixes: 0de5cd36 ("nvme-fabrics: protect against module unload during create_ctrl")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>

5a1e5953

14 2月, 2018 4 次提交

nvme-rdma: fix sysfs invoked reset_ctrl error flow · 8000d1fd

由 Nitzan Carmi 提交于 1月 17, 2018

When reset_controller that is invoked by sysfs fails,
it enters an error flow which practically removes the
nvme ctrl entirely (similar to delete_ctrl flow). It
causes the system to hang, since a sysfs attribute cannot
be unregistered by one of its own methods.

This can be fixed by calling delete_ctrl as a work rather
than sequential code. In addition, it should give the ctrl
a chance to recover using reconnection mechanism (consistant
with FC reset_ctrl error flow). Also, while we're here, return
suitable errno in case the reset ended with non live ctrl.
Signed-off-by: NNitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

8000d1fd

nvme-pci: Fix timeouts in connecting state · 4244140d

由 Keith Busch 提交于 2月 08, 2018

We need to halt the controller immediately if we haven't completed
initialization as indicated by the new "connecting" state.

Fixes: ad70062c ("nvme-pci: introduce RECONNECTING state to mark initializing procedure")
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

4244140d

nvme-pci: Remap CMB SQ entries on every controller reset · 815c6704

由 Keith Busch 提交于 2月 13, 2018

The controller memory buffer is remapped into a kernel address on each
reset, but the driver was setting the submission queue base address
only on the very first queue creation. The remapped address is likely to
change after a reset, so accessing the old address will hit a kernel bug.

This patch fixes that by setting the queue's CMB base address each time
the queue is created.

Fixes: f63572df ("nvme: unmap CMB and remove sysfs file in reset path")
Reported-by: NChristian Black <christian.d.black@intel.com>
Cc: Jon Derrick <jonathan.derrick@intel.com>
Cc: <stable@vger.kernel.org> # 4.9+
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

815c6704

nvme: fix the deadlock in nvme_update_formats · 3fd176b7

由 Jianchao Wang 提交于 2月 12, 2018

nvme_update_formats will invoke nvme_ns_remove under namespaces_mutext.
The will cause deadlock because nvme_ns_remove will also require
the namespaces_mutext. Fix it by getting the ns entries which should
be removed under namespaces_mutext and invoke nvme_ns_remove out of
namespaces_mutext.
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

3fd176b7

13 2月, 2018 1 次提交

nvme: Don't use a stack buffer for keep-alive command · 0a34e466

由 Roland Dreier 提交于 1月 11, 2018

In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
the stack into blk_execute_rq_nowait(). However, the block layer doesn't
guarantee that the request is fully queued before blk_execute_rq_nowait()
returns. If not, and the request is queued after nvme_keep_alive() returns,
then we'll end up using stack memory that might have been overwritten to
form the NVMe command we pass to hardware.

Fix this by keeping a special command struct in the nvme_ctrl struct right
next to the delayed work struct used for keep-alives.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

0a34e466

11 2月, 2018 2 次提交

nvme_fc: cleanup io completion · c3aedd22

由 James Smart 提交于 2月 06, 2018

There was some old cold that dealt with complete_rq being called
prior to the lldd returning the io completion. This is garbage code.
The complete_rq routine was being called after eh_timeouts were
called and it was due to eh_timeouts not being handled properly.
The timeouts were fixed in prior patches so that in general, a
timeout will initiate an abort and the reset timer restarted as
the abort operation will take care of completing things. Given the
reset timer restarted, the erroneous complete_rq calls were eliminated.

So remove the work that was synchronizing complete_rq with io
completion.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

c3aedd22

nvme_fc: correct abort race condition on resets · 3efd6e8e

由 James Smart 提交于 2月 06, 2018

During reset handling, there is live io completing while the reset
is taking place. The reset path attempts to abort all outstanding io,
counting the number of ios that were reset. It then waits for those
ios to be reclaimed from the lldd before continuing.

The transport's logic on io state and flag setting was poor, allowing
ios to complete simultaneous to the abort request. The completed ios
were counted, but as the completion had already occurred, the
completion never reduced the count. As the count never zeros, the
reset/delete never completes.

Tighten it up by unconditionally changing the op state to completed
when the io done handler is called.  The reset/abort path now changes
the op state to aborted, but the abort only continues if the op
state was live priviously. If complete, the abort is backed out.
Thus proper counting of io aborts and their completions is working
again.

Also removed the TERMIO state on the op as it's redundant with the
op's aborted state.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

3efd6e8e

09 2月, 2018 4 次提交

nvme: Fix discard buffer overrun · 8cb6af7b

由 Keith Busch 提交于 1月 31, 2018

This patch checks the discard range array bounds before setting it in
case the driver gets a badly formed request.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

8cb6af7b

nvme: delete NVME_CTRL_LIVE --> NVME_CTRL_CONNECTING transition · 3096a739

由 Max Gurtovoy 提交于 1月 31, 2018

There is no logical reason to move from live state to connecting
state. In case of initial connection establishment, the transition
should be NVME_CTRL_NEW --> NVME_CTRL_CONNECTING --> NVME_CTRL_LIVE.
In case of error recovery or reset, the transition should be
NVME_CTRL_LIVE --> NVME_CTRL_RESETTING --> NVME_CTRL_CONNECTING -->
NVME_CTRL_LIVE.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

3096a739

nvme-rdma: use NVME_CTRL_CONNECTING state to mark init process · b754a32c

由 Max Gurtovoy 提交于 1月 31, 2018

In order to avoid concurrent error recovery during initialization
process (allowed by the NVME_CTRL_NEW --> NVME_CTRL_RESETTING transition)
we must mark the ctrl as CONNECTING before initial connection
establisment.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

b754a32c

nvme: rename NVME_CTRL_RECONNECTING state to NVME_CTRL_CONNECTING · ad6a0a52

由 Max Gurtovoy 提交于 1月 31, 2018

In pci transport, this state is used to mark the initialization
process. This should be also used in other transports as well.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

ad6a0a52

31 1月, 2018 1 次提交

blk-mq: introduce BLK_STS_DEV_RESOURCE · 86ff7c2a

由 Ming Lei 提交于 1月 30, 2018

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Tested-by: NLaurence Oberman <loberman@redhat.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

86ff7c2a

26 1月, 2018 5 次提交

nvme: add tracepoint for nvme_complete_rq · ca5554a6

由 Johannes Thumshirn 提交于 1月 26, 2018

Add a tracepoint in nvme_complete_rq() for completions of NVMe commands. An
expmale output of the trace-point is as follows:

<idle>-0     [001] d.h.     3.505266: nvme_complete_rq: cmdid=989, qid=1, res=0, retries=0, flags=0x0, status=0
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca5554a6

nvme: add tracepoint for nvme_setup_cmd · 3d030e41

由 Johannes Thumshirn 提交于 1月 26, 2018

Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands.

Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd():
kworker/u8:0-5 [003] .... 2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0)

and trace_nvme_setup_nvm_cmd():
dd-205 [001] .... 3.503929: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=989, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=4096, len=2047, ctrl=0x0, dsmgmt=0, reftag=0)
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3d030e41

nvme-pci: introduce RECONNECTING state to mark initializing procedure · ad70062c

由 Jianchao Wang 提交于 1月 22, 2018

After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING    - quiesce blk-mq queues, teardown and delete queues/
               connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
               initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.
Suggested-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NReviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ad70062c

nvme-rdma: remove redundant boolean for inline_data · 1dad3a67

由 Max Gurtovoy 提交于 12月 06, 2017

Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1dad3a67

nvme: don't free uuid pointer before printing it · 6e494120

由 Johannes Thumshirn 提交于 1月 25, 2018

Commit df351ef7 ("nvme-fabrics: fix memory leak when parsing host ID
option") fixed the leak of 'p' but in case uuid_parse() fails the memory
is freed before the error print that is using it.

Free it after printing eventual errors.
Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
Fixes: df351ef7 ("nvme-fabrics: fix memory leak when parsing host ID option")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6e494120

25 1月, 2018 1 次提交

nvme-pci: Suspend queues after deleting them · ee9aebb2

由 Keith Busch 提交于 1月 24, 2018

The driver had been abusing the cq_vector state to know if new submissions
were safe, but that was before we could quiesce blk-mq. If the controller
happens to get an interrupt through while we're suspending those queues,
'no irq handler' warnings may occur.

This patch will disable the interrupts only after the queues are deleted.
Reported-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Tested-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ee9aebb2

24 1月, 2018 1 次提交

nvme-pci: Fix queue double allocations · 62314e40

由 Keith Busch 提交于 1月 23, 2018

The queue count says the highest queue that's been allocated, so don't
reallocate a queue lower than that.

Fixes: 147b27e4 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

62314e40

18 1月, 2018 6 次提交

nvme-pci: take sglist coalescing in dma_map_sg into account · b0f2853b

由 Christoph Hellwig 提交于 1月 17, 2018

Some iommu implementations can merge physically and/or virtually
contiguous segments inside sg_map_dma.  The NVMe SGL support does not take
this into account and will warn because of falling off a loop.  Pass the
number of mapped segments to nvme_pci_setup_sgls so that the SGL setup
can take the number of mapped segments into account.
Reported-by: NFangjian (Turing) <f.fangjian@huawei.com>
Fixes: a7a7cbe3 ("nvme-pci: add SGL support")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@rimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0f2853b

nvme-pci: check segement valid for SGL use · 20469a37

由 Keith Busch 提交于 1月 17, 2018

The driver needs to verify there is a payload with a command before
seeing if it should use SGLs to map it.

Fixes: 955b1b5a ("nvme-pci: move use_sgl initialization to nvme_init_iod()")
Reported-by: NPaul Menzel <pmenzel+linux-nvme@molgen.mpg.de>
Reviewed-by: NPaul Menzel <pmenzel+linux-nvme@molgen.mpg.de>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

20469a37

nvme-pci: clean up SMBSZ bit definitions · 88de4598

由 Christoph Hellwig 提交于 12月 20, 2017

Define the bit positions instead of macros using the magic values,
and move the expanded helpers to calculate the size and size unit into
the implementation C file.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>

88de4598

nvme-pci: clean up CMB initialization · f65efd6d

由 Christoph Hellwig 提交于 12月 20, 2017

Refactor the call to nvme_map_cmb, and change the conditions for probing
for the CMB.  First remove the version check as NVMe TPs always apply
to earlier versions of the spec as well.  Second check for the whole CMBSZ
register for support of the CMB feature instead of just the size field
inside of it to simplify the code a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>

f65efd6d

nvme-fc: correct hang in nvme_ns_remove() · 0fd997d3

由 James Smart 提交于 1月 11, 2018

When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.

If connectivity is lost for a sufficient amount of time that the
controller is then deleted, the delete path starts tearing down queues,
and eventually calling nvme_ns_remove(). It appears that pending
commands may cause blk_cleanup_queue() to never complete and the
teardown stalls.

Correct by starting the ns queues after transitioning to a DELETING
state, allowing pending commands to be flushed with io failures. Thus
the delete path is clear when reached.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

0fd997d3

nvme-fc: fix rogue admin cmds stalling teardown · d625d05e

由 James Smart 提交于 1月 11, 2018

When connectivity is lost to a device, the association is terminated
and the blk-mq queues are quiesced/stopped. When connectivity is
re-established, they are resumed.

If an admin command is received while connectivity is list, the ioctl
queues the command on the admin_q and the command stalls (the thread
issuing the ioctl hangs/waits). if the connectivity is lost long
enough such that the controller is then deleted, the delete code
makes its calls to initiate the delete, which then expects the core
layer to call the transport when all references are removed and the
controller can be freed. Unfortunately, nothing in this path dequeued
the admin command, so a reference sits outstanding and things stop,
hanging the delete indefinitely.

Correct by unquiescing the admin queue in the delete association. This
means any admin command (which should only be from an ioctl) issued
after connectivity is lost will detect the controller is in a
reconnecting state and will (fast) fail the command. Thus, a pending
reference can no longer be created. Once connectivity is re-established,
a new ioctl/admin command would see proper device state and function again.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d625d05e

16 1月, 2018 3 次提交

nvme-fabrics: fix memory leak when parsing host ID option · df351ef7

由 Roland Dreier 提交于 1月 11, 2018

We use match_strdup() to get a copy of the option string for host ID string, but
we just pass it to uuid_parse() and don't store the string pointer, so we need to
kfree() the string after parsing it.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

df351ef7

nvme: fix comment typos in nvme_create_io_queues · 8adb8c14

由 Minwoo Im 提交于 1月 14, 2018

fix comment typos in nvme_create_io_queues() like below.
  _aount_ to _amount_
  _an_    to _can_
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8adb8c14

nvme: host delete_work and reset_work on separate workqueues · b227c59b

由 Roy Shterman 提交于 1月 14, 2018

We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].

Also, given that delete_work flushes reset_work, host reset_work
on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
fix the flushing in the individual drivers to flush nvme_delete_wq
when draining queued deletes.

[1]:
[  178.491942] =============================================
[  178.492718] [ INFO: possible recursive locking detected ]
[  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
[  178.494382] ---------------------------------------------
[  178.495160] kworker/5:1/135 is trying to acquire lock:
[  178.495894]  (
[  178.496120] "nvme-wq"
[  178.496471] ){++++.+}
[  178.496599] , at:
[  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[  178.497670]
               but task is already holding lock:
[  178.498499]  (
[  178.498724] "nvme-wq"
[  178.499074] ){++++.+}
[  178.499202] , at:
[  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.500343]
               other info that might help us debug this:
[  178.501269]  Possible unsafe locking scenario:

[  178.502113]        CPU0
[  178.502472]        ----
[  178.502829]   lock(
[  178.503115] "nvme-wq"
[  178.503467] );
[  178.503716]   lock(
[  178.504001] "nvme-wq"
[  178.504353] );
[  178.504601]
                *** DEADLOCK ***

[  178.505441]  May be due to missing lock nesting notation

[  178.506453] 2 locks held by kworker/5:1/135:
[  178.507068]  #0:
[  178.507330]  (
[  178.507598] "nvme-wq"
[  178.507726] ){++++.+}
[  178.508079] , at:
[  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.509004]  #1:
[  178.509265]  (
[  178.509532] (&ctrl->delete_work)
[  178.509795] ){+.+.+.}
[  178.510145] , at:
[  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.511070]
               stack backtrace:
:
[  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
[  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[  178.518443] Call Trace:
[  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
[  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
[  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
[  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40
Signed-off-by: NRoy Shterman <roys@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b227c59b

15 1月, 2018 2 次提交

nvme-pci: allocate device queues storage space at probe · 147b27e4

由 Sagi Grimberg 提交于 1月 14, 2018

It may cause race by setting 'nvmeq' in nvme_init_request()
because .init_request is called inside switching io scheduler, which
may happen when the NVMe device is being resetted and its nvme queues
are being freed and created. We don't have any sync between the two
pathes.

This patch changes the nvmeq allocation to occur at probe time so
there is no way we can dereference it at init_request.

[   93.268391] kernel BUG at drivers/nvme/host/pci.c:408!
[   93.274146] invalid opcode: 0000 [#1] SMP
[   93.278618] Modules linked in: nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache sunrpc ipmi_ssif vfat fat
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel
kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
intel_cstate ipmi_si iTCO_vendor_support intel_uncore mxm_wmi mei_me
ipmi_devintf intel_rapl_perf pcspkr sg ipmi_msghandler lpc_ich dcdbas mei
shpchp acpi_power_meter wmi dm_multipath ip_tables xfs libcrc32c sd_mod
mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm ahci libahci nvme libata crc32c_intel nvme_core tg3
megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[   93.349071] CPU: 5 PID: 1842 Comm: sh Not tainted 4.15.0-rc2.ming+ #4
[   93.356256] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[   93.364801] task: 00000000fb8abf2a task.stack: 0000000028bd82d1
[   93.371408] RIP: 0010:nvme_init_request+0x36/0x40 [nvme]
[   93.377333] RSP: 0018:ffffc90002537ca8 EFLAGS: 00010246
[   93.383161] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008
[   93.391122] RDX: 0000000000000000 RSI: ffff880276ae0000 RDI: ffff88047bae9008
[   93.399084] RBP: ffff88047bae9008 R08: ffff88047bae9008 R09: 0000000009dabc00
[   93.407045] R10: 0000000000000004 R11: 000000000000299c R12: ffff880186bc1f00
[   93.415007] R13: ffff880276ae0000 R14: 0000000000000000 R15: 0000000000000071
[   93.422969] FS:  00007f33cf288740(0000) GS:ffff88047ba80000(0000) knlGS:0000000000000000
[   93.431996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   93.438407] CR2: 00007f33cf28e000 CR3: 000000047e5bb006 CR4: 00000000001606e0
[   93.446368] Call Trace:
[   93.449103]  blk_mq_alloc_rqs+0x231/0x2a0
[   93.453579]  blk_mq_sched_alloc_tags.isra.8+0x42/0x80
[   93.459214]  blk_mq_init_sched+0x7e/0x140
[   93.463687]  elevator_switch+0x5a/0x1f0
[   93.467966]  ? elevator_get.isra.17+0x52/0xc0
[   93.472826]  elv_iosched_store+0xde/0x150
[   93.477299]  queue_attr_store+0x4e/0x90
[   93.481580]  kernfs_fop_write+0xfa/0x180
[   93.485958]  __vfs_write+0x33/0x170
[   93.489851]  ? __inode_security_revalidate+0x4c/0x60
[   93.495390]  ? selinux_file_permission+0xda/0x130
[   93.500641]  ? _cond_resched+0x15/0x30
[   93.504815]  vfs_write+0xad/0x1a0
[   93.508512]  SyS_write+0x52/0xc0
[   93.512113]  do_syscall_64+0x61/0x1a0
[   93.516199]  entry_SYSCALL64_slow_path+0x25/0x25
[   93.521351] RIP: 0033:0x7f33ce96aab0
[   93.525337] RSP: 002b:00007ffe57570238 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   93.533785] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f33ce96aab0
[   93.541746] RDX: 0000000000000006 RSI: 00007f33cf28e000 RDI: 0000000000000001
[   93.549707] RBP: 00007f33cf28e000 R08: 000000000000000a R09: 00007f33cf288740
[   93.557669] R10: 00007f33cf288740 R11: 0000000000000246 R12: 00007f33cec42400
[   93.565630] R13: 0000000000000006 R14: 0000000000000001 R15: 0000000000000000
[   93.573592] Code: 4c 8d 40 08 4c 39 c7 74 16 48 8b 00 48 8b 04 08 48 85 c0
74 16 48 89 86 78 01 00 00 31 c0 c3 8d 4a 01 48 63 c9 48 c1 e1 03 eb de <0f>
0b 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 85 f6 53 48 89
[   93.594676] RIP: nvme_init_request+0x36/0x40 [nvme] RSP: ffffc90002537ca8
[   93.602273] ---[ end trace 810dde3993e5f14e ]---
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

147b27e4

nvme-pci: serialize pci resets · 79c48ccf

由 Sagi Grimberg 提交于 1月 14, 2018

Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

79c48ccf

11 1月, 2018 1 次提交

nvme/multipath: Use blk_path_error · e1f425e7

由 Keith Busch 提交于 1月 09, 2018

Uses common code for determining if an error should be retried on
alternate path.
Acked-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1f425e7

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功