提交 · d7b544def408ab5dc724e8887ca696db59cd9216 · xiphi1978 / linux

06 10月, 2017 1 次提交

Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-linus · d7b544de

由 Jens Axboe 提交于 10月 05, 2017

Pull NVMe fixes from Christoph:

"A trivial one-liner from Martin to fix the visible of the uuid attr,
and another one (originally from Abhishek Shah, rewritten by me) to fix
the CMB addresses passed back to the controller in case of a system that
remaps BAR addresses between host and device."

d7b544de

04 10月, 2017 6 次提交

bsg-lib: fix use-after-free under memory-pressure · eab40cf3

由 Benjamin Block 提交于 10月 03, 2017

When under memory-pressure it is possible that the mempool which backs
the 'struct request_queue' will make use of up to BLKDEV_MIN_RQ count
emergency buffers - in case it can't get a regular allocation. These
buffers are preallocated and once they are also used, they are
re-supplied with old finished requests from the same request_queue (see
mempool_free()).

The bug is, when re-supplying the emergency pool, the old requests are
not again ran through the callback mempool_t->alloc(), and thus also not
through the callback bsg_init_rq(). Thus we skip initialization, and
while the sense-buffer still should be good, scsi_request->cmd might
have become to be an invalid pointer in the meantime. When the request
is initialized in bsg.c, and the user's CDB is larger than BLK_MAX_CDB,
bsg will replace it with a custom allocated buffer, which is freed when
the user's command is finished, thus it dangles afterwards. When next a
command is sent by the user that has a smaller/similar CDB as
BLK_MAX_CDB, bsg will assume that scsi_request->cmd is backed by
scsi_request->__cmd, will not make a custom allocation, and write into
undefined memory.

Fix this by splitting bsg_init_rq() into two functions:
 - bsg_init_rq() is changed to only do the allocation of the
   sense-buffer, which is used to back the bsg job's reply buffer. This
   pointer should never change during the lifetime of a scsi_request, so
   it doesn't need re-initialization.
 - bsg_initialize_rq() is a new function that makes use of
   'struct request_queue's initialize_rq_fn callback (which was
   introduced in v4.12). This is always called before the request is
   given out via blk_get_request(). This function does the remaining
   initialization that was previously done in bsg_init_rq(), and will
   also do it when the request is taken from the emergency-pool of the
   backing mempool.

Fixes: 50b4d485 ("bsg-lib: fix kernel panic resulting from missing allocation of reply-buffer")
Cc: <stable@vger.kernel.org> # 4.11+
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBenjamin Block <bblock@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eab40cf3

nvme-pci: Use PCI bus address for data/queues in CMB · 8969f1f8

由 Christoph Hellwig 提交于 10月 01, 2017

Currently, NVMe PCI host driver is programming CMB dma address as
I/O SQs addresses. This results in failures on systems where 1:1
outbound mapping is not used (example Broadcom iProc SOCs) because
CMB BAR will be progammed with PCI bus address but NVMe PCI EP will
try to access CMB using dma address.

To have CMB working on systems without 1:1 outbound mapping, we
program PCI bus address for I/O SQs instead of dma address. This
approach will work on systems with/without 1:1 outbound mapping.

Based on a report and previous patch from Abhishek Shah.

Fixes: 8ffaadf7 ("NVMe: Use CMB for the IO SQes if available")
Cc: stable@vger.kernel.org
Reported-by: NAbhishek Shah <abhishek.shah@broadcom.com>
Tested-by: NAbhishek Shah <abhishek.shah@broadcom.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8969f1f8

blk-mq-debugfs: fix device sched directory for default scheduler · 70e62f4b

由 Omar Sandoval 提交于 10月 03, 2017

In blk_mq_debugfs_register(), I remembered to set up the per-hctx sched
directories if a default scheduler was already configured by
blk_mq_sched_init() from blk_mq_init_allocated_queue(), but I didn't do
the same for the device-wide sched directory. Fix it.

Fixes: d332ce09 ("blk-mq-debugfs: allow schedulers to register debugfs attributes")
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

70e62f4b

null_blk: change configfs dependency to select · 6cd1a6fe

由 Jens Axboe 提交于 10月 03, 2017

A recent commit made null_blk depend on configfs, which is kind of
annoying since you now have to find this dependency and enable that
as well. Discovered this since I no longer had null_blk available
on a box I needed to debug, since it got killed when the config
updated after the configfs change was merged.

Fixes: 3bf2bd20 ("nullb: add configfs interface")
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6cd1a6fe

blk-throttle: fix possible io stall when upgrade to max · 4f02fb76

由 Joseph Qi 提交于 9月 30, 2017

There is a case which will lead to io stall. The case is described as
follows.
/test1
  |-subtest1
/test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios already.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
to /; note that test1 and test2 each still has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   0        0     2.539007641     0  m   N throtl upgrade to max
8,32   0        0     2.539072267     0  m   N throtl /test2 dispatch nr_queued=16 read=0 write=16
8,32   7        0     2.539077142     0  m   N throtl /test1 dispatch nr_queued=16 read=0 write=16

So force schedule dispatch if there are pending children.
Reviewed-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJoseph Qi <qijiang.qj@alibaba-inc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f02fb76

MAINTAINERS: update list for NBD · 38b249bc

由 Wouter Verhelst 提交于 9月 22, 2017

nbd-general@sourceforge.net becomes nbd@other.debian.org, because
sourceforge is just a spamtrap these days.
Signed-off-by: NWouter Verhelst <w@uter.be>
Reviewed-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

38b249bc

03 10月, 2017 1 次提交

nbd: fix -ERESTARTSYS handling · 6e60a3bb

由 Josef Bacik 提交于 10月 02, 2017

Christoph made it so that if we return'ed BLK_STS_RESOURCE whenever we
got ERESTARTSYS from sending our packets we'd return BLK_STS_OK, which
means we'd never requeue and just hang.  We really need to return the
right value from the upper layer.

Fixes: fc17b653 ("blk-mq: switch ->queue_rq return value to blk_status_t")
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6e60a3bb

01 10月, 2017 1 次提交

nvme: fix visibility of "uuid" ns attribute · 007a61ae

由 Martin Wilck 提交于 9月 28, 2017

"uuid" must be invisible if both ns->uuid and ns->nguid are unset,
not if either one is.

Fixes: d934f984 "nvme: provide UUID value to userspace"
Signed-off-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

007a61ae

28 9月, 2017 1 次提交

bcache: use llist_for_each_entry_safe() in __closure_wake_up() · a5f3d8a5

由 Coly Li 提交于 9月 26, 2017

Commit 09b3efec ("bcache: Don't reinvent the wheel but use existing llist
API") replaces the following while loop by llist_for_each_entry(),

-
-	while (reverse) {
-		cl = container_of(reverse, struct closure, list);
-		reverse = llist_next(reverse);
-
+	llist_for_each_entry(cl, reverse, list) {
 		closure_set_waiting(cl, 0);
 		closure_sub(cl, CLOSURE_WAITING + 1);
 	}

This modification introduces a potential race by iterating a corrupted
list. Here is how it happens.

In the above modification, closure_sub() may wake up a process which is
waiting on reverse list. If this process decides to wait again by calling
closure_wait(), its cl->list will be added to another wait list. Then
when llist_for_each_entry() continues to iterate next node, it will travel
on another new wait list which is added in closure_wait(), not the
original reverse list in __closure_wake_up(). It is more probably to
happen on UP machine because the waked up process may preempt the process
which wakes up it.

Use llist_for_each_entry_safe() will fix the issue, the safe version fetch
next node before waking up a process. Then the copy of next node will make
sure list iteration stays on original reverse list.

Fixes: 09b3efec ("bcache: Don't reinvent the wheel but use existing llist API")
Signed-off-by: NColy Li <colyli@suse.de>
Reported-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NByungchul Park <byungchul.park@lge.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a5f3d8a5

27 9月, 2017 2 次提交

Merge tag 'mmc-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc · dc972a67

由 Linus Torvalds 提交于 9月 26, 2017

Pull MMC fixes from Ulf Hansson:

  - sdhci-pci: Fix voltage switch for some Intel host controllers

  - tmio: remove broken and noisy debug macro

* tag 'mmc-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: sdhci-pci: Fix voltage switch for some Intel host controllers
  mmc: tmio: remove broken and noisy debug macro

dc972a67

vfs: Return -ENXIO for negative SEEK_HOLE / SEEK_DATA offsets · fc46820b

由 Andreas Gruenbacher 提交于 9月 25, 2017

In generic_file_llseek_size, return -ENXIO for negative offsets as well
as offsets beyond EOF.  This affects filesystems which don't implement
SEEK_HOLE / SEEK_DATA internally, possibly because they don't support
holes.

Fixes xfstest generic/448.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc46820b

26 9月, 2017 15 次提交

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · e365806a

由 Linus Torvalds 提交于 9月 25, 2017

Pull compat fix from Al Viro:
 "I really wish gcc warned about conversions from pointer to function
  into void *..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix a typo in put_compat_shm_info()

e365806a

fix a typo in put_compat_shm_info() · b776e4b1

由 Al Viro 提交于 9月 25, 2017

"uip" misspelled as "up"; unfortunately, the latter happens to be
a function and gcc is happy to convert it to void *...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b776e4b1

Merge branch 'for-linus' of git://git.kernel.dk/linux-block · 19240e6b

由 Linus Torvalds 提交于 9月 25, 2017

Pull block fixes from Jens Axboe:

 - Two sets of NVMe pull requests from Christoph:
      - Fixes for the Fibre Channel host/target to fix spec compliance
      - Allow a zero keep alive timeout
      - Make the debug printk for broken SGLs work better
      - Fix queue zeroing during initialization
      - Set of RDMA and FC fixes
      - Target div-by-zero fix

 - bsg double-free fix.

 - ndb unknown ioctl fix from Josef.

 - Buffered vs O_DIRECT page cache inconsistency fix. Has been floating
   around for a long time, well reviewed. From Lukas.

 - brd overflow fix from Mikulas.

 - Fix for a loop regression in this merge window, where using a union
   for two members of the loop_cmd turned out to be a really bad idea.
   From Omar.

 - Fix for an iostat regression fix in this series, using the wrong API
   to get at the block queue. From Shaohua.

 - Fix for a potential blktrace delection deadlock. From Waiman.

* 'for-linus' of git://git.kernel.dk/linux-block: (30 commits)
  nvme-fcloop: fix port deletes and callbacks
  nvmet-fc: sync header templates with comments
  nvmet-fc: ensure target queue id within range.
  nvmet-fc: on port remove call put outside lock
  nvme-rdma: don't fully stop the controller in error recovery
  nvme-rdma: give up reconnect if state change fails
  nvme-core: Use nvme_wq to queue async events and fw activation
  nvme: fix sqhd reference when admin queue connect fails
  block: fix a crash caused by wrong API
  fs: Fix page cache inconsistency when mixing buffered and AIO DIO
  nvmet: implement valid sqhd values in completions
  nvme-fabrics: Allow 0 as KATO value
  nvme: allow timed-out ios to retry
  nvme: stop aer posting if controller state not live
  nvme-pci: Print invalid SGL only once
  nvme-pci: initialize queue memory before interrupts
  nvmet-fc: fix failing max io queue connections
  nvme-fc: use transport-specific sgl format
  nvme: add transport SGL definitions
  nvme.h: remove FC transport-specific error values
  ...

19240e6b

Merge tag 'gfs2-for-linus-4.14-rc3' of... · 17763641

由 Linus Torvalds 提交于 9月 25, 2017

Merge tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Bob Peterson:
 "GFS2: Fix an old regression in GFS2's debugfs interface

 This fixes a regression introduced by commit 88ffbf3e ("GFS2: Use
 resizable hash table for glocks"). The regression caused the glock dump
 in debugfs to not report all the glocks, which makes debugging
 extremely difficult"

* tag 'gfs2-for-linus-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix debugfs glocks dump

17763641

Merge tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze · cf034616

由 Linus Torvalds 提交于 9月 25, 2017

Pull Microblaze fixes from Michal Simek:

 - Kbuild fix

 - use vma_pages

 - setup default little endians

* tag 'microblaze-4.14-rc3' of git://git.monstr.eu/linux-2.6-microblaze:
  arch: change default endian for microblaze
  microblaze: Cocci spatch "vma_pages"
  microblaze: Add missing kvm_para.h to Kbuild

cf034616

Merge tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · ac0a3646

由 Linus Torvalds 提交于 9月 25, 2017

Pull tracing fixes from Steven Rostedt:
"Stack tracing and RCU has been having issues with each other and
lockdep has been pointing out constant problems.

The changes have been going into the stack tracer, but it has been
discovered that the problem isn't with the stack tracer itself, but it
is with calling save_stack_trace() from within the internals of RCU.

The stack tracer is the one that can trigger the issue the easiest,
but examining the problem further, it could also happen from a WARN()
in the wrong place, or even if an NMI happened in this area and it did
an rcu_read_lock().

The critical area is where RCU is not watching. Which can happen while
going to and from idle, or bringing up or taking down a CPU.

The final fix was to put the protection in kernel_text_address() as it
is the one that requires RCU to be watching while doing the stack
trace.

To make this work properly, Paul had to allow rcu_irq_enter() happen
after rcu_nmi_enter(). This should have been done anyway, since an NMI
can page fault (reading vmalloc area), and a page fault triggers
rcu_irq_enter().

One patch is just a consolidation of code so that the fix only needed
to be done in one location"

* tag 'trace-v4.14-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Remove RCU work arounds from stack tracer
extable: Enable RCU if it is not watching in kernel_text_address()
extable: Consolidate *kernel_text_address() functions
rcu: Allow for page faults in NMI handlers

ac0a3646

nvme-fcloop: fix port deletes and callbacks · fddc9923

由 James Smart 提交于 9月 19, 2017

Now that there are potentially long delays between when a remoteport or
targetport delete calls is made and when the callback occurs (dev_loss_tmo
timeout), no longer block in the delete routines and move the final nport
puts to the callbacks.

Moved the fcloop_nport_get/put/free routines to avoid forward declarations.

Ensure port_info structs used in registrations are nulled in case fields
are not set (ex: devloss_tmo values).
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fddc9923

nvmet-fc: sync header templates with comments · 6b71f9e1

由 James Smart 提交于 9月 20, 2017

Comments were incorrect:
- defer_rcv was in host port template. moved to target port template
- Added Mandatory statements for target port template items
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b71f9e1

nvmet-fc: ensure target queue id within range. · 0c319d3a

由 James Smart 提交于 9月 19, 2017

When searching for queue id's ensure they are within the expected range.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0c319d3a

nvmet-fc: on port remove call put outside lock · 3688feb5

由 James Smart 提交于 9月 19, 2017

Avoid calling the put routine, as it may traverse to free routines while
holding the target lock.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3688feb5

nvme-rdma: don't fully stop the controller in error recovery · e4d753d7

由 Sagi Grimberg 提交于 9月 21, 2017

By calling nvme_stop_ctrl on a already failed controller will wait for the
scan work to complete (only by identify timeout expiration which is 60
seconds). This is unnecessary when we already know that the controller has
failed.
Reported-by: NYi Zhang <yizhan@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e4d753d7

nvme-rdma: give up reconnect if state change fails · 0a960afd

由 Sagi Grimberg 提交于 9月 21, 2017

If we failed to transition to state LIVE after a successful reconnect,
then controller deletion already started. In this case there is no
point moving forward with reconnect.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0a960afd

nvme-core: Use nvme_wq to queue async events and fw activation · 1a40d972

由 Sagi Grimberg 提交于 9月 21, 2017

async_event_work might race as it is executed from two different
workqueues at the moment.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1a40d972

nvme: fix sqhd reference when admin queue connect fails · 8cbd96a6

由 James Smart 提交于 9月 21, 2017

Fix bug in sqhd patch.

It wasn't the sq that was at risk. In the case where the admin queue
connect command fails, the sq->size field is not set. Therefore, this
becomes a divide by zero error.

Add a quick check to bypass under this failure condition.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8cbd96a6

gfs2: Fix debugfs glocks dump · 10201655

由 Andreas Gruenbacher 提交于 9月 19, 2017

The switch to rhashtables (commit 88ffbf3e) broke the debugfs glock
dump (/sys/kernel/debug/gfs2/<device>/glocks) for dumps bigger than a
single buffer: the right function for restarting an rhashtable iteration
from the beginning of the hash table is rhashtable_walk_enter;
rhashtable_walk_stop + rhashtable_walk_start will just resume from the
current position.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Cc: stable@vger.kernel.org # v4.3+

10201655

25 9月, 2017 13 次提交

block: fix a crash caused by wrong API · f5c156c4

由 Shaohua Li 提交于 9月 21, 2017

part_stat_show takes a part device not a disk, so we should use
part_to_disk.

Fixes: d62e26b3("block: pass in queue to inflight accounting")
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f5c156c4

fs: Fix page cache inconsistency when mixing buffered and AIO DIO · 332391a9

由 Lukas Czerner 提交于 9月 21, 2017

Currently when mixing buffered reads and asynchronous direct writes it
is possible to end up with the situation where we have stale data in the
page cache while the new data is already written to disk. This is
permanent until the affected pages are flushed away. Despite the fact
that mixing buffered and direct IO is ill-advised it does pose a thread
for a data integrity, is unexpected and should be fixed.

Fix this by deferring completion of asynchronous direct writes to a
process context in the case that there are mapped pages to be found in
the inode. Later before the completion in dio_complete() invalidate
the pages in question. This ensures that after the completion the pages
in the written area are either unmapped, or populated with up-to-date
data. Also do the same for the iomap case which uses
iomap_dio_complete() instead.

This has a side effect of deferring the completion to a process context
for every AIO DIO that happens on inode that has pages mapped. However
since the consensus is that this is ill-advised practice the performance
implication should not be a problem.

This was based on proposal from Jeff Moyer, thanks!
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

332391a9

nvmet: implement valid sqhd values in completions · bb1cc747

由 James Smart 提交于 9月 18, 2017

To support sqhd, for initiators that are following the spec and
paying attention to sqhd vs their sqtail values:

- add sqhd to struct nvmet_sq
- initialize sqhd to 0 in nvmet_sq_setup
- rather than propagate the 0's-based qsize value from the connect message
  which requires a +1 in every sqhd update, and as nothing else references
  it, convert to 1's-based value in nvmt_sq/cq_setup() calls.
- validate connect message sqsize being non-zero per spec.
- updated assign sqhd for every completion that goes back.

Also remove handling the NULL sq case in __nvmet_req_complete, as it can't
happen with the current code.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bb1cc747

nvme-fabrics: Allow 0 as KATO value · 8edd11c9

由 Guilherme G. Piccoli 提交于 9月 14, 2017

Currently, driver code allows user to set 0 as KATO
(Keep Alive TimeOut), but this is not being respected.
This patch enforces the expected behavior.
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8edd11c9

nvme: allow timed-out ios to retry · 0951338d

由 James Smart 提交于 9月 07, 2017

Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.

Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.

On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0951338d

nvme: stop aer posting if controller state not live · cd48282c

由 James Smart 提交于 9月 14, 2017

If an nvme async_event command completes, in most cases, a new
async event is posted. However, if the controller enters a
resetting or reconnecting state, there is nothing to block the
scheduled work element from posting the async event again. Nor are
there calls from the transport to stop async events when an
association dies.

In the case of FC, where the association is torn down, the aer must
be aborted on the FC link and completes through the normal job
completion path. Thus the terminated async event ends up being
rescheduled even though the controller isn't in a valid state for
the aer, and the reposting gets the transport into a partially torn
down data structure.

It's possible to hit the scenario on rdma, although much less likely
due to an aer completing right as the association is terminated and
as the association teardown reclaims the blk requests via
nvme_cancel_request() so its immediate, not a link-related action
like on FC.

Fix by putting controller state checks in both the async event
completion routine where it schedules the async event and in the
async event work routine before it calls into the transport. It's
effectively a "stop_async_events()" behavior.  The transport, when
it creates a new association with the subsystem will transition
the state back to live and is already restarting the async event
posting.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
[hch: remove taking a lock over reading the controller state]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cd48282c

nvme-pci: Print invalid SGL only once · d0877473

由 Keith Busch 提交于 9月 15, 2017

The WARN_ONCE macro returns true if the condition is true, not if the
warn was raised, so we're printing the scatter list every time it's
invalid. This is excessive and makes debugging harder, so this patch
prints it just once.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d0877473

nvme-pci: initialize queue memory before interrupts · 161b8be2

由 Keith Busch 提交于 9月 14, 2017

A spurious interrupt before the nvme driver has initialized the completion
queue may inadvertently cause the driver to believe it has a completion
to process. This may result in a NULL dereference since the nvmeq's tags
are not set at this point.

The patch initializes the host's CQ memory so that a spurious interrupt
isn't mistaken for a real completion.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

161b8be2

nvmet-fc: fix failing max io queue connections · deb61742

由 James Smart 提交于 9月 11, 2017

fc transport is treating NVMET_NR_QUEUES as maximum queue count, e.g.
admin queue plus NVMET_NR_QUEUES-1 io queues.  But NVMET_NR_QUEUES is
the number of io queues, so maximum queue count is really
NVMET_NR_QUEUES+1.

Fix the handling in the target fc transport
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

deb61742

nvme-fc: use transport-specific sgl format · d9d34c0b

由 James Smart 提交于 9月 07, 2017

Sync with NVM Express spec change and FC-NVME 1.18.

FC transport sets SGL type to Transport SGL Data Block Descriptor and
subtype to transport-specific value 0x0A.

Removed the warn-on's on the PRP fields. They are unneeded. They were
to check for values from the upper layer that weren't set right, and
for the most part were fine. But, with Async events, which reuse the
same structure and 2nd time issued the SGL overlay converted them to
the Transport SGL values - the warn-on's were errantly firing.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d9d34c0b

nvme: add transport SGL definitions · d85cf207

由 James Smart 提交于 9月 07, 2017

Add transport SGL defintions from NVMe TP 4008, required for
the final NVMe-FC standard.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d85cf207

nvme.h: remove FC transport-specific error values · c98cb3bd

由 James Smart 提交于 9月 07, 2017

The NVM express group recinded the reserved range for the transport.
Remove the FC-centric values that had been defined.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c98cb3bd

qla2xxx: remove use of FC-specific error codes · 39a550d2

由 James Smart 提交于 9月 14, 2017

The qla2xxx driver uses the FC-specific error when it needed to return an
error to the FC-NVME transport.  Convert to use a generic value instead.
Signed-off-by: NJames Smart <james.smart@broadcom.com>
Acked-by: NHimanshu Madhani <himanshu.madhani@cavium.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

39a550d2