提交 · e6e96d73a2aaaa54ed2c0f98693f4bf572712f1c · openanolis / cloud-kernel

23 3月, 2015 1 次提交

NVMe: Initialize device list head before starting · e6e96d73

由 Keith Busch 提交于 3月 23, 2015

Driver recovery requires the device's list node to have been initialized.

Fixes: https://lkml.org/lkml/2015/3/22/262Reported-by: NSteven Noonan <steven@uplinklabs.net>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e6e96d73

24 2月, 2015 1 次提交

NVMe: Fix for BLK_DEV_INTEGRITY not set · 52b68d7e

由 Keith Busch 提交于 2月 23, 2015

Need to define and use appropriate functions for when BLK_DEV_INTEGRITY
is not set.
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

52b68d7e

20 2月, 2015 6 次提交

NVMe: Fix potential corruption on sync commands · 0c0f9b95

由 Keith Busch 提交于 2月 19, 2015

This makes all sync commands uninterruptible and schedules without timeout
so the controller either has to post a completion or the timeout recovery
fails the command. This fixes potential memory or data corruption from
a command timing out too early or woken by a signal. Previously any DMA
buffers mapped for that command would have been released even though we
don't know what the controller is planning to do with those addresses.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

0c0f9b95

NVMe: Remove unused variables · 48328518

由 Keith Busch 提交于 2月 19, 2015

We don't track queues in a llist, subscribe to hot-cpu notifications,
or internally retry commands. Delete the unused artifacts.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

48328518

NVMe: Fix potential corruption during shutdown · 07836e65

由 Keith Busch 提交于 2月 19, 2015

The driver has to end unreturned commands at some point even if the
controller has not provided a completion. The driver tried to be safe by
deleting IO queues prior to ending all unreturned commands. That should
cause the controller to internally abort inflight commands, but IO queue
deletion request does not have to be successful, so all bets are off. We
still have to make progress, so to be extra safe, this patch doesn't
clear a queue to release the dma mapping for a command until after the
pci device has been disabled.

This patch removes the special handling during device initialization
so controller recovery can be done all the time. This is possible since
initialization is not inlined with pci probe anymore.
Reported-by: NNilish Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>

07836e65

NVMe: Asynchronous controller probe · 2e1d8448

由 Keith Busch 提交于 2月 12, 2015

This performs the longest parts of nvme device probe in scheduled work.
This speeds up probe significantly when multiple devices are in use.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

2e1d8448

NVMe: Register management handle under nvme class · b3fffdef

由 Keith Busch 提交于 2月 03, 2015

This creates a new class type for nvme devices to register their
management character devices with. This is so we do not rely on miscdev
to provide enough minors for as many nvme devices some people plan to
use. The previous limit was approximately 60 NVMe controllers, depending
on the platform and kernel. Now the limit is 1M, which ought to be enough
for anybody.

Since we have a new device class, it makes sense to attach the block
devices under this as well, so part of this patch moves the management
handle initialization prior to the namespaces discovery.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

b3fffdef

NVMe: Metadata format support · e1e5e564

由 Keith Busch 提交于 2月 19, 2015

Adds support for NVMe metadata formats and exposes block devices for
all namespaces regardless of their format. Namespace formats that are
unusable will have disk capacity set to 0, but a handle to the block
device is created to simplify device management. A namespace is not
usable when the format requires host interleave block and metadata in
single buffer, has no provisioned storage, or has better data but failed
to register with blk integrity.

The namespace has to be scanned in two phases to support separate
metadata formats. The first establishes the sector size and capacity
prior to invoking add_disk. If metadata is required, the capacity will
be temporarilly set to 0 until it can be revalidated and registered with
the integrity extenstions after add_disk completes.

The driver relies on the integrity extensions to provide the metadata
buffer. NVMe requires this be a single physically contiguous region,
so only one integrity segment is allowed per command. If the metadata
is used for T10 PI, the driver provides mappings to save and restore
the reftag physical block translation. The driver provides no-op
functions for generate and verify if metadata is not used for protection
information. This way the setup is always provided by the block layer.

If a request does not supply a required metadata buffer, the command
is failed with bad address. This could only happen if a user manually
disables verify/generate on such a disk. The only exception to where
this is okay is if the controller is capable of stripping/generating
the metadata, which is possible on some types of formats.

The metadata scatter gather list now occupies the spot in the nvme_iod
that used to be used to link retryable IOD's, but we don't do that
anymore, so the field was unused.
Signed-off-by: NKeith Busch <keith.busch@intel.com>

e1e5e564

30 1月, 2015 1 次提交

NVMe: avoid kmalloc/kfree for smaller IO · ac3dd5bd

由 Jens Axboe 提交于 1月 22, 2015

Currently we allocate an nvme_iod for each IO, which holds the
sg list, prps, and other IO related info. Set a threshold of
2 pages and/or 8KB of data, below which we can just embed this
in the per-command pdu in blk-mq. For any IO at or below
NVME_INT_PAGES and NVME_INT_BYTES, we save a kmalloc and kfree.

For higher IOPS, this saves up to 1% of CPU time.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>

ac3dd5bd

22 1月, 2015 1 次提交

NVMe: within nvme_free_queues(), delete RCU sychro/deferred free · 121c7ad4

由 kaoudis 提交于 1月 14, 2015

Converting from to blk-queue got rid of the driver's RCU
locking-on-queue, so removing unnecessary RCU locking-on-queue
artefacts.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NKelly Nicole Kaoudis <kaoudis@colorado.edu>
Signed-off-by: NJens Axboe <axboe@fb.com>

121c7ad4

16 1月, 2015 1 次提交

NVMe: cq_vector should be signed · 6222d172

由 Jens Axboe 提交于 1月 15, 2015

This was inadvertently dropped from an earlier commit, otherwise
the check against cq_vector == -1 to prevent double free doesn't
make any sense.

Fixes: 2b25d981Signed-off-by: NJens Axboe <axboe@fb.com>

6222d172

09 1月, 2015 6 次提交

NVMe: Fix locking on abort handling · 7a509a6b

由 Keith Busch 提交于 1月 07, 2015

The queues and device need to be locked when messing with them.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7a509a6b

NVMe: Start and stop h/w queues on reset · c9d3bf88

由 Keith Busch 提交于 1月 07, 2015

This freezes and stops all the queues on device shutdown and restarts
them on resume. This fixes hotplug and reset issues when the controller
is actively being used.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c9d3bf88

NVMe: Command abort handling fixes · cef6a948

由 Keith Busch 提交于 1月 07, 2015

Aborts all requeued commands prior to killing the request_queue. For
commands that time out on a dying request queue, set the "Do Not Retry"
bit on the command status so the command cannot be requeued. Finanally, if
the driver is requested to abort a command it did not start, do nothing.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

cef6a948

NVMe: Admin queue removal handling · 0fb59cbc

由 Keith Busch 提交于 1月 07, 2015

This protects admin queue access on shutdown. When the controller is
disabled, the queue is frozen to prevent new entry, and unfrozen on
resume, and fixes cq_vector signedness to not suspend a queue twice.

Since unfreezing the queue makes it available for commands, it requires
the queue be initialized, so this moves this part after that.

Special handling is done when the device is unresponsive during
shutdown. This can be optimized to not require subsequent commands to
timeout, but saving that fix for later.

This patch also removes the kill signals in this path that were left-over
artifacts from the blk-mq conversion and no longer necessary.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0fb59cbc

NVMe: Reference count admin queue usage · ea191d2f

由 Keith Busch 提交于 1月 07, 2015

Since there is no gendisk associated with the admin queue, the driver
needs to hold a reference to it until all open references to the
controller are closed.

This also combines queue cleanup with freeing the tag set since these
should not be separate.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ea191d2f

NVMe: Start all requests · c917dfe5

由 Keith Busch 提交于 1月 07, 2015

Once the nvme callback is set for a request, the driver can start it
and make it available for timeout handling. For timed out commands on a
device that is not initialized, this fixes potential deadlocks that can
occur on startup and shutdown when a device is unresponsive since they
can now be cancelled.

Asynchronous requests do not have any expected timeout, so these are
using the new "REQ_NO_TIMEOUT" request flags.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c917dfe5

03 1月, 2015 1 次提交

block: fix checking return value of blk_mq_init_queue · 35b489d3

由 Ming Lei 提交于 1月 02, 2015

Check IS_ERR_OR_NULL(return value) instead of just return value.
Signed-off-by: NMing Lei <ming.lei@canonical.com>

Reduced to IS_ERR() by me, we never return NULL.
Signed-off-by: NJens Axboe <axboe@fb.com>

35b489d3

23 12月, 2014 1 次提交

NVMe: Fix double free irq · 2b25d981

由 Keith Busch 提交于 12月 22, 2014

Sets the vector to an invalid value after it's freed so we don't free
it twice.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2b25d981

12 12月, 2014 2 次提交

NVMe: fix race condition in nvme_submit_sync_cmd() · 849c6e77

由 Jens Axboe 提交于 12月 12, 2014

If we have a race between the schedule timing out and the command
completing, we could have the task issuing the command exit
nvme_submit_sync_cmd() while the irq is running sync_completion().
If that happens, we could be corrupting memory, since the stack
that held 'cmdinfo' is no longer valid.

Fix this by always calling nvme_abort_cmd_info(). Once that call
completes, we know that we have either run sync_completion() if
the completion came in, or that we will never run it since we now
have special_completion() as the command callback handler.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

849c6e77

NVMe: fix retry/error logic in nvme_queue_rq() · fe54303e

由 Jens Axboe 提交于 12月 11, 2014

The logic around retrying and erroring IO in nvme_queue_rq() is broken
in a few ways:

- If we fail allocating dma memory for a discard, we return retry. We
  have the 'iod' stored in ->special, but we free the 'iod'.

- For a normal request, if we fail dma mapping of setting up prps, we
  have the same iod situation. Additionally, we haven't set the callback
  for the request yet, so we also potentially leak IOMMU resources.

Get rid of the ->special 'iod' store. The retry is uncommon enough that
it's not worth optimizing for or holding on to resources to attempt to
speed it up. Additionally, it's usually best practice to free any
request related resources when doing retries.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fe54303e

11 12月, 2014 3 次提交

NVMe: Fix FS mount issue (hot-remove followed by hot-add) · 285dffc9

由 Indraneel M 提交于 12月 11, 2014

After Hot-remove of a device with a mounted partition,
when the device is hot-added again, the new node reappears
as nvme0n1. Mounting this new node fails with the error:

mount: mount /dev/nvme0n1p1 on /mnt failed: File exists.

The old nodes's FS entries still exist and the kernel can't re-create
procfs and sysfs entries for the new node with the same name.
The patch fixes this issue.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NIndraneel M <indraneel.m@samsung.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

285dffc9

NVMe: fix error return checking from blk_mq_alloc_request() · 97fe3832

由 Jens Axboe 提交于 12月 10, 2014

We return an error pointer or the request, not NULL. Half
the call paths got it right, the others didn't. Fix those up.
Signed-off-by: NJens Axboe <axboe@fb.com>

97fe3832

NVMe: fix freeing of wrong request in abort path · c87fd540

由 Sam Bradshaw 提交于 12月 10, 2014

We allocate 'abort_req', but free 'req' in case of an error
submitting the IO.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c87fd540

04 12月, 2014 1 次提交

NVMe: Fix command setup on IO retry · 9af8785a

由 Keith Busch 提交于 12月 03, 2014

On retry, the req->special is pointing to an already setup IOD, but we
still need to setup the command context and callback, otherwise you'll
see false twice completed errors and leak requests.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9af8785a

22 11月, 2014 1 次提交

NVMe: Update module version major number · c78b4713

由 Keith Busch 提交于 11月 21, 2014

It's already near impossible to tell what bits someone is running based on
a 'modinfo nvme', and I don't want to try guessing if someone is running
blk-mq or bio-based. Let's make it obvious with the module version that
the blk-mq conversion is a major change. Future bio-based versions can
increment to 0.10 in a fork if revisions occur.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c78b4713

21 11月, 2014 2 次提交

NVMe: fail pci initialization if the device doesn't have any BARs · be7837e8

由 Jens Axboe 提交于 11月 14, 2014

The PCI init of NVMe doesn't check for valid bars before proceeding
to map and use BAR 0. If the device is hosed (or firmware is), then
we should catch this case and give up early.

This fixes a:

[ 1662.035778] WARNING: CPU: 0 PID: 4 at arch/x86/mm/ioremap.c:63 __ioremap_check_ram+0xa7/0xc0()

and later badness on such a device.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

be7837e8

NVMe: add ->exit_hctx() hook · 2c30540b

由 Jens Axboe 提交于 11月 14, 2014

If we do teardown and setup of the queue and block related parts
of the driver, then we should clear nvmeq->hctx once we kill the
hardware queue.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2c30540b

20 11月, 2014 1 次提交

NVMe: make setup work for devices that don't do INTx · e32efbfc

由 Jens Axboe 提交于 11月 14, 2014

The setup/probe part currently relies on INTx being there and
working, that's not always the case. For devices that don't
advertise INTx, enable a single MSIx vector early on and disable
it again before we ask for our full range of queue vecs.
Acked-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e32efbfc

18 11月, 2014 3 次提交

NVMe: enable IO stats by default · 13976889

由 Jens Axboe 提交于 11月 18, 2014

Before the blk-mq conversion they were on by default, we should
not change behavior there.
Signed-off-by: NJens Axboe <axboe@fb.com>

13976889

NVMe: nvme_submit_async_admin_req() must use atomic rq allocation · 6dcc0cf6

由 Jens Axboe 提交于 11月 18, 2014

We are called for async event notification issues, and the
nvmeq lock is already held. If we fail the request allocation,
we'll just retry next time.
Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
Signed-off-by: NJens Axboe <axboe@fb.com>

6dcc0cf6

NVMe: replace blk_put_request() with blk_mq_free_request() · 9d135bb8

由 Jens Axboe 提交于 11月 17, 2014

No point in using blk_put_request(), since we know we are blk-mq.
This only makes sense in core code where we could be dealing with
either legacy or blk-mq drivers. Additionally, use
blk_mq_free_hctx_request() for the request completion fast path,
where we already know the mapping from request to hardware queue.
Signed-off-by: NJens Axboe <axboe@fb.com>

9d135bb8

11 11月, 2014 1 次提交

NVMe: __nvme_submit_admin_cmd() can be static · a64e6bb4

由 kbuild test robot 提交于 11月 05, 2014

drivers/block/nvme-core.c:865:5: sparse: symbol '__nvme_submit_admin_cmd' was not declared. Should it be static?
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a64e6bb4

06 11月, 2014 1 次提交

NVMe: blk_mq_alloc_request() returns error pointers · 9f173b33

由 Dan Carpenter 提交于 11月 05, 2014

We recently converted this to blk_mq but the error checks have to be
updated to check for IS_ERR() instead of NULL.

Fixes: a4aea562 ('NVMe: Convert to blk-mq')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9f173b33

05 11月, 2014 6 次提交

NVMe: Convert to blk-mq · a4aea562

由 Matias Bjørling 提交于 11月 04, 2014

This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within
itself.  By using blk-mq, a lot of these responsibilities can be moved
and simplified.

The patch is divided into the following blocks:

 * Per-command data and cmdid have been moved into the struct request
   field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
   maintenance are now handled by blk-mq through the rq->tag field.

 * The logic for splitting bio's has been moved into the blk-mq layer.
   The driver instead notifies the block layer about limited gap support in
   SG lists.

 * blk-mq handles timeouts and is reimplemented within nvme_timeout().
   This both includes abort handling and command cancelation.

 * Assignment of nvme queues to CPUs are replaced with the blk-mq
   version. The current blk-mq strategy is to assign the number of
   mapped queues and CPUs to provide synergy, while the nvme driver
   assign as many nvme hw queues as possible. This can be implemented in
   blk-mq if needed.

 * NVMe queues are merged with the tags structure of blk-mq.

 * blk-mq takes care of setup/teardown of nvme queues and guards invalid
   accesses. Therefore, RCU-usage for nvme queues can be removed.

 * IO tracing and accounting are handled by blk-mq and therefore removed.

 * Queue suspension logic is replaced with the logic from the block
   layer.

Contributions in this patch from:

  Sam Bradshaw <sbradshaw@micron.com>
  Jens Axboe <axboe@fb.com>
  Keith Busch <keith.busch@intel.com>
  Robert Nelson <rlnelson@google.com>
Acked-by: NKeith Busch <keith.busch@intel.com>
Acked-by: NJens Axboe <axboe@fb.com>

Updated for new ->queue_rq() prototype.
Signed-off-by: NJens Axboe <axboe@fb.com>

a4aea562

NVMe: Do not over allocate for discard requests · 9dbbfab7

由 Keith Busch 提交于 10月 06, 2014

Discard requests are often for very large ranges. The discard size is not
representative of the data transfer size so we don't need to allocate
for such a large prp list. This patch requests allocating only enough
for the memory needed for the data transfer and saves a little over 8k
of memory per max discard request.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reported-by: NPaul Grabinar <paul.grabinar@ranbarg.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9dbbfab7

NVMe: Do not open disks that are being deleted · 9e60352c

由 Keith Busch 提交于 10月 03, 2014

It is possible the block layer will request to open a block device after
the driver deleted it. Subsequent releases will cause a double free,
or the disk's private_data is pointing to freed memory. This patch
protects the driver's freed disks from being opened and accessed: the
nvme namespaces are freed only when the device's refcount is 0, so at
that moment there were no active openers and no more should be allowed,
and it is safe to clear the disk's private_data that is about to be freed.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reported-by: NHenry Chow <henry.chow@oracle.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9e60352c

NVMe: Clear QUEUE_FLAG_STACKABLE · 5940c857

由 Keith Busch 提交于 11月 04, 2014

The nvme namespace request_queue's flags are initialized to
QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The
device-mapper indicates this flag means the block driver is requset
based, though this driver is bio-based and problems will occur if an nvme
namespace is used with a request based dm device. This patch clears the
stackable flag.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5940c857

NVMe: Fix device probe waiting on kthread · 387caa5a

由 Keith Busch 提交于 9月 22, 2014

If we ever do parallel device probing, we need to wake up all processes
waiting for nvme kthread to start, not just one. This is currently
serialized so the bug is not reachable today, but fixing this anyway in
the hopes we implement parallel or asynchronous probe in the future.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

387caa5a

NVMe: Passthrough IOCTL for IO commands · 7963e521

由 Keith Busch 提交于 9月 12, 2014

The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data
transfers and isn't usable for other NVMe commands like flush,
data set management, or any sort of vendor unique command. The
NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary
IO commands in addition to arbitrary admin commands without breaking
backward compatibility. This patch just adds a new IOCTL to distinguish
if the driver should submit the command on an IO or Admin queue.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7963e521

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功