提交 · 4ca5829ac8b1297715bf609443ade2c332f3fd0c · openanolis / cloud-kernel

05 11月, 2014 3 次提交

由 Matias Bjørling 提交于 11月 04, 2014

This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within
itself.  By using blk-mq, a lot of these responsibilities can be moved
and simplified.

The patch is divided into the following blocks:

 * Per-command data and cmdid have been moved into the struct request
   field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
   maintenance are now handled by blk-mq through the rq->tag field.

 * The logic for splitting bio's has been moved into the blk-mq layer.
   The driver instead notifies the block layer about limited gap support in
   SG lists.

 * blk-mq handles timeouts and is reimplemented within nvme_timeout().
   This both includes abort handling and command cancelation.

 * Assignment of nvme queues to CPUs are replaced with the blk-mq
   version. The current blk-mq strategy is to assign the number of
   mapped queues and CPUs to provide synergy, while the nvme driver
   assign as many nvme hw queues as possible. This can be implemented in
   blk-mq if needed.

 * NVMe queues are merged with the tags structure of blk-mq.

 * blk-mq takes care of setup/teardown of nvme queues and guards invalid
   accesses. Therefore, RCU-usage for nvme queues can be removed.

 * IO tracing and accounting are handled by blk-mq and therefore removed.

 * Queue suspension logic is replaced with the logic from the block
   layer.

Contributions in this patch from:

  Sam Bradshaw <sbradshaw@micron.com>
  Jens Axboe <axboe@fb.com>
  Keith Busch <keith.busch@intel.com>
  Robert Nelson <rlnelson@google.com>
Acked-by: NKeith Busch <keith.busch@intel.com>
Acked-by: NJens Axboe <axboe@fb.com>

Updated for new ->queue_rq() prototype.
Signed-off-by: NJens Axboe <axboe@fb.com>

a4aea562

NVMe: Mismatched host/device page size support · 1d090624

由 Keith Busch 提交于 6月 23, 2014

Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1d090624

NVMe: Async event request · 6fccf938

由 Keith Busch 提交于 6月 18, 2014

Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6fccf938

13 6月, 2014 1 次提交

NVMe: Fix hot cpu notification dead lock · f3db22fe

由 Keith Busch 提交于 6月 11, 2014

There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f3db22fe

04 6月, 2014 1 次提交

NVMe: Rename io_timeout to nvme_io_timeout · bd67608a

由 Matthew Wilcox 提交于 6月 03, 2014

It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

bd67608a

05 5月, 2014 3 次提交

NVMe: Flush with data support · 53562be7

由 Keith Busch 提交于 4月 29, 2014

It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.

The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

53562be7

NVMe: Configure support for block flush · a7d2ce28

由 Keith Busch 提交于 4月 29, 2014

This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

a7d2ce28

NVMe: Update copyright headers · 8757ad65

由 Matthew Wilcox 提交于 4月 11, 2014

Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

8757ad65

11 4月, 2014 4 次提交

NVMe: Retry failed commands with non-fatal errors · edd10d33

由 Keith Busch 提交于 4月 03, 2014

For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.

Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.

The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

edd10d33

NVMe: Make I/O timeout a module parameter · b355084a

由 Keith Busch 提交于 4月 04, 2014

Increase the default timeout to 30 seconds to match SCSI.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[use byte instead of ushort]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b355084a

NVMe: CPU hot plug notification · 33b1e95c

由 Keith Busch 提交于 3月 24, 2014

Registers with hot cpu notification to rebalance, and potentially allocate
additional, io queues.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

33b1e95c

NVMe: per-cpu io queues · 42f61420

由 Keith Busch 提交于 3月 24, 2014

The device's IO queues are associated with CPUs, so we can use a per-cpu
variable to map the a qid to a cpu. This provides a convienient way
to optimally assign queues to multiple cpus when the device supports
fewer queues than the host has cpus. The previous implementation may
have assigned these poorly in these situations. This patch addresses
this by sharing queues among cpus that are "close" together and should
have a lower lock contention penalty.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

42f61420

24 3月, 2014 2 次提交

NVMe: IOCTL path RCU protect queue access · 4f5099af

由 Keith Busch 提交于 3月 03, 2014

This adds rcu protected access to a queue in the nvme IOCTL path
to fix potential races between a surprise removal and queue usage in
nvme_submit_sync_cmd. The fix holds the rcu_read_lock() here to prevent
the nvme_queue from freeing while this path is executing so it can't
sleep, and so this path will no longer wait for a available command
id should they all be in use at the time a passthrough IOCTL request
is received.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

4f5099af

NVMe: RCU protected access to io queues · 5a92e700

由 Keith Busch 提交于 2月 21, 2014

This adds rcu protected access to nvme_queue to fix a race between a
surprise removal freeing the queue and a thread with open reference on
a NVMe block device using that queue.

The queues do not need to be rcu protected during the initialization or
shutdown parts, so I've added a helper function for raw deferencing
to get around the sparse errors.

There is still a hole in the IOCTL path for the same problem, which is
fixed in a subsequent patch.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

5a92e700

07 3月, 2014 1 次提交

nvme: don't use PREPARE_WORK · 9ca97374

由 Tejun Heo 提交于 3月 07, 2014

PREPARE_[DELAYED_]WORK() are being phased out.  They have few users
and a nasty surprise in terms of reentrancy guarantee as workqueue
considers work items to be different if they don't have the same work
function.

nvme_dev->reset_work is multiplexed with multiple work functions.
Introduce nvme_reset_workfn() which invokes nvme_dev->reset_workfn and
always use it as the work function and update the users to set the
->reset_workfn field instead of overriding the work function using
PREPARE_WORK().

It would probably be best to route this with other related updates
through the workqueue tree.

Compile tested.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-nvme@lists.infradead.org

9ca97374

28 1月, 2014 2 次提交

NVMe: Abort timed out commands · c30341dc

由 Keith Busch 提交于 12月 10, 2013

Send nvme abort command to io requests that have timed out on an
initialized device. If the command is not returned after another timeout,
schedule the controller for reset.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[fix endianness issues]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

c30341dc

NVMe: Schedule reset for failed controllers · d4b4ff8e

由 Keith Busch 提交于 12月 10, 2013

Schedules a controller reset when it indicates it has a failed status. If
the device does not become ready after a reset, the pci device will be
scheduled for removal.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[fixed checkpatch issue]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

d4b4ff8e

17 12月, 2013 2 次提交

NVMe: Device resume error handling · 9a6b9458

由 Keith Busch 提交于 12月 10, 2013

Adds controller error handling on resume power management. If the device
fails to initialize, the device is queued for a reset. If the reset fails,
a thread is spawned to remove the pci device.

If the device resumes as "busy", the device is responding to admin
commands but will not create IO queues. In this case, we need to remove
the gendisks and free the IO queues since they can't be used and may be
holding bios in their lists.

From testing, the dma pools require a pci device so this had to change
the pci driver 'remove' to release the dma resources in line with that
call instead of after all references to the device are released.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

9a6b9458

NVMe: compat SG_IO ioctl · 320a3827

由 Keith Busch 提交于 10月 23, 2013

For 32-bit versions of sg3-utils running on a 64-bit system. This is
mostly a copy from the relevent portions of fs/compat_ioctl.c, with
slight modifications for going through block_device_operations.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
[fixed up CONFIG_COMPAT=n build problems]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

320a3827

19 11月, 2013 1 次提交

NVMe: Avoid shift operation when writing cq head doorbell · b80d5ccc

由 Haiyan Hu 提交于 9月 10, 2013

Changes the type of dev->db_stride to unsigned and changes the value
stored there to be 1 << the current value. Then there is less
calculation to be done at completion time.
Signed-off-by: NHaiyan Hu <huhaiyan@huawei.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b80d5ccc

04 9月, 2013 3 次提交

NVMe: Use normal shutdown · 1894d8f1

由 Keith Busch 提交于 7月 15, 2013

The NVMe spec recommends using the shutdown normal sequence when safely
taking the controller offline instead of hitting CC.EN on the next
start-up to reset the controller. The spec recommends a minimum of 1
second for the shutdown complete. This patch waits 2 seconds to be on
the safe side.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

1894d8f1

NVMe: Namespace IDs are unsigned · c3bfe717

由 Matthew Wilcox 提交于 7月 08, 2013

The 'Number of Namespaces' read from the device was being treated as
signed, which would cause us to not scan any namespaces for a device
with more than 2 billion namespaces. That led to noticing that the
namespace ID was also being treated as signed, which could lead to the
result from NVME_IOCTL_ID being treated as an error code.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

c3bfe717

NVMe: Split header file into user-visible and kernel-visible pieces · 42c77683

由 Matthew Wilcox 提交于 6月 25, 2013

To build user programs that call the NVMe ioctls, we need to have a
user header file.  Catch up to the new way of doing that by splitting
the header file into kernel and uapi portions.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

42c77683

21 6月, 2013 1 次提交

NVMe: Disk IO statistics · 6198221f

由 Keith Busch 提交于 5月 29, 2013

Add io stats accounting for bio requests so nvme block devices show
useful disk stats.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

6198221f

08 5月, 2013 1 次提交

NVMe: Simplify Firmware Activate code slightly · ab3ea5bf

由 Matthew Wilcox 提交于 5月 06, 2013

Add definitions for the three Firmware Activate actions, and change the
SCSI translation code to construct the command into a temporary variable
instead of translating the endianness back-and-forth.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>

ab3ea5bf

03 5月, 2013 2 次提交

NVMe: Meta-data support in NVME_IOCTL_SUBMIT_IO · f410c680

由 Keith Busch 提交于 4月 23, 2013

This adds support for namespaces with separate meta-data formats in the
submit io ioctl. The meta-data buffer has to be a contiguous, so such
a buffer is allocated and the mapped user pages are copied to/from this
buffer for write/read commands.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f410c680

NVMe: Device specific stripe size handling · 159b67d7

由 Keith Busch 提交于 4月 09, 2013

We have an nvme device that has a concept of a stripe size. IO requests
that do not transfer data crossing a stripe boundary has greater
performance compared to IO that does cross it. This patch sets the
stripe size for the device if the device and vendor ids match one with
this feature and splits IO requests that cross the stripe boundary.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

159b67d7

17 4月, 2013 3 次提交

NVMe: Add a character device for each nvme device · 5e82e952

由 Keith Busch 提交于 2月 19, 2013

Registers a miscellaneous device for each nvme controller probed. This
creates character device files as /dev/nvmeN, where N is the device
instance, and supports nvme admin ioctl commands so devices without
namespaces can be managed.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

5e82e952

NVMe: Fix endian-related problems in user I/O submission path · 1c9b5265

由 Matthew Wilcox 提交于 4月 16, 2013

When constructing the command, dsmgmt needs to be treated as a 32-bit
value, not a 16-bit value.  reftag, apptag and appmask all need to be
converted from native-endian to little-endian.  Again, sparse's bitwise
warnings caught this problem.  Thanks to Keith for pointing out the
correct way to fix the reftag.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: NKeith Busch <keith.busch@intel.com>

1c9b5265

NVMe: Abstract out sector to block number conversion · 063cc6d5

由 Matthew Wilcox 提交于 3月 27, 2013

Introduce nvme_block_nr() to help convert sectors to block numbers.
This fixes an integer overflow in the SCSI conversion layer, and it's
slightly less typing than opencoding it.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: NKeith Busch <keith.busch@intel.com>

063cc6d5

29 3月, 2013 1 次提交

NVMe: Add nvme-scsi.c · 5d0f6131

由 Vishal Verma 提交于 3月 04, 2013

Translates SCSI commands in SG_IO ioctl to NVMe commands.
Uses the scsi-nvme translation spec from nvmexpress.org as reference.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

5d0f6131

27 3月, 2013 3 次提交

NVMe: Add definitions for format command · f8ebf840

由 Vishal Verma 提交于 3月 27, 2013

The SCSI emulation has the ability to send format commands, so we need
to add the definition of the command. Also add a missing error code.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f8ebf840

NVMe: Move structures & definitions to header file · 13c3b0fc

由 Vishal Verma 提交于 3月 04, 2013

nvme-scsi.c uses several data structures and definitions that were
previously private to nvme-core.c.  Move the definitions to nvme.h,
protected by __KERNEL__.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

13c3b0fc

NVMe: Add discard support for capable devices · 0e5e4f0e

由 Keith Busch 提交于 11月 09, 2012

This adds discard support to block queues if the nvme device is capable of
deallocating blocks as indicated by the controller's optional command support.
A discard flagged bio request will submit an NVMe deallocate Data Set
Management command for the requested blocks.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

0e5e4f0e

13 11月, 2012 1 次提交

NVMe: Define SMART log · 6ecec745

由 Keith Busch 提交于 9月 26, 2012

This data structure is defined in the NVMe specification. It's not used
by the kernel, but is available for use by userspace software.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

6ecec745

28 7月, 2012 1 次提交

NVMe: Do not set IO queue depth beyond device max · a0cadb85

由 Keith Busch 提交于 7月 27, 2012

Set the depth for IO queues to the device's maximum supported queue
entries if the requested depth exceeds the device's capabilities.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

a0cadb85

27 7月, 2012 1 次提交

NVMe: Set block queue max sectors · 8fc23e03

由 Keith Busch 提交于 7月 26, 2012

Set the max hw sectors in a namespace's request queue if the nvme device
has a max data transfer size.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

8fc23e03

05 11月, 2011 3 次提交

NVMe: Update Identify Controller data structure · 010e646b

由 Matthew Wilcox 提交于 11月 04, 2011

The driver was still using an old definition of Identify Controller
which only came to light once we started using the 'number of namespaces'
field properly.
Reported-by: NNisheeth Bhat <nisheeth.bhat@intel.com>
Reported-by: NKhosrow Panah <Khosrow.Panah@idt.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

010e646b

NVMe: Implement doorbell stride capability · f1938f6e

由 Matthew Wilcox 提交于 10月 20, 2011

The doorbell stride allows devices to spread out their doorbells instead
of packing them tightly. This feature was added as part of ECN 003.

This patch also enables support for more than 512 queues :-)
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f1938f6e

NVMe: Rework ioctls · 6bbf1acd

由 Matthew Wilcox 提交于 5月 20, 2011

Remove the special-purpose IDENTIFY, GET_RANGE_TYPE, DOWNLOAD_FIRMWARE
and ACTIVATE_FIRMWARE commands.  Replace them with a generic ADMIN_CMD
ioctl that can submit any admin command.

Add a new ID ioctl that returns the namespace ID of the queried device.
It corresponds to the SCSI Idlun ioctl.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

6bbf1acd

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功