提交 · 9e60352cf83faaba57f99f6960b545687b8bbb20 · openeuler / Kernel

05 11月, 2014 22 次提交

NVMe: Do not open disks that are being deleted · 9e60352c

由 Keith Busch 提交于 10月 03, 2014

It is possible the block layer will request to open a block device after
the driver deleted it. Subsequent releases will cause a double free,
or the disk's private_data is pointing to freed memory. This patch
protects the driver's freed disks from being opened and accessed: the
nvme namespaces are freed only when the device's refcount is 0, so at
that moment there were no active openers and no more should be allowed,
and it is safe to clear the disk's private_data that is about to be freed.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reported-by: NHenry Chow <henry.chow@oracle.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9e60352c

NVMe: Clear QUEUE_FLAG_STACKABLE · 5940c857

由 Keith Busch 提交于 11月 04, 2014

The nvme namespace request_queue's flags are initialized to
QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The
device-mapper indicates this flag means the block driver is requset
based, though this driver is bio-based and problems will occur if an nvme
namespace is used with a request based dm device. This patch clears the
stackable flag.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5940c857

NVMe: Fix device probe waiting on kthread · 387caa5a

由 Keith Busch 提交于 9月 22, 2014

If we ever do parallel device probing, we need to wake up all processes
waiting for nvme kthread to start, not just one. This is currently
serialized so the bug is not reachable today, but fixing this anyway in
the hopes we implement parallel or asynchronous probe in the future.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

387caa5a

NVMe: Passthrough IOCTL for IO commands · 7963e521

由 Keith Busch 提交于 9月 12, 2014

The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data
transfers and isn't usable for other NVMe commands like flush,
data set management, or any sort of vendor unique command. The
NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary
IO commands in addition to arbitrary admin commands without breaking
backward compatibility. This patch just adds a new IOCTL to distinguish
if the driver should submit the command on an IO or Admin queue.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7963e521

NVMe: Add revalidate_disk callback · 1b9dbf7f

由 Keith Busch 提交于 9月 10, 2014

This adds a callback to revalidate the disk and change its block size
and capacity if needed. Before, a user would have to remove + rescan
an entire device if they changed the logical block size using an NVMe
Format or other vendor specific command; now they can just run something
that issues the BLKRRPART IOCTL, like

 # hdparm -z /dev/nvmeXnY

This can also be used in response to the 1.2 Spec's Namespace Attribute
Change asynchronous event.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1b9dbf7f

NVMe: Fix nvmeq waitqueue entry initialization · 7be50e93

由 Keith Busch 提交于 9月 10, 2014

We need to update the nvme queue's wait_queue_t entry during each
initialization since the nvme_thread may be ended and restarted when
the device is reset. If a device reset occurs during a large amount
of buffered IO, it would take a lot longer to complete the outstanding
requests due to the 1 second polling instead of waking up as completions
occur.

Fixes: b9afca3eSigned-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7be50e93

NVMe: Translate NVMe status to errno · b4ff9c8d

由 Keith Busch 提交于 8月 29, 2014

This returns a more appropriate error for the "capacity exceeded"
status. In case other NVMe statuses have a better errno, this patch adds
a convience function to translate an NVMe status code to an errno for
IO commands, defaulting to the current -EIO.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b4ff9c8d

NVMe: Remove duplicate compat SG_IO code · e179729a

由 Keith Busch 提交于 8月 27, 2014

We can return -ENOIOCTLCMD and the ioctl will be handled by
fs/compat_ioctl.c instead. This removes a lot of duplicate code in the
nvme driver.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

e179729a

NVMe: Reference count pci device · a96d4f5c

由 Keith Busch 提交于 8月 19, 2014

If an nvme device is removed but user space has an open reference,
the nvme driver would have been holding an invalid reference to its pci
device. You may get a general protection fault on x86 h/w when the driver
uses that reference in dma_map_sg(), as is done in nvme_map_user_pages()
from the IOCTL interface.

This patch fixes the fault by taking a reference on the pci device and
holding it even after device removal until all opens on the nvme device
are closed.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reported-by: NNilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a96d4f5c

nvme: Replace rcu_assign_pointer() with RCU_INIT_POINTER() · 062261be

由 Andreea-Cristina Bernat 提交于 8月 18, 2014

The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)
Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

062261be

NVMe: Correctly handle IOCTL_SUBMIT_IO when cpus > online queues · 59055356

由 Sam Bradshaw 提交于 7月 29, 2014

nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index.
This patch fixes the case where there are more cpus from which the ioctl
call can originate than online queues, which can happen when a device
supports or was allocated fewer interrupt vectors than exist cpu cores.

Thanks to Keith Busch for the implementation suggestion.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

59055356

NVMe: Fix filesystem sync deadlock on removal · 302c6727

由 Keith Busch 提交于 7月 18, 2014

This changes the order of deleting the gendisks so it happens after the
nvme IO queues are freed. If a device is removed while a filesystem has
associated dirty data, the removal will wait on these to complete before
proceeding from del_gendisk, which could have caused deadlock before.

The implication of this is that an orderly removal of a responsive
device won't necessarily wait for dirty data to be written, but we are
not guaranteed the device is even going to respond at this point either.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

302c6727

NVMe: Call nvme_free_queue directly · f435c282

由 Keith Busch 提交于 7月 07, 2014

Rather than relying on call_rcu, this patch directly frees the
nvme_queue's memory after ensuring no readers exist. Some arch specific
dma_free_coherent implementations may not be called from a call_rcu's
soft interrupt context, hence the change.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reported-by: NMatthew Minter <matthew_minter@xyratex.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f435c282

NVMe: Add shutdown timeout as module parameter. · 2484f407

由 Dan McLeran 提交于 7月 01, 2014

The current implementation hard-codes the shutdown timeout to 2 seconds.
Some devices take longer than this to complete a normal shutdown.
Changing the shutdown timeout to a module parameter with a default
timeout of 5 seconds.
Signed-off-by: NDan McLeran <daniel.mcleran@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2484f407

NVMe: Skip orderly shutdown on failed devices · 7c1b2450

由 Keith Busch 提交于 6月 25, 2014

Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7c1b2450

NVMe: Whitespace fixes · a6739479

由 Keith Busch 提交于 6月 23, 2014

Fixing tabs inadvertently converted to spaces.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a6739479

NVMe: Use pci_stop_and_remove_bus_device_locked() · c81f4975

由 Keith Busch 提交于 6月 23, 2014

Race conditions are theoretically possible between the NVMe PCI device
removal and the generic PCI bus rescan and device removal that can be
triggered via sysfs.

To avoid those race conditions make the NVMe code use
pci_stop_and_remove_bus_device_locked().
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c81f4975

NVMe: Handling devices incapable of I/O · badc34d4

由 Keith Busch 提交于 6月 23, 2014

This is a minor refactor for handling devices that are incapable of IO.
The driver previously used special error codes to know that IO queues
are unavailable, but we have an online queue count now.

This also fixes an issue where the driver successfully sets the queue
count, but either is unable to allocate an IO queue or the device can't
create one for some reason.

If the driver can successfully enable the device and get responses to
admin commands, the driver will bring up a character device for managment
but not create block devices.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

badc34d4

NVMe: Change nvme_enable_ctrl to set EN and manage CC thru ctrl_config. · 01079522

由 Dan McLeran 提交于 6月 23, 2014

Change the behavior of nvme_enable_ctrl to set EN.
Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl.
Remove reading of the CC register and manage the state in
dev->ctrl_config.
Signed-off-by: NDan McLeran <daniel.mcleran@intel.com>
[removed an unwanted write to CC]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

01079522

NVMe: Mismatched host/device page size support · 1d090624

由 Keith Busch 提交于 6月 23, 2014

Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

1d090624

NVMe: Async event request · 6fccf938

由 Keith Busch 提交于 6月 18, 2014

Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6fccf938

block: Use dma_zalloc_coherent · 4d51abf9

由 Joe Perches 提交于 6月 15, 2014

Use the zeroing function instead of dma_alloc_coherent & memset(,0,)
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4d51abf9

05 10月, 2014 1 次提交

block: disable entropy contributions for nonrot devices · b277da0a

由 Mike Snitzer 提交于 10月 04, 2014

Clear QUEUE_FLAG_ADD_RANDOM in all block drivers that set
QUEUE_FLAG_NONROT.

Historically, all block devices have automatically made entropy
contributions.  But as previously stated in commit e2e1a148 ("block: add
sysfs knob for turning off disk entropy contributions"):
    - On SSD disks, the completion times aren't as random as they
      are for rotational drives. So it's questionable whether they
      should contribute to the random pool in the first place.
    - Calling add_disk_randomness() has a lot of overhead.

There are more reliable sources for randomness than non-rotational block
devices.  From a security perspective it is better to err on the side of
caution than to allow entropy contributions from unreliable "random"
sources.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b277da0a

13 6月, 2014 1 次提交

NVMe: Fix hot cpu notification dead lock · f3db22fe

由 Keith Busch 提交于 6月 11, 2014

There is a potential dead lock if a cpu event occurs during nvme probe
since it registered with hot cpu notification. This fixes the race by
having the module register with notification outside of probe rather
than have each device register.

The actual work is done in a scheduled work queue instead of in the
notifier since assigning IO queues has the potential to block if the
driver creates additional queues.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

f3db22fe

04 6月, 2014 6 次提交

NVMe: Rename io_timeout to nvme_io_timeout · bd67608a

由 Matthew Wilcox 提交于 6月 03, 2014

It's positively immoral to have a global variable called 'io_timeout'.
Keep the module parameter called io_timeout, though.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

bd67608a

NVMe: Adhere to request queue block accounting enable/disable · b4e75cbf

由 Sam Bradshaw 提交于 5月 09, 2014

Recently, a new sysfs control "iostats" was added to selectively
enable or disable io statistics collection for request queues.  This
patch hooks that control.

IO statistics collection is rather expensive on large, multi-node
machines with drives pushing millions of iops.  Having the ability to
disable collection if not needed can improve throughput significantly.

As a data point, on a quad E5-4640, I see more than 50% throughput
improvement when io statistics accounting is disabled during heavily
multi-threaded small block random read benchmarks where device
performance is in the million iops+ range.
Signed-off-by: NSam Bradshaw <sbradshaw@micron.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

b4e75cbf

NVMe: Fix nvme get/put queue semantics · a51afb54

由 Keith Busch 提交于 5月 13, 2014

The routines to get and lock nvme queues required the caller to "put"
or "unlock" them even if getting one returned NULL. This patch fixes that.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

a51afb54

NVMe: Make admin timeout a module parameter · 9d43cf64

由 Keith Busch 提交于 5月 13, 2014

Signed-off-by: NKeith Busch <keith.busch@intel.com>
[made admin_timeout static]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

9d43cf64

NVMe: Make iod bio timeout a parameter · 61e4ce08

由 Keith Busch 提交于 5月 13, 2014

This was originally set to 4 times the IO timeout, but that was when
the IO timeout was 5 seconds instead of 30. 20 seconds for total time
to failure seemed more reasonable than 2 minutes for most, but other
users have requested to make this a module parameter instead.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[renamed the module parameter to retry_time]
[made retry_time static]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

61e4ce08

NVMe: Prevent possible NULL pointer dereference · 6808c5fb

由 Santosh Y 提交于 5月 29, 2014

kmalloc() used by the nvme_alloc_iod() to allocate memory for 'iod'
can fail. So check the return value.
Signed-off-by: NSantosh Y <santosh.sy@samsung.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

6808c5fb

28 5月, 2014 1 次提交

NVMe: Implement PCIe reset notification callback · f0d54a54

由 Keith Busch 提交于 5月 02, 2014

Quiesce and shutdown the device prior to reset, then restart the device and
resume IO after.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

f0d54a54

10 5月, 2014 1 次提交

NVMe: Enable BUILD_BUG_ON checks · 21bd78bc

由 Matthew Wilcox 提交于 5月 09, 2014

Since _nvme_check_size() wasn't being called from anywhere, the compiler
was optimising it away ... along with all the link-time build failures
that would result if any of the structures were the wrong size. Call it
from nvme_exit() for no particular reason.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

21bd78bc

05 5月, 2014 6 次提交

NVMe: Flush with data support · 53562be7

由 Keith Busch 提交于 4月 29, 2014

It is possible a filesystem may send a flush flagged bio with write
data. There is no such composite NVMe command, so the driver sends flush
and write separately.

The device is allowed to execute these commands in any order, so it was
possible the driver ends the bio after the write completes, but while the
flush is still active. We don't want to let a filesystem believe flush
succeeded before it really has; this could cause data corruption on a
power loss between these events. To fix, this patch splits the flush
and write into chained bios.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

53562be7

NVMe: Configure support for block flush · a7d2ce28

由 Keith Busch 提交于 4月 29, 2014

This configures an nvme request_queue as flush capable if the device
has a volatile write cache present.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

a7d2ce28

NVMe: Add tracepoints · 3291fa57

由 Keith Busch 提交于 4月 28, 2014

Adding tracepoints for bio_complete and block_split into nvme to help
with gathering IO info using blktrace and blkparse.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

3291fa57

NVMe: Protect against badly formatted CQEs · 94bbac40

由 Keith Busch 提交于 4月 24, 2014

If a misbehaving device posts a CQE with a command id < depth but for
one that was never allocated, the command info will have a callback
function set to NULL and we don't want to try invoking that.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

94bbac40

NVMe: Improve error messages · 27e8166c

由 Matthew Wilcox 提交于 4月 11, 2014

Help people diagnose what is going wrong at initialisation time by
printing out which command has gone wrong and what the device returned.
Also fix the error message printed while waiting for reset.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>

27e8166c

NVMe: Update copyright headers · 8757ad65

由 Matthew Wilcox 提交于 4月 11, 2014

Make the copyright dates accurate and remove the final paragraph that
includes the address of the FSF.
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

8757ad65

11 4月, 2014 2 次提交

NVMe: Retry failed commands with non-fatal errors · edd10d33

由 Keith Busch 提交于 4月 03, 2014

For commands returned with failed status, queue these for resubmission
and continue retrying them until success or for a limited amount of
time. The final timeout was arbitrarily chosen so requests can't be
retried indefinitely.

Since these are requeued on the nvmeq that submitted the command, the
callbacks have to take an nvmeq instead of an nvme_dev as a parameter
so that we can use the locked queue to append the iod to retry later.

The nvme_iod conviently can be used to track how long we've been trying
to successfully complete an iod request. The nvme_iod also provides the
nvme prp dma mappings, so I had to move a few things around so we can
keep those mappings.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[fixed checkpatch issue with long line]
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

edd10d33

NVMe: Add getgeo to block ops · 4cc09e2d

由 Keith Busch 提交于 4月 02, 2014

Some programs require HDIO_GETGEO work, which requires we implement
getgeo.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>

4cc09e2d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功