提交 · 7442ddcedc344b6fa073692f165dffdd1889e780 · openeuler / Kernel

22 8月, 2020 1 次提交

nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth · 7442ddce

由 John Garry 提交于 8月 14, 2020

Recently nvme_dev.q_depth was changed from an int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
 x1 : 0000000000000001 x0 : 0000000000000000
 Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
 Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---

Fix by making onto a u32.

Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.

Fixes: 61f3b896 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7442ddce

29 7月, 2020 4 次提交

nvme: add a Identify Namespace Identification Descriptor list quirk · 5bedd3af

由 Christoph Hellwig 提交于 7月 28, 2020

Add a quirk for a device that does not support the Identify Namespace
Identification Descriptor list despite claiming 1.3 compliance.

Fixes: ea43d970 ("nvme: fix identify error status silent ignore")
Reported-by: NIngo Brunberg <ingo_brunberg@web.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NIngo Brunberg <ingo_brunberg@web.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

5bedd3af

nvme-pci: add support for ACPI StorageD3Enable property · df4f9bc4

由 David E. Box 提交于 7月 09, 2020

This patch implements a solution for a BIOS hack used on some currently
shipping Intel systems to change driver power management policy for PCIe
NVMe drives. Some newer Intel platforms, like some Comet Lake systems,
require that PCIe devices use D3 when doing suspend-to-idle in order to
allow the platform to realize maximum power savings. This is particularly
needed to support ATX power supply shutdown on desktop systems. In order to
ensure this happens for root ports with storage devices, Microsoft
apparently created this ACPI _DSD property as a way to influence their
driver policy. To my knowledge this property has not been discussed with
the NVME specification body.

Though the solution is not ideal, it addresses a problem that also affects
Linux since the NVMe driver's default policy of using NVMe APST during
suspend-to-idle prevents the PCI root port from going to D3 and leads to
higher power consumption for these platforms. The power consumption
difference may be negligible on laptop systems, but many watts on desktop
systems when the ATX power supply is blocked from powering down.

The patch creates a new nvme_acpi_storage_d3 function to check for the
StorageD3Enable property during probe and enables D3 as a quirk if set. It
also provides a 'noacpi' module parameter to allow skipping the quirk if
needed.

Tested with:
- PM961 NVMe SED Samsung 512GB
- INTEL SSDPEKKF512G8

Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-introSigned-off-by: NDavid E. Box <david.e.box@linux.intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

df4f9bc4

nvme-pci: use max of PRP or SGL for iod size · b13c6393

由 Chaitanya Kulkarni 提交于 7月 20, 2020

>From the initial implementation of NVMe SGL kernel support
commit a7a7cbe3 ("nvme-pci: add SGL support") with addition of the
commit 943e942e ("nvme-pci: limit max IO size and segments to avoid
high order allocations") now there is only caller left for
nvme_pci_iod_alloc_size() which statically passes true for last
parameter that calculates allocation size based on SGL since we need
size of biggest command supported for mempool allocation.

This patch modifies the helper functions nvme_pci_iod_alloc_size() such
that it is now uses maximum of PRP and SGL size for iod allocation size
calculation.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b13c6393

nvme-core: replace ctrl page size with a macro · 6c3c05b0

由 Chaitanya Kulkarni 提交于 7月 16, 2020

Saving the nvme controller's page size was from a time when the driver
tried to use different sized pages, but this value is always set to
a constant, and has been this way for some time. Remove the 'page_size'
field and replace its usage with the constant value.

This also lets the compiler make some micro-optimizations in the io
path, and that's always a good thing.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6c3c05b0

26 7月, 2020 1 次提交

nvme-pci: prevent SK hynix PC400 from using Write Zeroes command · 5611ec2b

由 Kai-Heng Feng 提交于 7月 24, 2020

After commit 6e02318e ("nvme: add support for the Write Zeroes
command"), SK hynix PC400 becomes very slow with the following error
message:

[  224.567695] blk_update_request: operation not supported error, dev nvme1n1, sector 499384320 op 0x9:(WRITE_ZEROES) flags 0x1000000 phys_seg 0 prio class 0]

SK Hynix PC400 has a buggy firmware that treats NLB as max value instead
of a range, so the NLB passed isn't a valid value to the firmware.

According to SK hynix there are three commands are affected:
- Write Zeroes
- Compare
- Write Uncorrectable

Right now only Write Zeroes is implemented, so disable it completely on
SK hynix PC400.

BugLink: https://bugs.launchpad.net/bugs/1872383
Cc: kyounghwan sohn <kyounghwan.sohn@sk.com>
Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5611ec2b

08 7月, 2020 9 次提交

nvme-pci: use standard block status symbolic names · 359c1f88

由 Baolin Wang 提交于 7月 03, 2020

Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

359c1f88

nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size() · 9056fc9f

由 Baolin Wang 提交于 7月 03, 2020

The nvme_pci_iod_alloc_size() should return 'size_t' type to be
consistent with the sizeof return value.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9056fc9f

nvme-pci: add a blank line after declarations · 4e523547

由 Baolin Wang 提交于 7月 03, 2020

Add a blank line after declarations to make code more readable.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4e523547

nvme-pci: fix some comments issues · ee0d96d3

由 Baolin Wang 提交于 7月 03, 2020

Fix comment typos and remove whitespaces before tabs to cleanup
checkpatch errors.
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ee0d96d3

nvme-pci: remove redundant segment validation · c25c853e

由 Baolin Wang 提交于 7月 03, 2020

We've validated the segment counts before calling nvme_map_data(),
so there is no need to validate again in nvme_pci_use_sgls(, which is
only called from nvme_map_data().
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c25c853e

nvme: document quirked Intel models · 972b13e2

由 David Fugate 提交于 7月 02, 2020

Documented model names of Intel SSDs requiring quirks.
Signed-off-by: NDavid Fugate <david.fugate@linux.intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

972b13e2

nvme-pci: remove the empty line at the beginning of nvme_should_reset() · ad509996

由 Dongli Zhang 提交于 6月 08, 2020

Just cleanup by removing the empty line.
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ad509996

nvme-pci: code cleanup for nvme_alloc_host_mem() · 9dc54a0d

由 Chaitanya Kulkarni 提交于 6月 01, 2020

Although use of for loop is preferred it is not a common practice to
have 80 char long for loop initialization and comparison section.

Use temp variables for calculating values and replace them in the
for loop with size of all variables to set to u64 since preferred
variable is declared as u64.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9dc54a0d

nvme-pci: use unsigned for io queue depth · 61f3b896

由 Chaitanya Kulkarni 提交于 6月 17, 2020

The NVMe PCIe declares module parameter io_queue_depth as int. Change
this to u16 as queue depth can never be negative. Now to reflect this
update module parameter getter function from param_get_int() ->
param_get_uint() and respective setter function with type of n changed
from int to u16 with param_set_int() to param_set_ushort(). Finally
update struct nvme_dev q_depth member to u16 and use u16 in min_t()
when calculating dev->q_depth in the nvme_pci_enable() (since q_depth is
now u16) and use unsigned int instead of int when calculating
dev->tagset.queue_depth as target variable tagset->queue_depth is of type
unsigned int in nvme_dev_add().
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

61f3b896

25 6月, 2020 2 次提交

nvme-pci: initialize tagset numa value to the value of the ctrl · d4ec47f1

由 Max Gurtovoy 提交于 6月 16, 2020

Both admin's and drive's tagsets should be set according the numa node
of the controller.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d4ec47f1

nvme-pci: override the value of the controller's numa node · 635333e4

由 Max Gurtovoy 提交于 6月 16, 2020

Set the node value according to the PCI device numa node.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

635333e4

24 6月, 2020 1 次提交

nvme: use blk_mq_complete_request_remote to avoid an indirect function call · ff029451

由 Christoph Hellwig 提交于 6月 11, 2020

Use the new blk_mq_complete_request_remote helper to avoid an indirect
function call in the completion fast path.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ff029451

11 6月, 2020 1 次提交

nvme-pci: use simple suspend when a HMB is enabled · b97120b1

由 Christoph Hellwig 提交于 6月 03, 2020

While the NVMe specification allows the device to access the host memory
buffer in host DRAM from all power states, hosts will fail access to
DRAM during S3 and similar power states.

Fixes: d916b1be ("nvme-pci: use host managed power state for suspend")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b97120b1

28 5月, 2020 1 次提交

nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll() · 9210c075

由 Dongli Zhang 提交于 5月 27, 2020

There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
when doing live reset while polling the nvme device.

      CPU X                        CPU Y
                               nvme_poll()
nvme_dev_disable()
-> nvme_stop_queues()
-> nvme_suspend_io_queues()
-> nvme_suspend_queue()
                               -> spin_lock(&nvmeq->cq_poll_lock);
-> nvme_reap_pending_cqes()
   -> nvme_process_cq()        -> nvme_process_cq()

In the above scenario, the nvme_process_cq() for the same queue may be
running on both CPU X and CPU Y concurrently.

It is much more easier to reproduce the issue when CONFIG_PREEMPT is
enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
period.

This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
nvme_reap_pending_cqes().

Fixes: fa46c6fb ("nvme/pci: move cqe check after device shutdown")
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9210c075

27 5月, 2020 2 次提交

nvme: introduce max_integrity_segments ctrl attribute · 95093350

由 Max Gurtovoy 提交于 5月 19, 2020

This patch doesn't change any logic, and is needed as a preparation
for adding PI support for fabrics drivers that will use an extended
LBA format for metadata and will support more than 1 integrity segment.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

95093350

nvme-pci: make sure write/poll_queues less or equal then cpu count · 9c9e76d5

由 Weiping Zhang 提交于 5月 09, 2020

Check module parameter write/poll_queues before using it to catch
too large values.

Reproducer:

modprobe -r nvme
modprobe nvme write_queues=`nproc`
echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

[  657.069000] ------------[ cut here ]------------
[  657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0
[  657.069056]  dm_region_hash dm_log dm_mod
[  657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G        W         5.6.0+ #8
[  657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[  657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0
[  657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff
[  657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202
[  657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000
[  657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000
[  657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718
[  657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064
[  657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001
[  657.069071] FS:  0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000
[  657.069072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0
[  657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  657.069073] PKRU: 55555554
[  657.069074] Call Trace:
[  657.069080]  __pci_enable_msix_range+0x233/0x5a0
[  657.069085]  ? kernfs_put+0xec/0x190
[  657.069086]  pci_alloc_irq_vectors_affinity+0xbb/0x130
[  657.069089]  nvme_reset_work+0x6e6/0xeab [nvme]
[  657.069093]  ? __switch_to_asm+0x34/0x70
[  657.069094]  ? __switch_to_asm+0x40/0x70
[  657.069095]  ? nvme_irq_check+0x30/0x30 [nvme]
[  657.069098]  process_one_work+0x1a7/0x370
[  657.069101]  worker_thread+0x1c9/0x380
[  657.069102]  ? max_active_store+0x80/0x80
[  657.069103]  kthread+0x112/0x130
[  657.069104]  ? __kthread_parkme+0x70/0x70
[  657.069105]  ret_from_fork+0x35/0x40
[  657.069106] ---[ end trace f4f06b7d24513d06 ]---
[  657.077110] nvme nvme0: 95/1/0 default/read/poll queues
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9c9e76d5

13 5月, 2020 1 次提交

nvme-pci: dma read memory barrier for completions · b69e2ef2

由 Keith Busch 提交于 5月 08, 2020

Control dependencies do not guarantee load order across the condition,
allowing a CPU to predict and speculate memory reads.

Commit 324b494c inlined verifying a new completion with its
handling. At least one architecture was observed to access the contents
out of order, resulting in the driver using stale data for the
completion.

Add a dma read barrier before reading the completion queue entry and
after the condition its contents depend on to ensure the read order is
determinsitic.
Reported-by: NJohn Garry <john.garry@huawei.com>
Suggested-by: NWill Deacon <will@kernel.org>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NJohn Garry <john.garry@huawei.com>
Acked-by: NWill Deacon <will@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b69e2ef2

10 5月, 2020 4 次提交

nvme-pci: align io queue count with allocted nvme_queue in nvme_probe · 2a5bcfdd

由 Weiping Zhang 提交于 5月 02, 2020

Since commit 147b27e4 ("nvme-pci: allocate device queues storage
space at probe"), nvme_alloc_queue does not alloc the nvme queues
itself anymore.

If the write/poll_queues module parameters are changed at runtime to
values larger than the number of allocated queues in nvme_probe,
nvme_alloc_queue will access unallocated memory.

Add a new nr_allocated_queues member to struct nvme_dev to record how
many queues were alloctated in nvme_probe to avoid using more than the
allocated queues after a reset following a change to the
write/poll_queues module parameters.

Also add nr_write_queues and nr_poll_queues members to allow refreshing
the number of write and poll queues based on a change to the module
parameters when resetting the controller.

Fixes: 147b27e4 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
[hch: add nvme_max_io_queues, update the commit message]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2a5bcfdd

nvme-pci: remove last_sq_tail · 54b2fcee

由 Keith Busch 提交于 4月 27, 2020

The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

54b2fcee

nvme-pci: remove volatile cqes · 74943d45

由 Keith Busch 提交于 4月 28, 2020

The completion queue entry is not volatile once the phase is confirmed.
Remove the volatile keywords and check the phase using the appropriate
READ_ONCE() accessor, allowing the compiler to optimize the remaining
completion path.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74943d45

nvme-pci: fix "slimmer CQ head update" · a8de6639

由 Alexey Dobriyan 提交于 5月 07, 2020

Pre-incrementing ->cq_head can't be done in memory because OOB value
can be observed by another context.

This devalues space savings compared to original code :-\

	$ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
	add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-32 (-32)
	Function                                     old     new   delta
	nvme_poll_irqdisable                         464     456      -8
	nvme_poll                                    455     447      -8
	nvme_irq                                     388     380      -8
	nvme_dev_disable                             955     947      -8

But the code is minimal now: one read for head, one read for q_depth,
one increment, one comparison, single instruction phase bit update and
one write for new head.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reported-by: NJohn Garry <john.garry@huawei.com>
Tested-by: NJohn Garry <john.garry@huawei.com>
Fixes: e2a366a4 ("nvme-pci: slimmer CQ head update")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a8de6639

26 3月, 2020 8 次提交

nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl · 726612b6

由 Israel Rukshin 提交于 3月 24, 2020

Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

726612b6

nvme: Fix ctrl use-after-free during sysfs deletion · b780d741

由 Israel Rukshin 提交于 3月 24, 2020

In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

b780d741

nvme-pci: Re-order nvme_pci_free_ctrl · 253fd4ac

由 Israel Rukshin 提交于 3月 24, 2020

Destroy the resources in the same order like in nvme_probe error flow to
improve code readability.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

253fd4ac

nvme-pci: properly print controller address · 2db24e4a

由 Max Gurtovoy 提交于 3月 09, 2020

Align PCI address print with fabrics address that is printed with
newline character.

Before:
[root@server40 linux]# cat /sys/class/nvme/nvme2/address
0000:0b:00.0[root@server40 linux]#

After:
[root@server40 linux]# cat /sys/class/nvme/nvme2/address
0000:0b:00.0
[root@server40 linux]#
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>

2db24e4a

nvme-pci: Simplify nvme_poll_irqdisable · fa059b85

由 Keith Busch 提交于 3月 04, 2020

The timeout handler can use the existing nvme_poll() if it needs to
check a polled queue, allowing nvme_poll_irqdisable() to handle only
irq driven queues for the remaining callers.
Signed-off-by: NKeith Busch <kbusch@kernel.org>

fa059b85

nvme-pci: Remove two-pass completions · 324b494c

由 Keith Busch 提交于 3月 02, 2020

Completion handling had been done in two steps: find all new completions
under a lock, then handle those completions outside the lock. This was
done to make the locked section as short as possible so that other
threads using the same lock wait less time.

The driver no longer shares locks during completion, and is in fact
lockless for interrupt driven queues, so the optimization no longer
serves its original purpose. Replace the two-pass completion queue
handler with a single pass that completes entries immediately.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

324b494c

nvme-pci: Remove tag from process cq · bf392a5d

由 Keith Busch 提交于 3月 02, 2020

The only user for tagged completion was for timeout handling. That user,
though, really only cares if the timed out command is completed, which
we can safely check within the timeout handler.

Remove the tag check to simplify completion handling.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

bf392a5d

nvme-pci: slimmer CQ head update · e2a366a4

由 Alexey Dobriyan 提交于 2月 28, 2020

Update CQ head with pre-increment operator. This saves subtraction of 1
and a few registers.

Also update phase with "^= 1". This generates only one RMW instruction.

ffffffff815ba150 <nvme_update_cq_head>:
ffffffff815ba150: 0f b7 47 70 movzx eax,WORD PTR [rdi+0x70]
ffffffff815ba154: 83 c0 01 add eax,0x1
ffffffff815ba157: 66 89 47 70 mov WORD PTR [rdi+0x70],ax
ffffffff815ba15b: 66 3b 47 68 cmp ax,WORD PTR [rdi+0x68]
ffffffff815ba15f: 74 01 je ffffffff815ba162 <nvme_update_cq_head+0x12>
ffffffff815ba161: c3 ret
ffffffff815ba162: 31 c0 xor eax,eax
ffffffff815ba164: 80 77 74 01 ===> xor BYTE PTR [rdi+0x74],0x1
ffffffff815ba168: 66 89 47 70 mov WORD PTR [rdi+0x70],ax
ffffffff815ba16c: c3 ret

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-119 (-119)
Function old new delta
nvme_poll 690 678 -12
nvme_dev_disable 1230 1177 -53
nvme_irq 613 559 -54
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>

e2a366a4

28 2月, 2020 1 次提交

nvme-pci: Hold cq_poll_lock while completing CQEs · 9515743b

由 Bijan Mottahedeh 提交于 2月 26, 2020

Completions need to consumed in the same order the controller submitted
them, otherwise future completion entries may overwrite ones we haven't
handled yet. Hold the nvme queue's poll lock while completing new CQEs to
prevent another thread from freeing command tags for reuse out-of-order.

Fixes: dabcefab ("nvme: provide optimized poll function for separate poll queues")
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9515743b

19 2月, 2020 2 次提交

nvme-pci: Use single IRQ vector for old Apple models · 98f7b86a

由 Andy Shevchenko 提交于 2月 12, 2020

People reported that old Apple machines are not working properly
if the non-first IRQ vector is in use.

Set quirk for that models to limit IRQ to use first vector only.

Based on original patch by GitHub user npx001.

Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Leif Liddy <leif.liddy@gmail.com>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

98f7b86a

nvme/pci: Add sleep quirk for Samsung and Toshiba drives · 1fae37ac

由 Shyjumon N 提交于 2月 06, 2020

The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
C640 platform experience runtime resume issues when the SSDs are kept in
sleep/suspend mode for long time.

This patch applies the 'Simple Suspend' quirk to these configurations.
With this patch, the issue had not been observed in a 1+ day test.
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NShyjumon N <shyjumon.n@intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

1fae37ac

15 2月, 2020 1 次提交

nvme/pci: move cqe check after device shutdown · fa46c6fb

由 Keith Busch 提交于 2月 13, 2020

Many users have reported nvme triggered irq_startup() warnings during
shutdown. The driver uses the nvme queue's irq to synchronize scanning
for completions, and enabling an interrupt affined to only offline CPUs
triggers the alarming warning.

Move the final CQE check to after disabling the device and all
registered interrupts have been torn down so that we do not have any
IRQ to synchronize.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa46c6fb

04 2月, 2020 1 次提交

nvme-pci: remove nvmeq->tags · cfa27356

由 Christoph Hellwig 提交于 1月 30, 2020

There is no real need to have a pointer to the tagset in
struct nvme_queue, as we only need it in a single place, and that place
can derive the used tagset from the device and qid trivially.  This
fixes a problem with stale pointer exposure when tagsets are reset,
and also shrinks the nvme_queue structure.  It also matches what most
other transports have done since day 1.
Reported-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

cfa27356

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功