提交 · 8446546cc21819478542e9d728bce85e39898fc5 · openeuler / Kernel

24 6月, 2020 2 次提交

nvme-rdma: factor out a nvme_rdma_end_request helper · 8446546c

由 Christoph Hellwig 提交于 6月 11, 2020

Factor a small sniplet of duplicated code into a new helper in
preparation for making this sniplet a little bit less trivial.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8446546c

blk-mq: move failure injection out of blk_mq_complete_request · 15f73f5b

由 Christoph Hellwig 提交于 6月 11, 2020

Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15f73f5b

14 6月, 2020 1 次提交

treewide: replace '---help---' in Kconfig files with 'help' · a7f7f624

由 Masahiro Yamada 提交于 6月 14, 2020

Since commit 84af7a61 ("checkpatch: kconfig: prefer 'help' over
'---help---'"), the number of '---help---' has been gradually
decreasing, but there are still more than 2400 instances.

This commit finishes the conversion. While I touched the lines,
I also fixed the indentation.

There are a variety of indentation styles found.

  a) 4 spaces + '---help---'
  b) 7 spaces + '---help---'
  c) 8 spaces + '---help---'
  d) 1 space + 1 tab + '---help---'
  e) 1 tab + '---help---'    (correct indentation)
  f) 1 tab + 1 space + '---help---'
  g) 1 tab + 2 spaces + '---help---'

In order to convert all of them to 1 tab + 'help', I ran the
following commend:

  $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'
Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>

a7f7f624

11 6月, 2020 4 次提交

nvme-pci: use simple suspend when a HMB is enabled · b97120b1

由 Christoph Hellwig 提交于 6月 03, 2020

While the NVMe specification allows the device to access the host memory
buffer in host DRAM from all power states, hosts will fail access to
DRAM during S3 and similar power states.

Fixes: d916b1be ("nvme-pci: use host managed power state for suspend")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b97120b1

nvme-fc: don't call nvme_cleanup_cmd() for AENs · c9c12e51

由 Daniel Wagner 提交于 5月 29, 2020

Asynchronous event notifications do not have an associated request.
When fcp_io() fails we unconditionally call nvme_cleanup_cmd() which
leads to a crash.

Fixes: 16686f3a ("nvme: move common call to nvme_cleanup_cmd to core layer")
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NHimanshu Madhani <hmadhani2024@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c9c12e51

nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops · 6acbd961

由 Rikard Falkeborn 提交于 5月 29, 2020

nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops are never modified and can be
made const to allow the compiler to put them in read-only memory.

Before:
   text    data     bss     dec     hex filename
  53102    6885     576   60563    ec93 drivers/nvme/host/tcp.o

After:
   text    data     bss     dec     hex filename
  53422    6565     576   60563    ec93 drivers/nvme/host/tcp.o
Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6acbd961

nvme: do not call del_gendisk() on a disk that was never added · 108a5858

由 Niklas Cassel 提交于 6月 07, 2020

device_add_disk() is negated by del_gendisk().
alloc_disk_node() is negated by put_disk().

In nvme_alloc_ns(), device_add_disk() is one of the last things being
called in the success case, and only void functions are being called
after this. Therefore this call should not be negated in the error path.

The superfluous call to del_gendisk() leads to the following prints:
[    7.839975] kobject: '(null)' (000000001ff73734): is not initialized, yet kobject_put() is being called.
[    7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 kobject_put+0x70/0x120

Fixes: 33cfdc2a ("nvme: enforce extended LBA format for fabrics metadata")
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

108a5858

05 6月, 2020 1 次提交

block: remove the error argument to the block_bio_complete tracepoint · d24de76a

由 Christoph Hellwig 提交于 6月 03, 2020

The status can be trivially derived from the bio itself.  That also avoid
callers like NVMe to incorrectly pass a blk_status_t instead of the errno,
and the overhead of translating the blk_status_t to the errno in the I/O
completion fast path when no tracing is enabled.

Fixes: 35fe0d12 ("nvme: trace bio completion")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d24de76a

30 5月, 2020 1 次提交

nvme: force complete cancelled requests · 3382a567

由 Keith Busch 提交于 5月 29, 2020

Use blk_mq_foce_complete_rq() to bypass fake timeout error injection so
that request reclaim may proceed.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3382a567

29 5月, 2020 5 次提交

ipv4: add ip_sock_set_tos · 6ebf71ba

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6ebf71ba

tcp: add tcp_sock_set_syncnt · 557eadfc

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

557eadfc

tcp: add tcp_sock_set_nodelay · 12abc5ee

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Acked-by: NJason Gunthorpe <jgg@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

12abc5ee

net: add sock_set_priority · 6e434967

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the SO_PRIORITY sockopt from kernel space
without going through a fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e434967

net: add sock_no_linger · c433594c

由 Christoph Hellwig 提交于 5月 28, 2020

Add a helper to directly set the SO_LINGER sockopt from kernel space
with onoff set to true and a linger time of 0 without going through a
fake uaccess.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c433594c

28 5月, 2020 1 次提交

nvme-pci: avoid race between nvme_reap_pending_cqes() and nvme_poll() · 9210c075

由 Dongli Zhang 提交于 5月 27, 2020

There may be a race between nvme_reap_pending_cqes() and nvme_poll(), e.g.,
when doing live reset while polling the nvme device.

      CPU X                        CPU Y
                               nvme_poll()
nvme_dev_disable()
-> nvme_stop_queues()
-> nvme_suspend_io_queues()
-> nvme_suspend_queue()
                               -> spin_lock(&nvmeq->cq_poll_lock);
-> nvme_reap_pending_cqes()
   -> nvme_process_cq()        -> nvme_process_cq()

In the above scenario, the nvme_process_cq() for the same queue may be
running on both CPU X and CPU Y concurrently.

It is much more easier to reproduce the issue when CONFIG_PREEMPT is
enabled in kernel. When CONFIG_PREEMPT is disabled, it would take longer
time for nvme_stop_queues()-->blk_mq_quiesce_queue() to wait for grace
period.

This patch protects nvme_process_cq() with nvmeq->cq_poll_lock in
nvme_reap_pending_cqes().

Fixes: fa46c6fb ("nvme/pci: move cqe check after device shutdown")
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9210c075

27 5月, 2020 16 次提交

nvme: set dma alignment to qword · 3b2a1ebc

由 Keith Busch 提交于 5月 20, 2020

The default dma alignment mask is 511, which is much larger than any nvme
controller requires. NVMe controllers accept qword aligned DMA addresses,
so set the request_queue constraints to that. This can help avoid bounce
buffers on user passthrough commands.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3b2a1ebc

nvme-rdma: add metadata/T10-PI support · 5ec5d3bd

由 Max Gurtovoy 提交于 5月 19, 2020

For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow end-to-end
protection information passthrough and validation for NVMe over RDMA
transport. Metadata offload support was implemented over the new RDMA
signature verbs API and it is enabled for capable controllers.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5ec5d3bd

nvme-rdma: introduce nvme_rdma_sgl structure · 324d9e78

由 Israel Rukshin 提交于 5月 19, 2020

Remove first_sgl pointer from struct nvme_rdma_request and use pointer
arithmetic instead. The inline scatterlist, if exists, will be located
right after the nvme_rdma_request. This patch is needed as a preparation
for adding PI support.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

324d9e78

nvme: introduce NVME_INLINE_METADATA_SG_CNT · ba7ca2ae

由 Israel Rukshin 提交于 5月 19, 2020

SGL size of metadata is usually small. Thus, 1 inline sg should cover
most cases. The macro will be used for pre-allocate a single SGL entry
for metadata. The preallocation of small inline SGLs depends on SG_CHAIN
capability so if the ARCH doesn't support SG_CHAIN, use the runtime
allocation for the SGL. This patch is a preparation for adding metadata
(T10-PI) over fabric support.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ba7ca2ae

nvme: enforce extended LBA format for fabrics metadata · 33cfdc2a

由 Max Gurtovoy 提交于 5月 19, 2020

An extended LBA is a larger LBA that is created when metadata associated
with the LBA is transferred contiguously with the LBA data (AKA
interleaved). The metadata may be either transferred as part of the LBA
(creating an extended LBA) or it may be transferred as a separate
contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl
supports only an Extended LBA format. Fail revalidation in case we have a
spec violation. Also add a flag that will imply on capable transports and
controllers as part of a preparation for allowing end-to-end protection
information for fabric controllers.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

33cfdc2a

nvme: introduce max_integrity_segments ctrl attribute · 95093350

由 Max Gurtovoy 提交于 5月 19, 2020

This patch doesn't change any logic, and is needed as a preparation
for adding PI support for fabrics drivers that will use an extended
LBA format for metadata and will support more than 1 integrity segment.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

95093350

nvme: make nvme_ns_has_pi accessible to transports · 4d2ce688

由 James Smart 提交于 5月 19, 2020

Move the nvme_ns_has_pi() inline from core.c to the nvme.h header.
This allows use by the transports.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
[maxg: added a comment for nvme_ns_has_pi()]
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4d2ce688

nvme: introduce NVME_NS_METADATA_SUPPORTED flag · b29f8485

由 Max Gurtovoy 提交于 5月 19, 2020

This is a preparation for adding support for metadata in fabric
controllers. New flag will imply that NVMe namespace supports getting
metadata that was originally generated by host's block layer.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b29f8485

nvme: introduce namespace features flag · ffc89b1d

由 Max Gurtovoy 提交于 5月 19, 2020

Replace the specific ext boolean (that implies on extended LBA format)
with a feature in the new namespace features flag. This is a preparation
for adding more namespace features (such as metadata specific features).
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ffc89b1d

nvme: delete an unnecessary declaration · ec0862ac

由 Dan Carpenter 提交于 5月 15, 2020

The nvme_put_ctrl() is implemented earlier as an inline function so
this declaration isn't required.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ec0862ac

nvme: replace zero-length array with flexible-array · f1e71d75

由 Gustavo A. R. Silva 提交于 5月 07, 2020

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f1e71d75

nvme: fix io_opt limit setting · 68ab60ca

由 Damien Le Moal 提交于 5月 14, 2020

Currently, a namespace io_opt queue limit is set by default to the
physical sector size of the namespace and to the the write optimal
size (NOWS) when the namespace reports optimal IO sizes. This causes
problems with block limits stacking in blk_stack_limits() when a
namespace block device is combined with an HDD which generally do not
report any optimal transfer size (io_opt limit is 0). The code:

/* Optimal I/O a multiple of the physical block size? */
if (t->io_opt & (t->physical_block_size - 1)) {
	t->io_opt = 0;
	t->misaligned = 1;
	ret = -1;
}

in blk_stack_limits() results in an error return for this function when
the combined devices have different but compatible physical sector
sizes (e.g. 512B sector SSD with 4KB sector disks).

Fix this by not setting the optimal IO size queue limit if the namespace
does not report an optimal write size value.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NBart van Assche <bvanassche@acm.org>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

68ab60ca

nvme: disable streams when get stream params failed · 84e4c204

由 Wu Bo 提交于 5月 13, 2020

Disable streams again if getting the stream params fails.
Signed-off-by: NWu Bo <wubo40@huawei.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

84e4c204

nvme-fc: print proper nvme-fc devloss_tmo value · 614fc1c0

由 Martin George 提交于 5月 12, 2020

The nvme-fc devloss_tmo is computed as the min of either the
ctrl_loss_tmo (max_retries * reconnect_delay) or the remote port's
devloss_tmo. But what gets printed as the nvme-fc devloss_tmo in
nvme_fc_reconnect_or_delete() is always the remote port's devloss_tmo
value. So correct this by printing the min value instead.
Signed-off-by: NMartin George <marting@netapp.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

614fc1c0

nvme-pci: make sure write/poll_queues less or equal then cpu count · 9c9e76d5

由 Weiping Zhang 提交于 5月 09, 2020

Check module parameter write/poll_queues before using it to catch
too large values.

Reproducer:

modprobe -r nvme
modprobe nvme write_queues=`nproc`
echo $((`nproc`+1)) > /sys/module/nvme/parameters/write_queues
echo 1 > /sys/block/nvme0n1/device/reset_controller

[  657.069000] ------------[ cut here ]------------
[  657.069022] WARNING: CPU: 10 PID: 1163 at kernel/irq/affinity.c:390 irq_create_affinity_masks+0x47c/0x4a0
[  657.069056]  dm_region_hash dm_log dm_mod
[  657.069059] CPU: 10 PID: 1163 Comm: kworker/u193:9 Kdump: loaded Tainted: G        W         5.6.0+ #8
[  657.069060] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
[  657.069064] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  657.069066] RIP: 0010:irq_create_affinity_masks+0x47c/0x4a0
[  657.069067] Code: fe ff ff 48 c7 c0 b0 89 14 95 48 89 46 20 e9 e9 fb ff ff 31 c0 e9 90 fc ff ff 0f 0b 48 c7 44 24 08 00 00 00 00 e9 e9 fc ff ff <0f> 0b e9 87 fe ff ff 48 8b 7c 24 28 e8 33 a0 80 00 e9 b6 fc ff ff
[  657.069068] RSP: 0018:ffffb505ce1ffc78 EFLAGS: 00010202
[  657.069069] RAX: 0000000000000060 RBX: ffff9b97921fe5c0 RCX: 0000000000000000
[  657.069069] RDX: ffff9b67bad80000 RSI: 00000000ffffffa0 RDI: 0000000000000000
[  657.069070] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff9b97921fe718
[  657.069070] R10: ffff9b97921fe710 R11: 0000000000000001 R12: 0000000000000064
[  657.069070] R13: 0000000000000060 R14: 0000000000000000 R15: 0000000000000001
[  657.069071] FS:  0000000000000000(0000) GS:ffff9b67c0880000(0000) knlGS:0000000000000000
[  657.069072] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  657.069072] CR2: 0000559eac6fc238 CR3: 000000057860a002 CR4: 00000000007606e0
[  657.069073] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  657.069073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  657.069073] PKRU: 55555554
[  657.069074] Call Trace:
[  657.069080]  __pci_enable_msix_range+0x233/0x5a0
[  657.069085]  ? kernfs_put+0xec/0x190
[  657.069086]  pci_alloc_irq_vectors_affinity+0xbb/0x130
[  657.069089]  nvme_reset_work+0x6e6/0xeab [nvme]
[  657.069093]  ? __switch_to_asm+0x34/0x70
[  657.069094]  ? __switch_to_asm+0x40/0x70
[  657.069095]  ? nvme_irq_check+0x30/0x30 [nvme]
[  657.069098]  process_one_work+0x1a7/0x370
[  657.069101]  worker_thread+0x1c9/0x380
[  657.069102]  ? max_active_store+0x80/0x80
[  657.069103]  kthread+0x112/0x130
[  657.069104]  ? __kthread_parkme+0x70/0x70
[  657.069105]  ret_from_fork+0x35/0x40
[  657.069106] ---[ end trace f4f06b7d24513d06 ]---
[  657.077110] nvme nvme0: 95/1/0 default/read/poll queues
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9c9e76d5

nvme-tcp: set MSG_SENDPAGE_NOTLAST with MSG_MORE when we have more to send · 5bb052d7

由 Sagi Grimberg 提交于 5月 04, 2020

We can signal the stack that this is not the last page coming and the
stack can build a larger tso segment, so go ahead and use it.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5bb052d7

13 5月, 2020 1 次提交

nvme-pci: dma read memory barrier for completions · b69e2ef2

由 Keith Busch 提交于 5月 08, 2020

Control dependencies do not guarantee load order across the condition,
allowing a CPU to predict and speculate memory reads.

Commit 324b494c inlined verifying a new completion with its
handling. At least one architecture was observed to access the contents
out of order, resulting in the driver using stale data for the
completion.

Add a dma read barrier before reading the completion queue entry and
after the condition its contents depend on to ensure the read order is
determinsitic.
Reported-by: NJohn Garry <john.garry@huawei.com>
Suggested-by: NWill Deacon <will@kernel.org>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NJohn Garry <john.garry@huawei.com>
Acked-by: NWill Deacon <will@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b69e2ef2

10 5月, 2020 8 次提交

nvme: define constants for identification values · 92decf11

由 Keith Busch 提交于 4月 03, 2020

Improve code readability by defining the specification's constants that
the driver is using when decoding identification payloads.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NBart van Assche <bvanassche@acm.org>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

92decf11

nvme-multipath: stop using ->queuedata · 7890b970

由 Christoph Hellwig 提交于 3月 29, 2020

nvme-multipath already uses the gendisk private data, not need to
also set up the request_queue queuedata and use it in one place only.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7890b970

nvme-tcp: try to send request in queue_rq context · db5ad6b7

由 Sagi Grimberg 提交于 5月 01, 2020

Today, nvme-tcp automatically schedules a send request
to a workqueue context, which is 1 more than we'd need
in case the socket buffer is wide open.

However, because we have async send activity (as a result
of r2t, or write_space callbacks), we need to synchronize
sends from possibly multiple contexts (ideally all running
on the same cpu though).

Thus, we only try to send directly from queue_rq in cases:
1. the send_list is empty
2. we can send it synchronously (i.e. not from the RX path)
3. we run on the same cpu as the queue->io_cpu to avoid
   contention on the send operation.
Proposed-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db5ad6b7

nvme-tcp: avoid scheduling io_work if we are already polling · 72e5d757

由 Sagi Grimberg 提交于 5月 01, 2020

When the user runs polled I/O, we shouldn't have to trigger
the workqueue to generate the receive work upon the .data_ready
upcall. This prevents a redundant context switch when the
application is already polling for completions.
Proposed-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

72e5d757

nvme-tcp: use bh_lock in data_ready · 386e5e6e

由 Sagi Grimberg 提交于 4月 30, 2020

data_ready may be invoked from send context or from
softirq, so need bh locking for that.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

386e5e6e

nvme-pci: align io queue count with allocted nvme_queue in nvme_probe · 2a5bcfdd

由 Weiping Zhang 提交于 5月 02, 2020

Since commit 147b27e4 ("nvme-pci: allocate device queues storage
space at probe"), nvme_alloc_queue does not alloc the nvme queues
itself anymore.

If the write/poll_queues module parameters are changed at runtime to
values larger than the number of allocated queues in nvme_probe,
nvme_alloc_queue will access unallocated memory.

Add a new nr_allocated_queues member to struct nvme_dev to record how
many queues were alloctated in nvme_probe to avoid using more than the
allocated queues after a reset following a change to the
write/poll_queues module parameters.

Also add nr_write_queues and nr_poll_queues members to allow refreshing
the number of write and poll queues based on a change to the module
parameters when resetting the controller.

Fixes: 147b27e4 ("nvme-pci: allocate device queues storage space at probe")
Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
[hch: add nvme_max_io_queues, update the commit message]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2a5bcfdd

nvme-pci: remove last_sq_tail · 54b2fcee

由 Keith Busch 提交于 4月 27, 2020

The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

54b2fcee

nvme-pci: remove volatile cqes · 74943d45

由 Keith Busch 提交于 4月 28, 2020

The completion queue entry is not volatile once the phase is confirmed.
Remove the volatile keywords and check the phase using the appropriate
READ_ONCE() accessor, allowing the compiler to optimize the remaining
completion path.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74943d45

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功