提交 · e18a9c18c38f523ae45416e2b75ed4ddf8ad107b · openeuler / Kernel

06 12月, 2022 1 次提交

nvme initialize core quirks before calling nvme_init_subsystem · 6f2d7152

由 Pankaj Raghav 提交于 12月 01, 2022

A device might have a core quirk for NVME_QUIRK_IGNORE_DEV_SUBNQN
(such as Samsung X5) but it would still give a:

    "missing or invalid SUBNQN field"

warning as core quirks are filled after calling nvme_init_subnqn.  Fill
ctrl->quirks from struct core_quirks before calling nvme_init_subsystem
to fix this.

Tested on a Samsung X5.

Fixes: ab9e00cc ("nvme: track subsystems")
Signed-off-by: NPankaj Raghav <p.raghav@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6f2d7152

30 11月, 2022 2 次提交

nvme: fix SRCU protection of nvme_ns_head list · 899d2a05

由 Caleb Sander 提交于 11月 18, 2022

Walking the nvme_ns_head siblings list is protected by the head's srcu
in nvme_ns_head_submit_bio() but not nvme_mpath_revalidate_paths().
Removing namespaces from the list also fails to synchronize the srcu.
Concurrent scan work can therefore cause use-after-frees.

Hold the head's srcu lock in nvme_mpath_revalidate_paths() and
synchronize with the srcu, not the global RCU, in nvme_ns_remove().

Observed the following panic when making NVMe/RDMA connections
with native multipath on the Rocky Linux 8.6 kernel
(it seems the upstream kernel has the same race condition).
Disassembly shows the faulting instruction is cmp 0x50(%rdx),%rcx;
computing capacity != get_capacity(ns->disk).
Address 0x50 is dereferenced because ns->disk is NULL.
The NULL disk appears to be the result of concurrent scan work
freeing the namespace (note the log line in the middle of the panic).

[37314.206036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
[37314.299753] PGD 0 P4D 0
[37314.299756] Oops: 0000 [#1] SMP PTI
[37314.299759] CPU: 29 PID: 322046 Comm: kworker/u98:3 Kdump: loaded Tainted: G        W      X --------- -  - 4.18.0-372.32.1.el8test86.x86_64 #1
[37314.299762] Hardware name: Dell Inc. PowerEdge R720/0JP31P, BIOS 2.7.0 05/23/2018
[37314.299763] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[37314.299783] RIP: 0010:nvme_mpath_revalidate_paths+0x26/0xb0 [nvme_core]
[37314.299790] Code: 1f 44 00 00 66 66 66 66 90 55 53 48 8b 5f 50 48 8b 83 c8 c9 00 00 48 8b 13 48 8b 48 50 48 39 d3 74 20 48 8d 42 d0 48 8b 50 20 <48> 3b 4a 50 74 05 f0 80 60 70 ef 48 8b 50 30 48 8d 42 d0 48 39 d3
[37315.058803] RSP: 0018:ffffabe28f913d10 EFLAGS: 00010202
[37315.121316] RAX: ffff927a077da800 RBX: ffff92991dd70000 RCX: 0000000001600000
[37315.206704] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff92991b719800
[37315.292106] RBP: ffff929a6b70c000 R08: 000000010234cd4a R09: c0000000ffff7fff
[37315.377501] R10: 0000000000000001 R11: ffffabe28f913a30 R12: 0000000000000000
[37315.462889] R13: ffff92992716600c R14: ffff929964e6e030 R15: ffff92991dd70000
[37315.548286] FS:  0000000000000000(0000) GS:ffff92b87fb80000(0000) knlGS:0000000000000000
[37315.645111] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37315.713871] CR2: 0000000000000050 CR3: 0000002208810006 CR4: 00000000000606e0
[37315.799267] Call Trace:
[37315.828515]  nvme_update_ns_info+0x1ac/0x250 [nvme_core]
[37315.892075]  nvme_validate_or_alloc_ns+0x2ff/0xa00 [nvme_core]
[37315.961871]  ? __blk_mq_free_request+0x6b/0x90
[37316.015021]  nvme_scan_work+0x151/0x240 [nvme_core]
[37316.073371]  process_one_work+0x1a7/0x360
[37316.121318]  ? create_worker+0x1a0/0x1a0
[37316.168227]  worker_thread+0x30/0x390
[37316.212024]  ? create_worker+0x1a0/0x1a0
[37316.258939]  kthread+0x10a/0x120
[37316.297557]  ? set_kthread_struct+0x50/0x50
[37316.347590]  ret_from_fork+0x35/0x40
[37316.390360] Modules linked in: nvme_rdma nvme_tcp(X) nvme_fabrics nvme_core netconsole iscsi_tcp libiscsi_tcp dm_queue_length dm_service_time nf_conntrack_netlink br_netfilter bridge stp llc overlay nft_chain_nat ipt_MASQUERADE nf_nat xt_addrtype xt_CT nft_counter xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment xt_multiport nft_compat nf_tables libcrc32c nfnetlink dm_multipath tg3 rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm irqbypass crct10dif_pclmul crc32_pclmul mlx5_ib ghash_clmulni_intel ib_uverbs rapl intel_cstate intel_uncore ib_core ipmi_si joydev mei_me pcspkr ipmi_devintf mei lpc_ich wmi ipmi_msghandler acpi_power_meter ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 mlx5_core drm_kms_helper syscopyarea
[37316.390419]  sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel libata mlxfw pci_hyperv_intf tls i2c_algo_bit psample dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: nvme_core]
[37317.645908] CR2: 0000000000000050

Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: NCaleb Sander <csander@purestorage.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

899d2a05

nvme-pci: clear the prp2 field when not used · a56ea614

由 Lei Rao 提交于 11月 29, 2022

If the prp2 field is not filled in nvme_setup_prp_simple(), the prp2
field is garbage data. According to nvme spec, the prp2 is reserved if
the data transfer does not cross a memory page boundary, so clear it to
zero if it is not used.
Signed-off-by: NLei Rao <lei.rao@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a56ea614

16 11月, 2022 1 次提交

nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV7000 · 8d6e38f6

由 Tiago Dias Ferreira 提交于 11月 16, 2022

Added a quirk to fix the Netac NV7000 SSD reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: NTiago Dias Ferreira <tiagodfer@gmail.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8d6e38f6

15 11月, 2022 1 次提交

nvme-pci: add NVME_QUIRK_BOGUS_NID for Micron Nitro · d5ceb4d1

由 Bean Huo 提交于 11月 14, 2022

Added a quirk to fix Micron Nitro NVMe reporting duplicate NGUIDs.

Cc: <stable@vger.kernel.org>
Signed-off-by: NBean Huo <beanhuo@micron.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d5ceb4d1

09 11月, 2022 1 次提交

nvme: quiet user passthrough command errors · d7ac8dca

由 Keith Busch 提交于 10月 28, 2022

The driver is spamming the kernel logs for entirely harmless errors from
user space submitting unsupported commands. Just silence the errors.
The application has direct access to command status, so there's no need
to log these.

And since every passthrough command now uses the quiet flag, move the
setting to the common initializer.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NAlan Adamson <alan.adamson@oracle.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NAlan Adamson <alan.adamson@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d7ac8dca

25 10月, 2022 3 次提交

nvme-multipath: set queue dma alignment to 3 · fe8714b0

由 Keith Busch 提交于 10月 24, 2022

NVMe spec requires all transports support dword aligned addresses, which
is already set in the namespace request_queue. Set the same limit in the
multipath device's request_queue as well.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

fe8714b0

nvme-tcp: fix possible circular locking when deleting a controller under memory pressure · 83e1226b

由 Sagi Grimberg 提交于 10月 23, 2022

When destroying a queue, when calling sock_release, the network stack
might need to allocate an skb to send a FIN/RST. When that happens
during memory pressure, there is a need to reclaim memory, which
in turn may ask the nvme-tcp device to write out dirty pages, however
this is not possible due to a ctrl teardown that is going on.

Set PF_MEMALLOC to the task that releases the socket to grant access
to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp
thread as this may also originate from the swap itself and should
be more resilient to memory pressure situations.

This fixes the following lockdep complaint:
--
======================================================
 WARNING: possible circular locking dependency detected
 6.0.0-rc2+ #25 Tainted: G        W
 ------------------------------------------------------
 kswapd0/92 is trying to acquire lock:
 ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0

 but task is already holding lock:
 ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (fs_reclaim){+.+.}-{0:0}:
        fs_reclaim_acquire+0x11e/0x160
        kmem_cache_alloc_node+0x44/0x530
        __alloc_skb+0x158/0x230
        tcp_send_active_reset+0x7e/0x730
        tcp_disconnect+0x1272/0x1ae0
        __tcp_close+0x707/0xd90
        tcp_close+0x26/0x80
        inet_release+0xfa/0x220
        sock_release+0x85/0x1a0
        nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
        nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
        nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
        kernfs_fop_write_iter+0x356/0x530
        vfs_write+0x4e8/0xce0
        ksys_write+0xfd/0x1d0
        do_syscall_64+0x58/0x80
        entry_SYSCALL_64_after_hwframe+0x63/0xcd

 -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
        __lock_acquire+0x2a0c/0x5690
        lock_acquire+0x18e/0x4f0
        lock_sock_nested+0x37/0xc0
        tcp_sendpage+0x23/0xa0
        inet_sendpage+0xad/0x120
        kernel_sendpage+0x156/0x440
        nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
        nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
        __blk_mq_try_issue_directly+0x452/0x660
        blk_mq_plug_issue_direct.constprop.0+0x207/0x700
        blk_mq_flush_plug_list+0x6f5/0xc70
        __blk_flush_plug+0x264/0x410
        blk_finish_plug+0x4b/0xa0
        shrink_lruvec+0x1263/0x1ea0
        shrink_node+0x736/0x1a80
        balance_pgdat+0x740/0x10d0
        kswapd+0x5f2/0xaf0
        kthread+0x256/0x2f0
        ret_from_fork+0x1f/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(sk_lock-AF_INET-NVME);
                               lock(fs_reclaim);
  lock(sk_lock-AF_INET-NVME);

 *** DEADLOCK ***

3 locks held by kswapd0/92:
 #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
 #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
 #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

83e1226b

nvme-tcp: replace sg_init_marker() with sg_init_table() · 5fa9add6

由 Nam Cao 提交于 10月 22, 2022

In nvme_tcp_ddgst_update(), sg_init_marker() is called with an
uninitialized scatterlist. This is probably fine, but gcc complains:

  CC [M]  drivers/nvme/host/tcp.o
In file included from ./include/linux/dma-mapping.h:10,
                 from ./include/linux/skbuff.h:31,
                 from ./include/net/net_namespace.h:43,
                 from ./include/linux/netdevice.h:38,
                 from ./include/net/sock.h:46,
                 from drivers/nvme/host/tcp.c:12:
In function ‘sg_mark_end’,
    inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2,
    inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2:
./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized]
  234 |         sg->page_link |= SG_END;
      |         ~~^~~~~~~~~~~
drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’:
drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here
  388 |         struct scatterlist sg;
      |                            ^~
cc1: all warnings being treated as errors

Use sg_init_table() instead, which basically memset the scatterlist to
zero first before calling sg_init_marker().
Signed-off-by: NNam Cao <namcaov@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5fa9add6

19 10月, 2022 5 次提交

nvme-hwmon: kmalloc the NVME SMART log buffer · c94b7f9b

由 Serge Semin 提交于 10月 18, 2022

Recent commit 52fde2c0 ("nvme: set dma alignment to dword") has
caused a regression on our platform.

It turned out that the nvme_get_log() method invocation caused the
nvme_hwmon_data structure instance corruption. In particular the
nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with
garbage. After some research we discovered that the problem happened
even before the actual NVME DMA execution, but during the buffer mapping.
Since our platform is DMA-noncoherent, the mapping implied the cache-line
invalidations or write-backs depending on the DMA-direction parameter.
In case of the NVME SMART log getting the DMA was performed
from-device-to-memory, thus the cache-invalidation was activated during
the buffer mapping. Since the log-buffer isn't cache-line aligned, the
cache-invalidation caused the neighbour data to be discarded. The
neighbouring data turned to be the data surrounding the buffer in the
framework of the nvme_hwmon_data structure.

In order to fix that we need to make sure that the whole log-buffer is
defined within the cache-line-aligned memory region so the
cache-invalidation procedure wouldn't involve the adjacent data. One of
the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the
rest of the NVME core driver prefer that method it has been chosen to fix
this problem too.

Note after a deeper researches we found out that the denoted commit wasn't
a root cause of the problem. It just revealed the invalidity by activating
the DMA-based NVME SMART log getting performed in the framework of the
NVME hwmon driver. The problem was here since the initial commit of the
driver.

[1] Documentation/core-api/dma-api-howto.rst

Fixes: 400b6a7b ("nvme: Add hardware monitoring support")
Signed-off-by: NSerge Semin <Sergey.Semin@baikalelectronics.ru>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c94b7f9b

nvme-hwmon: consistently ignore errors from nvme_hwmon_init · 6b8cf940

由 Christoph Hellwig 提交于 10月 18, 2022

An NVMe controller works perfectly fine even when the hwmon
initialization fails.  Stop returning errors that do not come from a
controller reset from nvme_hwmon_init to handle this case consistently.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NSerge Semin <fancer.lancer@gmail.com>

6b8cf940

nvme-apple: don't limit DMA segement size · d622f847

由 Russell King (Oracle) 提交于 10月 12, 2022

NVMe uses PRPs for data transfers and has no specific limit for a single
DMA segement.  Limiting the size will cause problems because the block
layer assumes PRP-ish devices using a virt boundary mask don't have a
segment limit.  And while this is true, we also really need to tell the
DMA mapping layer about it, otherwise dma-debug will trip over it.

Fixes: 5bd2927a ("nvme-apple: Add initial Apple SoC NVMe driver")
Suggested-by: NSven Peter <sven@svenpeter.dev>
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
[hch: rewrote the commit message based on the PCIe commit]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Curtin <ecurtin@redhat.com>
Reviewed-by: NSven Peter <sven@svenpeter.dev>

d622f847

nvme-pci: disable write zeroes on various Kingston SSD · ac9b57d4

由 Xander Li 提交于 10月 11, 2022

Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to
process.  The firmware version is locked by these SSDs, we can not expect
firmware improvement, so disable Write_Zeroes cmd.
Signed-off-by: NXander Li <xander_li@kingston.com.tw>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ac9b57d4

nvme: fix error pointer dereference in error handling · 4739824e

由 Dan Carpenter 提交于 10月 15, 2022

There is typo here so it releases the wrong variable.  "ctrl->admin_q"
was intended instead of "ctrl->fabrics_q".

Fixes: fe60e8c5 ("nvme: add common helpers to allocate and free tagsets")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4739824e

12 10月, 2022 5 次提交

nvme-multipath: fix possible hang in live ns resize with ANA access · 72e3b888

由 Sagi Grimberg 提交于 9月 29, 2022

When we revalidate paths as part of ns size change (as of commit
e7d65803), it is possible that during the path revalidation, the
only paths that is IO capable (i.e. optimized/non-optimized) are the
ones that ns resize was not yet informed to the host, which will cause
inflight requests to be requeued (as we have available paths but none
are IO capable). These requests on the requeue list are waiting for
someone to resubmit them at some point.

The IO capable paths will eventually notify the ns resize change to the
host, but there is nothing that will kick the requeue list to resubmit
the queued requests.

Fix this by always kicking the requeue list, and if no IO capable path
exists, these requests will be queued again.

A typical log that indicates that IOs are requeued:
--
nvme nvme1: creating 4 I/O queues.
nvme nvme1: new ctrl: "testnqn1"
nvme nvme2: creating 4 I/O queues.
nvme nvme2: mapped 4/0/0 default/read/poll queues.
nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
nvme nvme1: rescanning namespaces.
nvme1n1: detected capacity change from 2097152 to 4194304
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
nvme nvme2: rescanning namespaces.
--
Reported-by: NYogev Cohen <yogev@lightbitslabs.com>
Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: NChristoph Hellwig <hch@lst.de>

72e3b888

nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs · d5d3c100

由 Xi Ruoyao 提交于 9月 28, 2022

ZHITAI TiPro5000 SSDs has the same APST sleep problem as its cousin,
TiPro7000.  The quirk for TiPro7000 has been added in
commit 6b961bce ("nvme-pci: avoid the deepest sleep state on
ZHITAI TiPro7000 SSDs"), use the same quirk for TiPro5000.

The ASPT data from "nvme id-ctrl /dev/nvme1":

vid       : 0x1e49
ssvid     : 0x1e49
sn        : ZTA21T0KA2227304LM
mn        : ZHITAI TiPlus5000 1TB
fr        : ZTA09139
[...]
ps    0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
         rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
         rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
         rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
         rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0025W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
         rwt:4 rwl:4 idle_power:- active_power:-
Reported-and-tested-by: NChang Feng <flukehn@gmail.com>
Signed-off-by: NXi Ruoyao <xry111@xry111.site>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d5d3c100

nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760 · 80b26240

由 Abhijit 提交于 10月 10, 2022

Add a quirk to fix Lexar NM760 SSD drives reporting duplicate nsids.
Signed-off-by: NAbhijit <abhijit@abhijittomar.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

80b26240

nvme-tcp: fix possible hang caused during ctrl deletion · c4abd875

由 Sagi Grimberg 提交于 9月 28, 2022

When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
	inflight or scheduled (specifically also .stop_ctrl
	which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
	to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

[1]:
--
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? scan_shadow_nodes+0x40/0x40
 schedule+0x55/0xe0
 io_schedule+0x16/0x40
 do_read_cache_page+0x55d/0x850
 ? __page_cache_alloc+0x90/0x90
 read_cache_page+0x12/0x20
 read_part_sector+0x3f/0x110
 amiga_partition+0x3d/0x3e0
 ? osf_partition+0x33/0x220
 ? put_partition+0x90/0x90
 bdev_disk_changed+0x1fe/0x4d0
 blkdev_get_whole+0x7b/0x90
 blkdev_get_by_dev+0xda/0x2d0
 device_add_disk+0x356/0x3b0
 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
 ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
 nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
 nvme_mpath_add_disk+0x120/0x160 [nvme_core]
 nvme_alloc_ns+0x594/0xa00 [nvme_core]
 nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
 ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
 nvme_scan_work+0x281/0x410 [nvme_core]
 process_one_work+0x1be/0x380
 worker_thread+0x37/0x3b0
 ? process_one_work+0x380/0x380
 kthread+0x12d/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
      Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D
 stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? sched_clock+0x9/0x10
 schedule+0x55/0xe0
 schedule_timeout+0x24b/0x2e0
 ? try_to_wake_up+0x358/0x510
 ? finish_task_switch+0x88/0x2c0
 wait_for_completion+0xa5/0x110
 __flush_work+0x144/0x210
 ? worker_attach_to_pool+0xc0/0xc0
 flush_work+0x10/0x20
 nvme_remove_namespaces+0x41/0xf0 [nvme_core]
 nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
 nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
 dev_attr_store+0x14/0x30
 sysfs_kf_write+0x38/0x50
 kernfs_fop_write_iter+0x146/0x1d0
 new_sync_write+0x114/0x1b0
 ? intel_pmu_handle_irq+0xe0/0x420
 vfs_write+0x18d/0x270
 ksys_write+0x61/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
--

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: NJonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NJonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c4abd875

nvme-rdma: fix possible hang caused during ctrl deletion · a1ae8d4d

由 Sagi Grimberg 提交于 9月 28, 2022

When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
        inflight or scheduled (specifically also .stop_ctrl
        which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
        to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

Fixes: b435ecea ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432 ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a1ae8d4d

30 9月, 2022 8 次提交

nvme: wire up fixed buffer support for nvme passthrough · 23fd22e5

由 Kanchan Joshi 提交于 9月 30, 2022

if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag,
use the pre-registered buffer for IO (non-vectored variant). Pass the
buffer/length to io_uring and get the bvec iterator for the range. Next,
pass this bvec to block-layer and obtain a bio/request for subsequent
processing.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

23fd22e5

nvme: pass ubuffer as an integer · 4d174486

由 Kanchan Joshi 提交于 9月 30, 2022

This is a prep patch. Modify nvme_submit_user_cmd and
nvme_map_user_request to take ubuffer as plain integer
argument, and do away with nvme_to_user_ptr conversion in callers.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4d174486

nvme: refactor nvme_alloc_request · 470e900c

由 Kanchan Joshi 提交于 9月 30, 2022

nvme_alloc_request expects a large number of parameters.
Split this out into two functions to reduce number of parameters.
First one retains the name nvme_alloc_request, while second one is
named nvme_map_user_request.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

470e900c

nvme: refactor nvme_add_user_metadata · 38c0ddab

由 Kanchan Joshi 提交于 9月 30, 2022

Pass struct request rather than bio. It helps to kill a parameter, and
some processing clean-up too.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

38c0ddab

nvme: Use blk_rq_map_user_io helper · 7f056357

由 Anuj Gupta 提交于 9月 30, 2022

User blk_rq_map_user_io instead of duplicating the same code at
different places
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

7f056357

nvme: enable batched completions of passthrough IO · 851eb780

由 Jens Axboe 提交于 9月 22, 2022

Now that the normal passthrough end_io path doesn't need the request
anymore, we can kill the explicit blk_mq_free_request() and just pass
back RQ_END_IO_FREE instead. This enables the batched completion from
freeing batches of requests at the time.

This brings passthrough IO performance at least on par with bdev based
O_DIRECT with io_uring. With this and batche allocations, peak performance
goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
about 10% faster than previously, going from ~61M to ~67M IOPS.
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

851eb780

nvme: split out metadata vs non metadata end_io uring_cmd completions · c0a7ba77

由 Jens Axboe 提交于 9月 21, 2022

By splitting up the metadata and non-metadata end_io handling, we can
remove any request dependencies on the normal non-metadata IO path. This
is in preparation for enabling the normal IO passthrough path to pass
the ownership of the request back to the block layer.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c0a7ba77

block: change request end_io handler to pass back a return value · de671d61

由 Jens Axboe 提交于 9月 21, 2022

Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de671d61

27 9月, 2022 13 次提交

nvme: remove nvme_ctrl_init_connect_q · fe6f04c0

由 Christoph Hellwig 提交于 9月 20, 2022

Unused now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

fe6f04c0

nvme-fc: use the tagset alloc/free helpers · 6dfba1c0

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

6dfba1c0

nvme-fc: store the generic nvme_ctrl in set->driver_data · 1864ea46

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

1864ea46

nvme-fc: keep ctrl->sqsize in sync with opts->queue_size · 18ecd975

由 Christoph Hellwig 提交于 9月 20, 2022

Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

18ecd975

nvme-rdma: use the tagset alloc/free helpers · cefa1032

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_rdma_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

cefa1032

nvme-rdma: store the generic nvme_ctrl in set->driver_data · 2d60738c

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

2d60738c

nvme-tcp: use the tagset alloc/free helpers · de777825

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_tcp_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

de777825

nvme-tcp: store the generic nvme_ctrl in set->driver_data · 06427ca0

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

06427ca0

nvme-tcp: remove the unused queue_size member in nvme_tcp_queue · fb8745d0

由 Christoph Hellwig 提交于 9月 20, 2022

->nvme_tcp_queue is not used anywhere, so remove it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

fb8745d0

nvme: add common helpers to allocate and free tagsets · fe60e8c5

由 Christoph Hellwig 提交于 9月 04, 2022

Add common helpers to allocate and tear down the admin and I/O tag sets,
including the special queues allocated with them.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

fe60e8c5

nvme-pci: report the actual number of tagset maps · 6ee742fa

由 Keith Busch 提交于 9月 26, 2022

We've been reporting 2 maps regardless of whether the module parameter
asked for anything beyond the default queues. A consequence of this
means that blk-mq will reinitialize the all the hardware contexts and io
schedulers on every controller reset when the mapping is exactly the
same as before. This unnecessary overhead is adding several milliseconds
on a reset for environments that don't need it. Report the actual number
of mappings in use.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6ee742fa

nvme-pci: set min_align_mask before calculating max_hw_sectors · 61ce339f

由 Rishabh Bhatnagar 提交于 9月 20, 2022

If swiotlb is force enabled dma_max_mapping_size ends up calling
swiotlb_max_mapping_size which takes into account the min align mask for
the device.  Set the min align mask for nvme driver before calling
dma_max_mapping_size while calculating max hw sectors.
Signed-off-by: NRishabh Bhatnagar <risbhat@amazon.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

61ce339f

nvme: send a rediscover uevent when a persistent discovery controller reconnects · f46ef9e8

由 Sagi Grimberg 提交于 9月 22, 2022

When a discovery controller is disconnected, no AENs will arrive to
notify the host about discovery log change events.

In order to solve this, send a uevent notification when a
persistent discovery controller reconnects. We add a new ctrl
flag NVME_CTRL_STARTED_ONCE that will be set on the first
start, and consecutive calls will find it set, and send the
event to userspace if the controller is a discovery controller.

Upon the event reception, userspace will re-read the discovery
log page and will act upon changes as it sees fit.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f46ef9e8

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功