提交 · 1b96f862ecccb3e6f950eba584bebf22955cecc5 · openeuler / Kernel

15 11月, 2022 4 次提交

nvme: implement the DEAC bit for the Write Zeroes command · 1b96f862

由 Christoph Hellwig 提交于 10月 30, 2022

While the specification allows devices to either deallocate data
or to actually write zeroes on any Write Zeroes command, many SSDs
only do the sensible thing and deallocate data when the DEAC bit
is specific.  Set it when it is supported and the caller doesn't
explicitly opt out of deallocation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>

1b96f862

nvme: identify-namespace without CAP_SYS_ADMIN · e4fbcf32

由 Kanchan Joshi 提交于 11月 01, 2022

Allow all identify-namespace variants (CNS 00h, 05h and 08h) without
requiring CAP_SYS_ADMIN. The information (retrieved using id-ns) is
needed to form IO commands for passthrough interface.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e4fbcf32

nvme: fine-granular CAP_SYS_ADMIN for nvme io commands · 855b7717

由 Kanchan Joshi 提交于 10月 31, 2022

Currently both io and admin commands are kept under a
coarse-granular CAP_SYS_ADMIN check, disregarding file mode completely.

$ ls -l /dev/ng*
crw-rw-rw- 1 root root 242, 0 Sep  9 19:20 /dev/ng0n1
crw------- 1 root root 242, 1 Sep  9 19:20 /dev/ng0n2

In the example above, ng0n1 appears as if it may allow unprivileged
read/write operation but it does not and behaves same as ng0n2.

This patch implements a shift from CAP_SYS_ADMIN to more fine-granular
control for io-commands.
If CAP_SYS_ADMIN is present, nothing else is checked as before.
Otherwise, following rules are in place
- any admin-cmd is not allowed
- vendor-specific and fabric commmand are not allowed
- io-commands that can write are allowed if matching FMODE_WRITE
permission is present
- io-commands that read are allowed

Add a helper nvme_cmd_allowed that implements above policy.
Change all the callers of CAP_SYS_ADMIN to go through nvme_cmd_allowed
for any decision making.
Since file open mode is counted for any approval/denial, change at
various places to keep file-mode information handy.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

855b7717

nvme-fc: improve memory usage in nvme_fc_rcv_ls_req() · cf3d0084

由 Christophe JAILLET 提交于 10月 02, 2022

sizeof( struct nvmefc_ls_rcv_op ) = 64
sizeof( union nvmefc_ls_requests ) = 1024
sizeof( union nvmefc_ls_responses ) = 128

So, in nvme_fc_rcv_ls_req(), 1216 bytes of memory are requested when
kzalloc() is called.

Because of the way memory allocations are performed, 2048 bytes are
allocated. So about 800 bytes are wasted for each request.

Switch to 3 distinct memory allocations, in order to:
   - save these 800 bytes
   - avoid zeroing this extra memory
   - make sure that memory is properly aligned in case of DMA access
    ("fc_dma_map_single(lsop->rspbuf)" just a few lines below)
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cf3d0084

02 11月, 2022 11 次提交

nvme: use blk_mq_[un]quiesce_tagset · 98d81f0d

由 Chao Leng 提交于 11月 01, 2022

All controller namespaces share the same tagset, so we can use this
interface which does the optimal operation for parallel quiesce based on
the tagset type(e.g. blocking tagsets and non-blocking tagsets).

nvme connect_q should not be quiesced when quiesce tagset, so set the
QUEUE_FLAG_SKIP_TAGSET_QUIESCE to skip it when init connect_q.

Currently we use NVME_NS_STOPPED to ensure pairing quiescing and
unquiescing. If use blk_mq_[un]quiesce_tagset, NVME_NS_STOPPED will be
invalided, so introduce NVME_CTRL_STOPPED to replace NVME_NS_STOPPED.
In addition, we never really quiesce a single namespace. It is a better
choice to move the flag from ns to ctrl.
Signed-off-by: NChao Leng <lengchao@huawei.com>
[hch: rebased on top of prep patches]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

98d81f0d

blk-mq: pass a tagset to blk_mq_wait_quiesce_done · 483239c7

由 Christoph Hellwig 提交于 11月 01, 2022

Nothing in blk_mq_wait_quiesce_done needs the request_queue now, so just
pass the tagset, and move the non-mq check into the only caller that
needs it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-13-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

483239c7

nvme-apple: don't unquiesce the I/O queues in apple_nvme_reset_work · 2b4c2355

由 Christoph Hellwig 提交于 11月 01, 2022

apple_nvme_reset_work schedules apple_nvme_remove, to be called, which
will call apple_nvme_disable and unquiesce the I/O queues.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-10-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

2b4c2355

nvme-pci: don't unquiesce the I/O queues in nvme_remove_dead_ctrl · bad3e021

由 Christoph Hellwig 提交于 11月 01, 2022

nvme_remove_dead_ctrl schedules nvme_remove to be called, which will
call nvme_dev_disable and unquiesce the I/O queues.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

bad3e021

nvme: split nvme_kill_queues · cd50f9b2

由 Christoph Hellwig 提交于 11月 01, 2022

nvme_kill_queues does two things:

 1) mark the gendisk of all namespaces dead
 2) unquiesce all I/O queues

These used to be be intertwined due to block layer issues, but aren't
any more.  So move the unquiscing of the I/O queues into the callers,
and rename the rest of the function to the now more descriptive
nvme_mark_namespaces_dead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

cd50f9b2

nvme: don't unquiesce the admin queue in nvme_kill_queues · 6bcd5089

由 Christoph Hellwig 提交于 11月 01, 2022

None of the callers of nvme_kill_queues needs it to unquiesce the
admin queues, as all of them already do it themselves:

 1) nvme_reset_work explicit call nvme_start_admin_queue toward the
    beginning of the function.  The extra call to nvme_start_admin_queue
    in nvme_reset_work this won't do anything as
    NVME_CTRL_ADMIN_Q_STOPPED will already be cleared.
 2) nvme_remove calls nvme_dev_disable with shutdown flag set to true at
    the very beginning of the function if the PCIe device was not present,
    which is the precondition for the call to nvme_kill_queues.
    nvme_dev_disable already calls nvme_start_admin_queue toward the
    end of the function when the shutdown flag is set to true, so the
    admin queue is already enabled at this point.
 3) nvme_remove_dead_ctrl schedules a workqueue to unbind the driver,
    which will end up in nvme_remove, which calls nvme_dev_disable with
    the shutdown flag.  This case will call nvme_start_admin_queue a bit
    later than before.
 4) apple_nvme_remove uses the same sequence as nvme_remove_dead_ctrl
    above.
 5) nvme_remove_namespaces only calls nvme_kill_queues when the
    controller is in the DEAD state.  That can only happen in the PCIe
    driver, and only from nvme_remove. See item 2) above for the
    conditions there.

So it is safe to just remove the call to nvme_start_admin_queue in
nvme_kill_queues without replacement.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6bcd5089

nvme: remove the NVME_NS_DEAD check in nvme_validate_ns · fde776af

由 Christoph Hellwig 提交于 11月 01, 2022

At the point where namespaces are marked dead, the controller is in a
non-live state and we won't get pass the identify commands.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

fde776af

nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces · 4f17344e

由 Christoph Hellwig 提交于 11月 01, 2022

The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

4f17344e

nvme: don't remove namespaces in nvme_passthru_end · 23a90864

由 Christoph Hellwig 提交于 11月 01, 2022

The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more. Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

23a90864

nvme-pci: refactor the tagset handling in nvme_reset_work · 0ffc7e98

由 Christoph Hellwig 提交于 11月 01, 2022

The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de
[axboe: fix whitespace issue]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0ffc7e98

block: set the disk capacity to 0 in blk_mark_disk_dead · 71b26083

由 Christoph Hellwig 提交于 11月 01, 2022

nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later.  Move it to the
common code.

This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChao Leng <lengchao@huawei.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

71b26083

25 10月, 2022 3 次提交

nvme-apple: remove an extra queue reference · 941f7298

由 Christoph Hellwig 提交于 10月 18, 2022

Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the
apple_nvme structure.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NSven Peter <sven@svenpeter.dev>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

941f7298

nvme-pci: remove an extra queue reference · 7dcebef9

由 Christoph Hellwig 提交于 10月 18, 2022

Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the nvme_dev.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

7dcebef9

blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue · 2b3f056f

由 Christoph Hellwig 提交于 10月 18, 2022

The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference.  Move the call to
blk_put_queue into the callers to allow removing the extra references.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2b3f056f

19 10月, 2022 5 次提交

nvme-hwmon: kmalloc the NVME SMART log buffer · c94b7f9b

由 Serge Semin 提交于 10月 18, 2022

Recent commit 52fde2c0 ("nvme: set dma alignment to dword") has
caused a regression on our platform.

It turned out that the nvme_get_log() method invocation caused the
nvme_hwmon_data structure instance corruption. In particular the
nvme_hwmon_data.ctrl pointer was overwritten either with zeros or with
garbage. After some research we discovered that the problem happened
even before the actual NVME DMA execution, but during the buffer mapping.
Since our platform is DMA-noncoherent, the mapping implied the cache-line
invalidations or write-backs depending on the DMA-direction parameter.
In case of the NVME SMART log getting the DMA was performed
from-device-to-memory, thus the cache-invalidation was activated during
the buffer mapping. Since the log-buffer isn't cache-line aligned, the
cache-invalidation caused the neighbour data to be discarded. The
neighbouring data turned to be the data surrounding the buffer in the
framework of the nvme_hwmon_data structure.

In order to fix that we need to make sure that the whole log-buffer is
defined within the cache-line-aligned memory region so the
cache-invalidation procedure wouldn't involve the adjacent data. One of
the option to guarantee that is to kmalloc the DMA-buffer [1]. Seeing the
rest of the NVME core driver prefer that method it has been chosen to fix
this problem too.

Note after a deeper researches we found out that the denoted commit wasn't
a root cause of the problem. It just revealed the invalidity by activating
the DMA-based NVME SMART log getting performed in the framework of the
NVME hwmon driver. The problem was here since the initial commit of the
driver.

[1] Documentation/core-api/dma-api-howto.rst

Fixes: 400b6a7b ("nvme: Add hardware monitoring support")
Signed-off-by: NSerge Semin <Sergey.Semin@baikalelectronics.ru>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c94b7f9b

nvme-hwmon: consistently ignore errors from nvme_hwmon_init · 6b8cf940

由 Christoph Hellwig 提交于 10月 18, 2022

An NVMe controller works perfectly fine even when the hwmon
initialization fails.  Stop returning errors that do not come from a
controller reset from nvme_hwmon_init to handle this case consistently.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NSerge Semin <fancer.lancer@gmail.com>

6b8cf940

nvme-apple: don't limit DMA segement size · d622f847

由 Russell King (Oracle) 提交于 10月 12, 2022

NVMe uses PRPs for data transfers and has no specific limit for a single
DMA segement.  Limiting the size will cause problems because the block
layer assumes PRP-ish devices using a virt boundary mask don't have a
segment limit.  And while this is true, we also really need to tell the
DMA mapping layer about it, otherwise dma-debug will trip over it.

Fixes: 5bd2927a ("nvme-apple: Add initial Apple SoC NVMe driver")
Suggested-by: NSven Peter <sven@svenpeter.dev>
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
[hch: rewrote the commit message based on the PCIe commit]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NEric Curtin <ecurtin@redhat.com>
Reviewed-by: NSven Peter <sven@svenpeter.dev>

d622f847

nvme-pci: disable write zeroes on various Kingston SSD · ac9b57d4

由 Xander Li 提交于 10月 11, 2022

Kingston SSDs do support NVMe Write_Zeroes cmd but take long time to
process.  The firmware version is locked by these SSDs, we can not expect
firmware improvement, so disable Write_Zeroes cmd.
Signed-off-by: NXander Li <xander_li@kingston.com.tw>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ac9b57d4

nvme: fix error pointer dereference in error handling · 4739824e

由 Dan Carpenter 提交于 10月 15, 2022

There is typo here so it releases the wrong variable.  "ctrl->admin_q"
was intended instead of "ctrl->fabrics_q".

Fixes: fe60e8c5 ("nvme: add common helpers to allocate and free tagsets")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4739824e

12 10月, 2022 5 次提交

nvme-multipath: fix possible hang in live ns resize with ANA access · 72e3b888

由 Sagi Grimberg 提交于 9月 29, 2022

When we revalidate paths as part of ns size change (as of commit
e7d65803), it is possible that during the path revalidation, the
only paths that is IO capable (i.e. optimized/non-optimized) are the
ones that ns resize was not yet informed to the host, which will cause
inflight requests to be requeued (as we have available paths but none
are IO capable). These requests on the requeue list are waiting for
someone to resubmit them at some point.

The IO capable paths will eventually notify the ns resize change to the
host, but there is nothing that will kick the requeue list to resubmit
the queued requests.

Fix this by always kicking the requeue list, and if no IO capable path
exists, these requests will be queued again.

A typical log that indicates that IOs are requeued:
--
nvme nvme1: creating 4 I/O queues.
nvme nvme1: new ctrl: "testnqn1"
nvme nvme2: creating 4 I/O queues.
nvme nvme2: mapped 4/0/0 default/read/poll queues.
nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
nvme nvme1: rescanning namespaces.
nvme1n1: detected capacity change from 2097152 to 4194304
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
block nvme1n1: no usable path - requeuing I/O
nvme nvme2: rescanning namespaces.
--
Reported-by: NYogev Cohen <yogev@lightbitslabs.com>
Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Cc: <stable@vger.kernel.org> # v5.15+
Signed-off-by: NChristoph Hellwig <hch@lst.de>

72e3b888

nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs · d5d3c100

由 Xi Ruoyao 提交于 9月 28, 2022

ZHITAI TiPro5000 SSDs has the same APST sleep problem as its cousin,
TiPro7000.  The quirk for TiPro7000 has been added in
commit 6b961bce ("nvme-pci: avoid the deepest sleep state on
ZHITAI TiPro7000 SSDs"), use the same quirk for TiPro5000.

The ASPT data from "nvme id-ctrl /dev/nvme1":

vid       : 0x1e49
ssvid     : 0x1e49
sn        : ZTA21T0KA2227304LM
mn        : ZHITAI TiPlus5000 1TB
fr        : ZTA09139
[...]
ps    0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
         rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
         rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
         rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
         rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0025W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
         rwt:4 rwl:4 idle_power:- active_power:-
Reported-and-tested-by: NChang Feng <flukehn@gmail.com>
Signed-off-by: NXi Ruoyao <xry111@xry111.site>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d5d3c100

nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760 · 80b26240

由 Abhijit 提交于 10月 10, 2022

Add a quirk to fix Lexar NM760 SSD drives reporting duplicate nsids.
Signed-off-by: NAbhijit <abhijit@abhijittomar.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

80b26240

nvme-tcp: fix possible hang caused during ctrl deletion · c4abd875

由 Sagi Grimberg 提交于 9月 28, 2022

When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
	inflight or scheduled (specifically also .stop_ctrl
	which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
	to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing (see
stack trace [1]).

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

[1]:
--
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? scan_shadow_nodes+0x40/0x40
 schedule+0x55/0xe0
 io_schedule+0x16/0x40
 do_read_cache_page+0x55d/0x850
 ? __page_cache_alloc+0x90/0x90
 read_cache_page+0x12/0x20
 read_part_sector+0x3f/0x110
 amiga_partition+0x3d/0x3e0
 ? osf_partition+0x33/0x220
 ? put_partition+0x90/0x90
 bdev_disk_changed+0x1fe/0x4d0
 blkdev_get_whole+0x7b/0x90
 blkdev_get_by_dev+0xda/0x2d0
 device_add_disk+0x356/0x3b0
 nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
 ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
 nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
 nvme_mpath_add_disk+0x120/0x160 [nvme_core]
 nvme_alloc_ns+0x594/0xa00 [nvme_core]
 nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
 ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
 nvme_scan_work+0x281/0x410 [nvme_core]
 process_one_work+0x1be/0x380
 worker_thread+0x37/0x3b0
 ? process_one_work+0x380/0x380
 kthread+0x12d/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
INFO: task nvme:6725 blocked for more than 491 seconds.
      Not tainted 5.15.65-f0.el7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:nvme            state:D
 stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x390/0x910
 ? sched_clock+0x9/0x10
 schedule+0x55/0xe0
 schedule_timeout+0x24b/0x2e0
 ? try_to_wake_up+0x358/0x510
 ? finish_task_switch+0x88/0x2c0
 wait_for_completion+0xa5/0x110
 __flush_work+0x144/0x210
 ? worker_attach_to_pool+0xc0/0xc0
 flush_work+0x10/0x20
 nvme_remove_namespaces+0x41/0xf0 [nvme_core]
 nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
 nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
 dev_attr_store+0x14/0x30
 sysfs_kf_write+0x38/0x50
 kernfs_fop_write_iter+0x146/0x1d0
 new_sync_write+0x114/0x1b0
 ? intel_pmu_handle_irq+0xe0/0x420
 vfs_write+0x18d/0x270
 ksys_write+0x61/0xe0
 __x64_sys_write+0x1a/0x20
 do_syscall_64+0x37/0x90
 entry_SYSCALL_64_after_hwframe+0x61/0xcb
--

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Reported-by: NJonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NJonathan Nicklin <jnicklin@blockbridge.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c4abd875

nvme-rdma: fix possible hang caused during ctrl deletion · a1ae8d4d

由 Sagi Grimberg 提交于 9月 28, 2022

When we delete a controller, we execute the following:
1. nvme_stop_ctrl() - stop some work elements that may be
        inflight or scheduled (specifically also .stop_ctrl
        which cancels ctrl error recovery work)
2. nvme_remove_namespaces() - which first flushes scan_work
        to avoid competing ns addition/removal
3. continue to teardown the controller

However, if err_work was scheduled to run in (1), it is designed to
cancel any inflight I/O, particularly I/O that is originating from ns
scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
prevent forward progress of (2) as ns scanning is blocking on I/O
(that will never be cancelled).

The race is:
1. transport layer error observed -> err_work is scheduled
2. scan_work executes, discovers ns, generate I/O to it
3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
   - err_work never executed
4. nvme_remove_namespaces() -> flush_work(scan_work)
--> deadlock, because scan_work is blocked on I/O that was supposed
to be cancelled by err_work, but was cancelled before executing.

Fix this by flushing err_work instead of cancelling it, to force it
to execute and cancel all inflight I/O.

Fixes: b435ecea ("nvme: Add .stop_ctrl to nvme ctrl ops")
Fixes: f6c8e432 ("nvme: flush namespace scanning work just before removing namespaces")
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a1ae8d4d

30 9月, 2022 8 次提交

nvme: wire up fixed buffer support for nvme passthrough · 23fd22e5

由 Kanchan Joshi 提交于 9月 30, 2022

if io_uring sends passthrough command with IORING_URING_CMD_FIXED flag,
use the pre-registered buffer for IO (non-vectored variant). Pass the
buffer/length to io_uring and get the bvec iterator for the range. Next,
pass this bvec to block-layer and obtain a bio/request for subsequent
processing.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-13-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

23fd22e5

nvme: pass ubuffer as an integer · 4d174486

由 Kanchan Joshi 提交于 9月 30, 2022

This is a prep patch. Modify nvme_submit_user_cmd and
nvme_map_user_request to take ubuffer as plain integer
argument, and do away with nvme_to_user_ptr conversion in callers.
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-12-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

4d174486

nvme: refactor nvme_alloc_request · 470e900c

由 Kanchan Joshi 提交于 9月 30, 2022

nvme_alloc_request expects a large number of parameters.
Split this out into two functions to reduce number of parameters.
First one retains the name nvme_alloc_request, while second one is
named nvme_map_user_request.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-8-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

470e900c

nvme: refactor nvme_add_user_metadata · 38c0ddab

由 Kanchan Joshi 提交于 9月 30, 2022

Pass struct request rather than bio. It helps to kill a parameter, and
some processing clean-up too.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220930062749.152261-7-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

38c0ddab

nvme: Use blk_rq_map_user_io helper · 7f056357

由 Anuj Gupta 提交于 9月 30, 2022

User blk_rq_map_user_io instead of duplicating the same code at
different places
Signed-off-by: NAnuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20220930062749.152261-6-anuj20.g@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

7f056357

nvme: enable batched completions of passthrough IO · 851eb780

由 Jens Axboe 提交于 9月 22, 2022

Now that the normal passthrough end_io path doesn't need the request
anymore, we can kill the explicit blk_mq_free_request() and just pass
back RQ_END_IO_FREE instead. This enables the batched completion from
freeing batches of requests at the time.

This brings passthrough IO performance at least on par with bdev based
O_DIRECT with io_uring. With this and batche allocations, peak performance
goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
about 10% faster than previously, going from ~61M to ~67M IOPS.
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

851eb780

nvme: split out metadata vs non metadata end_io uring_cmd completions · c0a7ba77

由 Jens Axboe 提交于 9月 21, 2022

By splitting up the metadata and non-metadata end_io handling, we can
remove any request dependencies on the normal non-metadata IO path. This
is in preparation for enabling the normal IO passthrough path to pass
the ownership of the request back to the block layer.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NAnuj Gupta <anuj20.g@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Co-developed-by: NStefan Roesch <shr@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c0a7ba77

block: change request end_io handler to pass back a return value · de671d61

由 Jens Axboe 提交于 9月 21, 2022

Everything is just converted to returning RQ_END_IO_NONE, and there
should be no functional changes with this patch.

In preparation for allowing the end_io handler to pass ownership back
to the block layer, rather than retain ownership of the request.
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

de671d61

27 9月, 2022 4 次提交

nvme: remove nvme_ctrl_init_connect_q · fe6f04c0

由 Christoph Hellwig 提交于 9月 20, 2022

Unused now.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

fe6f04c0

nvme-fc: use the tagset alloc/free helpers · 6dfba1c0

由 Christoph Hellwig 提交于 9月 20, 2022

Use the common helpers to allocate and free the tagsets.  To make this
work the generic nvme_ctrl now needs to be stored in the hctx private
data instead of the nvme_fc_ctrl.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

6dfba1c0

nvme-fc: store the generic nvme_ctrl in set->driver_data · 1864ea46

由 Christoph Hellwig 提交于 9月 20, 2022

Point the private data to the generic controller structure in preparation
of using the common tagset init/exit code and use the chance the cleanup
the init_hctx methods a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

1864ea46

nvme-fc: keep ctrl->sqsize in sync with opts->queue_size · 18ecd975

由 Christoph Hellwig 提交于 9月 20, 2022

Also update the sqsize field when capping the queue size, and remove the
check a queue size that is larger than sqsize given that sqsize is only
initialized from opts->queue_size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>

18ecd975

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功