提交 · 363f6368603743072e5f318c668c632bccb097a3 · openeuler / Kernel

23 2月, 2022 1 次提交

nvme: don't return an error from nvme_configure_metadata · 363f6368

由 Christoph Hellwig 提交于 2月 16, 2022

When a fabrics controller claims to support an invalidate metadata
configuration we already warn and disable metadata support.  No need to
also return an error during revalidation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NKanchan Joshi <joshi.k@samsung.com>

363f6368

17 2月, 2022 1 次提交

block: fix surprise removal for drivers calling blk_set_queue_dying · 7a5428dc

由 Christoph Hellwig 提交于 2月 17, 2022

Various block drivers call blk_set_queue_dying to mark a disk as dead due
to surprise removal events, but since commit 8e141f9e that doesn't
work given that the GD_DEAD flag needs to be set to stop I/O.

Replace the driver calls to blk_set_queue_dying with a new (and properly
documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
only remaining caller.

Fixes: 8e141f9e ("block: drain file system I/O on del_gendisk")
Reported-by: NMarkus Blöchl <markus.bloechl@ipetronik.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

7a5428dc

09 2月, 2022 2 次提交

nvme-tcp: fix bogus request completion when failing to send AER · 63573807

由 Sagi Grimberg 提交于 2月 07, 2022

AER is not backed by a real request, hence we should not incorrectly
assume that when failing to send a nvme command, it is a normal request
but rather check if this is an aer and if so complete the aer (similar
to the normal completion path).

Cc: stable@vger.kernel.org
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

63573807

nvme: add nvme_complete_req tracepoint for batched completion · 00e757b6

由 Bean Huo 提交于 2月 08, 2022

Add NVMe request completion trace in nvme_complete_batch_req() because
nvme:nvme_complete_req tracepoint is missing in case of request batched
completion.
Signed-off-by: NBean Huo <beanhuo@micron.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

00e757b6

03 2月, 2022 1 次提交

nvme-fabrics: fix state check in nvmf_ctlr_matches_baseopts() · 6a51abde

由 Uday Shankar 提交于 1月 20, 2022

Controller deletion/reset, immediately followed by or concurrent with
a reconnect, is hard failing the connect attempt resulting in a
complete loss of connectivity to the controller.

In the connect request, fabrics looks for an existing controller with
the same address components and aborts the connect if a controller
already exists and the duplicate connect option isn't set. The match
routine filters out controllers that are dead or dying, so they don't
interfere with the new connect request.

When NVME_CTRL_DELETING_NOIO was added, it missed updating the state
filters in the nvmf_ctlr_matches_baseopts() routine. Thus, when in this
new state, it's seen as a live controller and fails the connect request.

Correct by adding the DELETING_NIO state to the match checks.

Fixes: ecca390e ("nvme: fix deadlock in disconnect during scan_work and/or ana_work")
Cc: <stable@vger.kernel.org> # v5.7+
Signed-off-by: NUday Shankar <ushankar@purestorage.com>
Reviewed-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6a51abde

02 2月, 2022 3 次提交

nvme-rdma: fix possible use-after-free in transport error_recovery work · b6bb1722

由 Sagi Grimberg 提交于 2月 01, 2022

While nvme_rdma_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

b6bb1722

nvme-tcp: fix possible use-after-free in transport error_recovery work · ff9fc7eb

由 Sagi Grimberg 提交于 2月 01, 2022

While nvme_tcp_submit_async_event_work is checking the ctrl and queue
state before preparing the AER command and scheduling io_work, in order
to fully prevent a race where this check is not reliable the error
recovery work must flush async_event_work before continuing to destroy
the admin queue after setting the ctrl state to RESETTING such that
there is no race .submit_async_event and the error recovery handler
itself changing the ctrl state.
Tested-by: NChris Leech <cleech@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

ff9fc7eb

nvme: fix a possible use-after-free in controller reset during load · 0fa0f99f

由 Sagi Grimberg 提交于 2月 01, 2022

Unlike .queue_rq, in .submit_async_event drivers may not check the ctrl
readiness for AER submission. This may lead to a use-after-free
condition that was observed with nvme-tcp.

The race condition may happen in the following scenario:
1. driver executes its reset_ctrl_work
2. -> nvme_stop_ctrl - flushes ctrl async_event_work
3. ctrl sends AEN which is received by the host, which in turn
   schedules AEN handling
4. teardown admin queue (which releases the queue socket)
5. AEN processed, submits another AER, calling the driver to submit
6. driver attempts to send the cmd
==> use-after-free

In order to fix that, add ctrl state check to validate the ctrl
is actually able to accept the AER submission.

This addresses the above race in controller resets because the driver
during teardown should:
1. change ctrl state to RESETTING
2. flush async_event_work (as well as other async work elements)

So after 1,2, any other AER command will find the
ctrl state to be RESETTING and bail out without submitting the AER.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

0fa0f99f

27 1月, 2022 2 次提交

nvme-fabrics: remove the unneeded ret variable in nvmf_dev_show · a5f3851b

由 Changcheng Deng 提交于 1月 07, 2022

Remove unneeded variable and directly return 0.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a5f3851b

nvme-pci: add the IGNORE_DEV_SUBNQN quirk for Intel P4500/P4600 SSDs · 25e58af4

由 Wu Zheng 提交于 6月 21, 2021

The Intel P4500/P4600 SSDs do not report a subsystem NQN despite claiming
compliance to a standards version where reporting one is required.

Add the IGNORE_DEV_SUBNQN quirk to not fail the initialization of a
second such SSDs in a system.
Signed-off-by: NZheng Wu <wu.zheng@intel.com>
Signed-off-by: NYe Jinhe <jinhe.ye@intel.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25e58af4

06 1月, 2022 1 次提交

nvme-pci: fix queue_rqs list splitting · 6bfec799

由 Keith Busch 提交于 1月 05, 2022

If command prep fails, current handling will orphan subsequent requests
in the list. Consider a simple example:

  rqlist = [ 1 -> 2 ]

When prep for request '1' fails, it will be appended to the
'requeue_list', leaving request '2' disconnected from the original
rqlist and no longer tracked. Meanwhile, rqlist is still pointing to the
failed request '1' and will attempt to submit the unprepped command.

Fix this by updating the rqlist accordingly using the request list
helper functions.

Fixes: d62cbcf6 ("nvme: add support for mq_ops->queue_rqs()")
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220105170518.3181469-5-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

6bfec799

23 12月, 2021 4 次提交

nvme: add 'iopolicy' module parameter · e3d34794

由 Hannes Reinecke 提交于 12月 20, 2021

While the 'iopolicy' sysfs attribute can be set at runtime, most
storage arrays prefer to use the 'round-robin' iopolicy per default.
We can use udev rules to set this, but is getting rather unwieldy
for rebranded arrays as we would have to update the udev rules
anytime a new array shows up, leading to the same mess we currently
have in multipathd for configuring the RDAC arrays.

Hence this patch adds a module parameter 'iopolicy' to allow the
admin to switch the default, and to do away with the need for a
udev rule here.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e3d34794

nvme: drop unused variable ctrl in nvme_setup_cmd · 3a605e32

由 Geliang Tang 提交于 12月 22, 2021

The variable 'ctrl' became useless since the code using it was dropped
from nvme_setup_cmd() in the commit 292ddf67bbd5 ("nvme: increment
request genctr on completion"). Fix it to get rid of this compilation
warning in the nvme-5.17 branch:

 drivers/nvme/host/core.c: In function ‘nvme_setup_cmd’:
 drivers/nvme/host/core.c:993:20: warning: unused variable ‘ctrl’ [-Wunused-variable]
   struct nvme_ctrl *ctrl = nvme_req(req)->ctrl;
                     ^~~~

Fixes: 292ddf67bbd5 ("nvme: increment request genctr on completion")
Signed-off-by: NGeliang Tang <geliang.tang@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3a605e32

nvme: increment request genctr on completion · e4fdb2b1

由 Keith Busch 提交于 12月 13, 2021

The nvme request generation counter is intended to catch duplicate
completions. Incrementing the counter on submission means duplicates can
only be caught if the request tag is reallocated and dispatched prior to
the driver observing the corrupted CQE. Incrementing on completion
removes this window, making it possible to detect duplicate completions
in consecutive entries.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e4fdb2b1

nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics · f18ee3d9

由 Hannes Reinecke 提交于 12月 07, 2021

Currently applications have a hard time figuring out which
nvme-over-fabrics arguments are supported for any given kernel;
the ioctl will return an error code on failure, and the application
has to guess whether this was due to an invalid argument or due
to a connection or controller error.
With this patch applications can read a list of supported
arguments by simply reading from /dev/nvme-fabrics, allowing
them to validate the connection string.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f18ee3d9

17 12月, 2021 3 次提交

nvme: add support for mq_ops->queue_rqs() · d62cbcf6

由 Jens Axboe 提交于 11月 18, 2021

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d62cbcf6

nvme: separate command prep and issue · 62451a2b

由 Jens Axboe 提交于 10月 29, 2021

Add a nvme_prep_rq() helper to setup a command, and nvme_queue_rq() is
adapted to use this helper.
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62451a2b

nvme: split command copy into a helper · 3233b94c

由 Jens Axboe 提交于 10月 29, 2021

We'll need it for batched submit as well. Since we now have a copy
helper, get rid of the nvme_submit_cmd() wrapper.
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3233b94c

08 12月, 2021 2 次提交

nvme: fix use after free when disconnecting a reconnecting ctrl · 8b77fa6f

由 Ruozhu Li 提交于 11月 04, 2021

A crash happens when trying to disconnect a reconnecting ctrl:

 1) The network was cut off when the connection was just established,
    scan work hang there waiting for some IOs complete.  Those I/Os were
    retried because we return BLK_STS_RESOURCE to blk in reconnecting.
 2) After a while, I tried to disconnect this connection.  This
    procedure also hangs because it tried to obtain ctrl->scan_lock.
    It should be noted that now we have switched the controller state
    to NVME_CTRL_DELETING.
 3) In nvme_check_ready(), we always return true when ctrl->state is
    NVME_CTRL_DELETING, so those retrying I/Os were issued to the bottom
    device which was already freed.

To fix this, when ctrl->state is NVME_CTRL_DELETING, issue cmd to bottom
device only when queue state is live.  If not, return host path error to
the block layer
Signed-off-by: NRuozhu Li <liruozhu@huawei.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8b77fa6f

nvme-multipath: set ana_log_size to 0 after free ana_log_buf · c7c15ae3

由 Hou Tao 提交于 12月 03, 2021

Set ana_log_size to 0 when ana_log_buf is freed to make sure
nvme_mpath_init_identify will do the right thing when retrying
after an earlier failure.
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c7c15ae3

06 12月, 2021 3 次提交

nvme: report write pointer for a full zone as zone start + zone len · 793fcab8

由 Niklas Cassel 提交于 11月 26, 2021

The write pointer in NVMe ZNS is invalid for a zone in zone state full.
The same also holds true for ZAC/ZBC.

The current behavior for NVMe is to simply propagate the wp reported by
the drive, even for full zones. Since the wp is invalid for a full zone,
the wp reported by the drive may be any value.

The way that the sd_zbc driver handles a full zone is to always report
the wp as zone start + zone len, regardless of what the drive reported.
null_blk also follows this convention.

Do the same for NVMe, so that a BLKREPORTZONE ioctl reports the write
pointer for a full zone in a consistent way, regardless of the interface
of the underlying zoned block device.

blkzone report before patch:
start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0xfffffffffffbfff8
reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]

blkzone report after patch:
start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0x040000 reset:0
non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

793fcab8

nvme: disable namespace access for unsupported metadata · d39ad2a4

由 Keith Busch 提交于 11月 30, 2021

The only fabrics target that supports metadata handling through the
separate integrity buffer is RDMA. It is currently usable only if the
size is 8B per block and formatted for protection information. If an
rdma target were to export a namespace with a different format (ex:
4k+64B), the driver will not be able to submit valid read/write commands
for that namespace.

Suppress setting the metadata feature in the namespace so that the
gendisk capacity will be set to 0. This will prevent read/write access
through the block stack, but will continue to allow ioctl passthrough
commands.

Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d39ad2a4

nvme: show subsys nqn for duplicate cntlids · 16cc33b2

由 Keith Busch 提交于 11月 29, 2021

The driver assigned nvme handle isn't persistent across reboots, so is
not enough information to match up where the collisions are occuring.
Add the subsys nqn string to the output so that it can more easily be
identified later.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215099Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

16cc33b2

29 11月, 2021 2 次提交

block: remove the gendisk argument to blk_execute_rq · b84ba30b

由 Christoph Hellwig 提交于 11月 26, 2021

Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b84ba30b

block: remove the ->rq_disk field in struct request · f3fa33ac

由 Christoph Hellwig 提交于 11月 26, 2021

Just use the disk attached to the request_queue instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

f3fa33ac

24 11月, 2021 5 次提交

nvme: fix write zeroes pi · 00b33cf3

由 Klaus Jensen 提交于 11月 10, 2021

Write Zeroes sets PRACT when block integrity is enabled (as it should),
but neglects to also set the reftag which is expected by reads. This
causes protection errors on reads.

Fix this by setting the reftag for type 1 and 2 (for type 3, reads will
not check the reftag).
Signed-off-by: NKlaus Jensen <k.jensen@samsung.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

00b33cf3

nvme-fabrics: ignore invalid fast_io_fail_tmo values · 8e8aaf51

由 Maurizio Lombardi 提交于 11月 12, 2021

Valid fast_io_fail_tmo values are integers >= 0 or -1 (disabled).
Prevent userspace from setting arbitrary negative values.
Signed-off-by: NMaurizio Lombardi <mlombard@redhat.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8e8aaf51

nvme-pci: add NO APST quirk for Kioxia device · 5a6254d5

由 Enzo Matsumiya 提交于 11月 05, 2021

This particular Kioxia device times out and aborts I/O during any load,
but it's more easily observable with discards (fstrim).

The device gets to a state that is also not possible to use
"nvme set-feature" to disable APST.
Booting with nvme_core.default_ps_max_latency=0 solves the issue.

We had a dozen or so of these devices behaving this same way in
customer environments.
Signed-off-by: NEnzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5a6254d5

nvme-tcp: fix memory leak when freeing a queue · a5053c92

由 Maurizio Lombardi 提交于 11月 03, 2021

Release the page frag cache when tearing down the io queues
Signed-off-by: NMaurizio Lombardi <mlombard@redhat.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJohn Meneghini <jmeneghi@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a5053c92

nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t() · 1d3ef9c3

由 Varun Prakash 提交于 11月 23, 2021

If maxh2cdata < r2t_length then driver will form multiple
H2CData PDUs, validate R2T PDU in nvme_tcp_handle_r2t() to
reuse nvme_tcp_setup_h2c_data_pdu().

Also set req->state to NVME_TCP_SEND_H2C_PDU in
nvme_tcp_setup_h2c_data_pdu().
Signed-off-by: NVarun Prakash <varun@chelsio.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1d3ef9c3

09 11月, 2021 1 次提交

nvme: wait until quiesce is done · 26af1cd0

由 Ming Lei 提交于 11月 09, 2021

NVMe uses one atomic flag to check if quiesce is needed. If quiesce is
started, the helper returns immediately. This way is wrong, since we
have to wait until quiesce is done.

Fixes: e70feb8b ("blk-mq: support concurrent queue quiesce/unquiesce")
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211109071144.181581-5-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

26af1cd0

27 10月, 2021 3 次提交

nvmet: use flex_array_size and struct_size · d156cfca

由 Len Baker 提交于 10月 24, 2021

In an effort to avoid open-coded arithmetic in the kernel [1], use the
flex_array_size() and struct_size() helpers instead of an open-coded
calculation.

[1] https://github.com/KSPP/linux/issues/160Signed-off-by: NLen Baker <len.baker@gmx.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d156cfca

nvme-tcp: fix data digest pointer calculation · d89b9f3b

由 Varun Prakash 提交于 10月 25, 2021

ddgst is of type __le32, &req->ddgst + req->offset
increases &req->ddgst by 4 * req->offset, fix this by
type casting &req->ddgst to u8 *.

Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: NVarun Prakash <varun@chelsio.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d89b9f3b

nvme-tcp: fix possible req->offset corruption · ce7723e9

由 Varun Prakash 提交于 10月 26, 2021

With commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
context") r2t and response PDU can get processed while send function
is executing.

Current data digest send code uses req->offset after kernel_sendmsg(),
this creates a race condition where req->offset gets reset before it
is used in send function.

This can happen in two cases -
1. Target sends r2t PDU which resets req->offset.
2. Target send response PDU which completes the req and then req is
   used for a new command, nvme_tcp_setup_cmd_pdu() resets req->offset.

Fix this by storing req->offset in a local variable and using
this local variable after kernel_sendmsg().

Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
Signed-off-by: NVarun Prakash <varun@chelsio.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ce7723e9

26 10月, 2021 1 次提交

nvme-tcp: fix H2CData PDU send accounting (again) · 25e1f67e

由 Sagi Grimberg 提交于 10月 24, 2021

We should not access request members after the last send, even to
determine if indeed it was the last data payload send. The reason is
that a completion could have arrived and trigger a new execution of the
request which overridden these members. This was fixed by commit
825619b0 ("nvme-tcp: fix possible use-after-completion").

Commit e371af03 broke that assumption again to address cases where
multiple r2t pdus are sent per request. To fix it, we need to record the
request data_sent and data_len and after the payload network send we
reference these counters to determine weather we should advance the
request iterator.

Fixes: e371af03 ("nvme-tcp: fix incorrect h2cdata pdu offset accounting")
Reported-by: NKeith Busch <kbusch@kernel.org>
Cc: stable@vger.kernel.org # 5.10+
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25e1f67e

21 10月, 2021 5 次提交

nvme: drop scan_lock and always kick requeue list when removing namespaces · 2b81a5f0

由 Hannes Reinecke 提交于 10月 20, 2021

When reading the partition table on initial scan hits an I/O error the
I/O will hang with the scan_mutex held:

[<0>] do_read_cache_page+0x49b/0x790
[<0>] read_part_sector+0x39/0xe0
[<0>] read_lba+0xf9/0x1d0
[<0>] efi_partition+0xf1/0x7f0
[<0>] bdev_disk_changed+0x1ee/0x550
[<0>] blkdev_get_whole+0x81/0x90
[<0>] blkdev_get_by_dev+0x128/0x2e0
[<0>] device_add_disk+0x377/0x3c0
[<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core]
[<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core]
[<0>] nvme_alloc_ns+0x417/0x950 [nvme_core]
[<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core]
[<0>] nvme_scan_work+0x168/0x310 [nvme_core]
[<0>] process_one_work+0x231/0x420

and trying to delete the controller will deadlock as it tries to grab
the scan mutex:

[<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core]
[<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core]
[<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core]

As we're now properly ordering the namespace list there is no need to
hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore.
And we always need to kick the requeue list as the path will be marked
as unusable and I/O will be requeued _without_ a current path.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

2b81a5f0

nvme-pci: clear shadow doorbell memory on resets · 58847f12

由 Keith Busch 提交于 10月 14, 2021

The host memory doorbell and event buffers need to be initialized on
each reset so the driver doesn't observe stale values from the previous
instantiation.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NJohn Levon <john.levon@nutanix.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

58847f12

nvme-rdma: fix error code in nvme_rdma_setup_ctrl · 09748122

由 Max Gurtovoy 提交于 10月 17, 2021

In case that icdoff is not zero or mandatory keyed sgls are not
supported by the NVMe/RDMA target, we'll go to error flow but we'll
return 0 to the caller. Fix it by returning an appropriate error code.

Fixes: c66e2998 ("nvme-rdma: centralize controller setup sequence")
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

09748122

nvme-multipath: add error handling support for add_disk() · 11384580

由 Luis Chamberlain 提交于 10月 15, 2021

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

Since we now can tell for sure when a disk was added, move
setting the bit NVME_NSHEAD_DISK_LIVE only when we did
add the disk successfully.

Nothing to do here as the cleanup is done elsewhere. We take
care and use test_and_set_bit() because it is protects against
two nvme paths simultaneously calling device_add_disk() on the
same namespace head.
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

11384580

nvme: display correct subsystem NQN · e5ea42fa

由 Hannes Reinecke 提交于 9月 22, 2021

With discovery controllers supporting unique subsystem NQNs the
actual subsystem NQN might be different from that one passed in
via the connect args. So add a helper to display the resulting
subsystem NQN.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e5ea42fa

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功