提交 · 764075fdcb2f09baa23fdc4df4b79741e5c39b57 · openeuler / Kernel

08 7月, 2020 19 次提交

nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs · 764075fd

由 Sagi Grimberg 提交于 7月 05, 2020

This is useful information, and moreover its it's useful to
be able to alter these parameters per controller after it
has been established.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

764075fd

nvme: support for zoned namespaces · 240e6ee2

由 Keith Busch 提交于 6月 29, 2020

Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
in NVM Express TP4053. Zoned namespaces are discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer as a host managed
zoned block device with Zone Append command support. A namespace that
does not support append is not supported by the driver.
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com>
Signed-off-by: NAravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: NMatias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NKeith Busch <keith.busch@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

240e6ee2

nvme: support for multiple Command Sets Supported and Effects log pages · be93e87e

由 Keith Busch 提交于 6月 29, 2020

The Commands Supported and Effects log page was extended with a CSI
field that enables the host to query the log page for each command set
supported. Retrieve this log page for each command set that an attached
namespace supports, and save a pointer to that log in the namespace head.
Reviewed-by: NMatias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NKeith Busch <keith.busch@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

be93e87e

nvme: implement multiple I/O Command Set support · 71010c30

由 Niklas Cassel 提交于 6月 29, 2020

Implements support for multiple I/O Command Sets.  NVMe TP 4056
introduces a method to enumerate multiple command sets per namespace. If
the command set is exposed, this method for enumeration will be used
instead of the traditional method that uses the CC.CSS register command
set register for command set identification.

For namespaces where the Command Set Identifier is not supported or
recognized, the specific namespace will not be created.
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NMatias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

71010c30

nvme: use USEC_PER_SEC instead of magic numbers · f5af577d

由 Baolin Wang 提交于 6月 24, 2020

Use USEC_PER_SEC instead of magic numbers to make code more readable.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f5af577d

nvmet-tcp: simplify nvmet_process_resp_list · b8a12e93

由 Sagi Grimberg 提交于 6月 24, 2020

We can make it shorter and simpler without some redundant checks.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b8a12e93

nvme-tcp: optimize network stack with setting msg flags according to batch size · 122e5b9f

由 Sagi Grimberg 提交于 6月 18, 2020

If we have a long list of request to send, signal the network stack
that more is coming (MSG_MORE). If we have nothing else, signal MSG_EOR.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

122e5b9f

nvme-tcp: leverage request plugging · 86f0348a

由 Sagi Grimberg 提交于 6月 18, 2020

blk-mq request plugging can improve the execution of our pipeline.
When we queue a request we actually trigger our I/O worker thread
yielding a context switch by definition. However if we know that
there are more requests in the pipe that are coming, we are better
off not trigger our I/O worker and only do that for the last request
in the batch (bd->last). By having it, we improve efficiency by
amortizing context switches over a batch of requests.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

86f0348a

nvme-tcp: have queue prod/cons send list become a llist · 15ec928a

由 Sagi Grimberg 提交于 6月 18, 2020

The queue processing will splice to a queue local list, this should
alleviate some contention on the send_list lock, but also prepares
us to the next patch where we look on these lists for network stack
flag optimization.

Remove queue lock as its not used anymore.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
[hch: simplified a loop]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

15ec928a

nvme-fcloop: verify wwnn and wwpn format · ca8f4bee

由 Dongli Zhang 提交于 5月 25, 2020

The nvme host and target verify the wwnn and wwpn format via
nvme_fc_parse_traddr(). For instance, it is required that the length of
wwnn to be either 21 ("nn-0x") or 19 (nn-).

Add this verification to nvme-fcloop so that the input should always be in
hex and the length of input should always be 18.

Otherwise, the user may use e.g. 0x2 to create fcloop local port, while
0x0000000000000002 is required for nvme host and target. This makes the
requirement of format confusing.
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca8f4bee

nvmet: use unsigned type for u64 · a0f0dbaa

由 Chaitanya Kulkarni 提交于 6月 11, 2020

In function nvmet_subsys_atte_version_show() which uses the NVME_XXX()
macros related to version (of type u64) get rid of the int type cast
when printing subsys version and use appropriate format specifier for
u64.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a0f0dbaa

nvmet: introduce flags member in nvmet_fabrics_ops · 6fa350f7

由 Max Gurtovoy 提交于 6月 02, 2020

Replace has_keyed_sgls and metadata_support booleans with a flags member
that will be used for adding more features in the future.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

6fa350f7

nvmet-tcp: remove has_keyed_sgls initialization · 4e102559

由 Max Gurtovoy 提交于 6月 02, 2020

Since the nvmet_tcp_ops is static, there is no need to initialize values
to zero.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NIsrael Rukshin <israelr@mellanox.com>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4e102559

nvmet-loop: remove unused 'target_ctrl' in nvme_loop_ctrl · b261b61c

由 Dongli Zhang 提交于 6月 08, 2020

This field is never used since the introduction of nvme loopback by
commit 3a85a5de ("nvme-loop: add a NVMe loopback host driver").
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b261b61c

nvme-pci: remove the empty line at the beginning of nvme_should_reset() · ad509996

由 Dongli Zhang 提交于 6月 08, 2020

Just cleanup by removing the empty line.
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ad509996

nvme-pci: code cleanup for nvme_alloc_host_mem() · 9dc54a0d

由 Chaitanya Kulkarni 提交于 6月 01, 2020

Although use of for loop is preferred it is not a common practice to
have 80 char long for loop initialization and comparison section.

Use temp variables for calculating values and replace them in the
for loop with size of all variables to set to u64 since preferred
variable is declared as u64.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9dc54a0d

nvme-pci: use unsigned for io queue depth · 61f3b896

由 Chaitanya Kulkarni 提交于 6月 17, 2020

The NVMe PCIe declares module parameter io_queue_depth as int. Change
this to u16 as queue depth can never be negative. Now to reflect this
update module parameter getter function from param_get_int() ->
param_get_uint() and respective setter function with type of n changed
from int to u16 with param_set_int() to param_set_ushort(). Finally
update struct nvme_dev q_depth member to u16 and use u16 in min_t()
when calculating dev->q_depth in the nvme_pci_enable() (since q_depth is
now u16) and use unsigned int instead of int when calculating
dev->tagset.queue_depth as target variable tagset->queue_depth is of type
unsigned int in nvme_dev_add().
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

61f3b896

nvme-core: use u16 type for ctrl->sqsize · d4047cf9

由 Chaitanya Kulkarni 提交于 6月 01, 2020

In nvme_init_identify() when calculating submission queue size use u16
instead of int type in the min_t() since target variable ctrl->sqsize is
of type u16.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d4047cf9

nvme-core: use u16 type for directives · a87835e9

由 Chaitanya Kulkarni 提交于 6月 01, 2020

In nvme_configure_directives() when calculating number of streams use
u16 instead of unsigned type in the min_t() since target variable
ctrl->nr_streams is of type u16.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a87835e9

02 7月, 2020 2 次提交

nvme: fix a crash in nvme_mpath_add_disk · 72d44711

由 Christoph Hellwig 提交于 6月 29, 2020

For private namespaces ns->head_disk is NULL, so add a NULL check
before updating the BDI capabilities.

Fixes: b2ce4d90 ("nvme-multipath: set bdi capabilities once")
Reported-by: NAvinash M N <Avinash.M.N@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>

72d44711

nvme: fix identify error status silent ignore · ea43d970

由 Sagi Grimberg 提交于 6月 26, 2020

Commit 59c7c3ca intended to only silently ignore non retry-able
errors (DNR bit set) such that we can still identify misbehaving
controllers, and in the other hand propagate retry-able errors (DNR bit
cleared) so we don't wrongly abandon a namespace just because it happens
to be temporarily inaccessible.

The goal remains the same as the original commit where this was
introduced but unfortunately had the logic backwards.

Fixes: 59c7c3ca ("nvme: fix possible hang when ns scanning fails during error recovery")
Reported-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ea43d970

01 7月, 2020 5 次提交

block: remove the bd_queue field from struct block_device · e556f6ba

由 Christoph Hellwig 提交于 6月 26, 2020

Just use bd_disk->queue instead.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e556f6ba

block: remove direct_make_request · 5a6c35f9

由 Christoph Hellwig 提交于 7月 01, 2020

Now that submit_bio_noacct has a decent blk-mq fast path there is no
more need for this bypass.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a6c35f9

block: rename generic_make_request to submit_bio_noacct · ed00aabd

由 Christoph Hellwig 提交于 7月 01, 2020

generic_make_request has always been very confusingly misnamed, so rename
it to submit_bio_noacct to make it clear that it is submit_bio minus
accounting and a few checks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ed00aabd

block: move ->make_request_fn to struct block_device_operations · c62b37d9

由 Christoph Hellwig 提交于 7月 01, 2020

The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c62b37d9

block: remove the request_queue argument from blk_queue_split · f695ca38

由 Christoph Hellwig 提交于 7月 01, 2020

The queue can be trivially derived from the bio, so pass one less
argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f695ca38

25 6月, 2020 11 次提交

nvme-multipath: fix bogus request queue reference put · c3124466

由 Sagi Grimberg 提交于 6月 24, 2020

The mpath disk node takes a reference on the request mpath
request queue when adding live path to the mpath gendisk.
However if we connected to an inaccessible path device_add_disk
is not called, so if we disconnect and remove the mpath gendisk
we endup putting an reference on the request queue that was
never taken [1].

Fix that to check if we ever added a live path (using
NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
reference.

[1]:
------------[ cut here ]------------
refcount_t: underflow; use-after-free.
WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
RIP: 0010:refcount_warn_saturate+0xa6/0xf0
RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 disk_release+0xa2/0xc0
 device_release+0x28/0x80
 kobject_put+0xa5/0x1b0
 nvme_put_ns_head+0x26/0x70 [nvme_core]
 nvme_put_ns+0x30/0x60 [nvme_core]
 nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
 nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
 nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
 kernfs_fop_write+0xc1/0x1a0
 vfs_write+0xb6/0x1a0
 ksys_write+0x5f/0xe0
 do_syscall_64+0x52/0x1a0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
Tested-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c3124466

nvme-multipath: fix deadlock due to head->lock · d8a22f85

由 Anton Eidelman 提交于 6月 24, 2020

In the following scenario scan_work and ana_work will deadlock:

When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.

While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.

Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on
ANY ctrl will not be able to complete nvme_mpath_set_live()
on the same ns->head, which is required in order to update
the new accessible path and remove NVME_NS_ANA_PENDING..
Therefore IO never completes: deadlock [1].

Fix:
Move device_add_disk out of the head->lock and protect it with an
atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit.

[1]:
kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:2    D    0   160      2 0x80004000
kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  nvme_update_ns_ana_state+0x22/0x60 [nvme_core]
kernel:  nvme_update_ana_state+0xca/0xe0 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_read_ana_log+0x76/0x100 [nvme_core]
kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ret_from_fork+0x35/0x40
kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u8:4    D    0   439      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_mpath_add_disk+0xbe/0x100 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x256/0x390 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ret_from_fork+0x35/0x40

Fixes: 0d0b660f ("nvme: add ANA support")
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d8a22f85

nvme: don't protect ns mutation with ns->head->lock · e164471d

由 Sagi Grimberg 提交于 6月 24, 2020

Right now ns->head->lock is protecting namespace mutation
which is wrong and unneeded. Move it to only protect
against head mutations. While we're at it, remove unnecessary
ns->head reference as we already have head pointer.

The problem with this is that the head->lock spans
mpath disk node I/O that may block under some conditions (if
for example the controller is disconnecting or the path
became inaccessible), The locking scheme does not allow any
other path to enable itself, preventing blocked I/O to complete
and forward-progress from there.

This is a preparation patch for the fix in a subsequent patch
where the disk I/O will also be done outside the head->lock.

Fixes: 0d0b660f ("nvme: add ANA support")
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e164471d

nvme-multipath: fix deadlock between ana_work and scan_work · 489dd102

由 Anton Eidelman 提交于 6月 24, 2020

When scan_work calls nvme_mpath_add_disk() this holds ana_lock
and invokes nvme_parse_ana_log(), which may issue IO
in device_add_disk() and hang waiting for an accessible path.
While nvme_mpath_set_live() only called when nvme_state_is_live(),
a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO.

In order to recover and complete the IO ana_work on the same ctrl
should be able to update the path state and remove NVME_NS_ANA_PENDING.

The deadlock occurs because scan_work keeps holding ana_lock,
so ana_work hangs [1].

Fix:
Now nvme_mpath_add_disk() uses nvme_parse_ana_log() to obtain a copy
of the ANA group desc, and then calls nvme_update_ns_ana_state() without
holding ana_lock.

[1]:
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140

kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  ? select_task_rq_fair+0x1aa/0x5c0
kernel:  ? kvm_sched_clock_read+0x11/0x20
kernel:  ? sched_clock+0x9/0x10
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  nvme_read_ana_log+0x3a/0x100 [nvme_core]
kernel:  nvme_ana_work+0x15/0x20 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x4d/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x35/0x40

Fixes: 0d0b660f ("nvme: add ANA support")
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

489dd102

nvme: fix possible deadlock when I/O is blocked · 3b4b1972

由 Sagi Grimberg 提交于 6月 24, 2020

Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")

When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].

The mpath disk revalidation was originally added to detect online disk
size change, but this is no longer needed since commit cb224c3a
("nvme: Convert to use set_capacity_revalidate_and_notify") which already
updates resize info without unnecessarily revalidating the disk (the
mpath disk doesn't even implement .revalidate_disk fop).

[1]:
--
kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:9   D    0   494      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  revalidate_disk+0x63/0xa0
kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
kernel:  ? evict+0x14c/0x1b0
kernel:  revalidate_disk+0x2b/0xa0
kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
...
kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:1   D    0  2630      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  ? file_fdatawait_range+0x30/0x30
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
kernel:  efi_partition+0x1e6/0x708
kernel:  ? vsnprintf+0x39e/0x4e0
kernel:  ? snprintf+0x49/0x60
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
--

Fixes: fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3b4b1972

nvme-rdma: assign completion vector correctly · 032a9966

由 Max Gurtovoy 提交于 6月 23, 2020

The completion vector index that is given during CQ creation can't
exceed the number of support vectors by the underlying RDMA device. This
violation currently can accure, for example, in case one will try to
connect with N regular read/write queues and M poll queues and the sum
of N + M > num_supported_vectors. This will lead to failure in establish
a connection to remote target. Instead, in that case, share a completion
vector between queues.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

032a9966

nvme-loop: initialize tagset numa value to the value of the ctrl · 1b4ad7a5