提交 · aa41d2fe60ee2e4452b0f9ca9f0f6d80a4ff9f9d · openeuler / Kernel

31 5月, 2022 2 次提交

nvme: set controller enable bit in a separate write · aa41d2fe

由 Niklas Cassel 提交于 5月 26, 2022

The NVM Express Base Specification 2.0 specifies in the description
of the CC – Controller Configuration register:
"Host software shall set the Arbitration Mechanism Selected (CC.AMS),
the Memory Page Size (CC.MPS), and the I/O Command Set Selected (CC.CSS)
to valid values prior to enabling the controller by setting CC.EN to ‘1’.

While we haven't seen any controller misbehaving while setting all bits
in a single write, let's do it in the order that it is written in the
spec, as there could potentially be controllers that are implemented to
rely on the configuration bits being set before enabling the controller.
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

aa41d2fe

nvme-pci: disable namespace identifiers for the MAXIO MAP1001 · 70ce3455

由 Christoph Hellwig 提交于 5月 27, 2022

The MAXIO MAP1001 controllers reports completely bogus Namespace
identifiers that even change after suspend cycles.  Disable using
the Identifiers entirely.
Reported-by: NArman Hajishafieha <arman.hajishafieha@hotmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Tested-by: NArman Hajishafieha <arman.hajishafieha@hotmail.com>

70ce3455

28 5月, 2022 7 次提交

bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() · a1a2d8f0

由 Coly Li 提交于 5月 28, 2022

The kworker routine update_writeback_rate() is schedued to update the
writeback rate in every 5 seconds by default. Before calling
__update_writeback_rate() to do real job, semaphore dc->writeback_lock
should be held by the kworker routine.

At the same time, bcache writeback thread routine bch_writeback_thread()
also needs to hold dc->writeback_lock before flushing dirty data back
into the backing device. If the dirty data set is large, it might be
very long time for bch_writeback_thread() to scan all dirty buckets and
releases dc->writeback_lock. In such case update_writeback_rate() can be
starved for long enough time so that kernel reports a soft lockup warn-
ing started like:
  watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]

Such soft lockup condition is unnecessary, because after the writeback
thread finishes its job and releases dc->writeback_lock, the kworker
update_writeback_rate() may continue to work and everything is fine
indeed.

This patch avoids the unnecessary soft lockup by the following method,
- Add new member to struct cached_dev
  - dc->rate_update_retry (0 by default)
- In update_writeback_rate() call down_read_trylock(&dc->writeback_lock)
  firstly, if it fails then lock contention happens.
- If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't
  acquire the lock and reschedules the kworker for next try.
- If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry
  anymore and call down_read(&dc->writeback_lock) to wait for the lock.

By the above method, at worst case update_writeback_rate() may retry for
1+ minutes before blocking on dc->writeback_lock by calling down_read().
For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft
lockup warning message can be avoided.

When retrying to acquire dc->writeback_lock in update_writeback_rate(),
of course the writeback rate cannot be updated. It is fair, because when
the kworker is blocked on the lock contention of dc->writeback_lock, the
writeback rate cannot be updated neither.

This change follows Jens Axboe's suggestion to a more clear and simple
version.
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

a1a2d8f0

nbd: use pr_err to output error message · 1243172d

由 Yu Kuai 提交于 5月 21, 2022

Instead of using the long printk(KERN_ERR "nbd: ...") to
output error message, defining pr_fmt and using
the short pr_err("") to do that. The replacemen is done
by using the following command:

  sed -i 's/printk(KERN_ERR "nbd: /pr_err("/g' \
		  drivers/block/nbd.c

This patch also rewrap to 80 columns where possible.
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-7-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1243172d

nbd: fix possible overflow on 'first_minor' in nbd_dev_add() · 858f1bf6

由 Zhang Wensheng 提交于 5月 21, 2022

When 'index' is a big numbers, it may become negative which forced
to 'int'. then 'index << part_shift' might overflow to a positive
value that is not greater than '0xfffff', then sysfs might complains
about duplicate creation. Because of this, move the 'index' judgment
to the front will fix it and be better.

Fixes: b0d9111a ("nbd: use an idr to keep track of nbd devices")
Fixes: 940c2649 ("nbd: fix possible overflow for 'first_minor' in nbd_dev_add()")
Signed-off-by: NZhang Wensheng <zhangwensheng5@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-6-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

858f1bf6

nbd: fix io hung while disconnecting device · 09dadb59

由 Yu Kuai 提交于 5月 21, 2022

In our tests, "qemu-nbd" triggers a io hung:

INFO: task qemu-nbd:11445 blocked for more than 368 seconds.
      Not tainted 5.18.0-rc3-next-20220422-00003-g2176915513ca #884
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:qemu-nbd        state:D stack:    0 pid:11445 ppid:     1 flags:0x00000000
Call Trace:
 <TASK>
 __schedule+0x480/0x1050
 ? _raw_spin_lock_irqsave+0x3e/0xb0
 schedule+0x9c/0x1b0
 blk_mq_freeze_queue_wait+0x9d/0xf0
 ? ipi_rseq+0x70/0x70
 blk_mq_freeze_queue+0x2b/0x40
 nbd_add_socket+0x6b/0x270 [nbd]
 nbd_ioctl+0x383/0x510 [nbd]
 blkdev_ioctl+0x18e/0x3e0
 __x64_sys_ioctl+0xac/0x120
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fd8ff706577
RSP: 002b:00007fd8fcdfebf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000040000000 RCX: 00007fd8ff706577
RDX: 000000000000000d RSI: 000000000000ab00 RDI: 000000000000000f
RBP: 000000000000000f R08: 000000000000fbe8 R09: 000055fe497c62b0
R10: 00000002aff20000 R11: 0000000000000246 R12: 000000000000006d
R13: 0000000000000000 R14: 00007ffe82dc5e70 R15: 00007fd8fcdff9c0

"qemu-ndb -d" will call ioctl 'NBD_DISCONNECT' first, however, following
message was found:

block nbd0: Send disconnect failed -32

Which indicate that something is wrong with the server. Then,
"qemu-nbd -d" will call ioctl 'NBD_CLEAR_SOCK', however ioctl can't clear
requests after commit 2516ab15("nbd: only clear the queue on device
teardown"). And in the meantime, request can't complete through timeout
because nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', which
means such request will never be completed in this situation.

Now that the flag 'NBD_CMD_INFLIGHT' can make sure requests won't
complete multiple times, switch back to call nbd_clear_sock() in
nbd_clear_sock_ioctl(), so that inflight requests can be cleared.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-5-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

09dadb59

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed · 2895f183

由 Yu Kuai 提交于 5月 21, 2022

Otherwise io will hung because request will only be completed if the
cmd has the flag 'NBD_CMD_INFLIGHT'.

Fixes: 07175cb1 ("nbd: make sure request completion won't concurrent")
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-4-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2895f183

nbd: fix race between nbd_alloc_config() and module removal · c55b2b98

由 Yu Kuai 提交于 5月 21, 2022

When nbd module is being removing, nbd_alloc_config() may be
called concurrently by nbd_genl_connect(), although try_module_get()
will return false, but nbd_alloc_config() doesn't handle it.

The race may lead to the leak of nbd_config and its related
resources (e.g, recv_workq) and oops in nbd_read_stat() due
to the unload of nbd module as shown below:

  BUG: kernel NULL pointer dereference, address: 0000000000000040
  Oops: 0000 [#1] SMP PTI
  CPU: 5 PID: 13840 Comm: kworker/u17:33 Not tainted 5.14.0+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  Workqueue: knbd16-recv recv_work [nbd]
  RIP: 0010:nbd_read_stat.cold+0x130/0x1a4 [nbd]
  Call Trace:
   recv_work+0x3b/0xb0 [nbd]
   process_one_work+0x1ed/0x390
   worker_thread+0x4a/0x3d0
   kthread+0x12a/0x150
   ret_from_fork+0x22/0x30

Fixing it by checking the return value of try_module_get()
in nbd_alloc_config(). As nbd_alloc_config() may return ERR_PTR(-ENODEV),
assign nbd->config only when nbd_alloc_config() succeeds to ensure
the value of nbd->config is binary (valid or NULL).

Also adding a debug message to check the reference counter
of nbd_config during module removal.
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-3-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c55b2b98

nbd: call genl_unregister_family() first in nbd_cleanup() · 06c4da89

由 Yu Kuai 提交于 5月 21, 2022

Otherwise there may be race between module removal and the handling of
netlink command, which can lead to the oops as shown below:

  BUG: kernel NULL pointer dereference, address: 0000000000000098
  Oops: 0002 [#1] SMP PTI
  CPU: 1 PID: 31299 Comm: nbd-client Tainted: G            E     5.14.0-rc4
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  RIP: 0010:down_write+0x1a/0x50
  Call Trace:
   start_creating+0x89/0x130
   debugfs_create_dir+0x1b/0x130
   nbd_start_device+0x13d/0x390 [nbd]
   nbd_genl_connect+0x42f/0x748 [nbd]
   genl_family_rcv_msg_doit.isra.0+0xec/0x150
   genl_rcv_msg+0xe5/0x1e0
   netlink_rcv_skb+0x55/0x100
   genl_rcv+0x29/0x40
   netlink_unicast+0x1a8/0x250
   netlink_sendmsg+0x21b/0x430
   ____sys_sendmsg+0x2a4/0x2d0
   ___sys_sendmsg+0x81/0xc0
   __sys_sendmsg+0x62/0xb0
   __x64_sys_sendmsg+0x1f/0x30
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  Modules linked in: nbd(E-)
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220521073749.3146892-2-yukuai3@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

06c4da89

27 5月, 2022 3 次提交

md: bcache: check the return value of kzalloc() in detached_dev_do_request() · 40f567bb

由 Jia-Ju Bai 提交于 5月 27, 2022

The function kzalloc() in detached_dev_do_request() can fail, so its
return value should be checked.

Fixes: bc082a55 ("bcache: fix inaccurate io state for detached bcache devices")
Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: NJia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

40f567bb

bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() · 7d6b902e

由 Coly Li 提交于 5月 27, 2022

The local variables check_state (in bch_btree_check()) and state (in
bch_sectors_dirty_init()) should be fully filled by 0, because before
allocating them on stack, they were dynamically allocated by kzalloc().
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

7d6b902e

block, loop: support partitions without scanning · b9684a71

由 Christoph Hellwig 提交于 5月 27, 2022

Historically we did distinguish between a flag that surpressed partition
scanning, and a combinations of the minors variable and another flag if
any partitions were supported. This was generally confusing and doesn't
make much sense, but some corner case uses of the loop driver actually
do want to support manually added partitions on a device that does not
actively scan for partitions. To make things worsee the loop driver
also wants to dynamically toggle the scanning for partitions on a live
gendisk, which makes the disk->flags updates non-atomic.

Introduce a new GD_SUPPRESS_PART_SCAN bit in disk->state that disables
just scanning for partitions, and toggle that instead of GENHD_FL_NO_PART
in the loop driver.

Fixes: 1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220527055806.1972352-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b9684a71

24 5月, 2022 4 次提交

bcache: avoid journal no-space deadlock by reserving 1 journal bucket · 32feee36

由 Coly Li 提交于 5月 24, 2022

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: NNikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

32feee36

bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() · 80db4e47

由 Coly Li 提交于 5月 24, 2022

After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().

That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.

In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.

The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.

But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.

Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

80db4e47

bcache: improve multithreaded bch_sectors_dirty_init() · 4dc34ae1

由 Coly Li 提交于 5月 24, 2022

Commit b144e45f ("bcache: make bch_sectors_dirty_init() to be
multithreaded") makes bch_sectors_dirty_init() to be much faster
when counting dirty sectors by iterating all dirty keys in the btree.
But it isn't in ideal shape yet, still can be improved.

This patch does the following changes to improve current parallel dirty
keys iteration on the btree,
- Add read lock to root node when multiple threads iterating the btree,
  to prevent the root node gets split by I/Os from other registered
  bcache devices.
- Remove local variable "char name[32]" and generate kernel thread name
  string directly when calling kthread_run().
- Allocate "struct bch_dirty_init_state state" directly on stack and
  avoid the unnecessary dynamic memory allocation for it.
- Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
- Increase &state->started to count created kernel thread after it
  succeeds to create.
- When wait for all dirty key counting threads to finish, use
  wait_event() to replace wait_event_interruptible().

With the above changes, the code is more clear, and some potential error
conditions are avoided.

Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

4dc34ae1

bcache: improve multithreaded bch_btree_check() · 62253644

由 Coly Li 提交于 5月 24, 2022

Commit 8e710227 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
  multiple threads. Although currently it is not mandatory but it is
  good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
  string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
  stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
  succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
  to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

62253644

23 5月, 2022 5 次提交

md: fix double free of io_acct_set bioset · 42b805af

由 Xiao Ni 提交于 5月 12, 2022

Now io_acct_set is alloc and free in personality. Remove the codes that
free io_acct_set in md_free and md_stop.

Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <song@kernel.org>

42b805af

md: Don't set mddev private to NULL in raid0 pers->free · 0f2571ad

由 Xiao Ni 提交于 5月 12, 2022

In normal stop process, it does like this:
   do_md_stop
      |
   __md_stop (pers->free(); mddev->private=NULL)
      |
   md_free (free mddev)
__md_stop sets mddev->private to NULL after pers->free. The raid device
will be stopped and mddev memory is free. But in reshape, it doesn't
free the mddev and mddev will still be used in new raid.

In reshape, it first sets mddev->private to new_pers and then runs
old_pers->free(). Now raid0 sets mddev->private to NULL in raid0_free.
The new raid can't work anymore. It will panic when dereference
mddev->private because of NULL pointer dereference.

It can panic like this:
[63010.814972] kernel BUG at drivers/md/raid10.c:928!
[63010.819778] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[63010.825011] CPU: 3 PID: 44437 Comm: md0_resync Kdump: loaded Not tainted 5.14.0-86.el9.x86_64 #1
[63010.833789] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.15.0 09/11/2020
[63010.841440] RIP: 0010:raise_barrier+0x161/0x170 [raid10]
[63010.865508] RSP: 0018:ffffc312408bbc10 EFLAGS: 00010246
[63010.870734] RAX: 0000000000000000 RBX: ffffa00bf7d39800 RCX: 0000000000000000
[63010.877866] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa00bf7d39800
[63010.884999] RBP: 0000000000000000 R08: fffffa4945e74400 R09: 0000000000000000
[63010.892132] R10: ffffa00eed02f798 R11: 0000000000000000 R12: ffffa00bbc435200
[63010.899266] R13: ffffa00bf7d39800 R14: 0000000000000400 R15: 0000000000000003
[63010.906399] FS:  0000000000000000(0000) GS:ffffa00eed000000(0000) knlGS:0000000000000000
[63010.914485] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[63010.920229] CR2: 00007f5cfbe99828 CR3: 0000000105efe000 CR4: 00000000003506e0
[63010.927363] Call Trace:
[63010.929822]  ? bio_reset+0xe/0x40
[63010.933144]  ? raid10_alloc_init_r10buf+0x60/0xa0 [raid10]
[63010.938629]  raid10_sync_request+0x756/0x1610 [raid10]
[63010.943770]  md_do_sync.cold+0x3e4/0x94c
[63010.947698]  md_thread+0xab/0x160
[63010.951024]  ? md_write_inc+0x50/0x50
[63010.954688]  kthread+0x149/0x170
[63010.957923]  ? set_kthread_struct+0x40/0x40
[63010.962107]  ret_from_fork+0x22/0x30

Removing the code that sets mddev->private to NULL in raid0 can fix
problem.

Fixes: 0c031fd3 (md: Move alloc/free acct bioset in to personality)
Reported-by: NFine Fan <ffan@redhat.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <song@kernel.org>

0f2571ad

md: remove most calls to bdevname · 913cce5a

由 Christoph Hellwig 提交于 5月 12, 2022

Use the %pg format specifier to save on stack consumption and code size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

913cce5a

md: protect md_unregister_thread from reentrancy · 1e267742

由 Guoqing Jiang 提交于 4月 29, 2022

Generally, the md_unregister_thread is called with reconfig_mutex, but
raid_message in dm-raid doesn't hold reconfig_mutex to unregister thread,
so md_unregister_thread can be called simulitaneously from two call sites
in theory.

Then after previous commit which remove the protection of reconfig_mutex
for md_unregister_thread completely, the potential issue could be worse
than before.

Let's take pers_lock at the beginning of function to ensure reentrancy.
Reported-by: NDonald Buczek <buczek@molgen.mpg.de>
Signed-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>

1e267742

md: don't unregister sync_thread with reconfig_mutex held · 8b48ec23

由 Guoqing Jiang 提交于 2月 13, 2021

Unregister sync_thread doesn't need to hold reconfig_mutex since it
doesn't reconfigure array.

And it could cause deadlock problem for raid5 as follows:

1. process A tried to reap sync thread with reconfig_mutex held after echo
   idle to sync_action.
2. raid5 sync thread was blocked if there were too many active stripes.
3. SB_CHANGE_PENDING was set (because of write IO comes from upper layer)
   which causes the number of active stripes can't be decreased.
4. SB_CHANGE_PENDING can't be cleared since md_check_recovery was not able
   to hold reconfig_mutex.

More details in the link:
https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t

And add one parameter to md_reap_sync_thread since it could be called by
dm-raid which doesn't hold reconfig_mutex.
Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <song@kernel.org>

8b48ec23

21 5月, 2022 1 次提交

mtip32xx: fix typo in comment · 537b9f2b

由 Julia Lawall 提交于 5月 21, 2022

Spelling mistake (triple letters) in comment.
Detected with the help of Coccinelle.
Signed-off-by: NJulia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20220521111145.81697-28-Julia.Lawall@inria.frSigned-off-by: NJens Axboe <axboe@kernel.dk>

537b9f2b

20 5月, 2022 1 次提交

nvme: set non-mdts limits in nvme_scan_work · 78288665

由 Chaitanya Kulkarni 提交于 5月 18, 2022

In current implementation we set the non-mdts limits by calling
nvme_init_non_mdts_limits() from nvme_init_ctrl_finish().
This also tries to set the limits for the discovery controller which
has no I/O queues resulting in the warning message reported by the
nvme_log_error() when running blktest nvme/002: -

[ 2005.155946] run blktests nvme/002 at 2022-04-09 16:57:47
[ 2005.192223] loop: module loaded
[ 2005.196429] nvmet: adding nsid 1 to subsystem blktests-subsystem-0
[ 2005.200334] nvmet: adding nsid 1 to subsystem blktests-subsystem-1

<------------------------------SNIP---------------------------------->

[ 2008.958108] nvmet: adding nsid 1 to subsystem blktests-subsystem-997
[ 2008.962082] nvmet: adding nsid 1 to subsystem blktests-subsystem-998
[ 2008.966102] nvmet: adding nsid 1 to subsystem blktests-subsystem-999
[ 2008.973132] nvmet: creating discovery controller 1 for subsystem nqn.2014-08.org.nvmexpress.discovery for NQN testhostnqn.
*[ 2008.973196] nvme1: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE DNR*
[ 2008.974595] nvme nvme1: new ctrl: "nqn.2014-08.org.nvmexpress.discovery"
[ 2009.103248] nvme nvme1: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"

Move the call of nvme_init_non_mdts_limits() to nvme_scan_work() after
we verify that I/O queues are created since that is a converging point
for each transport where these limits are actually used.

1. FC :
nvme_fc_create_association()
 ...
 nvme_fc_create_io_queues(ctrl);
 ...
 nvme_start_ctrl()
  nvme_scan_queue()
   nvme_scan_work()

2. PCIe:-
nvme_reset_work()
 ...
 nvme_setup_io_queues()
  nvme_create_io_queues()
   nvme_alloc_queue()
 ...
 nvme_start_ctrl()
  nvme_scan_queue()
   nvme_scan_work()

3. RDMA :-
nvme_rdma_setup_ctrl
 ...
  nvme_rdma_configure_io_queues
  ...
  nvme_start_ctrl()
   nvme_scan_queue()
    nvme_scan_work()

4. TCP :-
nvme_tcp_setup_ctrl
 ...
  nvme_tcp_configure_io_queues
  ...
  nvme_start_ctrl()
   nvme_scan_queue()
    nvme_scan_work()

* nvme_scan_work()
...
nvme_validate_or_alloc_ns()
  nvme_alloc_ns()
   nvme_update_ns_info()
    nvme_update_disk_info()
     nvme_config_discard() <---
     blk_queue_max_write_zeroes_sectors() <---
Signed-off-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

78288665

19 5月, 2022 1 次提交

nvme: add support for TP4084 - Time-to-Ready Enhancements · 354201c5

由 Christoph Hellwig 提交于 5月 16, 2022

Add support for using longer timeouts during controller initialization
and letting the controller come up with namespaces that are not ready
for I/O yet.  We skip these not ready namespaces during scanning and
only bring them online once anoter scan is kicked off by the AEN that
is set when the NRDY bit gets set in the  I/O Command Set Independent
Identify Namespace Data Structure.   This asynchronous probing avoids
blocking the kernel boot when controllers take a very long time to
recover after unclean shutdowns (up to minutes).
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>

354201c5

16 5月, 2022 9 次提交

nbd: Fix hung on disconnect request if socket is closed before · 491bf8f2

由 Xie Yongji 提交于 3月 22, 2022

When userspace closes the socket before sending a disconnect
request, the following I/O requests will be blocked in
wait_for_reconnect() until dead timeout. This will cause the
following disconnect request also hung on blk_mq_quiesce_queue().
That means we have no way to disconnect a nbd device if there
are some I/O requests waiting for reconnecting until dead timeout.
It's not expected. So let's wake up the thread waiting for
reconnecting directly when a disconnect request is sent.
Reported-by: NXu Jianhai <zero.xu@bytedance.com>
Signed-off-by: NXie Yongji <xieyongji@bytedance.com>
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220322080639.142-1-xieyongji@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

491bf8f2

nvme-fabrics: add a request timeout helper · 93ba75c9

由 Chaitanya Kulkarni 提交于 3月 30, 2022

The RDAMA and TCP transport both complete the timed out request in the
same manner and hence code is duplicated. Add and use the helper
nvmf_complete_timed_out_request() to remove the duplicate code.
Signed-off-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

93ba75c9

nvme-pci: harden drive presence detect in nvme_dev_disable() · b98235d3

由 Stefan Roese 提交于 5月 06, 2022

On our ZynqMP system we observe, that a NVMe drive that resets itself
while doing a firmware update causes a Kernel crash like this:

[ 67.720772] pcieport 0000:02:02.0: pciehp: Slot(2): Link Down
[ 67.720783] pcieport 0000:02:02.0: pciehp: Slot(2): Card not present
[ 67.720795] nvme 0000:04:00.0: PME# disabled
[ 67.720849] Internal error: synchronous external abort: 96000010 [#1] PREEMPT SMP
[ 67.720853] nwl-pcie fd0e0000.pcie: Slave error

Analysis: When nvme_dev_disable() is called because of this PCIe hotplug
event, pci_is_enabled() is still true. And accessing the NVMe drive
which is currently not available as it's in reboot process causes this
"synchronous external abort" on this ARM64 platform.

This patch adds the pci_device_is_present() check as well, which returns
false in this "Card not present" hot-plug case. With this change, the
NVMe driver does not try to access the NVMe registers any more and the
FW update finishes without any problems.
Signed-off-by: NStefan Roese <sr@denx.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b98235d3

nvme-pci: fix a NULL pointer dereference in nvme_alloc_admin_tags · da427611

由 Smith, Kyle Miller (Nimble Kernel) 提交于 4月 22, 2022

In nvme_alloc_admin_tags, the admin_q can be set to an error (typically
-ENOMEM) if the blk_mq_init_queue call fails to set up the queue, which
is checked immediately after the call. However, when we return the error
message up the stack, to nvme_reset_work the error takes us to
nvme_remove_dead_ctrl()
  nvme_dev_disable()
   nvme_suspend_queue(&dev->queues[0]).

Here, we only check that the admin_q is non-NULL, rather than not
an error or NULL, and begin quiescing a queue that never existed, leading
to bad / NULL pointer dereference.
Signed-off-by: NKyle Smith <kyles@hpe.com>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

da427611

nvme: mark internal passthru request RQF_QUIET · 128126a7

由 Chaitanya Kulkarni 提交于 4月 19, 2022

Most of the internal passthru commands use __nvme_submit_sync_cmd()
interface. There are few places we open code the request submission :-

1. nvme_keep_alive_work(struct work_struct *work)
2. nvme_timeout(struct request *req, bool reserved)
3. nvme_delete_queue(struct nvme_queue *nvmeq, u8 opcode)

Mark the internal passthru request quiet so that we can skip the verbose
error message from nvme_log_error() in nvme_end_req() completion path,
this will be consistent with what we have in __nvme_submit_sync_cmd().
Signed-off-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NAlan Adamson <alan.adamson@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

128126a7

nvme: remove unneeded include from constants file · da3340e7

由 Max Gurtovoy 提交于 4月 28, 2022

No usage of blkdev.h elements.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

da3340e7

nvme: add missing status values to verbose logging · ca2d8992

由 Max Gurtovoy 提交于 4月 28, 2022

Log a few more path related status codes.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca2d8992

nvme: set dma alignment to dword · 52fde2c0

由 Keith Busch 提交于 5月 04, 2022

The nvme specification only requires qword alignment for segment
descriptors, and the driver already guarantees that. The spec has always
allowed user data to be dword aligned, which is what the queue's
attribute is for, so relax the alignment requirement to that value.

While we could allow byte alignment for some controllers when using
SGLs, we still need to support PRP, and that only allows dword.

Fixes: 3b2a1ebc ("nvme: set dma alignment to qword")
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

52fde2c0

nvme: fix interpretation of DMRSL · 1a86924e

由 Tom Yan 提交于 4月 29, 2022

DMRSLl is in the unit of logical blocks, while max_discard_sectors is
in the unit of "linux sector".
Signed-off-by: NTom Yan <tom.ty89@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

1a86924e

10 5月, 2022 3 次提交

loop: remove most the top-of-file boilerplate comment · eb04bb15

由 Christoph Hellwig 提交于 4月 19, 2022

Remove the irrelevant changelogs and todo notes and just leave the SPDX
marker and the copyright notice.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220419063303.583106-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

eb04bb15

loop: add a SPDX header · f21e6e18

由 Christoph Hellwig 提交于 4月 19, 2022

The copyright statement says:

"Redistribution of this file is permitted under the GNU General Public
 License." and was added by Ted in 1993, at which point GPLv2 only
 was the default Linux license.

Replace it with the usual GPLv2 only SPDX header.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220419063303.583106-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

f21e6e18

loop: remove loop.h · 754d9679

由 Christoph Hellwig 提交于 4月 19, 2022

Merge loop.h into loop.c as all the content is only used there.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220419063303.583106-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

754d9679

04 5月, 2022 4 次提交

block: null_blk: Improve device creation with configfs · 49c3b926

由 Damien Le Moal 提交于 4月 20, 2022

Currently, the directory name used to create a nullb device through
sysfs is not used as the device name, potentially causing headaches for
users if devices are already created through the modprobe operation
withe the nr_device module parameter not set to 0. E.g. a user can do
"mkdir /sys/kernel/config/nullb/nullb0" to create a nullb device even
though /dev/nullb0 was already created by modprobe. In this case, the
configfs nullb device will be named nullb1, causing confusion for the
user.

Simplify this by using the configfs directory name as the nullb device
name, always, unless another nullb device is already using the same
name. E.g. if modprobe created nullb0, then:

$ mkdir /sys/kernel/config/nullb/nullb0
mkdir: cannot create directory '/sys/kernel/config/nullb/nullb0': File
exists

will be reported to the user.

To implement this, the function null_find_dev_by_name() is added to
check for the existence of a nullb device with the name used for a new
configfs device directory. nullb_group_make_item() uses this new
function to check if the directory name can be used as the disk name.
Finally, null_add_dev() is modified to use the device config item name
as the disk name for a new nullb device created using configfs.
The naming of devices created though modprobe remains unchanged.

Of note is that it is possible for a user to create through configfs a
nullb device with the same name as an existing device. E.g.

$ mkdir /sys/kernel/config/nullb/null

will successfully create the nullb device named "null" but this block
device will however not appear under /dev/ since /dev/null already
exists.
Suggested-by: NJoseph Bacik <josef@toxicpanda.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220420005718.3780004-5-damien.lemoal@opensource.wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>