提交 · abfc426d1b2fb2176df59851a64223b58ddae7e7 · openeuler / Kernel

04 2月, 2022 1 次提交

block: pass a block_device to bio_clone_fast · abfc426d

由 Christoph Hellwig 提交于 2月 02, 2022

Pass a block_device to bio_clone_fast and __bio_clone_fast and give
the functions more suitable names.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Link: https://lore.kernel.org/r/20220202160109.108149-14-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

abfc426d

02 2月, 2022 2 次提交

block: pass a block_device and opf to bio_init · 49add496

由 Christoph Hellwig 提交于 1月 24, 2022

Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment. A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

49add496

block: pass a block_device and opf to bio_alloc_bioset · 609be106

由 Christoph Hellwig 提交于 1月 24, 2022

Pass the block_device and operation that we plan to use this bio for to
bio_alloc_bioset to optimize the assigment. NULL/0 can be passed, both
for the passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-16-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

609be106

07 1月, 2022 4 次提交

md: use default_groups in kobj_type · 1745e857

由 Greg Kroah-Hartman 提交于 1月 06, 2022

There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.  Move the md rdev sysfs code to use default_groups field which
has been the preferred way since commit aa30f47c ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.

Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NSong Liu <song@kernel.org>

1745e857

md: Move alloc/free acct bioset in to personality · 0c031fd3

由 Xiao Ni 提交于 12月 10, 2021

bioset acct is only needed for raid0 and raid5. Therefore, md_run only
allocates it for raid0 and raid5. However, this does not cover
personality takeover, which may cause uninitialized bioset. For example,
the following repro steps:

  mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
  mdadm --wait /dev/md0
  mkfs.xfs /dev/md0
  mdadm /dev/md0 --grow -l5
  mount /dev/md0 /mnt

causes panic like:

[  225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  225.934903] #PF: supervisor instruction fetch in kernel mode
[  225.935639] #PF: error_code(0x0010) - not-present page
[  225.936361] PGD 0 P4D 0
[  225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
[  225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
[  225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
[  225.939922] RIP: 0010:0x0
[  225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[  225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
[  225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
[  225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
[  225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
[  225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
[  225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
[  225.946709] FS:  00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
[  225.947814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
[  225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  225.951414] Call Trace:
[  225.951787]  <TASK>
[  225.952120]  mempool_alloc+0xe5/0x250
[  225.952625]  ? mempool_resize+0x370/0x370
[  225.953187]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  225.953862]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  225.954464]  ? sched_clock_cpu+0x15/0x120
[  225.955019]  ? find_held_lock+0xac/0xd0
[  225.955564]  bio_alloc_bioset+0x1ed/0x2a0
[  225.956080]  ? lock_downgrade+0x3a0/0x3a0
[  225.956644]  ? bvec_alloc+0xc0/0xc0
[  225.957135]  bio_clone_fast+0x19/0x80
[  225.957651]  raid5_make_request+0x1370/0x1b70
[  225.958286]  ? sched_clock_cpu+0x15/0x120
[  225.958797]  ? __lock_acquire+0x8b2/0x3510
[  225.959339]  ? raid5_get_active_stripe+0xce0/0xce0
[  225.959986]  ? lock_is_held_type+0xd8/0x130
[  225.960528]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  225.961135]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  225.961703]  ? sched_clock_cpu+0x15/0x120
[  225.962232]  ? lock_release+0x27a/0x6c0
[  225.962746]  ? do_wait_intr_irq+0x130/0x130
[  225.963302]  ? lock_downgrade+0x3a0/0x3a0
[  225.963815]  ? lock_release+0x6c0/0x6c0
[  225.964348]  md_handle_request+0x342/0x530
[  225.964888]  ? set_in_sync+0x170/0x170
[  225.965397]  ? blk_queue_split+0x133/0x150
[  225.965988]  ? __blk_queue_split+0x8b0/0x8b0
[  225.966524]  ? submit_bio_checks+0x3b2/0x9d0
[  225.967069]  md_submit_bio+0x127/0x1c0
[...]

Fix this by moving alloc/free of acct bioset to pers->run and pers->free.

While we are on this, properly handle md_integrity_register() error in
raid0_run().

Fixes: daee2024 (md: check level before create and exit io_acct_set)
Cc: stable@vger.kernel.org
Acked-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <song@kernel.org>

0c031fd3

md: fix spelling of "its" · dd3dc5f4

由 Randy Dunlap 提交于 12月 25, 2021

Use the possessive "its" instead of the contraction "it's"
in printed messages.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NSong Liu <song@kernel.org>

dd3dc5f4

md: add support for REQ_NOWAIT · f51d46d0

由 Vishal Verma 提交于 12月 21, 2021

commit 021a2446 ("block: add QUEUE_FLAG_NOWAIT") added support
for checking whether a given bdev supports handling of REQ_NOWAIT or not.
Since then commit 6abc4946 ("dm: add support for REQ_NOWAIT and enable
it for linear target") added support for REQ_NOWAIT for dm. This uses
a similar approach to incorporate REQ_NOWAIT for md based bios.

This patch was tested using t/io_uring tool within FIO. A nvme drive
was partitioned into 2 partitions and a simple raid 0 configuration
/dev/md0 was created.

md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
      937423872 blocks super 1.2 512k chunks

Before patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

  38396   38396 pts/2    00:00:00 io_uring
  38396   38397 pts/2    00:00:15 io_uring
  38396   38398 pts/2    00:00:13 iou-wrk-38397

We can see iou-wrk-38397 io worker thread created which gets created
when io_uring sees that the underlying device (/dev/md0 in this case)
doesn't support nowait.

After patch:

$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100

Running top while the above runs:

$ ps -eL | grep $(pidof io_uring)

  38341   38341 pts/2    00:10:22 io_uring
  38341   38342 pts/2    00:10:37 io_uring

After running this patch, we don't see any io worker thread
being created which indicated that io_uring saw that the
underlying device does support nowait. This is the exact behaviour
noticed on a dm device which also supports nowait.

For all the other raid personalities except raid0, we would need
to train pieces which involves make_request fn in order for them
to correctly handle REQ_NOWAIT.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NVishal Verma <vverma@digitalocean.com>
Signed-off-by: NSong Liu <song@kernel.org>

f51d46d0

11 12月, 2021 2 次提交

md: fix double free of mddev->private in autorun_array() · 07641b5f

由 zhangyue 提交于 11月 16, 2021

In driver/md/md.c, if the function autorun_array() is called,
the problem of double free may occur.

In function autorun_array(), when the function do_md_run() returns an
error, the function do_md_stop() will be called.

The function do_md_run() called function md_run(), but in function
md_run(), the pointer mddev->private may be freed.

The function do_md_stop() called the function __md_stop(), but in
function __md_stop(), the pointer mddev->private also will be freed
without judging null.

At this time, the pointer mddev->private will be double free, so it
needs to be judged null or not.
Signed-off-by: Nzhangyue <zhangyue1@kylinos.cn>
Signed-off-by: NSong Liu <songliubraving@fb.com>

07641b5f

md: fix update super 1.0 on rdev size change · 55df1ce0

由 Markus Hochholdinger 提交于 11月 16, 2021

The superblock of version 1.0 doesn't get moved to the new position on a
device size change. This leads to a rdev without a superblock on a known
position, the raid can't be re-assembled.

The line was removed by mistake and is re-added by this patch.

Fixes: d9c0fa50 ("md: fix max sectors calculation for super 1.0")
Cc: stable@vger.kernel.org
Signed-off-by: NMarkus Hochholdinger <markus@hochholdinger.net>
Reviewed-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

55df1ce0

29 11月, 2021 1 次提交

block: remove GENHD_FL_EXT_DEVT · 1ebe2e5f

由 Christoph Hellwig 提交于 11月 22, 2021

All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1ebe2e5f

19 10月, 2021 7 次提交

md: update superblock after changing rdev flags in state_store · 8b9e2291

由 Xiao Ni 提交于 10月 13, 2021

When the in memory flag is changed, we need to persist the change in the
rdev superblock flags. This is needed for "writemostly" and "failfast".
Reviewed-by: NLi Feng <fengli@smartx.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b9e2291

md: remove unused argument from md_new_event · 54679486

由 Guoqing Jiang 提交于 10月 04, 2021

Actually, mddev is not used by md_new_event.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

54679486

md: properly unwind when failing to add the kobject in md_alloc · 7ad10691

由 Christoph Hellwig 提交于 9月 01, 2021

Add proper error handling to delete the gendisk when failing to add
the md kobject and clean up the error unwinding in general.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7ad10691

md: extend disks_mutex coverage · 94f3cd7d

由 Christoph Hellwig 提交于 9月 01, 2021

disks_mutex is intended to serialize md_alloc.  Extended it to also cover
the kobject_uevent call and getting the sysfs dirent to help reducing
error handling complexity.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

94f3cd7d

md: add the bitmap group to the default groups for the md kobject · 51238e7f

由 Christoph Hellwig 提交于 9月 01, 2021

Replace the deprecated default_attrs with the default_groups mechanism,
and add the always visible bitmap group to the groups created add
kobject_add time.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

51238e7f

md: add error handling support for add_disk() · 9be68dd7

由 Luis Chamberlain 提交于 9月 01, 2021

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.

We just do the unwinding of what was not done before, and are
sure to unlock prior to bailing.
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9be68dd7

md: use bdev_nr_sectors instead of open coding it · 0fe80347

由 Christoph Hellwig 提交于 10月 18, 2021

Use the proper helper to read the block device size.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NSong Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20211018101130.1838532-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

0fe80347

18 10月, 2021 3 次提交

block: switch polling to be bio based · 3e08773c

由 Christoph Hellwig 提交于 10月 12, 2021

Replace the blk_poll interface that requires the caller to keep a queue
and cookie from the submissions with polling based on the bio.

Polling for the bio itself leads to a few advantages:

 - the cookie construction can made entirely private in blk-mq.c
 - the caller does not need to remember the request_queue and cookie
   separately and thus sidesteps their lifetime issues
 - keeping the device and the cookie inside the bio allows to trivially
   support polling BIOs remapping by stacking drivers
 - a lot of code to propagate the cookie back up the submission path can
   be removed entirely.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

3e08773c

block: move integrity handling out of <linux/blkdev.h> · fe45e630

由 Christoph Hellwig 提交于 9月 20, 2021

Split the integrity/metadata handling definitions out into a new header.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

fe45e630

block: drop unused includes in <linux/genhd.h> · b81e0c23

由 Christoph Hellwig 提交于 9月 20, 2021

Drop various include not actually used in genhd.h itself, and
move the remaning includes closer together.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b81e0c23

22 9月, 2021 1 次提交

md: fix a lock order reversal in md_alloc · 7df835a3

由 Christoph Hellwig 提交于 9月 01, 2021

Commit b0140891 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.

On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.

Fixes: b0140891 ("md: Fix race when creating a new md device.")
Fixes: d6263387 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

7df835a3

15 6月, 2021 5 次提交

md: add comments in md_integrity_register · de3ea66e

由 Guoqing Jiang 提交于 6月 03, 2021

Given it is not obvious for the error handling, let's try to add some
comments here to make it clear.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

de3ea66e

md: check level before create and exit io_acct_set · daee2024

由 Guoqing Jiang 提交于 6月 03, 2021

The bio_set (io_acct_set) is used by personalities to clone bio and
trace the timestamp of bio. Some personalities such as raid1/10 don't
need the bio_set, so add check to not create it unconditionally.

Also update the comment for md_account_bio to make it more clear.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

daee2024

md: Constify attribute_group structs · c32dc040

由 Rikard Falkeborn 提交于 5月 29, 2021

The attribute_group structs are never modified, they're only passed to
sysfs_create_group() and sysfs_remove_group(). Make them const to allow
the compiler to put them in read-only memory.
Signed-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com>
Signed-off-by: NSong Liu <song@kernel.org>

c32dc040

md: add io accounting for raid0 and raid5 · 10764815

由 Guoqing Jiang 提交于 5月 25, 2021

We introduce a new bioset (io_acct_set) for raid0 and raid5 since they
don't own clone infrastructure to accounting io. And the bioset is added
to mddev instead of to raid0 and raid5 layer, because with this way, we
can put common functions to md.h and reuse them in raid0 and raid5.

Also struct md_io_acct is added accordingly which includes io start_time,
the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct
to get related io status.
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

10764815

md: revert io stats accounting · ad3fc798

由 Guoqing Jiang 提交于 5月 25, 2021

The commit 41d2d848 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848 ("md: improve io stats accounting")
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>

ad3fc798

01 6月, 2021 1 次提交

md: convert to blk_alloc_disk/blk_cleanup_disk · 0f1d2e06

由 Christoph Hellwig 提交于 5月 21, 2021

Convert the md driver to use the blk_alloc_disk and blk_cleanup_disk
helpers to simplify gendisk and request_queue allocation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-15-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

0f1d2e06

24 4月, 2021 1 次提交

md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9

由 Heming Zhao 提交于 4月 08, 2021

md_kick_rdev_from_array will remove rdev, so we should
use rdev_for_each_safe to search list.

How to trigger:

env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).

```
node2=192.168.0.3

for i in {1..20}; do
    echo ==== $i `date` ====;

    mdadm -Ss && ssh ${node2} "mdadm -Ss"
    wipefs -a /dev/sda /dev/sdb

    mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
       /dev/sdb --assume-clean
    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
    mdadm --wait /dev/md0
    ssh ${node2} "mdadm --wait /dev/md0"

    mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
    sleep 1
done
```

Crash stack:

```
stack segment: 0000 [#1] SMP
... ...
RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
... ...
RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 raid1d+0x5c/0xd40 [raid1]
 ? finish_task_switch+0x75/0x2a0
 ? lock_timer_base+0x67/0x80
 ? try_to_del_timer_sync+0x4d/0x80
 ? del_timer_sync+0x41/0x50
 ? schedule_timeout+0x254/0x2d0
 ? md_start_sync+0xe0/0xe0 [md_mod]
 ? md_thread+0x127/0x160 [md_mod]
 md_thread+0x127/0x160 [md_mod]
 ? wait_woken+0x80/0x80
 kthread+0x10d/0x130
 ? kthread_park+0xa0/0xa0
 ret_from_fork+0x1f/0x40
```

Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
Fixes: 659b254f ("md-cluster: remove a disk asynchronously from cluster environment")
Cc: stable@vger.kernel.org
Reviewed-by: NGang He <ghe@suse.com>
Signed-off-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

f7c7a2f9

16 4月, 2021 3 次提交

md: do not return existing mddevs from mddev_find_or_alloc · 0d809b38

由 Christoph Hellwig 提交于 4月 12, 2021

Instead of returning an existing mddev, just for it to be discarded
later directly return -EEXIST.  Rename the function to mddev_alloc now
that it doesn't find an existing mddev.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

0d809b38

md: refactor mddev_find_or_alloc · d144fe6f

由 Christoph Hellwig 提交于 4月 12, 2021

Allocate the new mddev first speculatively, which greatly simplifies
the code flow.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

d144fe6f

md: factor out a mddev_alloc_unit helper from mddev_find · 85c8c3c1

由 Christoph Hellwig 提交于 4月 12, 2021

Split out a self contained helper to find a free minor for the md
"unit" number.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

85c8c3c1

08 4月, 2021 3 次提交

md: split mddev_find · 65aa97c4

由 Christoph Hellwig 提交于 4月 03, 2021

Split mddev_find into a simple mddev_find that just finds an existing
mddev by the unit number, and a more complicated mddev_find that deals
with find or allocating a mddev.

This turns out to fix this bug reported by Zhao Heming.

----------------------------- snip ------------------------------
commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger 1 time with 10 tests

`1  node1="15sp3-mdcluster1"
2  node2="15sp3-mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..100}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
`
I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

`ID: 2831   TASK: ffff8dd7223b5040  CPU: 0   COMMAND: "mdadm"
 #0 [ffffa15d00a13b90] __schedule at ffffffffb8f1935f
 #1 [ffffa15d00a13ba8] exact_lock at ffffffffb8a4a66d
 #2 [ffffa15d00a13bb0] kobj_lookup at ffffffffb8c62fe3
 #3 [ffffa15d00a13c28] __blkdev_get at ffffffffb89273b9
 #4 [ffffa15d00a13c98] blkdev_get at ffffffffb8927964
 #5 [ffffa15d00a13cb0] do_dentry_open at ffffffffb88dc4b4
 #6 [ffffa15d00a13ce0] path_openat at ffffffffb88f0ccc
 #7 [ffffa15d00a13db8] do_filp_open at ffffffffb88f32bb
 #8 [ffffa15d00a13ee0] do_sys_open at ffffffffb88ddc7d
 #9 [ffffa15d00a13f38] do_syscall_64 at ffffffffb86053cb ffffffffb900008c

or:
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
`
*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

`<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
`
In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.
------------------------------ snip ------------------------------

Cc: stable@vger.kernel.org
Fixes: d3374825 ("md: make devices disappear when they are no longer needed.")
Reported-by: NHeming Zhao <heming.zhao@suse.com>
Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

65aa97c4

md: factor out a mddev_find_locked helper from mddev_find · 8b57251f

由 Christoph Hellwig 提交于 4月 03, 2021

Factor out a self-contained helper to just lookup a mddev by the dev_t
"unit".

Cc: stable@vger.kernel.org
Reviewed-by: NHeming Zhao <heming.zhao@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSong Liu <song@kernel.org>

8b57251f

md: md_open returns -EBUSY when entering racing area · 6a4db2a6

由 Zhao Heming 提交于 4月 03, 2021

commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NZhao Heming <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>

6a4db2a6

25 3月, 2021 2 次提交

md: Fix missing unused status line of /proc/mdstat · 7abfabaf

由 Jan Glauber 提交于 3月 17, 2021

Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

7abfabaf

md: add md_submit_discard_bio() for submitting discard bio · cf78408f

由 Xiao Ni 提交于 2月 04, 2021

Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.
Reviewed-by: NColy Li <colyli@suse.de>
Reviewed-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Tested-by: NAdrian Huang <ahuang12@lenovo.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

cf78408f

02 2月, 2021 2 次提交

md: use rdev_read_only in restart_array · a42e0d70

由 Christoph Hellwig 提交于 2月 01, 2021

Make the read-only check in restart_array identical to the other two
read-only checks.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a42e0d70

md: check for NULL ->meta_bdev before calling bdev_read_only · d7a47838

由 Christoph Hellwig 提交于 2月 01, 2021

->meta_bdev is optional and not set for most arrays.  Add a
rdev_read_only helper that calls bdev_read_only for both devices
in a safe way.

Fixes: 6f0d9689 ("block: remove the NULL bdev check in bdev_read_only")
Reported-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d7a47838

28 1月, 2021 2 次提交

md: remove md_bio_alloc_sync · 6a596569

由 Christoph Hellwig 提交于 1月 26, 2021

md_bio_alloc_sync is never called with a NULL mddev, and ->sync_set is
initialized in md_run, so it always must be initialized as well.  Just
open code the remaining call to bio_alloc_bioset.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSong Liu <song@kernel.org>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6a596569

md: simplify sync_page_io · 32637385

由 Christoph Hellwig 提交于 1月 26, 2021

Use an on-stack bio and biovec for the single page synchronous I/O.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NSong Liu <song@kernel.org>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

32637385

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功