提交 · 659772b2cc16c09bfa0dd5b2e9229d5dc6d6c024 · openanolis / cloud-kernel

18 3月, 2020 5 次提交

md: make sure desc_nr less than MD_SB_DISKS · 659772b2

由 Yufen Yu 提交于 12月 10, 2019

commit 3b7436cc9449d5ff7fa1c1fd5bc3edb6402ff5b8 upstream.

For super_90_load, we need to make sure 'desc_nr' less
than MD_SB_DISKS, avoiding invalid memory access of 'sb->disks'.

Fixes: 228fc7d76db6 ("md: avoid invalid memory access for array sb->dev_roles")
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

659772b2

md: avoid invalid memory access for array sb->dev_roles · 3126899a

由 Yufen Yu 提交于 10月 30, 2019

commit 228fc7d76db68732677230a3c64337908fd298e3 upstream.

we need to gurantee 'desc_nr' valid before access array
of sb->dev_roles.

In addition, we should avoid .load_super always return '0'
when level is LEVEL_MULTIPATH, which is not expected.
Reported-by: Ncoverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1487373 ("Memory - illegal accesses")
Fixes: 6a5cb53aaa4e ("md: no longer compare spare disk superblock events in super_load")
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

3126899a

md: no longer compare spare disk superblock events in super_load · 432bc300

由 Yufen Yu 提交于 10月 16, 2019

commit 6a5cb53aaa4ef515ddeffa04ce18b771121127b4 upstream.

We have a test case as follow:

  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] \
	--assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

When we readd /dev/sda to the array, it started to do recovery.
After offline the other two disks in md1, the recovery have
been interrupted and superblock update info cannot be written
to the offline disks. While the spare disk (/dev/sda) can continue
to update superblock info.

After stopping the array and assemble it, we found the array
run fail, with the follow kernel message:

[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)
[  173.023466] md: md1 stopped.

Since both sdb and sdc have the value of 'sb->events' smaller than
that in sda, they have been kicked from the array. However, the only
remained disk sda is in 'spare' state before stop and it cannot be
added to conf->mirrors[] array. In the end, raid array assemble
and run fail.

In fact, we can use the older disk sdb or sdc to assemble the array.
That means we should not choose the 'spare' disk as the fresh disk in
analyze_sbs().

To fix the problem, we do not compare superblock events when it is
a spare disk, as same as validate_super.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

432bc300

md: return -ENODEV if rdev has no mddev assigned · d6d6b831

由 Pawel Baldysiak 提交于 3月 27, 2019

commit c42d3240990814eec1e4b2b93fa0487fc4873aed upstream.

Mdadm expects that setting drive as faulty will fail with -EBUSY only if
this operation will cause RAID to be failed. If this happens, it will
try to stop the array. Currently -EBUSY might also be returned if rdev
is in the middle of the removal process - for example there is a race
with mdmon that already requested the drive to be failed/removed.

If rdev does not contain mddev, return -ENODEV instead, so the caller
can distinguish between those two cases and behave accordingly.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d6d6b831

md/raid10: Fix raid10 replace hang when new added disk faulty · 52c7cc01

由 Alex Wu 提交于 9月 21, 2018

commit ee37d7314a32ab6809eacc3389bad0406c69a81f upstream.

[Symptom]

Resync thread hang when new added disk faulty during replacing.

[Root Cause]

In raid10_sync_request(), we expect to issue a bio with callback
end_sync_read(), and a bio with callback end_sync_write().

In normal situation, we will add resyncing sectors into
mddev->recovery_active when raid10_sync_request() returned, and sub
resynced sectors from mddev->recovery_active when end_sync_write()
calls end_sync_request().

If new added disk, which are replacing the old disk, is set faulty,
there is a race condition:
    1. In the first rcu protected section, resync thread did not detect
       that mreplace is set faulty and pass the condition.
    2. In the second rcu protected section, mreplace is set faulty.
    3. But, resync thread will prepare the read object first, and then
       check the write condition.
    4. It will find that mreplace is set faulty and do not have to
       prepare write object.
This cause we add resync sectors but never sub it.

[How to Reproduce]

This issue can be easily reproduced by the following steps:
    mdadm -C /dev/md0 --assume-clean -l 10 -n 4 /dev/sd[abcd]
    mdadm /dev/md0 -a /dev/sde
    mdadm /dev/md0 --replace /dev/sdd
    sleep 1
    mdadm /dev/md0 -f /dev/sde

[How to Fix]

This issue can be fixed by using local variables to record the result
of test conditions. Once the conditions are satisfied, we can make sure
that we need to issue a bio for read and a bio for write.

Previous 'commit 24afd80d ("md/raid10: handle recovery of
replacement devices.")' will also check whether bio is NULL, but leave
the comment saying that it is a pointless test. So we remove this dummy
check.
Reported-by: NAlex Chen <alexchen@synology.com>
Reviewed-by: NAllen Peng <allenpeng@synology.com>
Reviewed-by: NBingJing Chang <bingjingc@synology.com>
Signed-off-by: NAlex Wu <alexwu@synology.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

52c7cc01

15 1月, 2020 1 次提交

dm thin: wakeup worker only when deferred bios exist · 88c0b3cc

由 Jeffle Xu 提交于 11月 18, 2019

commit d256d796279de0bdc227ff4daef565aa7e80c898 upstream.

Single thread fio test (read, bs=4k, ioengine=libaio, iodepth=128,
numjobs=1) over dm-thin device has poor performance versus bare nvme
device.

Further investigation with perf indicates that queue_work_on() consumes
over 20% CPU time when doing IO over dm-thin device. The call stack is
as follows.

- 40.57% thin_map
    + 22.07% queue_work_on
    + 9.95% dm_thin_find_block
    + 2.80% cell_defer_no_holder
      1.91% inc_all_io_entry.isra.33.part.34
    + 1.78% bio_detain.isra.35

In cell_defer_no_holder(), wakeup_worker() is always called, no matter
whether the tc->deferred_bio_list list is empty or not. In single thread
IO model, this list is most likely empty. So skip waking up worker thread
if tc->deferred_bio_list list is empty.

Single thread IO performance improves from 448 MiB/s to 646 MiB/s (+44%)
once the needless wake_worker() calls are properly skipped.
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

88c0b3cc

27 12月, 2019 5 次提交

dm: add missing trace_block_split() to __split_and_process_bio() · 2caf8302

由 Mike Snitzer 提交于 1月 18, 2019

commit 075c18c3e124a1511ebc10a89f1858c8a77dcb01 upstream.

Provides useful context about bio splits in blktrace.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2caf8302

dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate · df918dc8

由 Mike Snitzer 提交于 1月 17, 2019

commit 6548c7c538e5658cbce686c2dd1a9b4f5398bf34 upstream.

Otherwise targets that don't support/expect IO splitting could resubmit
bios using code paths with unnecessary IO splitting complexity.

Depends-on: 24113d487843 ("dm: avoid indirect call in __dm_make_request")
Fixes: 978e51ba ("dm: optimize bio-based NVMe IO submission")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

df918dc8

dm: avoid indirect call in __dm_make_request · 39eb7095

由 Mikulas Patocka 提交于 11月 06, 2018

commit 24113d4878439baf1f23c1a33dfcc340fba66e97 upstream.

Indirect calls are inefficient because of retpolines that are used for
spectre workaround. This patch replaces an indirect call with a condition
(that can be predicted by the branch predictor).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

39eb7095

dm: fix redundant IO accounting for bios that need splitting · 9ea8684c

由 Mike Snitzer 提交于 1月 17, 2019

commit a1e1cb72d96491277ede8d257ce6b48a381dd336 upstream.

[Joseph: cherry-pick part_stat_get() from commit 1226b8dd0e91 ("block:
switch to per-cpu in-flight counters") since we don't want the whole
patch series get involved.]

The risk of redundant IO accounting was not taken into consideration
when commit 18a25da8 ("dm: ensure bio submission follows a
depth-first tree walk") introduced IO splitting in terms of recursion
via generic_make_request().

Fix this by subtracting the split bio's payload from the IO stats that
were already accounted for by start_io_acct() upon dm_make_request()
entry.  This repeat oscillation of the IO accounting, up then down,
isn't ideal but refactoring DM core's IO splitting to pre-split bios
_before_ they are accounted turned out to be an excessive amount of
change that will need a full development cycle to refine and verify.

Before this fix:

  /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
  bios are split on 32k boundaries.

  # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
    	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers

  with debugging added:
  [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
  [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
  [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
  ...

  16M written yet 136M (278528 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  278528

After this fix:

  16M written and 16M (32768 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  32768

Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: NBryan Gurney <bgurney@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9ea8684c

dm: fix clone_bio() to trigger blk_recount_segments() · 7b6d55b4

由 Mike Snitzer 提交于 1月 16, 2019

commit 57c36519e4b949f89381053f7283f5d605595b42 upstream.

DM's clone_bio() now benefits from using bio_trim() by fixing the fact
that clone_bio() wasn't clearing BIO_SEG_VALID like bio_trim() does;
which triggers blk_recount_segments() via bio_phys_segments().
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7b6d55b4

21 12月, 2019 2 次提交

dm btree: increase rebalance threshold in __rebalance2() · 6970c159

由 Hou Tao 提交于 12月 03, 2019

commit 474e559567fa631dea8fb8407ab1b6090c903755 upstream.

We got the following warnings from thin_check during thin-pool setup:

  $ thin_check /dev/vdb
  examining superblock
  examining devices tree
    missing devices: [1, 84]
      too few entries in btree_node: 41, expected at least 42 (block 138, max_entries = 126)
  examining mapping tree

The phenomenon is the number of entries in one node of details_info tree is
less than (max_entries / 3). And it can be easily reproduced by the following
procedures:

  $ new a thin pool
  $ presume the max entries of details_info tree is 126
  $ new 127 thin devices (e.g. 1~127) to make the root node being full
    and then split
  $ remove the first 43 (e.g. 1~43) thin devices to make the children
    reblance repeatedly
  $ stop the thin pool
  $ thin_check

The root cause is that the B-tree removal procedure in __rebalance2()
doesn't guarantee the invariance: the minimal number of entries in
non-root node should be >= (max_entries / 3).

Simply fix the problem by increasing the rebalance threshold to
make sure the number of entries in each child will be greater
than or equal to (max_entries / 3 + 1), so no matter which
child is used for removal, the number will still be valid.

Cc: stable@vger.kernel.org
Signed-off-by: NHou Tao <houtao1@huawei.com>
Acked-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

6970c159

dm mpath: remove harmful bio-based optimization · 980b632f

由 Mike Snitzer 提交于 11月 26, 2019

commit dbaf971c9cdf10843071a60dcafc1aaab3162354 upstream.

Removes the branching for edge-case where no SCSI device handler
exists.  The __map_bio_fast() method was far too limited, by only
selecting a new pathgroup or path IFF there was a path failure, fix this
be eliminating it in favor of __map_bio().  __map_bio()'s extra SCSI
device handler specific MPATHF_PG_INIT_REQUIRED test is not in the fast
path anyway.

This change restores full path selector functionality for bio-based
configurations that don't haave a SCSI device handler.  But it should be
noted that the path selectors do have an impact on performance for
certain networks that are extremely fast (and don't require frequent
switching).

Fixes: 8d47e659 ("dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks")
Cc: stable@vger.kernel.org
Reported-by: NDrew Hastings <dhastings@crucialwebhost.com>
Suggested-by: NMartin Wilck <mwilck@suse.de>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

980b632f

18 12月, 2019 4 次提交

raid5: need to set STRIPE_HANDLE for batch head · a40982c7

由 Guoqing Jiang 提交于 11月 27, 2019

[ Upstream commit a7ede3d16808b8f3915c8572d783530a82b2f027 ]

With commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ("raid5: don't set
STRIPE_HANDLE to stripe which is in batch list"), we don't want to set
STRIPE_HANDLE flag for sh which is already in batch list.

However, the stripe which is the head of batch list should set this flag,
otherwise panic could happen inside init_stripe at BUG_ON(sh->batch_head),
it is reproducible with raid5 on top of nvdimm devices per Xiao oberserved.

Thanks for Xiao's effort to verify the change.

Fixes: 6ce220dd2f8ea ("raid5: don't set STRIPE_HANDLE to stripe which is in batch list")
Reported-by: NXiao Ni <xni@redhat.com>
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a40982c7

md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · a12c768d

由 David Jeffery 提交于 9月 16, 2019

commit 775d78319f1ceb32be8eb3b1202ccdc60e9cb7f1 upstream.

If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function.  If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83e629 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

a12c768d

dm zoned: reduce overhead of backing device checks · 56a83024

由 Dmitry Fomichev 提交于 11月 06, 2019

commit e7fad909b68aa37470d9f2d2731b5bec355ee5d6 upstream.

Commit 75d66ffb48efb3 added backing device health checks and as a part
of these checks, check_events() block ops template call is invoked in
dm-zoned mapping path as well as in reclaim and flush path. Calling
check_events() with ATA or SCSI backing devices introduces a blocking
scsi_test_unit_ready() call being made in sd_check_events(). Even though
the overhead of calling scsi_test_unit_ready() is small for ATA zoned
devices, it is much larger for SCSI and it affects performance in a very
negative way.

Fix this performance regression by executing check_events() only in case
of any I/O errors. The function dmz_bdev_is_dying() is modified to call
only blk_queue_dying(), while calls to check_events() are made in a new
helper function, dmz_check_bdev().
Reported-by: Nzhangxiaoxu <zhangxiaoxu5@huawei.com>
Fixes: 75d66ffb48efb3 ("dm zoned: properly handle backing device failure")
Cc: stable@vger.kernel.org
Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

56a83024

dm writecache: handle REQ_FUA · 10b9bf59

由 Maged Mokhtar 提交于 10月 23, 2019

commit c1005322ff02110a4df7f0033368ea015062b583 upstream.

Call writecache_flush() on REQ_FUA in writecache_map().

Cc: stable@vger.kernel.org # 4.18+
Signed-off-by: NMaged Mokhtar <mmokhtar@petasan.org>
Acked-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

10b9bf59

13 12月, 2019 1 次提交

md/raid0: Fix an error message in raid0_make_request() · 23c81ea6

由 Dan Carpenter 提交于 9月 21, 2019

[ Upstream commit e3fc3f3d0943b126f76b8533960e4168412d9e5a ]

The first argument to WARN() is supposed to be a condition.  The
original code will just print the mdname() instead of the full warning
message.

Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

23c81ea6

05 12月, 2019 4 次提交

dm raid: fix false -EBUSY when handling check/repair message · fe8e594c

由 Heinz Mauelshagen 提交于 12月 18, 2018

[ Upstream commit 74694bcbdf7e28a5ad548cdda9ac56d30be00d13 ]

Sending a check/repair message infrequently leads to -EBUSY instead of
properly identifying an active resync.  This occurs because
raid_message() is testing recovery bits in a racy way.

Fix by calling decipher_sync_action() from raid_message() to properly
identify the idle state of the RAID device.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

fe8e594c

dm flakey: Properly corrupt multi-page bios. · d4c637af

由 Sweet Tea 提交于 11月 13, 2018

[ Upstream commit a00f5276e26636cbf72f24f79831026d2e2868e7 ]

The flakey target is documented to be able to corrupt the Nth byte in
a bio, but does not corrupt byte indices after the first biovec in the
bio. Change the corrupting function to actually corrupt the Nth byte
no matter in which biovec that index falls.

A test device generating two-page bios, atop a flakey device configured
to corrupt a byte index on the second page, verified both the failure
to corrupt before this patch and the expected corruption after this
change.
Signed-off-by: NJohn Dorminy <jdorminy@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

d4c637af

bcache: do not mark writeback_running too early · b2d84967

由 Shenghui Wang 提交于 12月 13, 2018

[ Upstream commit 79b791466e525c98f6aeee9acf5726e7b27f4231 ]

A fresh backing device is not attached to any cache_set, and
has no writeback kthread created until first attached to some
cache_set.

But bch_cached_dev_writeback_init run
"
	dc->writeback_running		= true;
	WARN_ON(test_and_clear_bit(BCACHE_DEV_WB_RUNNING,
			&dc->disk.flags));
"
for any newly formatted backing devices.

For a fresh standalone backing device, we can get something like
following even if no writeback kthread created:
------------------------
/sys/block/bcache0/bcache# cat writeback_running
1
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		512.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	-15427384ms

The none ZERO fields are misleading as no alive writeback kthread yet.

Set dc->writeback_running false as no writeback thread created in
bch_cached_dev_writeback_init().

We have writeback thread created and woken up in bch_cached_dev_writeback
_start(). Set dc->writeback_running true before bch_writeback_queue()
called, as a writeback thread will check if dc->writeback_running is true
before writing back dirty data, and hung if false detected.

After the change, we can get the following output for a fresh standalone
backing device:
-----------------------
/sys/block/bcache0/bcache$ cat writeback_running
0
/sys/block/bcache0/bcache# cat writeback_rate_debug
rate:		0.0k/sec
dirty:		0.0k
target:		0.0k
proportional:	0.0k
integral:	0.0k
change:		0.0k/sec
next io:	0ms

v1 -> v2:
  Set dc->writeback_running before bch_writeback_queue() called,
Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

b2d84967

bcache: do not check if debug dentry is ERR or NULL explicitly on remove · 6f48e238

由 Shenghui Wang 提交于 12月 13, 2018

[ Upstream commit ae17102316550b4b230a283febe31b2a9ff30084 ]

debugfs_remove and debugfs_remove_recursive will check if the dentry
pointer is NULL or ERR, and will do nothing in that case.

Remove the check in cache_set_free and bch_debug_init.
Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

6f48e238

01 12月, 2019 2 次提交

md/raid10: prevent access of uninitialized resync_pages offset · a268d985

由 John Pittman 提交于 11月 11, 2019

commit 45422b704db392a6d79d07ee3e3670b11048bd53 upstream.

Due to unneeded multiplication in the out_free_pages portion of
r10buf_pool_alloc(), when using a 3-copy raid10 layout, it is
possible to access a resync_pages offset that has not been
initialized.  This access translates into a crash of the system
within resync_free_pages() while passing a bad pointer to
put_page().  Remove the multiplication, preventing access to the
uninitialized area.

Fixes: f0250618 ("md: raid10: don't use bio's vec table to manage resync pages")
Cc: stable@vger.kernel.org # 4.12+
Signed-off-by: NJohn Pittman <jpittman@redhat.com>
Suggested-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NLaurence Oberman <loberman@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

a268d985

dm raid: avoid bitmap with raid4/5/6 journal device · 56b8b183

由 Heinz Mauelshagen 提交于 10月 12, 2018

[ Upstream commit d857ad75edf3c0066fcd920746f9dc75382b3324 ]

With raid4/5/6, journal device and write intent bitmap are mutually exclusive.
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

56b8b183

24 11月, 2019 3 次提交

bcache: recal cached_dev_sectors on detach · 04d38fa8

由 Shenghui Wang 提交于 10月 08, 2018

[ Upstream commit 46010141da6677b81cc77f9b47f8ac62bd1cbfd3 ]

Recal cached_dev_sectors on cached_dev detached, as recal done on
cached_dev attached.

Update the cached_dev_sectors before bcache_device_detach called
as bcache_device_detach will set bcache_device->c to NULL.
Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

04d38fa8

bcache: account size of buckets used in uuid write to ca->meta_sectors_written · 65910055

由 Shenghui Wang 提交于 10月 08, 2018

[ Upstream commit 7a55948d38eb9b274cbbdd56dc1dd4b96ebfbe04 ]

UUIDs are considered as metadata. __uuid_write should add the number
of buckets (in sectors) written to disk to ca->meta_sectors_written.
Currently only 1 bucket is used in uuid write.

Steps to test:
1) create a fresh backing device and a fresh cache device separately.
   The backing device didn't attach to any cache set.
2) cd /sys/block/<cache device>/bcache
   cat metadata_written      // record the output value
   cat bucket_size
3) attach the backing device to cache set
4) cat metadata_written
   The output value is almost the same as the value in step 2
   before the change.
   After the change, the value is bigger about 1 bucket size.
Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
Reviewed-by: NTang Junhui <tang.junhui.linux@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

65910055

md: allow metadata updates while suspending an array - fix · b69cfc4f

由 NeilBrown 提交于 10月 03, 2018

[ Upstream commit 059421e041eb461fb2b3e81c9adaec18ef03ca3c ]

Commit 35bfc521 ("md: allow metadata update while suspending.")
added support for allowing md_check_recovery() to still perform
metadata updates while the array is entering the 'suspended' state.
This is needed to allow the processes of entering the state to
complete.

Unfortunately, the patch doesn't really work.  The test for
"mddev->suspended" at the start of md_check_recovery() means that the
function doesn't try to do anything at all while entering suspend.

This patch moves the code of updating the metadata while suspending to
*before* the test on mddev->suspended.
Reported-by: NJeff Mahoney <jeffm@suse.com>
Fixes: 35bfc521 ("md: allow metadata update while suspending.")
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

b69cfc4f

06 11月, 2019 3 次提交

bcache: fix input overflow to writeback_rate_minimum · 437de041

由 Coly Li 提交于 2月 09, 2019

[ Upstream commit dab71b2db98dcdd4657d151b01a7be88ce10f9d1 ]

dc->writeback_rate_minimum is type unsigned integer variable, it is set
via sysfs interface, and converte from input string to unsigned integer
by d_strtoul_nonzero(). When the converted input value is larger than
UINT_MAX, overflow to unsigned integer happens.

This patch fixes the overflow by using sysfs_strotoul_clamp() to
convert input string and limit the value in range [1, UINT_MAX], then
the overflow can be avoided.
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

437de041

dm snapshot: rework COW throttling to fix deadlock · a8afda77

由 Mikulas Patocka 提交于 10月 02, 2019

[ Upstream commit b21555786f18cd77f2311ad89074533109ae3ffa ]

Commit 721b1d98fb517a ("dm snapshot: Fix excessive memory usage and
workqueue stalls") introduced a semaphore to limit the maximum number of
in-flight kcopyd (COW) jobs.

The implementation of this throttling mechanism is prone to a deadlock:

1. One or more threads write to the origin device causing COW, which is
   performed by kcopyd.

2. At some point some of these threads might reach the s->cow_count
   semaphore limit and block in down(&s->cow_count), holding a read lock
   on _origins_lock.

3. Someone tries to acquire a write lock on _origins_lock, e.g.,
   snapshot_ctr(), which blocks because the threads at step (2) already
   hold a read lock on it.

4. A COW operation completes and kcopyd runs dm-snapshot's completion
   callback, which ends up calling pending_complete().
   pending_complete() tries to resubmit any deferred origin bios. This
   requires acquiring a read lock on _origins_lock, which blocks.

   This happens because the read-write semaphore implementation gives
   priority to writers, meaning that as soon as a writer tries to enter
   the critical section, no readers will be allowed in, until all
   writers have completed their work.

   So, pending_complete() waits for the writer at step (3) to acquire
   and release the lock. This writer waits for the readers at step (2)
   to release the read lock and those readers wait for
   pending_complete() (the kcopyd thread) to signal the s->cow_count
   semaphore: DEADLOCK.

The above was thoroughly analyzed and documented by Nikos Tsironis as
part of his initial proposal for fixing this deadlock, see:
https://www.redhat.com/archives/dm-devel/2019-October/msg00001.html

Fix this deadlock by reworking COW throttling so that it waits without
holding any locks. Add a variable 'in_progress' that counts how many
kcopyd jobs are running. A function wait_for_in_progress() will sleep if
'in_progress' is over the limit. It drops _origins_lock in order to
avoid the deadlock.
Reported-by: NGuruswamy Basavaiah <guru2018@gmail.com>
Reported-by: NNikos Tsironis <ntsironis@arrikto.com>
Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
Tested-by: NNikos Tsironis <ntsironis@arrikto.com>
Fixes: 721b1d98fb51 ("dm snapshot: Fix excessive memory usage and workqueue stalls")
Cc: stable@vger.kernel.org # v5.0+
Depends-on: 4a3f111a73a8c ("dm snapshot: introduce account_start_copy() and account_end_copy()")
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a8afda77

dm snapshot: introduce account_start_copy() and account_end_copy() · 223f1af6

由 Mikulas Patocka 提交于 10月 02, 2019

[ Upstream commit a2f83e8b0c82c9500421a26c49eb198b25fcdea3 ]

This simple refactoring moves code for modifying the semaphore cow_count
into separate functions to prepare for changes that will extend these
methods to provide for a more sophisticated mechanism for COW
throttling.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

223f1af6

29 10月, 2019 2 次提交

dm cache: fix bugs when a GFP_NOWAIT allocation fails · e49c84c5

由 Mikulas Patocka 提交于 10月 16, 2019

commit 13bd677a472d534bf100bab2713efc3f9e3f5978 upstream.

GFP_NOWAIT allocation can fail anytime - it doesn't wait for memory being
available and it fails if the mempool is exhausted and there is not enough
memory.

If we go down this path:
  map_bio -> mg_start -> alloc_migration -> mempool_alloc(GFP_NOWAIT)
we can see that map_bio() doesn't check the return value of mg_start(),
and the bio is leaked.

If we go down this path:
  map_bio -> mg_start -> mg_lock_writes -> alloc_prison_cell ->
  dm_bio_prison_alloc_cell_v2 -> mempool_alloc(GFP_NOWAIT) ->
  mg_lock_writes -> mg_complete
the bio is ended with an error - it is unacceptable because it could
cause filesystem corruption if the machine ran out of memory
temporarily.

Change GFP_NOWAIT to GFP_NOIO, so that the mempool code will properly
wait until memory becomes available. mempool_alloc with GFP_NOIO can't
fail, so remove the code paths that deal with allocation failure.

Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e49c84c5

md/raid0: fix warning message for parameter default_layout · 9457994a

由 Song Liu 提交于 10月 14, 2019

[ Upstream commit 3874d73e06c9b9dc15de0b7382fc223986d75571 ]

The message should match the parameter, i.e. raid0.default_layout.

Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: NeilBrown <neilb@suse.de>
Reported-by: NIvan Topolsky <doktor.yak@gmail.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

9457994a

05 10月, 2019 8 次提交

md/raid0: avoid RAID0 data corruption due to layout confusion. · bbe3e205

由 NeilBrown 提交于 9月 09, 2019

[ Upstream commit c84a1372df929033cb1a0441fb57bd3932f39ac9 ]

If the drives in a RAID0 are not all the same size, the array is
divided into zones.
The first zone covers all drives, to the size of the smallest.
The second zone covers all drives larger than the smallest, up to
the size of the second smallest - etc.

A change in Linux 3.14 unintentionally changed the layout for the
second and subsequent zones.  All the correct data is still stored, but
each chunk may be assigned to a different device than in pre-3.14 kernels.
This can lead to data corruption.

It is not possible to determine what layout to use - it depends which
kernel the data was written by.
So we add a module parameter to allow the old (0) or new (1) layout to be
specified, and refused to assemble an affected array if that parameter is
not set.

Fixes: 20d0189b ("block: Introduce new bio_split()")
cc: stable@vger.kernel.org (3.14+)
Acked-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

bbe3e205

md: only call set_in_sync() when it is expected to succeed. · 5dc86e95

由 NeilBrown 提交于 8月 20, 2019

commit 480523feae581ab714ba6610388a3b4619a2f695 upstream.

Since commit 4ad23a97 ("MD: use per-cpu counter for
writes_pending"), set_in_sync() is substantially more expensive: it
can wait for a full RCU grace period which can be 10s of milliseconds.

So we should only call it when the cost is justified.

md_check_recovery() currently calls set_in_sync() every time it finds
anything to do (on non-external active arrays).  For an array
performing resync or recovery, this will be quite often.
Each call will introduce a delay to the md thread, which can noticeable
affect IO submission latency.

In md_check_recovery() we only need to call set_in_sync() if
'safemode' was non-zero at entry, meaning that there has been not
recent IO.  So we save this "safemode was nonzero" state, and only
call set_in_sync() if it was non-zero.

This measurably reduces mean and maximum IO submission latency during
resync/recovery.
Reported-and-tested-by: NJack Wang <jinpu.wang@cloud.ionos.com>
Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

5dc86e95

md: don't report active array_state until after revalidate_disk() completes. · 598a2cda

由 NeilBrown 提交于 8月 20, 2019

commit 9d4b45d6af442237560d0bb5502a012baa5234b7 upstream.

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready.  As soon as
it appear to be zero, fsck can be run.  If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show().  This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared.  This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice.  This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

598a2cda

md/raid6: Set R5_ReadError when there is read failure on parity disk · e8323e0d

由 Xiao Ni 提交于 7月 08, 2019

commit 143f6e733b73051cd22dcb80951c6c929da413ce upstream.

7471fb77 ("md/raid6: Fix anomily when recovering a single device in
RAID6.") avoids rereading P when it can be computed from other members.
However, this misses the chance to re-write the right data to P. This
patch sets R5_ReadError if the re-read fails.

Also, when re-read is skipped, we also missed the chance to reset
rdev->read_errors to 0. It can fail the disk when there are many read
errors on P member disk (other disks don't have read error)

V2: upper layer read request don't read parity/Q data. So there is no
need to consider such situation.

This is Reported-by: kbuild test robot <lkp@intel.com>

Fixes: 7471fb77 ("md/raid6: Fix anomily when recovering a single device in RAID6.")
Cc: <stable@vger.kernel.org> #4.4+
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e8323e0d

blk-mq: add callback of .cleanup_rq · 4ec3ca27

由 Ming Lei 提交于 7月 25, 2019

[ Upstream commit 226b4fc75c78f9c497c5182d939101b260cfb9f3 ]

SCSI maintains its own driver private data hooked off of each SCSI
request, and the pridate data won't be freed after scsi_queue_rq()
returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
(e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
fully dispatched them, due to a lower level SCSI driver's resource
limitation identified in scsi_queue_rq(). Currently SCSI's per-request
private data is leaked when the upper layer driver (dm-rq) frees and
then retries these requests in response to BLK_STS_RESOURCE or
BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().

This usecase is so specialized that it doesn't warrant training an
existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
account for freeing its driver private data -- doing so would add an
extra branch for handling a special case that all other consumers of
SCSI (and blk-mq) won't ever need to worry about.

So the most pragmatic way forward is to delegate freeing SCSI driver
private data to the upper layer driver (dm-rq).  Do so by adding
new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
from dm-rq.  A following commit will implement the .cleanup_rq() hook
in scsi_mq_ops.

Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: <stable@vger.kernel.org>
Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

4ec3ca27

raid5: don't increment read_errors on EILSEQ return · 0a43d5d4

由 Nigel Croxon 提交于 9月 06, 2019

[ Upstream commit b76b4715eba0d0ed574f58918b29c1b2f0fa37a8 ]

While MD continues to count read errors returned by the lower layer.
If those errors are -EILSEQ, instead of -EIO, it should NOT increase
the read_errors count.

When RAID6 is set up on dm-integrity target that detects massive
corruption, the leg will be ejected from the array.  Even if the
issue is correctable with a sector re-write and the array has
necessary redundancy to correct it.

The leg is ejected because it runs up the rdev->read_errors beyond
conf->max_nr_stripes.  The return status in dm-drypt when there is
a data integrity error is -EILSEQ (BLK_STS_PROTECTION).
Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

0a43d5d4

raid5: don't set STRIPE_HANDLE to stripe which is in batch list · a5443cd2

由 Guoqing Jiang 提交于 9月 11, 2019

[ Upstream commit 6ce220dd2f8ea71d6afc29b9a7524c12e39f374a ]

If stripe in batch list is set with STRIPE_HANDLE flag, then the stripe
could be set with STRIPE_ACTIVE by the handle_stripe function. And if
error happens to the batch_head at the same time, break_stripe_batch_list
is called, then below warning could happen (the same report in [1]), it
means a member of batch list was set with STRIPE_ACTIVE.

[7028915.431770] stripe state: 2001
[7028915.431815] ------------[ cut here ]------------
[7028915.431828] WARNING: CPU: 18 PID: 29089 at drivers/md/raid5.c:4614 break_stripe_batch_list+0x203/0x240 [raid456]
[...]
[7028915.431879] CPU: 18 PID: 29089 Comm: kworker/u82:5 Tainted: G           O    4.14.86-1-storage #4.14.86-1.2~deb9
[7028915.431881] Hardware name: Supermicro SSG-2028R-ACR24L/X10DRH-iT, BIOS 3.1 06/18/2018
[7028915.431888] Workqueue: raid5wq raid5_do_work [raid456]
[7028915.431890] task: ffff9ab0ef36d7c0 task.stack: ffffb72926f84000
[7028915.431896] RIP: 0010:break_stripe_batch_list+0x203/0x240 [raid456]
[7028915.431898] RSP: 0018:ffffb72926f87ba8 EFLAGS: 00010286
[7028915.431900] RAX: 0000000000000012 RBX: ffff9aaa84a98000 RCX: 0000000000000000
[7028915.431901] RDX: 0000000000000000 RSI: ffff9ab2bfa15458 RDI: ffff9ab2bfa15458
[7028915.431902] RBP: ffff9aaa8fb4e900 R08: 0000000000000001 R09: 0000000000002eb4
[7028915.431903] R10: 00000000ffffffff R11: 0000000000000000 R12: ffff9ab1736f1b00
[7028915.431904] R13: 0000000000000000 R14: ffff9aaa8fb4e900 R15: 0000000000000001
[7028915.431906] FS:  0000000000000000(0000) GS:ffff9ab2bfa00000(0000) knlGS:0000000000000000
[7028915.431907] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7028915.431908] CR2: 00007ff953b9f5d8 CR3: 0000000bf4009002 CR4: 00000000003606e0
[7028915.431909] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7028915.431910] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7028915.431910] Call Trace:
[7028915.431923]  handle_stripe+0x8e7/0x2020 [raid456]
[7028915.431930]  ? __wake_up_common_lock+0x89/0xc0
[7028915.431935]  handle_active_stripes.isra.58+0x35f/0x560 [raid456]
[7028915.431939]  raid5_do_work+0xc6/0x1f0 [raid456]

Also commit 59fc630b ("RAID5: batch adjacent full stripe write")
said "If a stripe is added to batch list, then only the first stripe
of the list should be put to handle_list and run handle_stripe."

So don't set STRIPE_HANDLE to stripe which is already in batch list,
otherwise the stripe could be put to handle_list and run handle_stripe,
then the above warning could be triggered.

[1]. https://www.spinics.net/lists/raid/msg62552.htmlSigned-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a5443cd2

md/raid1: fail run raid1 array when active disk less than one · f1db7562

由 Yufen Yu 提交于 9月 03, 2019

[ Upstream commit 07f1a6850c5d5a65c917c3165692b5179ac4cb6b ]

When run test case:
  mdadm -CR /dev/md1 -l 1 -n 4 /dev/sd[a-d] --assume-clean --bitmap=internal
  mdadm -S /dev/md1
  mdadm -A /dev/md1 /dev/sd[b-c] --run --force

  mdadm --zero /dev/sda
  mdadm /dev/md1 -a /dev/sda

  echo offline > /sys/block/sdc/device/state
  echo offline > /sys/block/sdb/device/state
  sleep 5
  mdadm -S /dev/md1

  echo running > /sys/block/sdb/device/state
  echo running > /sys/block/sdc/device/state
  mdadm -A /dev/md1 /dev/sd[a-c] --run --force

mdadm run fail with kernel message as follow:
[  172.986064] md: kicking non-fresh sdb from array!
[  173.004210] md: kicking non-fresh sdc from array!
[  173.022383] md/raid1:md1: active with 0 out of 4 mirrors
[  173.022406] md1: failed to create bitmap (-5)

In fact, when active disk in raid1 array less than one, we
need to return fail in raid1_run().
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

f1db7562

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功