提交 · a22a9602b88fabf10847f238ff81fde5f906fef7 · openeuler / Kernel

28 8月, 2019 1 次提交

md: don't report active array_state until after revalidate_disk() completes. · 9d4b45d6

由 NeilBrown 提交于 8月 20, 2019

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready.  As soon as
it appear to be zero, fsck can be run.  If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show().  This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared.  This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice.  This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

9d4b45d6

08 8月, 2019 1 次提交

md: allow last device to be forcibly removed from RAID1/RAID10. · 9a567843

由 Guoqing Jiang 提交于 7月 24, 2019

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.
Signed-off-by: NNeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

9a567843

21 6月, 2019 2 次提交

md: introduce mddev_create/destroy_wb_pool for the change of member device · 963c555e

由 Guoqing Jiang 提交于 6月 14, 2019

Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

963c555e

md/raid1: fix potential data inconsistency issue with write behind device · 3e148a32

由 Guoqing Jiang 提交于 6月 19, 2019

For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3e148a32

24 5月, 2019 1 次提交

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 · af1a8899

由 Thomas Gleixner 提交于 5月 20, 2019

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

af1a8899

02 4月, 2019 2 次提交

md: batch flush requests. · 2bc13b83

由 NeilBrown 提交于 3月 29, 2019

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768    10.509    78.305
  Flush                  2078376     0.013    10.094
  Close                  21787697     0.019    18.821
  LockX                    96580     0.007     3.184
  Mkdir                      384     0.008     0.062
  Rename                 1255883     0.191    23.534
  ReadX                  46495589     0.020    14.230
  WriteX                 14790591     7.123    60.706
  Unlink                 5989118     0.440    54.551
  UnlockX                  96580     0.005     2.736
  FIND_FIRST             10393845     0.042    12.079
  SET_FILE_INFORMATION   2415558     0.129    10.088
  QUERY_FILE_INFORMATION 4711725     0.005     8.462
  QUERY_PATH_INFORMATION 26883327     0.032    21.715
  QUERY_FS_INFORMATION   4929409b     0.010     8.238
  NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    256     8.326    36.761
  Flush                   693291     3.974   180.269
  Close                  7266404     0.009    36.929
  LockX                    32160     0.006     0.840
  Mkdir                      128     0.008     0.021
  Rename                  418755     0.063    29.945
  ReadX                  15498708     0.007     7.216
  WriteX                 4932310    22.482   267.928
  Unlink                 1997557     0.109    47.553
  UnlockX                  32160     0.004     1.110
  FIND_FIRST             3465791     0.036     7.320
  SET_FILE_INFORMATION    805825     0.015     1.561
  QUERY_FILE_INFORMATION 1570950     0.005     2.403
  QUERY_PATH_INFORMATION 8965483     0.013    14.277
  QUERY_FS_INFORMATION   1643626     0.009     3.314
  NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
max_latency=267.939 m

3. With patch1 and patch2
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768     9.570    54.588
  Flush                  2061354     0.666    15.102
  Close                  21604811     0.012    25.697
  LockX                    95770     0.007     1.424
  Mkdir                      384     0.008     0.053
  Rename                 1245411     0.096    12.263
  ReadX                  46103198     0.011    12.116
  WriteX                 14667988     7.375    60.069
  Unlink                 5938936     0.173    30.905
  UnlockX                  95770     0.005     4.147
  FIND_FIRST             10306407     0.041    11.715
  SET_FILE_INFORMATION   2395987     0.048     7.640
  QUERY_FILE_INFORMATION 4672371     0.005     9.291
  QUERY_PATH_INFORMATION 26656735     0.018    19.719
  QUERY_FS_INFORMATION   4887940     0.010     7.654
  NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2bc13b83

Revert "MD: fix lock contention for flush bios" · 4bc034d3

由 NeilBrown 提交于 3月 29, 2019

This reverts commit 5a409b4f.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4bc034d3

19 10月, 2018 1 次提交

md-cluster/raid10: support add disk under grow mode · 7564beda

由 Guoqing Jiang 提交于 10月 18, 2018

For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7564beda

06 7月, 2018 1 次提交

md-cluster: show array's status more accurate · 0357ba27

由 Guoqing Jiang 提交于 7月 02, 2018

When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
   which means slaves receive msg with both lo
   and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0357ba27

31 5月, 2018 1 次提交

md: convert to bioset_init()/mempool_init() · afeee514

由 Kent Overstreet 提交于 5月 20, 2018

Convert md to embedded bio sets.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

afeee514

22 5月, 2018 1 次提交

MD: fix lock contention for flush bios · 5a409b4f

由 Xiao Ni 提交于 5月 21, 2018

There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    11165   167.595  5879.560
 Close                   107469     1.391  2231.094
 LockX                      384     0.003     0.019
 Rename                    5944     2.141  1856.001
 ReadX                   208121     0.003     0.074
 WriteX                   98259  1925.402 15204.895
 Unlink                   25198    13.264  3457.268
 UnlockX                    384     0.001     0.009
 FIND_FIRST               47111     0.012     0.076
 SET_FILE_INFORMATION     12966     0.007     0.065
 QUERY_FILE_INFORMATION   27921     0.004     0.085
 QUERY_PATH_INFORMATION  124650     0.005     5.766
 QUERY_FS_INFORMATION     22519     0.003     0.053
 NTCreateX               141086     4.291  2502.812

Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms

=================With the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                     4500   174.134   406.398
 Close                    48195     0.060   467.062
 LockX                      256     0.003     0.029
 Rename                    2324     0.026     0.360
 ReadX                    78846     0.004     0.504
 WriteX                   66832   562.775  1467.037
 Unlink                    5516     3.665  1141.740
 UnlockX                    256     0.002     0.019
 FIND_FIRST               16428     0.015     0.313
 SET_FILE_INFORMATION      6400     0.009     0.520
 QUERY_FILE_INFORMATION   17865     0.003     0.089
 QUERY_PATH_INFORMATION   47060     0.078   416.299
 QUERY_FS_INFORMATION      7024     0.004     0.032
 NTCreateX                55921     0.854  1141.452

Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
that you previously called
                          atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing
Suggested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a409b4f

19 2月, 2018 1 次提交

md: fix md_write_start() deadlock w/o metadata devices · 4b6c1060

由 Heinz Mauelshagen 提交于 2月 02, 2018

If no metadata devices are configured on raid1/4/5/6/10
(e.g. via dm-raid), md_write_start() unconditionally waits
for superblocks to be written thus deadlocking.

Fix introduces mddev->has_superblocks bool, defines it in md_run()
and checks for it in md_write_start() to conditionally avoid waiting.

Once on it, check for non-existing superblocks in md_super_write().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

4b6c1060

16 1月, 2018 1 次提交

raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8

由 Tomasz Majchrzak 提交于 12月 27, 2017

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

1532d9e8

12 12月, 2017 1 次提交

md: introduce new personality funciton start() · d5d885fd

由 Song Liu 提交于 11月 19, 2017

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d5d885fd

02 11月, 2017 3 次提交

md: use lockdep_assert_held · efa4b77b

由 Shaohua Li 提交于 10月 18, 2017

lockdep_assert_held is a better way to assert lock held, and it works
for UP.
Signed-off-by: NShaohua Li <shli@fb.com>

efa4b77b

md: remove special meaning of ->quiesce(.., 2) · b03e0ccb

由 NeilBrown 提交于 10月 19, 2017

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b03e0ccb

md: allow metadata update while suspending. · 35bfc521

由 NeilBrown 提交于 10月 17, 2017

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.
Reported-and-tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35bfc521

28 9月, 2017 1 次提交

md: separate request handling · 393debc2

由 Shaohua Li 提交于 9月 21, 2017

With commit cc27b0c7, pers->make_request could bail out without handling
the bio. If that happens, we should retry.  The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

393debc2

28 8月, 2017 1 次提交

md: Runtime support for multiple ppls · ddc08823

由 Pawel Baldysiak 提交于 8月 16, 2017

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ddc08823

24 8月, 2017 1 次提交

block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992

由 Christoph Hellwig 提交于 8月 23, 2017

This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74d46992

22 7月, 2017 2 次提交

md: raid1-10: move raid1/raid10 common code into raid1-10.c · be453e77

由 Ming Lei 提交于 7月 14, 2017

No function change, just move 'struct resync_pages' and related
helpers into raid1-10.c
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

be453e77

md: remove 'idx' from 'struct resync_pages' · 022e510f

由 Ming Lei 提交于 7月 14, 2017

bio_add_page() won't fail for resync bio, and the page index for each
bio is same, so remove it.

More importantly the 'idx' of 'struct resync_pages' is initialized in
mempool allocator function, the current way is wrong since mempool is
only responsible for allocation, we can't use that for initialization.
Suggested-by: NNeilBrown <neilb@suse.com>
Reported-by: NNeilBrown <neilb@suse.com>
Reported-and-tested-by: NPatrick <dto@gmx.net>
Fixes: f0250618(md: raid10: don't use bio's vec table to manage resync pages)
Fixes: 98d30c58(md: raid1: don't use bio's vec table to manage resync pages)
Cc: stable@vger.kernel.org (4.12+)
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

022e510f

11 7月, 2017 1 次提交

md/bitmap: don't read page from device with Bitmap_sync · 4aaf7694

由 Guoqing Jiang 提交于 7月 04, 2017

The device owns Bitmap_sync flag needs recovery
to become in sync, and read page from this type
device could get stale status.

Also add comments for Bitmap_sync bit per the
suggestion from Shaohua and Neil.

Previous disscussion can be found here:
https://marc.info/?t=149760428900004&r=1&w=2Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4aaf7694

22 6月, 2017 1 次提交

md: use a separate bio_set for synchronous IO. · 5a85071c

由 NeilBrown 提交于 6月 21, 2017

md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.

This multiple usage can lead to deadlocks.  It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.

This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.

Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.

Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a85071c

14 6月, 2017 1 次提交

md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7

由 NeilBrown 提交于 6月 05, 2017

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().
Reported-by: NNix <nix@esperi.org.uk>
Fix: 68866e42(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

cc27b0c7

06 6月, 2017 1 次提交

md: initialise ->writes_pending in personality modules. · a415c0f1

由 NeilBrown 提交于 6月 05, 2017

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.
Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a415c0f1

09 5月, 2017 1 次提交

md: don't return -EAGAIN in md_allow_write for external metadata arrays · 2214c260

由 Artur Paszkiewicz 提交于 5月 08, 2017

This essentially reverts commit b5470dc5 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2214c260

09 4月, 2017 1 次提交

md: support REQ_OP_WRITE_ZEROES · 3deff1a7

由 Christoph Hellwig 提交于 4月 05, 2017

Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3deff1a7

25 3月, 2017 2 次提交

md: prepare for managing resync I/O pages in clean way · 513e2faa

由 Ming Lei 提交于 3月 17, 2017

Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

513e2faa

md: move two macros into md.h · d8e29fbc

由 Ming Lei 提交于 3月 17, 2017

Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d8e29fbc

23 3月, 2017 2 次提交

MD: use per-cpu counter for writes_pending · 4ad23a97

由 NeilBrown 提交于 3月 15, 2017

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed.  As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero.  md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode.  If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change.  We leverage that further by always rounding off the
jiffies to the timeout value.  This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set.  That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4ad23a97

md/raid5: use md_write_start to count stripes, not bios · 49728050

由 NeilBrown 提交于 3月 15, 2017

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
 Whenever we set bi_phys_segments to 1, we now call md_write_start.
 Whenever we increment it on non-read requests with
   raid5_inc_bi_active_stripes(), we now call md_write_start().
 Whenever we decrement bi_phys_segments on non-read requsts with
    raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

49728050

17 3月, 2017 3 次提交

raid5-ppl: runtime PPL enabling or disabling · ba903a3e

由 Artur Paszkiewicz 提交于 3月 09, 2017

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array.  The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ba903a3e

md: superblock changes for PPL · ea0213e0

由 Artur Paszkiewicz 提交于 3月 09, 2017

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ea0213e0

md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977

由 Guoqing Jiang 提交于 3月 01, 2017

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0ba95977

10 3月, 2017 1 次提交

md: delete dead code · 99b3d74e

由 Shaohua Li 提交于 2月 23, 2017

Nobody is using mddev_check_plugged(), so delete the dead code
Signed-off-by: NShaohua Li <shli@fb.com>

99b3d74e

16 2月, 2017 1 次提交

md: fast clone bio in bio_clone_mddev() · d7a10308

由 Ming Lei 提交于 2月 14, 2017

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d7a10308

14 2月, 2017 1 次提交

md: disable WRITE SAME if it fails in underlayer disks · 26483819

由 Shaohua Li 提交于 2月 13, 2017

This makes md do the same thing as dm for write same IO failure. Please
see 7eee4ae2(dm: disable WRITE SAME if it fails) for details why we need
this.

We did a little bit different than dm. Instead of disabling writesame in
the first IO error, we disable it till next writesame IO coming after
the first IO error. This way we don't need to clone a bio.

Also reported here: https://bugzilla.kernel.org/show_bug.cgi?id=118581Suggested-by: NNeilBrown <neilb@suse.com>
Acked-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

26483819

06 1月, 2017 1 次提交

md: cleanup mddev flag clear for takeover · 394ed8e4

由 Shaohua Li 提交于 1月 04, 2017

Commit 6995f0b2 (md: takeover should clear unrelated bits) clear
unrelated bits, but it's quite fragile. To avoid error in the future,
define a macro for unsupported mddev flags for each raid type and use it
to clear unsupported mddev flags. This should be less error-prone.
Suggested-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

394ed8e4

09 12月, 2016 1 次提交

md: separate flags for superblock changes · 2953079c

由 Shaohua Li 提交于 12月 08, 2016

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2953079c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功