提交 · a1d6bc018911390274e3904bdd28240cd96ddc54 · openeuler / Kernel

16 7月, 2020 1 次提交

md: replace the RAID_AUTORUN ioctl with a direct function call · d82fa81c

由 Christoph Hellwig 提交于 6月 06, 2020

Instead of using a spcial RAID_AUTORUN ioctl that only exists for
non-modular builds and is only called from the early init code, just
call the actual function directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NNeilBrown <neilb@suse.de>
Acked-by: NSong Liu <song@kernel.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>

d82fa81c

14 5月, 2020 1 次提交

md: use memalloc scope APIs in mddev_suspend()/mddev_resume() · 78f57ef9

由 Coly Li 提交于 4月 09, 2020

In raid5.c:resize_chunk(), scribble_alloc() is called with GFP_NOIO
flag, then it is sent into kvmalloc_array() inside scribble_alloc().

The problem is kvmalloc_array() eventually calls kvmalloc_node() which
does not accept non GFP_KERNEL compatible flag like GFP_NOIO, then
kmalloc_node() is called indeed to allocate physically continuous
pages. When system memory is under heavy pressure, and the requesting
size is large, there is high probability that allocating continueous
pages will fail.

But simply using GFP_KERNEL flag to call kvmalloc_array() is also
progblematic. In the code path where scribble_alloc() is called, the
raid array is suspended, if kvmalloc_node() triggers memory reclaim I/Os
and such I/Os go back to the suspend raid array, deadlock will happen.

What is desired here is to allocate non-physically (a.k.a virtually)
continuous pages and avoid memory reclaim I/Os. Michal Hocko suggests
to use the mmealloc sceope APIs to restrict memory reclaim I/O in
allocating context, specifically to call memalloc_noio_save() when
suspend the raid array and to call memalloc_noio_restore() when
resume the raid array.

This patch adds the memalloc scope APIs in mddev_suspend() and
mddev_resume(), to restrict memory reclaim I/Os during the raid array
is suspended. The benifit of adding the memalloc scope API in the
unified entry point mddev_suspend()/mddev_resume() is, no matter which
md raid array type (personality), we are sure the deadlock by recursive
memory reclaim I/O won't happen on the suspending context.

Please notice that the memalloc scope APIs only take effect on the raid
array suspending context, if the memory allocation is from another new
created kthread after raid array suspended, the recursive memory reclaim
I/Os won't be restricted. The mddev_suspend()/mddev_resume() entries are
used for the critical section where the raid metadata is modifying,
creating a kthread to allocate memory inside the critical section is
queer and very probably being buggy.

Fixes: b330e6a4 ("md: convert to kvmalloc")
Suggested-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NColy Li <colyli@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>

78f57ef9

14 1月, 2020 4 次提交

md: introduce a new struct for IO serialization · 69b00b5b

由 Guoqing Jiang 提交于 12月 23, 2019

Obviously, IO serialization could cause the degradation of
performance a lot. In order to reduce the degradation, so a
rb interval tree is added in raid1 to speed up the check of
collision.

So, a rb root is needed in md_rdev, then abstract all the
serialize related members to a new struct (serial_in_rdev),
embed it into md_rdev.

Of course, we need to free the struct if it is not needed
anymore, so rdev/rdevs_uninit_serial are added accordingly.
And they should be called when destroty memory pool or can't
alloc memory.

And we need to consider to call mddev_destroy_serial_pool
in case serialize_policy/write-behind is disabled, bitmap
is destroyed or in __md_stop_writes.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

69b00b5b

md: add serialize_policy sysfs node for raid1 · 3938f5fb

由 Guoqing Jiang 提交于 12月 23, 2019

With the new sysfs node, we can use it to control if raid1 array
wants io serialization or not. So mddev_create_serial_pool and
mddev_destroy_serial_pool are called in serialize_policy_store
to enable or disable the serialization.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3938f5fb

md: prepare for enable raid1 io serialization · 11d3a9f6

由 Guoqing Jiang 提交于 12月 23, 2019

1. The related resources (spin_lock, list and waitqueue) are needed for
address raid1 reorder overlap issue too, in this case, rdev is set to
NULL for mddev_create/destroy_serial_pool which implies all rdevs need
to handle these resources.

And also add "is_suspend" to mddev_destroy_serial_pool since it will
be called under suspended situation, which also makes both create and
destroy pool have same arguments.

2. Introduce rdevs_init_serial which is called if raid1 io serialization
is enabled since all rdevs need to init related stuffs.

3. rdev_init_serial and clear_bit(CollisionCheck, &rdev->flags) should
be called between suspend and resume.

No need to export mddev_create_serial_pool since it is only called in
md-mod module.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

11d3a9f6

md: rename wb stuffs · 404659cf

由 Guoqing Jiang 提交于 12月 23, 2019

Previously, wb_info_pool and wb_list stuffs are introduced
to address potential data inconsistence issue for write
behind device.

Now rename them to serial related name, since the same
mechanism will be used to address reorder overlap write
issue for raid1.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

404659cf

25 10月, 2019 1 次提交

md: improve handling of bio with REQ_PREFLUSH in md_flush_request() · 775d7831

由 David Jeffery 提交于 9月 16, 2019

If pers->make_request fails in md_flush_request(), the bio is lost. To
fix this, pass back a bool to indicate if the original make_request call
should continue to handle the I/O and instead of assuming the flush logic
will push it to completion.

Convert md_flush_request to return a bool and no longer calls the raid
driver's make_request function.  If the return is true, then the md flush
logic has or will complete the bio and the md make_request call is done.
If false, then the md make_request function needs to keep processing like
it is a normal bio. Let the original call to md_handle_request handle any
need to retry sending the bio to the raid driver's make_request function
should it be needed.

Also mark md_flush_request and the make_request function pointer as
__must_check to issue warnings should these critical return values be
ignored.

Fixes: 2bc13b83 ("md: batch flush requests.")
Cc: stable@vger.kernel.org # # v4.19+
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Reviewed-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

775d7831

04 9月, 2019 1 次提交

md raid0/linear: Mark array as 'broken' and fail BIOs if a member is gone · 62f7b198

由 Guilherme G. Piccoli 提交于 9月 03, 2019

Currently md raid0/linear are not provided with any mechanism to validate
if an array member got removed or failed. The driver keeps sending BIOs
regardless of the state of array members, and kernel shows state 'clean'
in the 'array_state' sysfs attribute. This leads to the following
situation: if a raid0/linear array member is removed and the array is
mounted, some user writing to this array won't realize that errors are
happening unless they check dmesg or perform one fsync per written file.
Despite udev signaling the member device is gone, 'mdadm' cannot issue the
STOP_ARRAY ioctl successfully, given the array is mounted.

In other words, no -EIO is returned and writes (except direct ones) appear
normal. Meaning the user might think the wrote data is correctly stored in
the array, but instead garbage was written given that raid0 does stripping
(and so, it requires all its members to be working in order to not corrupt
data). For md/linear, writes to the available members will work fine, but
if the writes go to the missing member(s), it'll cause a file corruption
situation, whereas the portion of the writes to the missing devices aren't
written effectively.

This patch changes this behavior: we check if the block device's gendisk
is UP when submitting the BIO to the array member, and if it isn't, we flag
the md device as MD_BROKEN and fail subsequent I/Os to that device; a read
request to the array requiring data from a valid member is still completed.
While flagging the device as MD_BROKEN, we also show a rate-limited warning
in the kernel log.

A new array state 'broken' was added too: it mimics the state 'clean' in
every aspect, being useful only to distinguish if the array has some member
missing. We rely on the MD_BROKEN flag to put the array in the 'broken'
state. This state cannot be written in 'array_state' as it just shows
one or more members of the array are missing but acts like 'clean', it
wouldn't make sense to write it.

With this patch, the filesystem reacts much faster to the event of missing
array member: after some I/O errors, ext4 for instance aborts the journal
and prevents corruption. Without this change, we're able to keep writing
in the disk and after a machine reboot, e2fsck shows some severe fs errors
that demand fixing. This patch was tested in ext4 and xfs filesystems, and
requires a 'mdadm' counterpart to handle the 'broken' state.

Cc: Song Liu <songliubraving@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

62f7b198

28 8月, 2019 1 次提交

md: don't report active array_state until after revalidate_disk() completes. · 9d4b45d6

由 NeilBrown 提交于 8月 20, 2019

Until revalidate_disk() has completed, the size of a new md array will
appear to be zero.
So we shouldn't report, through array_state, that the array is active
until that time.
udev rules check array_state to see if the array is ready.  As soon as
it appear to be zero, fsck can be run.  If it find the size to be
zero, it will fail.

So add a new flag to provide an interlock between do_md_run() and
array_state_show().  This flag is set while do_md_run() is active and
it prevents array_state_show() from reporting that the array is
active.

Before do_md_run() is called, ->pers will be NULL so array is
definitely not active.
After do_md_run() is called, revalidate_disk() will have run and the
array will be completely ready.

We also move various sysfs_notify*() calls out of md_run() into
do_md_run() after MD_NOT_READY is cleared.  This ensure the
information is ready before the notification is sent.

Prior to v4.12, array_state_show() was called with the
mddev->reconfig_mutex held, which provided exclusion with do_md_run().

Note that MD_NOT_READY cleared twice.  This is deliberate to cover
both success and error paths with minimal noise.

Fixes: b7b17c9b ("md: remove mddev_lock() from md_attr_show()")
Cc: stable@vger.kernel.org (v4.12++)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

9d4b45d6

08 8月, 2019 1 次提交

md: allow last device to be forcibly removed from RAID1/RAID10. · 9a567843

由 Guoqing Jiang 提交于 7月 24, 2019

When the 'last' device in a RAID1 or RAID10 reports an error,
we do not mark it as failed.  This would serve little purpose
as there is no risk of losing data beyond that which is obviously
lost (as there is with RAID5), and there could be other sectors
on the device which are readable, and only readable from this device.
This in general this maximises access to data.

However the current implementation also stops an admin from removing
the last device by direct action.  This is rarely useful, but in many
case is not harmful and can make automation easier by removing special
cases.

Also, if an attempt to write metadata fails the device must be marked
as faulty, else an infinite loop will result, attempting to update
the metadata on all non-faulty devices.

So add 'fail_last_dev' member to 'struct mddev', then we can bypasses
the 'last disk' checks for RAID1 and RAID10, and control the behavior
per array by change sysfs node.
Signed-off-by: NNeilBrown <neilb@suse.de>
[add sysfs node for fail_last_dev by Guoqing]
Signed-off-by: NGuoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

9a567843

21 6月, 2019 2 次提交

md: introduce mddev_create/destroy_wb_pool for the change of member device · 963c555e

由 Guoqing Jiang 提交于 6月 14, 2019

Previously, we called rdev_init_wb to avoid potential data
inconsistency when array is created.

Now, we need to call the function and create mempool if a
device is added or just be flaged as "writemostly". So
mddev_create_wb_pool is introduced and called accordingly.
And for safety reason, we mark implicit GFP_NOIO allocation
scope for create mempool during mddev_suspend/mddev_resume.

And mempool should be removed conversely after remove a
member device or its's "writemostly" flag, which is done
by call mddev_destroy_wb_pool.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

963c555e

md/raid1: fix potential data inconsistency issue with write behind device · 3e148a32

由 Guoqing Jiang 提交于 6月 19, 2019

For write-behind mode, we think write IO is complete once it has
reached all the non-writemostly devices. It works fine for single
queue devices.

But for multiqueue device, if there are lots of IOs come from upper
layer, then the write-behind device could issue those IOs to different
queues, depends on the each queue's delay, so there is no guarantee
that those IOs can arrive in order.

To address the issue, we need to check the collision among write
behind IOs, we can only continue without collision, otherwise wait
for the completion of previous collisioned IO.

And WBCollision is introduced for multiqueue device which is worked
under write-behind mode.

But this patch doesn't handle below cases which could have the data
inconsistency issue as well, these cases will be handled in later
patches.

1. modify max_write_behind by write backlog node.
2. add or remove array's bitmap dynamically.
3. the change of member disk.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>

3e148a32

24 5月, 2019 1 次提交

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 47 · af1a8899

由 Thomas Gleixner 提交于 5月 20, 2019

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 or at your option any
later version you should have received a copy of the gnu general
public license for example usr src linux copying if not write to the
free software foundation inc 675 mass ave cambridge ma 02139 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 20 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520170858.552543146@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

af1a8899

02 4月, 2019 2 次提交

md: batch flush requests. · 2bc13b83

由 NeilBrown 提交于 3月 29, 2019

Currently if many flush requests are submitted to an md device is quick
succession, they are serialized and can take a long to process them all.
We don't really need to call flush all those times - a single flush call
can satisfy all requests submitted before it started.
So keep track of when the current flush started and when it finished,
allow any pending flush that was requested before the flush started
to complete without waiting any more.

Test results from Xiao:

Test is done on a raid10 device which is created by 4 SSDs. The tool is
dbench.

1. The latest linux stable kernel
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768    10.509    78.305
  Flush                  2078376     0.013    10.094
  Close                  21787697     0.019    18.821
  LockX                    96580     0.007     3.184
  Mkdir                      384     0.008     0.062
  Rename                 1255883     0.191    23.534
  ReadX                  46495589     0.020    14.230
  WriteX                 14790591     7.123    60.706
  Unlink                 5989118     0.440    54.551
  UnlockX                  96580     0.005     2.736
  FIND_FIRST             10393845     0.042    12.079
  SET_FILE_INFORMATION   2415558     0.129    10.088
  QUERY_FILE_INFORMATION 4711725     0.005     8.462
  QUERY_PATH_INFORMATION 26883327     0.032    21.715
  QUERY_FS_INFORMATION   4929409b     0.010     8.238
  NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    256     8.326    36.761
  Flush                   693291     3.974   180.269
  Close                  7266404     0.009    36.929
  LockX                    32160     0.006     0.840
  Mkdir                      128     0.008     0.021
  Rename                  418755     0.063    29.945
  ReadX                  15498708     0.007     7.216
  WriteX                 4932310    22.482   267.928
  Unlink                 1997557     0.109    47.553
  UnlockX                  32160     0.004     1.110
  FIND_FIRST             3465791     0.036     7.320
  SET_FILE_INFORMATION    805825     0.015     1.561
  QUERY_FILE_INFORMATION 1570950     0.005     2.403
  QUERY_PATH_INFORMATION 8965483     0.013    14.277
  QUERY_FS_INFORMATION   1643626     0.009     3.314
  NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
max_latency=267.939 m

3. With patch1 and patch2
  Operation                Count    AvgLat    MaxLat
  --------------------------------------------------
  Deltree                    768     9.570    54.588
  Flush                  2061354     0.666    15.102
  Close                  21604811     0.012    25.697
  LockX                    95770     0.007     1.424
  Mkdir                      384     0.008     0.053
  Rename                 1245411     0.096    12.263
  ReadX                  46103198     0.011    12.116
  WriteX                 14667988     7.375    60.069
  Unlink                 5938936     0.173    30.905
  UnlockX                  95770     0.005     4.147
  FIND_FIRST             10306407     0.041    11.715
  SET_FILE_INFORMATION   2395987     0.048     7.640
  QUERY_FILE_INFORMATION 4672371     0.005     9.291
  QUERY_PATH_INFORMATION 26656735     0.018    19.719
  QUERY_FS_INFORMATION   4887940     0.010     7.654
  NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
max_latency=60.075 ms

Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2bc13b83

Revert "MD: fix lock contention for flush bios" · 4bc034d3

由 NeilBrown 提交于 3月 29, 2019

This reverts commit 5a409b4f.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4bc034d3

19 10月, 2018 1 次提交

md-cluster/raid10: support add disk under grow mode · 7564beda

由 Guoqing Jiang 提交于 10月 18, 2018

For clustered raid10 scenario, we need to let all the nodes
know about that a new disk is added to the array, and the
reshape caused by add new member just need to be happened in
one node, but other nodes should know about the change.

Since reshape means read data from somewhere (which is already
used by array) and write data to unused region. Obviously, it
is awful if one node is reading data from address while another
node is writing to the same address. Considering we have
implemented suspend writes in the resyncing area, so we can
just broadcast the reading address to other nodes to avoid the
trouble.

For master node, it would call reshape_request then update sb
during the reshape period. To avoid above trouble, we call
resync_info_update to send RESYNC message in reshape_request.

Then from slave node's view, it receives two type messages:
1. RESYNCING message
Slave node add the address (where master node reading data from)
to suspend list.

2. METADATA_UPDATED message
Once slave nodes know the reshaping is started in master node,
it is time to update reshape position and call start_reshape to
follow master node's step. After reshape is done, only reshape
position is need to be updated, so the majority task of reshaping
is happened on the master node.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7564beda

06 7月, 2018 1 次提交

md-cluster: show array's status more accurate · 0357ba27

由 Guoqing Jiang 提交于 7月 02, 2018

When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
   which means slaves receive msg with both lo
   and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0357ba27

31 5月, 2018 1 次提交

md: convert to bioset_init()/mempool_init() · afeee514

由 Kent Overstreet 提交于 5月 20, 2018

Convert md to embedded bio sets.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

afeee514

22 5月, 2018 1 次提交

MD: fix lock contention for flush bios · 5a409b4f

由 Xiao Ni 提交于 5月 21, 2018

There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    11165   167.595  5879.560
 Close                   107469     1.391  2231.094
 LockX                      384     0.003     0.019
 Rename                    5944     2.141  1856.001
 ReadX                   208121     0.003     0.074
 WriteX                   98259  1925.402 15204.895
 Unlink                   25198    13.264  3457.268
 UnlockX                    384     0.001     0.009
 FIND_FIRST               47111     0.012     0.076
 SET_FILE_INFORMATION     12966     0.007     0.065
 QUERY_FILE_INFORMATION   27921     0.004     0.085
 QUERY_PATH_INFORMATION  124650     0.005     5.766
 QUERY_FS_INFORMATION     22519     0.003     0.053
 NTCreateX               141086     4.291  2502.812

Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms

=================With the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                     4500   174.134   406.398
 Close                    48195     0.060   467.062
 LockX                      256     0.003     0.029
 Rename                    2324     0.026     0.360
 ReadX                    78846     0.004     0.504
 WriteX                   66832   562.775  1467.037
 Unlink                    5516     3.665  1141.740
 UnlockX                    256     0.002     0.019
 FIND_FIRST               16428     0.015     0.313
 SET_FILE_INFORMATION      6400     0.009     0.520
 QUERY_FILE_INFORMATION   17865     0.003     0.089
 QUERY_PATH_INFORMATION   47060     0.078   416.299
 QUERY_FS_INFORMATION      7024     0.004     0.032
 NTCreateX                55921     0.854  1141.452

Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
that you previously called
                          atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing
Suggested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a409b4f

19 2月, 2018 1 次提交

md: fix md_write_start() deadlock w/o metadata devices · 4b6c1060

由 Heinz Mauelshagen 提交于 2月 02, 2018

If no metadata devices are configured on raid1/4/5/6/10
(e.g. via dm-raid), md_write_start() unconditionally waits
for superblocks to be written thus deadlocking.

Fix introduces mddev->has_superblocks bool, defines it in md_run()
and checks for it in md_write_start() to conditionally avoid waiting.

Once on it, check for non-existing superblocks in md_super_write().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

4b6c1060

16 1月, 2018 1 次提交

raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8

由 Tomasz Majchrzak 提交于 12月 27, 2017

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

1532d9e8

12 12月, 2017 1 次提交

md: introduce new personality funciton start() · d5d885fd

由 Song Liu 提交于 11月 19, 2017

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d5d885fd

02 11月, 2017 3 次提交

md: use lockdep_assert_held · efa4b77b

由 Shaohua Li 提交于 10月 18, 2017

lockdep_assert_held is a better way to assert lock held, and it works
for UP.
Signed-off-by: NShaohua Li <shli@fb.com>

efa4b77b

md: remove special meaning of ->quiesce(.., 2) · b03e0ccb

由 NeilBrown 提交于 10月 19, 2017

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b03e0ccb

md: allow metadata update while suspending. · 35bfc521

由 NeilBrown 提交于 10月 17, 2017

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.
Reported-and-tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35bfc521

28 9月, 2017 1 次提交

md: separate request handling · 393debc2

由 Shaohua Li 提交于 9月 21, 2017

With commit cc27b0c7, pers->make_request could bail out without handling
the bio. If that happens, we should retry.  The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

393debc2

28 8月, 2017 1 次提交

md: Runtime support for multiple ppls · ddc08823

由 Pawel Baldysiak 提交于 8月 16, 2017

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ddc08823

24 8月, 2017 1 次提交

block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992

由 Christoph Hellwig 提交于 8月 23, 2017

This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74d46992

22 7月, 2017 2 次提交

md: raid1-10: move raid1/raid10 common code into raid1-10.c · be453e77

由 Ming Lei 提交于 7月 14, 2017

No function change, just move 'struct resync_pages' and related
helpers into raid1-10.c
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

be453e77

md: remove 'idx' from 'struct resync_pages' · 022e510f

由 Ming Lei 提交于 7月 14, 2017

bio_add_page() won't fail for resync bio, and the page index for each
bio is same, so remove it.

More importantly the 'idx' of 'struct resync_pages' is initialized in
mempool allocator function, the current way is wrong since mempool is
only responsible for allocation, we can't use that for initialization.
Suggested-by: NNeilBrown <neilb@suse.com>
Reported-by: NNeilBrown <neilb@suse.com>
Reported-and-tested-by: NPatrick <dto@gmx.net>
Fixes: f0250618(md: raid10: don't use bio's vec table to manage resync pages)
Fixes: 98d30c58(md: raid1: don't use bio's vec table to manage resync pages)
Cc: stable@vger.kernel.org (4.12+)
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

022e510f

11 7月, 2017 1 次提交

md/bitmap: don't read page from device with Bitmap_sync · 4aaf7694

由 Guoqing Jiang 提交于 7月 04, 2017

The device owns Bitmap_sync flag needs recovery
to become in sync, and read page from this type
device could get stale status.

Also add comments for Bitmap_sync bit per the
suggestion from Shaohua and Neil.

Previous disscussion can be found here:
https://marc.info/?t=149760428900004&r=1&w=2Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4aaf7694

22 6月, 2017 1 次提交

md: use a separate bio_set for synchronous IO. · 5a85071c

由 NeilBrown 提交于 6月 21, 2017

md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.

This multiple usage can lead to deadlocks.  It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.

This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.

Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.

Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a85071c

14 6月, 2017 1 次提交

md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7

由 NeilBrown 提交于 6月 05, 2017

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().
Reported-by: NNix <nix@esperi.org.uk>
Fix: 68866e42(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

cc27b0c7

06 6月, 2017 1 次提交

md: initialise ->writes_pending in personality modules. · a415c0f1

由 NeilBrown 提交于 6月 05, 2017

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.
Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a415c0f1

09 5月, 2017 1 次提交

md: don't return -EAGAIN in md_allow_write for external metadata arrays · 2214c260

由 Artur Paszkiewicz 提交于 5月 08, 2017

This essentially reverts commit b5470dc5 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2214c260

09 4月, 2017 1 次提交

md: support REQ_OP_WRITE_ZEROES · 3deff1a7

由 Christoph Hellwig 提交于 4月 05, 2017

Copy & paste from the REQ_OP_WRITE_SAME code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3deff1a7

25 3月, 2017 2 次提交

md: prepare for managing resync I/O pages in clean way · 513e2faa

由 Ming Lei 提交于 3月 17, 2017

Now resync I/O use bio's bec table to manage pages,
this way is very hacky, and may not work any more
once multipage bvec is introduced.

So introduce helpers and new data structure for
managing resync I/O pages more cleanly.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

513e2faa

md: move two macros into md.h · d8e29fbc

由 Ming Lei 提交于 3月 17, 2017

Both raid1 and raid10 share common resync
block size and page count, so move them into md.h.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d8e29fbc

23 3月, 2017 2 次提交

MD: use per-cpu counter for writes_pending · 4ad23a97

由 NeilBrown 提交于 3月 15, 2017

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed.  As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero.  md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode.  If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change.  We leverage that further by always rounding off the
jiffies to the timeout value.  This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set.  That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4ad23a97

md/raid5: use md_write_start to count stripes, not bios · 49728050

由 NeilBrown 提交于 3月 15, 2017

We use md_write_start() to increase the count of pending writes, and
md_write_end() to decrement the count.  We currently count bios
submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
has been attached to.

So now, raid5_make_request() calls md_write_start() and then
md_write_end() to keep the count elevated during the setup of the
request.

add_stripe_bio() calls md_write_start() for each stripe_head, and the
completion routines always call md_write_end(), instead of only
calling it when raid5_dec_bi_active_stripes() returns 0.
make_discard_request also calls md_write_start/end().

The parallel between md_write_{start,end} and use of bi_phys_segments
can be seen in that:
 Whenever we set bi_phys_segments to 1, we now call md_write_start.
 Whenever we increment it on non-read requests with
   raid5_inc_bi_active_stripes(), we now call md_write_start().
 Whenever we decrement bi_phys_segments on non-read requsts with
    raid5_dec_bi_active_stripes(), we now call md_write_end().

This reduces our dependence on keeping a per-bio count of active
stripes in bi_phys_segments.

md_write_inc() is added which parallels md_write_start(), but requires
that a write has already been started, and is certain never to sleep.
This can be used inside a spinlocked region when adding to a write
request.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

49728050

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功