提交 · 7f6b9285cadaeba029214128e685ad906967f9c0 · openanolis / cloud-kernel

26 5月, 2019 1 次提交

Revert "MD: fix lock contention for flush bios" · 7f6b9285

由 NeilBrown 提交于 3月 29, 2019

commit 4bc034d35377196c854236133b07730a777c4aba upstream.

This reverts commit 5a409b4f.

This patch has two problems.

1/ it make multiple calls to submit_bio() from inside a make_request_fn.
 The bios thus submitted will be queued on current->bio_list and not
 submitted immediately.  As the bios are allocated from a mempool,
 this can theoretically result in a deadlock - all the pool of requests
 could be in various ->bio_list queues and a subsequent mempool_alloc
 could block waiting for one of them to be released.

2/ It aims to handle a case when there are many concurrent flush requests.
  It handles this by submitting many requests in parallel - all of which
  are identical and so most of which do nothing useful.
  It would be more efficient to just send one lower-level request, but
  allow that to satisfy multiple upper-level requests.

Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
Cc: <stable@vger.kernel.org> # v4.19+
Tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7f6b9285

14 11月, 2018 4 次提交

MD: fix invalid stored role for a disk - try2 · 589c3750

由 Shaohua Li 提交于 10月 14, 2018

commit 9e753ba9 upstream.

Commit d595567d (MD: fix invalid stored role for a disk) broke linear
hotadd. Let's only fix the role for disks in raid1/10.
Based on Guoqing's original patch.
Reported-by: Nkernel test robot <rong.a.chen@intel.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

589c3750

MD: fix invalid stored role for a disk · 6596e997

由 Shaohua Li 提交于 10月 01, 2018

[ Upstream commit d595567d ]

If we change the number of array's device after device is removed from array,
then add the device back to array, we can see that device is added as active
role instead of spare which we expected.

Please see the below link for details:
https://marc.info/?l=linux-raid&m=153736982015076&w=2

This is caused by that we prefer to use device's previous role which is
recorded by saved_raid_disk, but we should respect the new number of
conf->raid_disks since it could be changed after device is removed.
Reported-by: NGioh Kim <gi-oh.kim@profitbricks.com>
Tested-by: NGioh Kim <gi-oh.kim@profitbricks.com>
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

6596e997

md: fix memleak for mempool · b5e07d44

由 Jack Wang 提交于 10月 19, 2018

[ Upstream commit 6aaa58c994277647f8b05ffef3b9b225a2d08f36 ]

I noticed kmemleak report memory leak when run create/stop
md in a loop, backtrace:
[<000000001ca975e7>] mempool_create_node+0x86/0xd0
[<0000000095576bcd>] md_run+0x1057/0x1410 [md_mod]
[<000000007b45c5fc>] do_md_run+0x15/0x130 [md_mod]
[<000000001ede9ec0>] md_ioctl+0x1f49/0x25d0 [md_mod]
[<000000004142cacf>] blkdev_ioctl+0x680/0xd00

The root cause is we alloc mddev->flush_pool and
mddev->flush_bio_pool in md_run, but from do_md_stop
will not call into md_stop but __md_stop, move the
mempool_destroy to __md_stop fixes the problem for me.

The bug was introduced in 5a409b4f, the fixes should go to
4.18+

Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
Signed-off-by: NJack Wang <jinpu.wang@profitbricks.com>
Reviewed-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

b5e07d44

MD: Memory leak when flush bio size is zero · fbf975ef

由 Xiao Ni 提交于 10月 20, 2018

[ Upstream commit af9b926de9c5986ab009e64917de87c9758bab10 ]

flush_pool is leaked when flush bio size is zero

Fixes: 5a409b4f ("MD: fix lock contention for flush bios")
Signed-off-by: NDavid Jeffery <djeffery@redhat.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

fbf975ef

02 8月, 2018 1 次提交

md: Avoid namespace collision with bitmap API · e64e4018

由 Andy Shevchenko 提交于 8月 01, 2018

bitmap API (include/linux/bitmap.h) has 'bitmap' prefix for its methods.

On the other hand MD bitmap API is special case.
Adding 'md' prefix to it to avoid name space collision.

No functional changes intended.
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>

e64e4018

25 7月, 2018 1 次提交

md: remove a bogus comment · 3ed122e6

由 Christoph Hellwig 提交于 7月 24, 2018

The function name mentioned doesn't exist, and the code next to it
doesn't match the description either.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3ed122e6

18 7月, 2018 2 次提交

block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3

由 Michael Callahan 提交于 7月 18, 2018

Add and use a new op_stat_group() function for indexing partition stat
fields rather than indexing them by rq_data_dir() or bio_data_dir().
This function works similarly to op_is_sync() in that it takes the
request::cmd_flags or bio::bi_opf flags and determines which stats
should et updated.

In addition, the second parameter to generic_start_io_acct() and
generic_end_io_acct() is now a REQ_OP rather than simply a read or
write bit and it uses op_stat_group() on the parameter to determine
the stat group.

Note that the partition in_flight counts are not part of the per-cpu
statistics and as such are not indexed via this function.  It's now
indexed by op_is_write().

tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Matias Bjorling <mb@lightnvm.io>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Alasdair Kergon <agk@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ddcf35d3

block: Add part_stat_read_accum to read across field entries. · 59767fbd

由 Michael Callahan 提交于 7月 18, 2018

Add a part_stat_read_accum macro to genhd.h to read and sum across
field entries.  For example to sum up the number read and write
sectors completed.  In addition to being ar reasonable cleanup by
itself this will make it easier to add new stat fields in the future.

tj: Refreshed on top of v4.17.
Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59767fbd

06 7月, 2018 1 次提交

md-cluster: show array's status more accurate · 0357ba27

由 Guoqing Jiang 提交于 7月 02, 2018

When resync or recovery is happening in one node,
other nodes don't show the appropriate info now.

For example, when create an array in master node
without "--assume-clean", then assemble the array
in slave nodes, you can see "resync=PENDING" when
read /proc/mdstat in slave nodes. However, the info
is confusing since "PENDING" status is introduced
for start array in read-only mode.

We introduce RESYNCING_REMOTE flag to indicate that
resync thread is running in remote node. The flags
is set when node receive RESYNCING msg. And we clear
the REMOTE flag in following cases:

1. resync or recover is finished in master node,
   which means slaves receive msg with both lo
   and hi are set to 0.
2. node continues resync/recovery in recover_bitmaps.
3. when resync_finish is called.

Then we show accurate information in status_resync
by check REMOTE flags and with other conditions.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0357ba27

19 6月, 2018 1 次提交
- S
  MD: cleanup resources in failure · bfc9dfdc
  由 Shaohua Li 提交于 6月 13, 2018
```
We need destroy the memory pool in failure
Signed-off-by: NShaohua Li <shli@fb.com>
```
  bfc9dfdc
08 6月, 2018 1 次提交

md: Unify mddev destruction paths · 28dec870

由 Kent Overstreet 提交于 6月 07, 2018

Previously, mddev_put() had a couple different paths for freeing a
mddev, due to the fact that the kobject wasn't initialized when the
mddev was first allocated. If we move the kobject_init() to when it's
first allocated and just use kobject_add() later, we can clean all this
up.

This also removes a hack in mddev_put() to avoid freeing biosets under a
spinlock, which involved copying biosets on the stack after the reset
bioset_init() changes.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

28dec870

31 5月, 2018 1 次提交

md: convert to bioset_init()/mempool_init() · afeee514

由 Kent Overstreet 提交于 5月 20, 2018

Convert md to embedded bio sets.
Signed-off-by: NKent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

afeee514

22 5月, 2018 1 次提交

MD: fix lock contention for flush bios · 5a409b4f

由 Xiao Ni 提交于 5月 21, 2018

There is a lock contention when there are many processes which send flush bios
to md device. eg. Create many lvs on one raid device and mkfs.xfs on each lv.

Now it just can handle flush request sequentially. It needs to wait mddev->flush_bio
to be NULL, otherwise get mddev->lock.

This patch remove mddev->flush_bio and handle flush bio asynchronously.
I did a test with command dbench -s 128 -t 300. This is the test result:

=================Without the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                    11165   167.595  5879.560
 Close                   107469     1.391  2231.094
 LockX                      384     0.003     0.019
 Rename                    5944     2.141  1856.001
 ReadX                   208121     0.003     0.074
 WriteX                   98259  1925.402 15204.895
 Unlink                   25198    13.264  3457.268
 UnlockX                    384     0.001     0.009
 FIND_FIRST               47111     0.012     0.076
 SET_FILE_INFORMATION     12966     0.007     0.065
 QUERY_FILE_INFORMATION   27921     0.004     0.085
 QUERY_PATH_INFORMATION  124650     0.005     5.766
 QUERY_FS_INFORMATION     22519     0.003     0.053
 NTCreateX               141086     4.291  2502.812

Throughput 3.7181 MB/sec (sync open)  128 clients  128 procs  max_latency=15204.905 ms

=================With the patch============================
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Flush                     4500   174.134   406.398
 Close                    48195     0.060   467.062
 LockX                      256     0.003     0.029
 Rename                    2324     0.026     0.360
 ReadX                    78846     0.004     0.504
 WriteX                   66832   562.775  1467.037
 Unlink                    5516     3.665  1141.740
 UnlockX                    256     0.002     0.019
 FIND_FIRST               16428     0.015     0.313
 SET_FILE_INFORMATION      6400     0.009     0.520
 QUERY_FILE_INFORMATION   17865     0.003     0.089
 QUERY_PATH_INFORMATION   47060     0.078   416.299
 QUERY_FS_INFORMATION      7024     0.004     0.032
 NTCreateX                55921     0.854  1141.452

Throughput 11.744 MB/sec (sync open)  128 clients  128 procs  max_latency=1467.041 ms

The test is done on raid1 disk with two rotational disks

V5: V4 is more complicated than the version with memory pool. So revert to the memory pool
version

V4: use address of fbio to do hash to choose free flush info.
V3:
Shaohua suggests mempool is overkill. In v3 it allocs memory during creating raid device
and uses a simple bitmap to record which resource is free.

Fix a bug from v2. It should set flush_pending to 1 at first.

V2:
Neil pointed out two problems. One is counting error problem and another is return value
when allocat memory fails.
1. counting error problem
This isn't safe.  It is only safe to call rdev_dec_pending() on rdevs
that you previously called
                          atomic_inc(&rdev->nr_pending);
If an rdev was added to the list between the start and end of the flush,
this will do something bad.

Now it doesn't use bio_chain. It uses specified call back function for each
flush bio.
2. Returned on IO error when kmalloc fails is wrong.
I use mempool suggested by Neil in V2
3. Fixed some places pointed by Guoqing
Suggested-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a409b4f

18 5月, 2018 1 次提交

md: fix NULL dereference of mddev->pers in remove_and_add_spares() · c42a0e26

由 Yufen Yu 提交于 5月 04, 2018

We met NULL pointer BUG as follow:

[  151.760358] BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
[  151.761340] PGD 80000001011eb067 P4D 80000001011eb067 PUD 1011ea067 PMD 0
[  151.762039] Oops: 0000 [#1] SMP PTI
[  151.762406] Modules linked in:
[  151.762723] CPU: 2 PID: 3561 Comm: mdadm-test Kdump: loaded Not tainted 4.17.0-rc1+ #238
[  151.763542] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[  151.764432] RIP: 0010:remove_and_add_spares.part.56+0x13c/0x3a0
[  151.765061] RSP: 0018:ffffc90001d7fcd8 EFLAGS: 00010246
[  151.765590] RAX: 0000000000000000 RBX: ffff88013601d600 RCX: 0000000000000000
[  151.766306] RDX: 0000000000000000 RSI: ffff88013601d600 RDI: ffff880136187000
[  151.767014] RBP: ffff880136187018 R08: 0000000000000003 R09: 0000000000000051
[  151.767728] R10: ffffc90001d7fed8 R11: 0000000000000000 R12: ffff88013601d600
[  151.768447] R13: ffff8801298b1300 R14: ffff880136187000 R15: 0000000000000000
[  151.769160] FS:  00007f2624276700(0000) GS:ffff88013ae80000(0000) knlGS:0000000000000000
[  151.769971] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.770554] CR2: 0000000000000060 CR3: 0000000111aac000 CR4: 00000000000006e0
[  151.771272] Call Trace:
[  151.771542]  md_ioctl+0x1df2/0x1e10
[  151.771906]  ? __switch_to+0x129/0x440
[  151.772295]  ? __schedule+0x244/0x850
[  151.772672]  blkdev_ioctl+0x4bd/0x970
[  151.773048]  block_ioctl+0x39/0x40
[  151.773402]  do_vfs_ioctl+0xa4/0x610
[  151.773770]  ? dput.part.23+0x87/0x100
[  151.774151]  ksys_ioctl+0x70/0x80
[  151.774493]  __x64_sys_ioctl+0x16/0x20
[  151.774877]  do_syscall_64+0x5b/0x180
[  151.775258]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

For raid6, when two disk of the array are offline, two spare disks can
be added into the array. Before spare disks recovery completing,
system reboot and mdadm thinks it is ok to restart the degraded
array by md_ioctl(). Since disks in raid6 is not only_parity(),
raid5_run() will abort, when there is no PPL feature or not setting
'start_dirty_degraded' parameter. Therefore, mddev->pers is NULL.

But, mddev->raid_disks has been set and it will not be cleared when
raid5_run abort. md_ioctl() can execute cmd 'HOT_REMOVE_DISK' to
remove a disk by mdadm, which will cause NULL pointer dereference
in remove_and_add_spares() finally.
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c42a0e26

02 5月, 2018 1 次提交

md: fix two problems with setting the "re-add" device state. · 011abdc9

由 NeilBrown 提交于 4月 26, 2018

If "re-add" is written to the "state" file for a device
which is faulty, this has an effect similar to removing
and re-adding the device.  It should take up the
same slot in the array that it previously had, and
an accelerated (e.g. bitmap-based) rebuild should happen.

The slot that "it previously had" is determined by
rdev->saved_raid_disk.
However this is not set when a device fails (only when a device
is added), and it is cleared when resync completes.
This means that "re-add" will normally work once, but may not work a
second time.

This patch includes two fixes.
1/ when a device fails, record the ->raid_disk value in
    ->saved_raid_disk before clearing ->raid_disk
2/ when "re-add" is written to a device for which
    ->saved_raid_disk is not set, fail.

I think this is suitable for stable as it can
cause re-adding a device to be forced to do a full
resync which takes a lot longer and so puts data at
more risk.

Cc: <stable@vger.kernel.org> (v4.1)
Fixes: 97f6cd39 ("md-cluster: re-add capabilities")
Signed-off-by: NNeilBrown <neilb@suse.com>
Reviewed-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

011abdc9

09 4月, 2018 1 次提交

md-cluster: don't update recovery_offset for faulty device · 0ea9924a

由 Guoqing Jiang 提交于 4月 09, 2018

Device could become faulty when clustered array handling
METADATA_UPDATED msg, so we don't need to call read_rdev
for this device.
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0ea9924a

09 3月, 2018 1 次提交

block: Use blk_queue_flag_*() in drivers instead of queue_flag_*() · 8b904b5b

由 Bart Van Assche 提交于 3月 07, 2018

This patch has been generated as follows:

for verb in set_unlocked clear_unlocked set clear; do
  replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
    $(git grep -lw queue_flag_${verb} drivers block/bsg*)
done

Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b904b5b

01 3月, 2018 1 次提交

md: Delete gendisk before cleaning up the request queue · d8115c35

由 Bart Van Assche 提交于 2月 28, 2018

Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d8115c35

26 2月, 2018 1 次提交

md: fix a potential deadlock of raid5/raid10 reshape · 8876391e

由 BingJing Chang 提交于 2月 22, 2018

There is a potential deadlock if mount/umount happens when
raid5_finish_reshape() tries to grow the size of emulated disk.

How the deadlock happens?
1) The raid5 resync thread finished reshape (expanding array).
2) The mount or umount thread holds VFS sb->s_umount lock and tries to
   write through critical data into raid5 emulated block device. So it
   waits for raid5 kernel thread handling stripes in order to finish it
   I/Os.
3) In the routine of raid5 kernel thread, md_check_recovery() will be
   called first in order to reap the raid5 resync thread. That is,
   raid5_finish_reshape() will be called. In this function, it will try
   to update conf and call VFS revalidate_disk() to grow the raid5
   emulated block device. It will try to acquire VFS sb->s_umount lock.
The raid5 kernel thread cannot continue, so no one can handle mount/
umount I/Os (stripes). Once the write-through I/Os cannot be finished,
mount/umount will not release sb->s_umount lock. The deadlock happens.

The raid5 kernel thread is an emulated block device. It is responible to
handle I/Os (stripes) from upper layers. The emulated block device
should not request any I/Os on itself. That is, it should not call VFS
layer functions. (If it did, it will try to acquire VFS locks to
guarantee the I/Os sequence.) So we have the resync thread to send
resync I/O requests and to wait for the results.

For solving this potential deadlock, we can put the size growth of the
emulated block device as the final step of reshape thread.

2017/12/29:
Thanks to Guoqing Jiang <gqjiang@suse.com>,
we confirmed that there is the same deadlock issue in raid10. It's
reproducible and can be fixed by this patch. For raid10.c, we can remove
the similar code to prevent deadlock as well since they has been called
before.
Reported-by: NAlex Wu <alexwu@synology.com>
Reviewed-by: NAlex Wu <alexwu@synology.com>
Reviewed-by: NChung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: NBingJing Chang <bingjingc@synology.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

8876391e

20 2月, 2018 1 次提交

md: only allow remove_and_add_spares when no sync_thread running. · 39772f0a

由 NeilBrown 提交于 2月 03, 2018

The locking protocols in md assume that a device will
never be removed from an array during resync/recovery/reshape.
When that isn't happening, rcu or reconfig_mutex is needed
to protect an rdev pointer while taking a refcount.  When
it is happening, that protection isn't needed.

Unfortunately there are cases were remove_and_add_spares() is
called when recovery might be happening: is state_store(),
slot_store() and hot_remove_disk().
In each case, this is just an optimization, to try to expedite
removal from the personality so the device can be removed from
the array.  If resync etc is happening, we just have to wait
for md_check_recover to find a suitable time to call
remove_and_add_spares().

This optimization and not essential so it doesn't
matter if it fails.
So change remove_and_add_spares() to abort early if
resync/recovery/reshape is happening, unless it is called
from md_check_recovery() as part of a newly started recovery.
The parameter "this" is only NULL when called from
md_check_recovery() so when it is NULL, there is no need to abort.

As this can result in a NULL dereference, the fix is suitable
for -stable.

cc: yuyufen <yuyufen@huawei.com>
Cc: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Fixes: 8430e7e0 ("md: disconnect device from personality before trying to remove it.")
Cc: stable@ver.kernel.org (v4.8+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

39772f0a

19 2月, 2018 1 次提交

md: fix md_write_start() deadlock w/o metadata devices · 4b6c1060

由 Heinz Mauelshagen 提交于 2月 02, 2018

If no metadata devices are configured on raid1/4/5/6/10
(e.g. via dm-raid), md_write_start() unconditionally waits
for superblocks to be written thus deadlocking.

Fix introduces mddev->has_superblocks bool, defines it in md_run()
and checks for it in md_write_start() to conditionally avoid waiting.

Once on it, check for non-existing superblocks in md_super_write().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=198647
Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Signed-off-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

4b6c1060

18 2月, 2018 1 次提交

MD: Free bioset when md_run fails · b126194c

由 Xiao Ni 提交于 1月 24, 2018

Signed-off-by: NXiao Ni <xni@redhat.com>
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

b126194c

12 2月, 2018 1 次提交

vfs: do bulk POLL* -> EPOLL* replacement · a9a08845

由 Linus Torvalds 提交于 2月 11, 2018

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9a08845

16 1月, 2018 1 次提交

raid5-ppl: PPL support for disks with write-back cache enabled · 1532d9e8

由 Tomasz Majchrzak 提交于 12月 27, 2017

In order to provide data consistency with PPL for disks with write-back
cache enabled all data has to be flushed to disks before next PPL
entry. The disks to be flushed are marked in the bitmap. It's modified
under a mutex and it's only read after PPL io unit is submitted.

A limitation of 64 disks in the array has been introduced to keep data
structures and implementation simple. RAID5 arrays with so many disks are
not likely due to high risk of multiple disks failure. Such restriction
should not be a real life limitation.

With write-back cache disabled next PPL entry is submitted when data write
for current one completes. Data flush defers next log submission so trigger
it when there are no stripes for handling found.

As PPL assures all data is flushed to disk at request completion, just
acknowledge flush request when PPL is enabled.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>

1532d9e8

12 12月, 2017 1 次提交

md: introduce new personality funciton start() · d5d885fd

由 Song Liu 提交于 11月 19, 2017

In do_md_run(), md threads should not wake up until the array is fully
initialized in md_run(). However, in raid5_run(), raid5-cache may wake
up mddev->thread to flush stripes that need to be written back. This
design doesn't break badly right now. But it could lead to bad bug in
the future.

This patch tries to resolve this problem by splitting start up work
into two personality functions, run() and start(). Tasks that do not
require the md threads should go into run(), while task that require
the md threads go into start().

r5l_load_log() is moved to raid5_start(), so it is not called until
the md threads are started in do_md_run().
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d5d885fd

02 12月, 2017 1 次提交

md: limit mdstat resync progress to max_sectors · d2e2ec82

由 Nate Dailey 提交于 11月 30, 2017

There is a small window near the end of md_do_sync where mddev->curr_resync
can be equal to MaxSector.

If status_resync is called during this window, the resulting /proc/mdstat
output contains a HUGE number of = signs due to the very large curr_resync:

Personalities : [raid1]
md123 : active raid1 sdd3[2] sdb3[0]
  204736 blocks super 1.0 [2/1] [U_]
  [=====================================================================
   ... (82 MB more) ...
   ================>]  recovery =429496729.3% (9223372036854775807/204736)
   finish=0.2min speed=12796K/sec
  bitmap: 0/1 pages [0KB], 65536KB chunk

Modify status_resync to ensure the resync variable doesn't exceed
the array's max_sectors.
Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d2e2ec82

29 11月, 2017 1 次提交
- A
  the rest of drivers/*: annotate ->poll() instances · afc9a42b
  由 Al Viro 提交于 7月 03, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  afc9a42b
15 11月, 2017 1 次提交

md: Convert timers to use timer_setup() · 8376d3c1

由 Kees Cook 提交于 10月 16, 2017

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: linux-bcache@vger.kernel.org
Cc: linux-raid@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NMichael Lyle <mlyle@lyle.org>
Reviewed-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8376d3c1

11 11月, 2017 1 次提交

md: release allocated bitset sync_set · 0202ce8a

由 Zdenek Kabelac 提交于 11月 08, 2017

Patch fixes kmemleak on md_stop() path used likely only by dm-raid wrapper.
Code of md is using  mddev_put() where both bitsets are released however this
freeing is not shared.

Also set NULL to bio_set and sync_set pointers just like mddev_put is
doing.
Signed-off-by: NZdenek Kabelac <zkabelac@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0202ce8a

09 11月, 2017 1 次提交

md: be cautious about using ->curr_resync_completed for ->recovery_offset · db0505d3

由 NeilBrown 提交于 10月 17, 2017

The ->recovery_offset shows how much of a non-InSync device is actually
in sync - how much has been recoveryed.

When performing a recovery, ->curr_resync and ->curr_resync_completed
follow the device address being recovered and so can be used to update
->recovery_offset.

When performing a reshape, ->curr_resync* might follow the device
addresses (raid5) or might follow array addresses (raid10), so cannot
in general be used to set ->recovery_offset.  When reshaping backwards,
->curre_resync* measures from the *end* of the array-or-device, so is
particularly unhelpful.

So change the common code in md.c to only use ->curr_resync_complete
for the simple recovery case, and add code to raid5.c to update
->recovery_offset during a forwards reshape.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

db0505d3

02 11月, 2017 9 次提交

md: don't check MD_SB_CHANGE_CLEAN in md_allow_write · b90f6ff0

由 Artur Paszkiewicz 提交于 10月 26, 2017

Only MD_SB_CHANGE_PENDING should be used to wait for transition from
clean to dirty. Checking also MD_SB_CHANGE_CLEAN is unnecessary and can
race with e.g. md_do_sync(). This sporadically causes a hang when
changing consistency policy during resync:

INFO: task mdadm:6183 blocked for more than 30 seconds.
      Not tainted 4.14.0-rc3+ #391
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mdadm           D12752  6183   6022 0x00000000
Call Trace:
 __schedule+0x93f/0x990
 schedule+0x6b/0x90
 md_allow_write+0x100/0x130 [md_mod]
 ? do_wait_intr_irq+0x90/0x90
 resize_stripes+0x3a/0x5b0 [raid456]
 ? kernfs_fop_write+0xbe/0x180
 raid5_change_consistency_policy+0xa6/0x200 [raid456]
 consistency_policy_store+0x2e/0x70 [md_mod]
 md_attr_store+0x90/0xc0 [md_mod]
 sysfs_kf_write+0x42/0x50
 kernfs_fop_write+0x119/0x180
 __vfs_write+0x28/0x110
 ? rcu_sync_lockdep_assert+0x12/0x60
 ? __sb_start_write+0x15a/0x1c0
 ? vfs_write+0xa3/0x1a0
 vfs_write+0xb4/0x1a0
 SyS_write+0x49/0xa0
 entry_SYSCALL_64_fastpath+0x18/0xad

Fixes: 2214c260 ("md: don't return -EAGAIN in md_allow_write for external metadata arrays")
Cc: <stable@vger.kernel.org>
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b90f6ff0

md: use lockdep_assert_held · efa4b77b

由 Shaohua Li 提交于 10月 18, 2017

lockdep_assert_held is a better way to assert lock held, and it works
for UP.
Signed-off-by: NShaohua Li <shli@fb.com>

efa4b77b

md: remove special meaning of ->quiesce(.., 2) · b03e0ccb

由 NeilBrown 提交于 10月 19, 2017

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b03e0ccb

md: allow metadata update while suspending. · 35bfc521

由 NeilBrown 提交于 10月 17, 2017

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.
Reported-and-tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35bfc521

md: use mddev_suspend/resume instead of ->quiesce() · 9e1cc0a5

由 NeilBrown 提交于 10月 17, 2017

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible.  A
future patch will make use of this.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9e1cc0a5

md: move suspend_hi/lo handling into core md code · b3143b9a

由 NeilBrown 提交于 10月 17, 2017

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended.  It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requests are waiting for suspend_lo/hi to change.
This is will be important after a subsequent patch
which uses mddev_suspend() to synchronize updating for
suspend_lo/hi.

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b3143b9a

md: don't call bitmap_create() while array is quiesced. · 52a0d49d

由 NeilBrown 提交于 10月 17, 2017

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

52a0d49d

md: always hold reconfig_mutex when calling mddev_suspend() · 4d5324f7

由 NeilBrown 提交于 10月 19, 2017

Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.  Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change.  This requires
waking mddev->sb_wait in mddev_unlock().  This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4d5324f7

md: forbid a RAID5 from having both a bitmap and a journal. · 230b55fa

由 NeilBrown 提交于 10月 17, 2017

Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
 - if raid5_run sees both are present, fail.
 - if adding a bitmap finds a journal is present, fail
 - if adding a journal finds a bitmap is present, fail.

Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Tested-by: NJoshua Kinard <kumba@gentoo.org>
Acked-by: NJoshua Kinard <kumba@gentoo.org>
Signed-off-by: NShaohua Li <shli@fb.com>

230b55fa

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功