提交 · b03e0ccb5ab9df3efbe51c87843a1ffbecbafa1f · openeuler / Kernel

02 11月, 2017 7 次提交

md: remove special meaning of ->quiesce(.., 2) · b03e0ccb

由 NeilBrown 提交于 10月 19, 2017

The '2' argument means "wake up anything that is waiting".
This is an inelegant part of the design and was added
to help support management of suspend_lo/suspend_hi setting.
Now that suspend_lo/hi is managed in mddev_suspend/resume,
that need is gone.
These is still a couple of places where we call 'quiesce'
with an argument of '2', but they can safely be changed to
call ->quiesce(.., 1); ->quiesce(.., 0) which
achieve the same result at the small cost of pausing IO
briefly.

This removes a small "optimization" from suspend_{hi,lo}_store,
but it isn't clear that optimization served a useful purpose.
The code now is a lot clearer.
Suggested-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b03e0ccb

md: allow metadata update while suspending. · 35bfc521

由 NeilBrown 提交于 10月 17, 2017

There are various deadlocks that can occur
when a thread holds reconfig_mutex and calls
->quiesce(mddev, 1).
As some write request block waiting for
metadata to be updated (e.g. to record device
failure), and as the md thread updates the metadata
while the reconfig mutex is held, holding the mutex
can stop write requests completing, and this prevents
->quiesce(mddev, 1) from completing.

->quiesce() is now usually called from mddev_suspend(),
and it is always called with reconfig_mutex held.  So
at this time it is safe for the thread to update metadata
without explicitly taking the lock.

So add 2 new flags, one which says the unlocked updates is
allowed, and one which ways it is happening.  Then allow it
while the quiesce completes, and then wait for it to finish.
Reported-and-tested-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35bfc521

md: use mddev_suspend/resume instead of ->quiesce() · 9e1cc0a5

由 NeilBrown 提交于 10月 17, 2017

mddev_suspend() is a more general interface than
calling ->quiesce() and is so more extensible.  A
future patch will make use of this.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9e1cc0a5

md: move suspend_hi/lo handling into core md code · b3143b9a

由 NeilBrown 提交于 10月 17, 2017

responding to ->suspend_lo and ->suspend_hi is similar
to responding to ->suspended.  It is best to wait in
the common core code without incrementing ->active_io.
This allows mddev_suspend()/mddev_resume() to work while
requests are waiting for suspend_lo/hi to change.
This is will be important after a subsequent patch
which uses mddev_suspend() to synchronize updating for
suspend_lo/hi.

So move the code for testing suspend_lo/hi out of raid1.c
and raid5.c, and place it in md.c
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b3143b9a

md: don't call bitmap_create() while array is quiesced. · 52a0d49d

由 NeilBrown 提交于 10月 17, 2017

bitmap_create() allocates memory with GFP_KERNEL and
so can wait for IO.
If called while the array is quiesced, it could wait indefinitely
for write out to the array - deadlock.
So call bitmap_create() before quiescing the array.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

52a0d49d

md: always hold reconfig_mutex when calling mddev_suspend() · 4d5324f7

由 NeilBrown 提交于 10月 19, 2017

Most often mddev_suspend() is called with
reconfig_mutex held.  Make this a requirement in
preparation a subsequent patch.  Also require
reconfig_mutex to be held for mddev_resume(),
partly for symmetry and partly to guarantee
no races with incr/decr of mddev->suspend.

Taking the mutex in r5c_disable_writeback_async() is
a little tricky as this is called from a work queue
via log->disable_writeback_work, and flush_work()
is called on that while holding ->reconfig_mutex.
If the work item hasn't run before flush_work()
is called, the work function will not be able to
get the mutex.

So we use mddev_trylock() inside the wait_event() call, and have that
abort when conf->log is set to NULL, which happens before
flush_work() is called.
We wait in mddev->sb_wait and ensure this is woken
when any of the conditions change.  This requires
waking mddev->sb_wait in mddev_unlock().  This is only
like to trigger extra wake_ups of threads that needn't
be woken when metadata is being written, and that
doesn't happen often enough that the cost would be
noticeable.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4d5324f7

md: forbid a RAID5 from having both a bitmap and a journal. · 230b55fa

由 NeilBrown 提交于 10月 17, 2017

Having both a bitmap and a journal is pointless.
Attempting to do so can corrupt the bitmap if the journal
replay happens before the bitmap is initialized.
Rather than try to avoid this corruption, simply
refuse to allow arrays with both a bitmap and a journal.
So:
 - if raid5_run sees both are present, fail.
 - if adding a bitmap finds a journal is present, fail
 - if adding a journal finds a bitmap is present, fail.

Cc: stable@vger.kernel.org (4.10+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Tested-by: NJoshua Kinard <kumba@gentoo.org>
Acked-by: NJoshua Kinard <kumba@gentoo.org>
Signed-off-by: NShaohua Li <shli@fb.com>

230b55fa

17 10月, 2017 1 次提交

md: rename some drivers/md/ files to have an "md-" prefix · 935fe098

由 Mike Snitzer 提交于 10月 10, 2017

Motivated by the desire to illiminate the imprecise nature of
DM-specific patches being unnecessarily sent to both the MD maintainer
and mailing-list.  Which is born out of the fact that DM files also
reside in drivers/md/

Now all MD-specific files in drivers/md/ start with either "raid" or
"md-" and the MAINTAINERS file has been updated accordingly.

Shaohua: don't change module name
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NShaohua Li <shli@fb.com>

935fe098

09 10月, 2017 1 次提交

md: always set THREAD_WAKEUP and wake up wqueue if thread existed · d1d90147

由 Guoqing Jiang 提交于 10月 09, 2017

Since commit 4ad23a97 ("MD: use per-cpu counter for writes_pending"),
the wait_queue is only got invoked if THREAD_WAKEUP is not set previously.

With above change, I can see process_metadata_update could always hang on
the wait queue, because mddev->thread could stay on 'D' status and the
THREAD_WAKEUP flag is not cleared since there are lots of place to wake up
mddev->thread. Then deadlock happened as follows:

linux175:~ # ps aux|grep md|grep D
root    20117   0.0 0.0         0   0 ? D   03:45   0:00 [md0_raid1]
root    20125   0.0 0.0         0   0 ? D   03:45   0:00 [md0_cluster_rec]
linux175:~ # cat /proc/20117/stack
[<ffffffffa0635604>] dlm_lock_sync+0x94/0xd0 [md_cluster]
[<ffffffffa0635674>] lock_token+0x34/0xd0 [md_cluster]
[<ffffffffa0635804>] metadata_update_start+0x64/0x110 [md_cluster]
[<ffffffffa04d985b>] md_update_sb.part.58+0x9b/0x860 [md_mod]
[<ffffffffa04da035>] md_update_sb+0x15/0x30 [md_mod]
[<ffffffffa04dc066>] md_check_recovery+0x266/0x490 [md_mod]
[<ffffffffa06450e2>] raid1d+0x42/0x810 [raid1]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140
linux175:~ # cat /proc/20125/stack
[<ffffffffa0636679>] recv_daemon+0x3f9/0x5c0 [md_cluster]
[<ffffffffa04d2252>] md_thread+0x122/0x150 [md_mod]
[<ffffffff81091741>] kthread+0x101/0x140

So let's revert the part of code in the commit to resovle the problem since
we can't get lots of benefits of previous change.

Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d1d90147

06 10月, 2017 1 次提交

md: fix deadlock error in recent patch. · d47c8ad2

由 NeilBrown 提交于 10月 05, 2017

A recent patch aimed to cause md_write_start() to fail (rather than
block) when the mddev was suspending, so as to avoid deadlocks.
Unfortunately the test in wait_event() was wrong, and it didn't change
behaviour at all.

We wait_event() must wait until the metadata is written OR the array is
suspending.

Fixes: cc27b0c7 ("md: fix deadlock between mddev_suspend() and md_write_start()")
Cc: stable@vger.kernel.org
Reported-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d47c8ad2

28 9月, 2017 2 次提交

md: fix a race condition for flush request handling · 79bf31a3

由 Shaohua Li 提交于 9月 21, 2017

md_submit_flush_data calls pers->make_request, which missed the suspend check.
Fix it with the new md_handle_request API.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Tested-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

79bf31a3

md: separate request handling · 393debc2

由 Shaohua Li 提交于 9月 21, 2017

With commit cc27b0c7, pers->make_request could bail out without handling
the bio. If that happens, we should retry.  The commit fixes md_make_request
but not other call sites. Separate the request handling part, so other call
sites can use it.
Reported-by: NNate Dailey <nate.dailey@stratus.com>
Fix: cc27b0c7(md: fix deadlock between mddev_suspend() and md_write_start())
Cc: stable@vger.kernel.org
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

393debc2

28 8月, 2017 1 次提交

md: Runtime support for multiple ppls · ddc08823

由 Pawel Baldysiak 提交于 8月 16, 2017

Increase PPL area to 1MB and use it as circular buffer to store PPL. The
entry with highest generation number is the latest one. If PPL to be
written is larger then space left in a buffer, rewind the buffer to the
start (don't wrap it).
Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ddc08823

26 8月, 2017 2 次提交

md: replace seq_release_private with seq_release · 26e13043

由 Cihangir Akturk 提交于 7月 29, 2017

Since commit f1514638 ("fs: seq_file - add event counter to simplify
poll() support"), md.c code has been no longer used the private field of
the struct seq_file, but seq_release_private() has been continued to be
used to release the allocated seq_file. This seems to have been
forgotten. So convert it to use seq_release() instead of
seq_release_private().
Signed-off-by: NCihangir Akturk <cakturk@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

26e13043

md: notify about new spare disk in the container · 5492c46e

由 Alexey Obitotskiy 提交于 7月 28, 2017

In case of external metadata arrays spare disks are added to containers
first. mdadm keeps monitoring /proc/mdstat output and when spare disk is
available, it moves it from the container to the array. The problem is
there is no notification of new spare disk in the container and mdadm
waits a long time (until timeout) before it takes the action.
Signed-off-by: NAlexey Obitotskiy <aleksey.obitotskiy@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5492c46e

24 8月, 2017 1 次提交

block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992

由 Christoph Hellwig 提交于 8月 23, 2017

This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

74d46992

12 8月, 2017 1 次提交

MD: not clear ->safemode for external metadata array · afc1f55c

由 Shaohua Li 提交于 8月 11, 2017

->safemode should be triggered by mdadm for external metadaa array, otherwise
array's state confuses mdadm.

Fixes: 33182d15(md: always clear ->safemode when md_check_recovery gets the mddev lock.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

afc1f55c

08 8月, 2017 2 次提交

md: fix test in md_write_start() · 81fe48e9

由 NeilBrown 提交于 8月 08, 2017

md_write_start() needs to clear the in_sync flag is it is set, or if
there might be a race with set_in_sync() such that the later will
set it very soon.  In the later case it is sufficient to take the
spinlock to synchronize with set_in_sync(), and then set the flag
if needed.

The current test is incorrect.
It should be:
  if "flag is set" or "race is possible"

"flag is set" is trivially "mddev->in_sync".
"race is possible" should be tested by "mddev->sync_checkers".

If sync_checkers is 0, then there can be no race.  set_in_sync() will
wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period,
and as md_write_start() holds the rcu_read_lock(), set_in_sync() will
be sure ot see the update to writes_pending.

If sync_checkers is > 0, there could be race.  If md_write_start()
happened entirely between
		if (!mddev->in_sync &&
		    percpu_ref_is_zero(&mddev->writes_pending)) {
and
			mddev->in_sync = 1;
in set_in_sync(), then it would not see that is_sync had been set,
and set_in_sync() would not see that writes_pending had been
incremented.

This bug means that in_sync is sometimes not set when it should be.
Consequently there is a small chance that the array will be marked as
"clean" when in fact it is inconsistent.

Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
cc: stable@vger.kernel.org (v4.12+)
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

81fe48e9

md: always clear ->safemode when md_check_recovery gets the mddev lock. · 33182d15

由 NeilBrown 提交于 8月 08, 2017

If ->safemode == 1, md_check_recovery() will try to get the mddev lock
and perform various other checks.
If mddev->in_sync is zero, it will call set_in_sync, and clear
->safemode.  However if mddev->in_sync is not zero, ->safemode will not
be cleared.

When md_check_recovery() drops the mddev lock, the thread is woken
up again.  Normally it would just check if there was anything else to
do, find nothing, and go to sleep.  However as ->safemode was not
cleared, it will take the mddev lock again, then wake itself up
when unlocking.

This results in an infinite loop, repeatedly calling
md_check_recovery(), which RCU or the soft-lockup detector
will eventually complain about.

Prior to commit 4ad23a97 ("MD: use per-cpu counter for
writes_pending"), safemode would only be set to one when the
writes_pending counter reached zero, and would be cleared again
when writes_pending is incremented.  Since that patch, safemode
is set more freely, but is not reliably cleared.

So in md_check_recovery() clear ->safemode before checking ->in_sync.

Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Cc: stable@vger.kernel.org (4.12+)
Reported-by: NDominik Brodowski <linux@dominikbrodowski.net>
Reported-by: NDavid R <david@unsolicited.net>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

33182d15

26 7月, 2017 1 次提交

MD: fix warnning for UP case · ed9b66d2

由 Shaohua Li 提交于 7月 25, 2017

spin_is_locked always returns 0 for UP case, so ignores it
Reported-by: NJoshua Kinard <kumba@gentoo.org>
Signed-off-by: NShaohua Li <shli@fb.com>

ed9b66d2

04 7月, 2017 1 次提交

MD: fix sleep in atomic · 7184ef8b

由 Shaohua Li 提交于 7月 03, 2017

bioset_free() will take a mutex, so can't get called with spinlock hold.

Fix: 5a85071c(md: use a separate bio_set for synchronous IO.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7184ef8b

24 6月, 2017 1 次提交

MD: fix a null dereference · 7f053a6a

由 Shaohua Li 提交于 6月 23, 2017

rdev->mddev could be null in start time.
Reported-by: NMing Lei <ming.lei@redhat.com>
Fix: 5a85071c(md: use a separate bio_set for synchronous IO.)
Cc: NeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7f053a6a

22 6月, 2017 1 次提交

md: use a separate bio_set for synchronous IO. · 5a85071c

由 NeilBrown 提交于 6月 21, 2017

md devices allocate a bio_set and use it for two
distinct purposes.
mddev->bio_set is used to clone bios as part of sending
upper level requests down to lower level devices,
and it is also use for synchronous IO such as superblock
and bitmap updates, and for correcting read errors.

This multiple usage can lead to deadlocks.  It is likely
that cloned bios might be queued for write and to be
waiting for a metadata update before the write can be permitted.
If the cloning exhausted mddev->bio_set, the metadata update
may not be able to proceed.

This scenario has been seen during heavy testing, with lots of IO and
lots of memory pressure.

Address this by adding a new bio_set specifically for synchronous IO.
All synchronous IO goes directly to the underlying device and is not
queued at the md level, so request using entries from the new
mddev->sync_set will complete in a timely fashion.
Requests that use mddev->bio_set will sometimes need to wait
for synchronous IO, but will no longer risk deadlocking that iO.

Also: small simplification in mddev_put(): there is no need to
wait until the spinlock is released before calling bioset_free().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

5a85071c

19 6月, 2017 3 次提交

block: remove bio_clone() and all references. · 9b10f6a9

由 NeilBrown 提交于 6月 18, 2017

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9b10f6a9

blk: replace bioset_create_nobvec() with a flags arg to bioset_create() · 011067b0

由 NeilBrown 提交于 6月 18, 2017

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@infradead.org>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

011067b0

blk: remove bio_set arg from blk_queue_split() · af67c31f

由 NeilBrown 提交于 6月 18, 2017

blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed)
Reviewed-by: NJavier González <javier@cnexlabs.com>
Tested-by: NJavier González <javier@cnexlabs.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af67c31f

17 6月, 2017 1 次提交

md: change the initialization value for a spare device spot to MD_DISK_ROLE_SPARE · 8df72024

由 Lidong Zhong 提交于 6月 12, 2017

The value for spare spot of sb->dev_roles is changed from
MD_DISK_ROLE_FAULTY to MD_DISK_ROLE_SPARE to keep align
with the value when the superblock is firstly created in
userspace.
Signed-off-by: NLidong Zhong <lzhong@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

8df72024

14 6月, 2017 1 次提交

md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7

由 NeilBrown 提交于 6月 05, 2017

If mddev_suspend() races with md_write_start() we can deadlock
with mddev_suspend() waiting for the request that is currently
in md_write_start() to complete the ->make_request() call,
and md_write_start() waiting for the metadata to be updated
to mark the array as 'dirty'.
As metadata updates done by md_check_recovery() only happen then
the mddev_lock() can be claimed, and as mddev_suspend() is often
called with the lock held, these threads wait indefinitely for each
other.

We fix this by having md_write_start() abort if mddev_suspend()
is happening, and ->make_request() aborts if md_write_start()
aborted.
md_make_request() can detect this abort, decrease the ->active_io
count, and wait for mddev_suspend().
Reported-by: NNix <nix@esperi.org.uk>
Fix: 68866e42(MD: no sync IO while suspended)
Cc: stable@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

cc27b0c7

09 6月, 2017 1 次提交

block: switch bios to blk_status_t · 4e4cbee9

由 Christoph Hellwig 提交于 6月 03, 2017

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

4e4cbee9

06 6月, 2017 1 次提交

md: initialise ->writes_pending in personality modules. · a415c0f1

由 NeilBrown 提交于 6月 05, 2017

The new per-cpu counter for writes_pending is initialised in
md_alloc(), which is not called by dm-raid.
So dm-raid fails when md_write_start() is called.

Move the initialization to the personality modules
that need it.  This way it is always initialised when needed,
but isn't unnecessarily initialized (requiring memory allocation)
when the personality doesn't use writes_pending.
Reported-by: NHeinz Mauelshagen <heinzm@redhat.com>
Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a415c0f1

05 6月, 2017 1 次提交

md: namespace private helper names · e6fd2093

由 Amir Goldstein 提交于 5月 04, 2017

The md private helper uuid_equal() collides with a generic helper
of the same name.

Rename the md private helper to md_uuid_equal() and do the same for
md_sb_equal().
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NShaohua Li <shli@kernel.org>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>

e6fd2093

01 6月, 2017 1 次提交

md: Make flush bios explicitely sync · 5a8948f8

由 Jan Kara 提交于 5月 31, 2017

Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

CC: linux-raid@vger.kernel.org
CC: Shaohua Li <shli@kernel.org>
Fixes: b685d3d6
CC: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NShaohua Li <shli@fb.com>

5a8948f8

09 5月, 2017 1 次提交

md: don't return -EAGAIN in md_allow_write for external metadata arrays · 2214c260

由 Artur Paszkiewicz 提交于 5月 08, 2017

This essentially reverts commit b5470dc5 ("md: resolve external
metadata handling deadlock in md_allow_write") with some adjustments.

Since commit 6791875e ("md: make reconfig_mutex optional for writes
to md sysfs files.") changing array_state to 'active' does not use
mddev_lock() and will not cause a deadlock with md_allow_write(). This
revert simplifies userspace tools that write to sysfs attributes like
"stripe_cache_size" or "consistency_policy" because it removes the need
for special handling for external metadata arrays, checking for EAGAIN
and retrying the write.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2214c260

21 4月, 2017 1 次提交

md: handle read-only member devices better. · 97b20ef7

由 NeilBrown 提交于 4月 13, 2017

1/ If an array has any read-only devices when it is started,
   the array itself must be read-only
2/ A read-only device cannot be added to an array after it is
   started.
3/ Setting an array to read-write should not succeed
   if any member devices are read-only
Reported-and-Tested-by: NNanda Kishore Chinnaram <Nanda_Kishore_Chinna@dell.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

97b20ef7

13 4月, 2017 2 次提交

md: support disabling of create-on-open semantics. · 78b6350d

由 NeilBrown 提交于 4月 12, 2017

md allows a new array device to be created by simply
opening a device file.  This make it difficult to
remove the device and udev is likely to open the device file
as part of processing the REMOVE event.

There is an alternate mechanism for creating arrays
by writing to the new_array module parameter.
When using tools that work with this parameter, it is
best to disable the old semantics.
This new module parameter allows that.
Signed-off-by: NNeilBrown <neilb@suse.com>
Acted-by: NColy Li <colyli@suse.de>
Signed-off-by: NShaohua Li <shli@fb.com>

78b6350d

md: allow creation of mdNNN arrays via md_mod/parameters/new_array · 039b7225

由 NeilBrown 提交于 4月 12, 2017

The intention when creating the "new_array" parameter and the
possibility of having array names line "md_HOME" was to transition
away from the old way of creating arrays and to eventually only use
this new way.

The "old" way of creating array is to create a device node in /dev
and then open it.  The act of opening creates the array.
This is problematic because sometimes the device node can be opened
when we don't want to create an array.  This can easily happen
when some rule triggered by udev looks at a device as it is being
destroyed.  The node in /dev continues to exist for a short period
after an array is stopped, and opening it during this time recreates
the array (as an inactive array).

Unfortunately no clear plan for the transition was created.  It is now
time to fix that.

This patch allows devices with numeric names, like "md999" to be
created by writing to "new_array".  This will only work if the minor
number given is not already in use.  This will allow mdadm to
support the creation of arrays with numbers > 511 (currently not
possible) by writing to new_array.
mdadm can, at some point, use this approach to create *all* arrays,
which will allow the transition to only using the new-way.
Signed-off-by: NNeilBrown <neilb@suse.com>
Acted-by: NColy Li <colyli@suse.de>
Signed-off-by: NShaohua Li <shli@fb.com>

039b7225

11 4月, 2017 2 次提交

md.c:didn't unlock the mddev before return EINVAL in array_size_store · b670883b

由 Zhilong Liu 提交于 4月 10, 2017

md.c: it needs to release the mddev lock before
the array_size_store() returns.

Fixes: ab5a98b1 ("md-cluster: change array_sectors and update size are not supported")
Signed-off-by: NZhilong Liu <zlliu@suse.com>
Reviewed-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

b670883b

md: MD_CLOSING needs to be cleared after called md_set_readonly or do_md_stop · 065e519e

由 NeilBrown 提交于 4月 06, 2017

if called md_set_readonly and set MD_CLOSING bit, the mddev cannot
be opened any more due to the MD_CLOING bit wasn't cleared. Thus it
needs to be cleared in md_ioctl after any call to md_set_readonly()
or do_md_stop().
Signed-off-by: NNeilBrown <neilb@suse.com>
Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag")
Cc: stable@vger.kernel.org (v4.9+)
Signed-off-by: NZhilong Liu <zlliu@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

065e519e

23 3月, 2017 2 次提交

MD: use per-cpu counter for writes_pending · 4ad23a97

由 NeilBrown 提交于 3月 15, 2017

The 'writes_pending' counter is used to determine when the
array is stable so that it can be marked in the superblock
as "Clean".  Consequently it needs to be updated frequently
but only checked for zero occasionally.  Recent changes to
raid5 cause the count to be updated even more often - once
per 4K rather than once per bio.  This provided
justification for making the updates more efficient.

So we replace the atomic counter a percpu-refcount.
This can be incremented and decremented cheaply most of the
time, and can be switched to "atomic" mode when more
precise counting is needed.  As it is possible for multiple
threads to want a precise count, we introduce a
"sync_checker" counter to count the number of threads
in "set_in_sync()", and only switch the refcount back
to percpu mode when that is zero.

We need to be careful about races between set_in_sync()
setting ->in_sync to 1, and md_write_start() setting it
to zero.  md_write_start() holds the rcu_read_lock()
while checking if the refcount is in percpu mode.  If
it is, then we know a switch to 'atomic' will not happen until
after we call rcu_read_unlock(), in which case set_in_sync()
will see the elevated count, and not set in_sync to 1.
If it is not in percpu mode, we take the mddev->lock to
ensure proper synchronization.

It is no longer possible to quickly check if the count is zero, which
we previously did to update a timer or to schedule the md_thread.
So now we do these every time we decrement that counter, but make
sure they are fast.

mod_timer() already optimizes the case where the timeout value doesn't
actually change.  We leverage that further by always rounding off the
jiffies to the timeout value.  This may delay the marking of 'clean'
slightly, but ensure we only perform atomic operation here when absolutely
needed.

md_wakeup_thread() current always calls wake_up(), even if
THREAD_WAKEUP is already set.  That too can be optimised to avoid
calls to wake_up().
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

4ad23a97

md: close a race with setting mddev->in_sync · 55cc39f3

由 NeilBrown 提交于 3月 15, 2017

If ->in_sync is being set just as md_write_start() is being called,
it is possible that set_in_sync() won't see the elevated
->writes_pending, and md_write_start() won't see the set ->in_sync.

To close this race, re-test ->writes_pending after setting ->in_sync,
and add memory barriers to ensure the increment of ->writes_pending
will be seen by the time of this second test, or the new ->in_sync
will be seen by md_write_start().

Add a spinlock to array_state_show() to ensure this temporary
instability is never visible from userspace.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

55cc39f3

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功