提交 · ea0213e0c7cc1c1b52badf27bd7db4f50a67baaa · openanolis / cloud-kernel

17 3月, 2017 3 次提交

md: superblock changes for PPL · ea0213e0

由 Artur Paszkiewicz 提交于 3月 09, 2017

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.
Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ea0213e0

md-cluster: add the support for resize · 818da59f

由 Guoqing Jiang 提交于 3月 01, 2017

To update size for cluster raid, we need to make
sure all nodes can perform the change successfully.
However, it is possible that some of them can't do
it due to failure (bitmap_resize could fail). So
we need to consider the issue before we set the
capacity unconditionally, and we use below steps
to perform sanity check.

1. A change the size, then broadcast METADATA_UPDATED
   msg.
2. B and C receive METADATA_UPDATED change the size
   excepts call set_capacity, sync_size is not update
   if the change failed. Also call bitmap_update_sb
   to sync sb to disk.
3. A checks other node's sync_size, if sync_size has
   been updated in all nodes, then send CHANGE_CAPACITY
   msg otherwise send msg to revert previous change.
4. B and C call set_capacity if receive CHANGE_CAPACITY
   msg, otherwise pers->resize will be called to restore
   the old value.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

818da59f

md-cluster: use sync way to handle METADATA_UPDATED msg · 0ba95977

由 Guoqing Jiang 提交于 3月 01, 2017

Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

0ba95977

11 3月, 2017 2 次提交

md: fix incorrect use of lexx_to_cpu in does_sb_need_changing · 13459213

由 Jason Yan 提交于 3月 10, 2017

The sb->layout is of type __le32, so we shoud use le32_to_cpu.
Signed-off-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

13459213

md: fix super_offset endianness in super_1_rdev_size_change · 3fb632e4

由 Jason Yan 提交于 3月 10, 2017

The sb->super_offset should be big-endian, but the rdev->sb_start is in
host byte order, so fix this by adding cpu_to_le64.
Signed-off-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NShaohua Li <shli@fb.com>

3fb632e4

10 3月, 2017 3 次提交

md: don't impose the MD_SB_DISKS limit on arrays without metadata. · 1b3bae49

由 NeilBrown 提交于 3月 01, 2017

These arrays, created with "mdadm --build" don't benefit from a limit.
The default will be used, which is '0' and is interpreted as "don't
impose a limit".

Reported-by: ian_bruce@mail.ru
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1b3bae49

md: move funcs from pers->resize to update_size · c9483634

由 Guoqing Jiang 提交于 2月 24, 2017

raid1_resize and raid5_resize should also check the
mddev->queue if run underneath dm-raid.

And both set_capacity and revalidate_disk are used in
pers->resize such as raid1, raid10 and raid5. So
move them from personality file to common code.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c9483634

md: delete dead code · 99b3d74e

由 Shaohua Li 提交于 2月 23, 2017

Nobody is using mddev_check_plugged(), so delete the dead code
Signed-off-by: NShaohua Li <shli@fb.com>

99b3d74e

02 3月, 2017 1 次提交

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> · 3f07c014

由 Ingo Molnar 提交于 2月 08, 2017

We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

3f07c014

16 2月, 2017 3 次提交

md: fast clone bio in bio_clone_mddev() · d7a10308

由 Ming Lei 提交于 2月 14, 2017

Firstly bio_clone_mddev() is used in raid normal I/O and isn't
in resync I/O path.

Secondly all the direct access to bvec table in raid happens on
resync I/O except for write behind of raid1, in which we still
use bio_clone() for allocating new bvec table.

So this patch replaces bio_clone() with bio_clone_fast()
in bio_clone_mddev().

Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
suggested by Christoph Hellwig.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

d7a10308

md: remove unnecessary check on mddev · ed7ef732

由 Ming Lei 提交于 2月 14, 2017

mddev is never NULL and neither is ->bio_set, so
remove the check.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

ed7ef732

md: fail if mddev->bio_set can't be created · 10273170

由 Ming Lei 提交于 2月 14, 2017

The current behaviour is to fall back to allocate
bio from 'fs_bio_set', that isn't a correct way
because it might cause deadlock.

So this patch simply return failure if mddev->bio_set
can't be created.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

10273170

14 2月, 2017 1 次提交

md: ensure md devices are freed before module is unloaded. · 9356863c

由 NeilBrown 提交于 2月 06, 2017

Commit: cbd19983 ("md: Fix unfortunate interaction with evms")
change mddev_put() so that it would not destroy an md device while
->ctime was non-zero.

Unfortunately, we didn't make sure to clear ->ctime when unloading
the module, so it is possible for an md device to remain after
module unload.  An attempt to open such a device will trigger
an invalid memory reference in:
  get_gendisk -> kobj_lookup -> exact_lock -> get_disk

when tring to access disk->fops, which was in the module that has
been removed.

So ensure we clear ->ctime in md_exit(), and explain how that is useful,
as it isn't immediately obvious when looking at the code.

Fixes: cbd19983 ("md: Fix unfortunate interaction with evms")
Tested-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9356863c

02 2月, 2017 1 次提交

block: Use pointer to backing_dev_info from request_queue · dc3b17cc

由 Jan Kara 提交于 2月 02, 2017

We will want to have struct backing_dev_info allocated separately from
struct request_queue. As the first step add pointer to backing_dev_info
to request_queue and convert all users touching it. No functional
changes in this patch.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@fb.com>

dc3b17cc

25 1月, 2017 1 次提交

md/r5cache: flush data only stripes in r5l_recovery_log() · a85dd7b8

由 Song Liu 提交于 1月 23, 2017

For safer operation, all arrays start in write-through mode, which has been
better tested and is more mature. And actually the write-through/write-mode
isn't persistent after array restarted, so we always start array in
write-through mode. However, if recovery found data-only stripes before the
shutdown (from previous write-back mode), it is not safe to start the array in
write-through mode, as write-through mode can not handle stripes with data in
write-back cache. To solve this problem, we flush all data-only stripes in
r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
empty cache in write-through mode.

This logic is implemented in r5c_recovery_flush_data_only_stripes():

1. enable write back cache
2. flush all stripes
3. wake up conf->mddev->thread
4. wait for all stripes get flushed (reuse wait_for_quiescent)
5. disable write back cache

The wait in 4 will be waked up in release_inactive_stripe_list()
when conf->active_stripes reaches 0.

It is safe to wake up mddev->thread here because all the resource
required for the thread has been initialized.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

a85dd7b8

09 12月, 2016 2 次提交

md: separate flags for superblock changes · 2953079c

由 Shaohua Li 提交于 12月 08, 2016

The mddev->flags are used for different purposes. There are a lot of
places we check/change the flags without masking unrelated flags, we
could check/change unrelated flags. These usage are most for superblock
write, so spearate superblock related flags. This should make the code
clearer and also fix real bugs.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

2953079c

md: MD_RECOVERY_NEEDED is set for mddev->recovery · 82a301cb

由 Shaohua Li 提交于 12月 08, 2016

Fixes: 90f5f7ad("md: Wait for md_check_recovery before attempting device
removal.")
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

82a301cb

06 12月, 2016 1 次提交

md: fix refcount problem on mddev when stopping array. · e2342ca8

由 NeilBrown 提交于 12月 05, 2016

md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.
Reported-by: NMarc Smith <marc.smith@mcc.edu>
Fixes: af8d8e6f ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NNeilBrown <neilb@suse.com>
Acked-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

e2342ca8

24 11月, 2016 2 次提交

md: stop write should stop journal reclaim · 034e33f5

由 Shaohua Li 提交于 11月 21, 2016

__md_stop_writes currently doesn't stop raid5-cache reclaim thread. It's
possible the reclaim thread is still running and doing write, which
doesn't match what __md_stop_writes should do. The extra ->quiesce()
call should not harm any raid types. For raid5-cache, this will
guarantee we reclaim all caches before we update superblock.
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>

034e33f5

raid5-cache: suspend reclaim thread instead of shutdown · ce1ccd07

由 Shaohua Li 提交于 11月 21, 2016

There is mechanism to suspend a kernel thread. Use it instead of playing
create/destroy game.
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
Cc: Song Liu <songliubraving@fb.com>

ce1ccd07

23 11月, 2016 2 次提交

md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7

由 NeilBrown 提交于 11月 18, 2016

This can only be supported on personalities which ensure
that md_error() never causes an array to enter the 'failed'
state.  i.e. if marking a device Faulty would cause some
data to be inaccessible, the device is status is left as
non-Faulty.  This is true for RAID1 and RAID10.

If we get a failure writing metadata but the device doesn't
fail, it must be the last device so we re-write without
FAILFAST to improve chance of success.  We also flag the
device as LastDev so that future metadata updates don't
waste time on failfast writes.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

46533ff7

md/failfast: add failfast flag for md to be used by some personalities. · 688834e6

由 NeilBrown 提交于 11月 18, 2016

This patch just adds a 'failfast' per-device flag which can be stored
in v0.90 or v1.x metadata.
The flag is not used yet but the intent is that it can be used for
mirrored (raid1/raid10) arrays where low latency is more important
than keeping all devices on-line.

Setting the flag for a device effectively gives permission for that
device to be marked as Faulty and excluded from the array on the first
error.  The underlying driver will be directed not to retry requests
that result in failures.  There is a proviso that the device must not
be marked faulty if that would cause the array as a whole to fail, it
may only be marked Faulty if the array remains functional, but is
degraded.

Failures on read requests will cause the device to be marked
as Faulty immediately so that further reads will avoid that
device.  No attempt will be made to correct read errors by
over-writing with the correct data.

It is expected that if transient errors, such as cable unplug, are
possible, then something in user-space will revalidate failed
devices and re-add them when they appear to be working again.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

688834e6

19 11月, 2016 1 次提交

md: add blktrace event for writes to superblock · 504634f6

由 Shaohua Li 提交于 11月 18, 2016

superblock write is an expensive operation. With raid5-cache, it can be called
regularly. Tracing to help performance debug.
Signed-off-by: NShaohua Li <shli@fb.com>
Cc: NeilBrown <neilb@suse.com>

504634f6

10 11月, 2016 1 次提交

md: remove md_super_wait() call after bitmap_flush() · 6119e679

由 NeilBrown 提交于 11月 09, 2016

bitmap_flush() finishes with bitmap_update_sb(), and that finishes
with write_page(..., 1), so write_page() will wait for all writes
to complete.  So there is no point calling md_super_wait()
immediately afterwards.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

6119e679

08 11月, 2016 6 次提交

md: perform async updates for metadata where possible. · 060b0689

由 NeilBrown 提交于 11月 04, 2016

When adding devices to, or removing device from, an array we need to
update the metadata.  However we don't need to do it synchronously as
data integrity doesn't depend on these changes being recorded
instantly.  So avoid the synchronous call to md_update_sb and just set
a flag so that the thread will do it.

This can reduce the number of updates performed when lots of devices
are being added or removed.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

060b0689

md: change all printk() to pr_err() or pr_warn() etc. · 9d48739e

由 NeilBrown 提交于 11月 02, 2016

1/ using pr_debug() for a number of messages reduces the noise of
   md, but still allows them to be enabled when needed.
2/ try to be consistent in the usage of pr_err() and pr_warn(), and
   document the intention
3/ When strings have been split onto multiple lines, rejoin into
   a single string.
   The cost of having lines > 80 chars is less than the cost of not
   being able to easily search for a particular message.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

9d48739e

md: fix some issues with alloc_disk_sb() · 7f0f0d87

由 NeilBrown 提交于 11月 02, 2016

1/ don't print a warning if allocation fails.
 page_alloc() does that already.
2/ always check return status for error.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

7f0f0d87

md: wake up personality thread after array state update · 91a6c4ad

由 Tomasz Majchrzak 提交于 10月 25, 2016

When raid1/raid10 array fails to write to one of the drives, the request
is added to bio_end_io_list and finished by personality thread. The
thread doesn't handle it as long as MD_CHANGE_PENDING flag is set. In
case of external metadata this flag is cleared, however the thread is
not woken up. It causes request to be blocked for few seconds (until
another action on the array wakes up the thread) or to get stuck
indefinitely.

Wake up personality thread once MD_CHANGE_PENDING has been cleared.
Moving 'restart_array' call after the flag is cleared it not a solution
because in read-write mode the call doesn't wake up the thread.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

91a6c4ad

md: don't fail an array if there are unacknowledged bad blocks · dcbcb486

由 Tomasz Majchrzak 提交于 10月 21, 2016

If external metadata handler supports bad blocks and unacknowledged bad
blocks are present, don't report disk via sysfs as faulty. Such
situation can be still handled so disk just has to be blocked for a
moment. It makes it consistent with kernel state as corresponding rdev
flag is also not set.

When the disk in being unblocked there are few cases:
1. Disk has been in blocked and faulty state, it is being unblocked but
it still remains in faulty state. Metadata handler will remove it from
array in the next call.
2. There is no bad block support in external metadata handler and bad
blocks are present - put the disk in blocked and faulty state (see
case 1).
3. There is bad block support in external metadata handler and all bad
blocks are acknowledged - clear all flags, continue.
4. There is bad block support in external metadata handler but there are
still unacknowledged bad blocks - clear all flags, continue. It is fine
to clear Blocked flag because it was probably not set anyway (if it was
it is case 1). BlockedBadBlocks flag can also be cleared because the
request waiting for it will set it again when it finds out that some bad
block is still not acknowledged. Recovery is not necessary but there are
no problems if the flag is set. Sysfs rdev state is still reported as
blocked (due to unacknowledged bad blocks) so metadata handler will
process remaining bad blocks and unblock disk again.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

dcbcb486

md: add bad block support for external metadata · 35b785f7

由 Tomasz Majchrzak 提交于 10月 21, 2016

Add new rdev flag which external metadata handler can use to switch
on/off bad block support. If new bad block is encountered, notify it via
rdev 'unacknowledged_bad_blocks' sysfs file. If bad block has been
cleared, notify update to rdev 'bad_blocks' sysfs file.

When bad blocks support is being removed, just clear rdev flag. It is
not necessary to reset badblocks->shift field. If there are bad blocks
cleared or added at the same time, it is ok for those changes to be
applied to the structure. The array is in blocked state and the drive
which cannot handle bad blocks any more will be removed from the array
before it is unlocked.

Simplify state_show function by adding a separator at the end of each
string and overwrite last separator with new line.
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Reviewed-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

35b785f7

01 11月, 2016 1 次提交

block,fs: use REQ_* flags directly · 70fd7614

由 Christoph Hellwig 提交于 11月 01, 2016

Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

70fd7614

29 10月, 2016 1 次提交

md: be careful not lot leak internal curr_resync value into metadata. -- (all) · 1217e1d1

由 NeilBrown 提交于 10月 28, 2016

mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.

 1 - means that the array is trying to start a resync, but has yielded
     to another array which shares physical devices, and also needs to
     start a resync
 2 - means the array is trying to start resync, but has found another
     array which shares physical devices and has already started resync.

 3 - means that resync has commensed, but it is possible that nothing
     has actually been resynced yet.

It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint.  In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.

There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.

Change them to only propagate the value if it is > 3.

As this can cause an array to fail, the patch is suitable for -stable.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: NViswesh <viswesh.vichu@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

1217e1d1

25 10月, 2016 1 次提交

md: report 'write_pending' state when array in sync · 16f88949

由 Tomasz Majchrzak 提交于 10月 24, 2016

If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cd
("md/raid5: ensure device failure recorded before write request
returns.").

The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.

Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).
Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: NShaohua Li <shli@fb.com>

16f88949

04 10月, 2016 1 次提交

md: set rotational bit · bb086a89

由 Shaohua Li 提交于 9月 30, 2016

if all disks in an array are non-rotational, set the array
non-rotational.

This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NShaohua Li <shli@fb.com>

bb086a89

22 9月, 2016 4 次提交

md: fix a potential deadlock · 90bcf133

由 Shaohua Li 提交于 9月 14, 2016

lockdep reports a potential deadlock. Fix this by droping the mutex
before md_import_device

[ 1137.126601] ======================================================
[ 1137.127013] [ INFO: possible circular locking dependency detected ]
[ 1137.127013] 4.8.0-rc4+ #538 Not tainted
[ 1137.127013] -------------------------------------------------------
[ 1137.127013] mdadm/16675 is trying to acquire lock:
[ 1137.127013]  (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013]
but task is already holding lock:
[ 1137.127013]  (detected_devices_mutex){+.+.+.}, at: [<ffffffff81a5138c>] md_ioctl+0x2ac/0x1f50
[ 1137.127013]
which lock already depends on the new lock.

[ 1137.127013]
the existing dependency chain (in reverse order) is:
[ 1137.127013]
-> #1 (detected_devices_mutex){+.+.+.}:
[ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013]        [<ffffffff81a4eeaf>] md_autodetect_dev+0x3f/0x90
[ 1137.127013]        [<ffffffff81595be8>] rescan_partitions+0x1a8/0x2c0
[ 1137.127013]        [<ffffffff81590081>] __blkdev_reread_part+0x71/0xb0
[ 1137.127013]        [<ffffffff815900e5>] blkdev_reread_part+0x25/0x40
[ 1137.127013]        [<ffffffff81590c4b>] blkdev_ioctl+0x51b/0xa30
[ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
-> #0 (&bdev->bd_mutex){+.+.+.}:
[ 1137.127013]        [<ffffffff810b6af2>] __lock_acquire+0x1662/0x1690
[ 1137.127013]        [<ffffffff810b6f19>] lock_acquire+0xb9/0x220
[ 1137.127013]        [<ffffffff81c51647>] mutex_lock_nested+0x67/0x3d0
[ 1137.127013]        [<ffffffff81243cf3>] __blkdev_get+0x63/0x450
[ 1137.127013]        [<ffffffff81244307>] blkdev_get+0x227/0x350
[ 1137.127013]        [<ffffffff812444f6>] blkdev_get_by_dev+0x36/0x50
[ 1137.127013]        [<ffffffff81a46d65>] lock_rdev+0x35/0x80
[ 1137.127013]        [<ffffffff81a49bb4>] md_import_device+0xb4/0x1b0
[ 1137.127013]        [<ffffffff81a513d6>] md_ioctl+0x2f6/0x1f50
[ 1137.127013]        [<ffffffff815909b3>] blkdev_ioctl+0x283/0xa30
[ 1137.127013]        [<ffffffff81242bf1>] block_ioctl+0x41/0x50
[ 1137.127013]        [<ffffffff81214c96>] do_vfs_ioctl+0x96/0x6e0
[ 1137.127013]        [<ffffffff81215321>] SyS_ioctl+0x41/0x70
[ 1137.127013]        [<ffffffff81c56825>] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 1137.127013]
other info that might help us debug this:

[ 1137.127013]  Possible unsafe locking scenario:

[ 1137.127013]        CPU0                    CPU1
[ 1137.127013]        ----                    ----
[ 1137.127013]   lock(detected_devices_mutex);
[ 1137.127013]                                lock(&bdev->bd_mutex);
[ 1137.127013]                                lock(detected_devices_mutex);
[ 1137.127013]   lock(&bdev->bd_mutex);
[ 1137.127013]
 *** DEADLOCK ***

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: NShaohua Li <shli@fb.com>

90bcf133

md-cluster: clean related infos of cluster · c20c33f0

由 Guoqing Jiang 提交于 8月 12, 2016

cluster_info and bitmap_info.nodes also need to be
cleared when array is stopped.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

c20c33f0

md: changes for MD_STILL_CLOSED flag · af8d8e6f

由 Guoqing Jiang 提交于 8月 12, 2016

When stop clustered raid while it is pending on resync,
MD_STILL_CLOSED flag could be cleared since udev rule
is triggered to open the mddev. So obviously array can't
be stopped soon and returns EBUSY.

	mdadm -Ss          md-raid-arrays.rules
  set MD_STILL_CLOSED          md_open()
	... ... ...          clear MD_STILL_CLOSED
	do_md_stop

We make below changes to resolve this issue:

1. rename MD_STILL_CLOSED to MD_CLOSING since it is set
   when stop array and it means we are stopping array.
2. let md_open returns early if CLOSING is set, so no
   other threads will open array if one thread is trying
   to close it.
3. no need to clear CLOSING bit in md_open because 1 has
   ensure the bit is cleared, then we also don't need to
   test CLOSING bit in do_md_stop.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

af8d8e6f

md-cluster: call md_kick_rdev_from_array once ack failed · e566aef1

由 Guoqing Jiang 提交于 8月 12, 2016

The new_disk_ack could return failure if WAITING_FOR_NEWDISK
is not set, so we need to kick the dev from array in case
failure happened.

And we missed to check err before call new_disk_ack othwise
we could kick a rdev which isn't in array, thanks for the
reminder from Shaohua.
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

e566aef1

09 9月, 2016 1 次提交

md-cluster: make md-cluster also can work when compiled into kernel · 47a7b0d8

由 Guoqing Jiang 提交于 9月 04, 2016

The md-cluster is compiled as module by default,
if it is compiled by built-in way, then we can't
make md-cluster works.

[64782.630008] md/raid1:md127: active with 2 out of 2 mirrors
[64782.630528] md-cluster module not found.
[64782.630530] md127: Could not setup cluster service (-2)

Fixes: edb39c9d ("Introduce md_cluster_operations to handle cluster functions")
Cc: stable@vger.kernel.org (v4.1+)
Reported-by: NMarc Smith <marc.smith@mcc.edu>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
Signed-off-by: NShaohua Li <shli@fb.com>

47a7b0d8

25 8月, 2016 1 次提交

r5cache: set MD_JOURNAL_CLEAN correctly · 486b0f7b

由 Song Liu 提交于 8月 19, 2016

Currently, the code sets MD_JOURNAL_CLEAN when the array has
MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
will be MD_JOURNAL_CLEAN even if the journal device is missing.

With this patch, the MD_JOURNAL_CLEAN is only set when the journal
device presents.
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NShaohua Li <shli@fb.com>

486b0f7b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功