提交 · 7bfec5f35c68121e7b1849f3f4166dd96c8da5b3 · openeuler / Kernel

23 12月, 2011 6 次提交

md/raid5: If there is a spare and a want_replacement device, start replacement. · 7bfec5f3

由 NeilBrown 提交于 12月 23, 2011

When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7bfec5f3

md: create externally visible flags for supporting hot-replace. · 2d78f8c4

由 NeilBrown 提交于 12月 23, 2011

hot-replace is a feature being added to md which will allow a
device to be replaced without removing it from the array first.

With hot-replace a spare can be activated and recovery can start while
the original device is still in place, thus allowing a transition from
an unreliable device to a reliable device without leaving the array
degraded during the transition.  It can also be use when the original
device is still reliable but it not wanted for some reason.

This will eventually be supported in RAID4/5/6 and RAID10.

This patch adds a super-block flag to distinguish the replacement
device.  If an old kernel sees this flag it will reject the device.

It also adds two per-device flags which are viewable and settable via
sysfs.
   "want_replacement" can be set to request that a device be replaced.
   "replacement" is set to show that this device is replacing another
   device.

The "rd%d" links in /sys/block/mdXx/md only apply to the original
device, not the replacement.  We currently don't make links for the
replacement - there doesn't seem to be a need.
Signed-off-by: NNeilBrown <neilb@suse.de>

2d78f8c4

md: change hot_remove_disk to take an rdev rather than a number. · b8321b68

由 NeilBrown 提交于 12月 23, 2011

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b8321b68

md: remove test for duplicate device when setting slot number. · 476a7abb

由 NeilBrown 提交于 12月 23, 2011

When setting the slot number on a device in an active array we
currently check that the number is not already in use.
We then call into the personality's hot_add_disk function
which performs the same test and returns the same error.

Thus the common test is not needed.

As we will shortly be changing some personalities to allow duplicates
in some cases (to support hot-replace), the common test will become
inconvenient.

So remove the common test.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

476a7abb

md: allow non-privileged uses to GET_*_INFO about raid arrays. · 506c9e44

由 NeilBrown 提交于 12月 23, 2011

The info is already available in /proc/mdstat and /sys/block in
an accessible form so there is no point in putting a road-block in
the ioctl for information gathering.
Signed-off-by: NNeilBrown <neilb@suse.de>

506c9e44

md: don't give up looking for spares on first failure-to-add · 60fc1370

由 NeilBrown 提交于 12月 23, 2011

Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.

Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.

If we abort early these might not happen properly.

So remove the early abort.

In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.
Reported-by: NAnssi Hannula <anssi.hannula@iki.fi>
Signed-off-by: NNeilBrown <neilb@suse.de>

60fc1370

08 12月, 2011 4 次提交

md: ensure new badblocks are handled promptly. · 8bd2f0a0

由 NeilBrown 提交于 12月 08, 2011

When we mark blocks as bad we need them to be acknowledged by the
metadata handler promptly.

For an in-kernel metadata handler that was already being done.  But
for an external metadata handler we need to alert it of the change by
sending a notification through the sysfs file.  This adds that
notification.
Signed-off-by: NNeilBrown <neilb@suse.de>

8bd2f0a0

md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152

由 NeilBrown 提交于 12月 08, 2011

Once a device is marked Faulty the badblocks - whether acknowledged or
not - become irrelevant.  So they shouldn't cause the device to be
marked as Blocked.

Without this patch, a process might write "-blocked" to clear the
Blocked status, but while that will correctly fail the device, it
won't remove the apparent 'blocked' status.
Signed-off-by: NNeilBrown <neilb@suse.de>

52c64152

md: take a reference to mddev during sysfs access. · af8a2434

由 NeilBrown 提交于 12月 08, 2011


When we are accessing an mddev via sysfs we know that the
mddev cannot disappear because it has an embedded kobj which
is refcounted by sysfs.
And we also take the mddev_lock.
However this is not enough.

The final mddev_put could have been called and the
mddev_delayed_delete is waiting for sysfs to let go so it can destroy
the kobj and mddev.
In this state there are a lot of changes that should not be attempted.

To to guard against this we:
 - initialise mddev->all_mddevs in on last put so the state can be
   easily detected.
 - in md_attr_show and md_attr_store, check ->all_mddevs under
   all_mddevs_lock and mddev_get the mddev if it still appears to
   be active.

This means that if we get to sysfs as the mddev is being deleted we
will get -EBUSY.

rdev_attr_store and rdev_attr_show are similar but already have
sufficient protection.  They check that rdev->mddev still points to
mddev after taking mddev_lock.  As this is cleared  before delayed
removal which can only be requested under the mddev_lock, this
ensure the rdev and mddev are still alive.
Signed-off-by: NNeilBrown <neilb@suse.de>

af8a2434

md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178

由 NeilBrown 提交于 12月 08, 2011

We like md devices to disappear when they really are not needed.
However it is not possible to tell from the current state whether it
is needed or not.  We can only tell from recent history of changes.

In particular immediately after we create an md device it looks very
similar to immediately after we have finished with it.

So we always preserve a newly created md device until something
significant happens.  This state is stored in 'hold_active'.

The normal case is to keep it until an ioctl happens, as that will
normally either activate it, or explicitly de-activate it.  If it
doesn't then it was probably created by mistake and it is now time to
get rid of it.

We can also modify an array via sysfs (instead of via ioctl) and we
currently treat any change via sysfs like an ioctl as a sign that if
it now isn't more active, it should be destroyed.
However this is not appropriate as changes made via sysfs are more
gradual so we should look for a more definitive change.

So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
the array_state is changed via sysfs.  Other changes via sysfs
are ignored.
Signed-off-by: NNeilBrown <neilb@suse.de>

1d23f178

01 11月, 2011 1 次提交

md: Add module.h to all files using it implicitly · 056075c7

由 Paul Gortmaker 提交于 7月 03, 2011

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

056075c7

19 10月, 2011 1 次提交

md.c: trivial comment fix · 751e67ca

由 Chris Dunlop 提交于 10月 19, 2011

Trivial comment fix
Signed-off-by: NChris Dunlop <chris@onthe.net.au>
Signed-off-by: NNeilBrown <neilb@suse.de>

751e67ca

18 10月, 2011 2 次提交

MD: Allow restarting an interrupted incremental recovery. · d70ed2e4

由 Andrei Warkentin 提交于 10月 18, 2011

If an incremental recovery was interrupted, a subsequent
re-add will result in a full recovery, even though an
incremental should be possible (seen with raid1).

Solve this problem by not updating the superblock on the
recovering device until array is not degraded any longer.

Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NAndrei Warkentin <andreiw@vmware.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d70ed2e4

md: clear In_sync bit on devices added to an active array. · d30519fc

由 NeilBrown 提交于 10月 18, 2011

When we add a device to an active array it can be meaningful to set
the 'insync' flag.  This indicates that the device is in-sync with the
array except for locations recorded in the bitmap.
A bitmap-based recovery can then bring it completely in-sync.

Internally we move that flag to 'saved_raid_disk' but forgot to clear
In_sync like we do in add_new_disk.

So clear In_sync after moving its value to saved_raid_disk.
Reported-by: NAndrei Warkentin <andreiw@vmware.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d30519fc

11 10月, 2011 4 次提交
- N
  md: rename "mdk_personality" to "md_personality" · 84fc4b56
  由 NeilBrown 提交于 10月 11, 2011
```
"mdk" doesn't mean anything any more.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  84fc4b56
- N
  md: remove typedefs: mdk_thread_t -> struct md_thread · 2b8bf345
  由 NeilBrown 提交于 10月 11, 2011
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  2b8bf345
- N
  md: remove typedefs: mddev_t -> struct mddev · fd01b88c
  由 NeilBrown 提交于 10月 11, 2011
```
Having mddev_t and 'struct mddev_s' is ugly and not preferred
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  fd01b88c
- N
  md: removing typedefs: mdk_rdev_t -> struct md_rdev · 3cb03002
  由 NeilBrown 提交于 10月 11, 2011
```
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  3cb03002
07 10月, 2011 1 次提交
- N
  md: remove PRINTK and dprintk debugging and use pr_debug · 36a4e1fe
  由 NeilBrown 提交于 10月 07, 2011
```
Being able to dynamically enable these make them much more useful.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  36a4e1fe
23 9月, 2011 1 次提交

md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91

由 Daniel P. Berrange 提交于 9月 23, 2011

The md_notify_reboot() method includes a call to mdelay(1000),
to deal with "exotic SCSI devices" which are too volatile on
reboot. The delay is unconditional. Even if the machine does
not have any block devices, let alone MD devices, the kernel
shutdown sequence is slowed down.

1 second does not matter much with physical hardware, but with
certain virtualization use cases any wasted time in the bootup
& shutdown sequence counts for alot.

* drivers/md/md.c: md_notify_reboot() - only impose a delay if
  there was at least one MD device to be stopped during reboot
Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

2dba6a91

21 9月, 2011 1 次提交

md: Avoid waking up a thread after it has been freed. · 01f96c0a

由 NeilBrown 提交于 9月 21, 2011

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.
Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
cc: stable@kernel.org

01f96c0a

12 9月, 2011 1 次提交

block: remove support for bio remapping from ->make_request · 5a7bbad2

由 Christoph Hellwig 提交于 9月 12, 2011

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5a7bbad2

10 9月, 2011 1 次提交

md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260

由 NeilBrown 提交于 9月 10, 2011

0.90 metadata uses an unsigned 32bit number to count the number of
kilobytes used from each device.
This should allow up to 4TB per device.
However we multiply this by 2 (to get sectors) before casting to a
larger type, so sizes above 2TB get truncated.

Also we allow rdev->sectors to be larger than 4TB, so it is possible
for the array to be resized larger than the metadata can handle.
So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
used.

Also the sanity check at the end of super_90_load should include level
1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

27a7b260

30 8月, 2011 1 次提交

md: fix clearing of 'blocked' flag in the presence of bad blocks. · 7da64a0a

由 NeilBrown 提交于 8月 30, 2011

When the 'blocked' flag on a device is cleared while there are
unacknowledged bad blocks we must fail the device.  This is needed for
backwards compatability of the interface.

The code currently uses the wrong test for "unacknowledged bad blocks
exist".  Change it to the right test.
Signed-off-by: NNeilBrown <neilb@suse.de>

7da64a0a

25 8月, 2011 3 次提交

md: use REQ_NOIDLE flag in md_super_write() · a5bf4df0

由 Namhyung Kim 提交于 8月 25, 2011

Queue idling is used for the anticipation of immediate
sequencial I/O's but md_super_write() is a kind of one-
shot operation, coupled with md_super_wait(), so the
idling in this case will be just a waste of time.

Specifying REQ_NOIDLE prevents it. Instead of adding
the flag to submit_bio() directly, use pre-defined
macro WRITE_FLUSH_FUA.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a5bf4df0

md: ensure changes to 'write-mostly' are reflected in metadata. · aeb9b211

由 NeilBrown 提交于 8月 25, 2011

The 'write-mostly' flag can be changed through sysfs.
With 0.90 metadata, those changes are reflected in the metadata.
For 1.x metadata, they aren't.

So fix super_1_sync to record 'write-mostly' status.
Signed-off-by: NNeilBrown <neilb@suse.de>

aeb9b211

md: report failure if a 'set faulty' request doesn't. · 5ef56c8f

由 NeilBrown 提交于 8月 25, 2011

Sometimes a device will refuse to be set faulty. e.g. RAID1 will
never let the last working device become faulty.

So check if "md_error()" did manage to set the faulty flag and fail
with EBUSY if it didn't.

Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198Reported-by: NMike Hommey <mh+reportbug@glandium.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

5ef56c8f

28 7月, 2011 9 次提交

md/raid10 record bad blocks as needed during recovery. · e875ecea

由 NeilBrown 提交于 7月 28, 2011

When recovering one or more devices, if all the good devices have
bad blocks we should record a bad block on the device being rebuilt.

If this fails, we need to abort the recovery.

To ensure we don't think that we aborted later than we actually did,
we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
in particular before mddev->curr_resync is updated.
Signed-off-by: NNeilBrown <neilb@suse.de>

e875ecea

md: make it easier to wait for bad blocks to be acknowledged. · de393cde

由 NeilBrown 提交于 7月 28, 2011

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NNeilBrown <neilb@suse.de>

de393cde

md: add 'write_error' flag to component devices. · d7a9d443

由 NeilBrown 提交于 7月 28, 2011

If a device has ever seen a write error, we will want to handle
known-bad-blocks differently.
So create an appropriate state flag and export it via sysfs.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

d7a9d443

md/raid1: avoid reading from known bad blocks. · d2eb35ac

由 NeilBrown 提交于 7月 28, 2011

Now that we have a bad block list, we should not read from those
blocks.
There are several main parts to this:
  1/ read_balance needs to check for bad blocks, and return not only
     the chosen device, but also how many good blocks are available
     there.
  2/ fix_read_error needs to avoid trying to read from bad blocks.
  3/ read submission must be ready to issue multiple reads to
     different devices as different bad blocks on different devices
     could mean that a single large read cannot be served by any one
     device, but can still be served by the array.
     This requires keeping count of the number of outstanding requests
     per bio.  This count is stored in 'bi_phys_segments'
  4/ retrying a read needs to also be ready to submit a smaller read
     and queue another request for the rest.

This does not yet handle bad blocks when reading to perform resync,
recovery, or check.

'md_trim_bio' will also be used for RAID10, so put it in md.c and
export it.
Signed-off-by: NNeilBrown <neilb@suse.de>

d2eb35ac

md: Disable bad blocks and v0.90 metadata. · 9f2f3830

由 NeilBrown 提交于 7月 28, 2011

v0.90 metadata cannot record bad blocks, so when loading metadata
for such a device, set shift to -1.
Signed-off-by: NNeilBrown <neilb@suse.de>

9f2f3830

md: load/store badblock list from v1.x metadata · 2699b672

由 NeilBrown 提交于 7月 28, 2011

Space must have been allocated when array was created.
A feature flag is set when the badblock list is non-empty, to
ensure old kernels don't load and trust the whole device.

We only update the on-disk badblocklist when it has changed.
If the badblocklist (or other metadata) is stored on a bad block, we
don't cope very well.

If metadata has no room for bad block, flag bad-blocks as disabled,
and do the same for 0.90 metadata.
Signed-off-by: NNeilBrown <neilb@suse.de>

2699b672

md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5

由 NeilBrown 提交于 7月 28, 2011

This can show the log (providing it fits in one page) and
allows bad blocks to be 'acknowledged' meaning that they
have safely been recorded in metadata.

Clearing bad blocks is not allowed via sysfs (except for
code testing).  A bad block can only be cleared when
a write to the block succeeds.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

16c791a5

md: beginnings of bad block management. · 2230dfe4

由 NeilBrown 提交于 7月 28, 2011

This the first step in allowing md to track bad-blocks per-device so
that we can fail individual blocks rather than the whole device.

This patch just adds a data structure for recording bad blocks, with
routines to add, remove, search the list.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

2230dfe4

md: remove suspicious size_of() · a519b26d

由 NeilBrown 提交于 7月 28, 2011

When calling bioset_create we pass the size of the front_pad as
   sizeof(mddev)
which looks suspicious as mddev is a pointer and so it looks like a
common mistake where
   sizeof(*mddev)
was intended.
The size is actually correct as we want to store a pointer in the
front padding of the bios created by the bioset, so make the intent
more explicit by using
   sizeof(mddev_t *)
Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a519b26d

27 7月, 2011 4 次提交

MD: generate an event when array sync is complete · 768e587e

由 Jonathan Brassow 提交于 7月 27, 2011

This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped. This is expected behavior for device-mapper.
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

768e587e

md: get rid of unnecessary casts on page_address() · 65a06f06

由 Namhyung Kim 提交于 7月 27, 2011

page_address() returns void pointer, so the casts can be removed.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

65a06f06

md: change managed of recovery_disabled. · 5389042f

由 NeilBrown 提交于 7月 27, 2011

If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.

Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1.  For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.

So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.

This will allow RAID10 to get the control that it needs.
Signed-off-by: NNeilBrown <neilb@suse.de>

5389042f

md: remove ro check in md_check_recovery() · a478a069

由 Namhyung Kim 提交于 7月 27, 2011

Commit c89a8eee ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a478a069

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功