提交 · 14a75d3e07c784c004b4b44b34af996b8e4ac453 · openeuler / Kernel

23 12月, 2011 6 次提交

md/raid5: preferentially read from replacement device if possible. · 14a75d3e

由 NeilBrown 提交于 12月 23, 2011

If a replacement device is present and has been recovered far enough,
then use it for reading into the stripe cache.

If we get an error we don't try to repair it, we just fail the device.
A replacement device that gives errors does not sound sensible.

This requires removing the setting of R5_ReadError when we get
a read error during a read that bypasses the cache.  It was probably
a bad idea anyway as we don't know that every block in the read
caused an error, and it could cause ReadError to be set for the
replacement device, which is bad.
Signed-off-by: NNeilBrown <neilb@suse.de>

14a75d3e

md/raid5: remove redundant bio initialisations. · 995c4275

由 NeilBrown 提交于 12月 23, 2011

We current initialise some fields of a bio when preparing a
stripe_head, and again just before submitting the request.

Remove the duplication by only setting the fields that lower level
devices don't touch in raid5_build_block, and only set the changeable
fields in ops_run_io.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

995c4275

md/raid5: allow each slot to have an extra replacement device · 671488cc

由 NeilBrown 提交于 12月 23, 2011

Just enhance data structures to record a second device per slot to be
used as a 'replacement' device, replacing the original.
We also have a second bio in each slot in each stripe_head.  This will
only be used when writing to the array - we need to write to both the
original and the replacement at the same time, so will need two bios.

For now, only try using the replacement drive for aligned-reads.
In this case, we prefer the replacement if it has been recovered far
enough, otherwise use the original.

This includes a small enhancement.  Previously we would only do
aligned reads if the target device was fully recovered.  Now we also
do them if it has recovered far enough.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

671488cc

md: change hot_remove_disk to take an rdev rather than a number. · b8321b68

由 NeilBrown 提交于 12月 23, 2011

Soon an array will be able to have multiple devices with the
same raid_disk number (an original and a replacement).  So removing
a device based on the number won't work.  So pass the actual device
handle instead.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b8321b68

md/raid5: be more thorough in calculating 'degraded' value. · 908f4fbd

由 NeilBrown 提交于 12月 23, 2011

When an array is being reshaped to change the number of devices,
the two halves can be differently degraded.  e.g. one could be
missing a device and the other not.

So we need to be more careful about calculating the 'degraded'
attribute.

Instead of just inc/dec at appropriate times, perform a full
re-calculation examining both possible cases.  This doesn't happen
often so it not a big cost, and we already have most of the code to
do it.
Signed-off-by: NNeilBrown <neilb@suse.de>

908f4fbd

md/raid5: ensure correct assessment of drives during degraded reshape. · 30d7a483

由 NeilBrown 提交于 12月 23, 2011

While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync. In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.
Reported-by: NAdam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

30d7a483

09 12月, 2011 1 次提交

md: raid5 crash during degradation · 5d8c71f9

由 Adam Kwolek 提交于 12月 09, 2011

NULL pointer access causes crash in raid5 module.
Signed-off-by: NAdam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5d8c71f9

08 12月, 2011 1 次提交

md/raid5: never wait for bad-block acks on failed device. · 9283d8c5

由 NeilBrown 提交于 12月 08, 2011

Once a device is failed we really want to completely ignore it.
It should go away soon anyway.

In particular the presence of bad blocks on it should not cause us to
block as we won't be trying to write there anyway.

So as soon as we can check if a device is Faulty, do so and pretend
that it is already gone if it is Faulty.
Signed-off-by: NNeilBrown <neilb@suse.de>

9283d8c5

08 11月, 2011 2 次提交

md/raid5: STRIPE_ACTIVE has lock semantics, add barriers · 257a4b42

由 Dan Williams 提交于 11月 08, 2011

All updates that occur under STRIPE_ACTIVE should be globally visible
when STRIPE_ACTIVE clears.  test_and_set_bit() implies a barrier, but
clear_bit() does not.

This is suitable for 3.1-stable.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

257a4b42

md/raid5: abort any pending parity operations when array fails. · 9a3f530f

由 NeilBrown 提交于 11月 08, 2011

When the number of failed devices exceeds the allowed number
we must abort any active parity operations (checks or updates) as they
are no longer meaningful, and can lead to a BUG_ON in
handle_parity_checks6.

This bug was introduce by commit 6c0069c0
in 2.6.29.
Reported-by: NManish Katiyar <mkatiyar@gmail.com>
Tested-by: NManish Katiyar <mkatiyar@gmail.com>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: stable@kernel.org

9a3f530f

01 11月, 2011 1 次提交

md: Add module.h to all files using it implicitly · 056075c7

由 Paul Gortmaker 提交于 7月 03, 2011

A pending cleanup will mean that module.h won't be implicitly
everywhere anymore. Make sure the modular drivers in md dir
are actually calling out for <module.h> explicitly in advance.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>

056075c7

26 10月, 2011 2 次提交

md: Fix some bugs in recovery_disabled handling. · d890fa2b

由 NeilBrown 提交于 10月 26, 2011

In 3.0 we changed the way recovery_disabled was handle so that instead
of testing against zero, we test an mddev-> value against a conf->
value.
Two problems:
  1/ one place in raid1 was missed and still sets to '1'.
  2/ We didn't explicitly set the conf-> value at array creation
     time.
     It defaulted to '0' just like the mddev value does so they
     could appear equal and thus disable recovery.
     This did not affect normal 'md' as it calls bind_rdev_to_array
     which changes the mddev value.  However the dmraid interface
     doesn't call this and so doesn't change ->recovery_disabled; so at
     array start all recovery is incorrectly disabled.

So initialise the 'conf' value to one less that the mddev value, so
the will only be the same when explicitly set that way.
Reported-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown  <neilb@suse.de>

d890fa2b

md/raid5: fix bug that could result in reads from a failed device. · 355840e7

由 NeilBrown 提交于 10月 26, 2011

This bug was introduced in 415e72d0
which was in 2.6.36.

There is a small window of time between when a device fails and when
it is removed from the array.  During this time we might still read
from it, but we won't write to it - so it is possible that we could
read stale data.

We didn't need the test of 'Faulty' before because the test on
In_sync is sufficient.  Since we started allowing reads from the early
part of non-In_sync devices we need a test on Faulty too.

This is suitable for any kernel from 2.6.36 onwards, though the patch
might need a bit of tweaking in 3.0 and earlier.

Cc: stable@kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

355840e7

11 10月, 2011 5 次提交
- N
  md: rename "mdk_personality" to "md_personality" · 84fc4b56
  由 NeilBrown 提交于 10月 11, 2011
```
"mdk" doesn't mean anything any more.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  84fc4b56
- N
  md/raid5: typedef removal: raid5_conf_t -> struct r5conf · d1688a6d
  由 NeilBrown 提交于 10月 11, 2011
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  d1688a6d
- N
  md/raid0: typedef removal: raid0_conf_t -> struct r0conf · e373ab10
  由 NeilBrown 提交于 10月 11, 2011
```
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  e373ab10
- N
  md: remove typedefs: mddev_t -> struct mddev · fd01b88c
  由 NeilBrown 提交于 10月 11, 2011
```
Having mddev_t and 'struct mddev_s' is ugly and not preferred
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  fd01b88c
- N
  md: removing typedefs: mdk_rdev_t -> struct md_rdev · 3cb03002
  由 NeilBrown 提交于 10月 11, 2011
```
The typedefs are just annoying. 'mdk' probably refers to 'md_k.h'
which used to be an include file that defined this thing.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  3cb03002
07 10月, 2011 3 次提交

md: remove some old DEBUGging code. · bdc04e6b

由 NeilBrown 提交于 10月 07, 2011

This code is not really helpful and is hard to maintain, so just
discard it.
Signed-off-by: NNeilBrown <neilb@suse.de>

bdc04e6b

N
md/raid5: convert to macros into inline functions. · db298e19
由 NeilBrown 提交于 10月 07, 2011
```
More type-safety.  Easier to read.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
db298e19

md/raid5: remove pointless NULL test. · e4f869d9

由 NeilBrown 提交于 10月 07, 2011

In the 'abort' branch of run(), 'conf' cannot possibly be NULL,
so remove the test.
Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e4f869d9

21 9月, 2011 1 次提交

md: Avoid waking up a thread after it has been freed. · 01f96c0a

由 NeilBrown 提交于 9月 21, 2011

Two related problems:

1/ some error paths call "md_unregister_thread(mddev->thread)"
   without subsequently clearing ->thread.  A subsequent call
   to mddev_unlock will try to wake the thread, and crash.

2/ Most calls to md_wakeup_thread are protected against the thread
   disappeared either by:
      - holding the ->mutex
      - having an active request, so something else must be keeping
        the array active.
   However mddev_unlock calls md_wakeup_thread after dropping the
   mutex and without any certainty of an active request, so the
   ->thread could theoretically disappear.
   So we need a spinlock to provide some protections.

So change md_unregister_thread to take a pointer to the thread
pointer, and ensure that it always does the required locking, and
clears the pointer properly.
Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
cc: stable@kernel.org

01f96c0a

12 9月, 2011 1 次提交

block: remove support for bio remapping from ->make_request · 5a7bbad2

由 Christoph Hellwig 提交于 9月 12, 2011

There is very little benefit in allowing to let a ->make_request
instance update the bios device and sector and loop around it in
__generic_make_request when we can archive the same through calling
generic_make_request from the driver and letting the loop in
generic_make_request handle it.

Note that various drivers got the return value from ->make_request and
returned non-zero values for errors.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

5a7bbad2

31 8月, 2011 1 次提交

md/raid5: fix a hang on device failure. · 43220aa0

由 NeilBrown 提交于 8月 31, 2011

Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
cannot work with internal metadata as it is the raid5d thread which
will clear the blocked flag.
This wasn't a problem in 3.0 and earlier as we only set the blocked
flag when external metadata was used then.
However we now set it always, so we need to be more careful.
Signed-off-by: NNeilBrown <neilb@suse.de>

43220aa0

28 7月, 2011 7 次提交

md/raid5: Clear bad blocks on successful write. · b84db560

由 NeilBrown 提交于 7月 28, 2011

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.
Signed-off-by: NNeilBrown <neilb@suse.de>

b84db560

md/raid5. Don't write to known bad block on doubtful devices. · 73e92e51

由 NeilBrown 提交于 7月 28, 2011

If a device has seen write errors, don't write to any known
bad blocks on that device.
Signed-off-by: NNeilBrown <neilb@suse.de>

73e92e51

md/raid5: write errors should be recorded as bad blocks if possible. · bc2607f3

由 NeilBrown 提交于 7月 28, 2011

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block.  Only if that fails
does the device get marked as faulty.
Signed-off-by: NNeilBrown <neilb@suse.de>

bc2607f3

md/raid5: use bad-block log to improve handling of uncorrectable read errors. · 7f0da59b

由 NeilBrown 提交于 7月 28, 2011

If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.
Signed-off-by: NNeilBrown <neilb@suse.de>

7f0da59b

md/raid5: avoid reading from known bad blocks. · 31c176ec

由 NeilBrown 提交于 7月 28, 2011

There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
   working device.
   In this case, if there is any bad block in the range of
   the read, we simply fail the cache-bypass read and
   perform the read though the stripe cache.

2/ when reading into the stripe cache.  In this case we
   mark as failed any device which has a bad block in that
   strip (1 page wide).
   Note that we will both avoid reading and avoid writing.
   This is correct (as we will never read from the block, there
   is no point writing), but not optimal (as writing could 'fix'
   the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error.  This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there.  We don't yet forget the bad block
in that case.  That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.
Signed-off-by: NNeilBrown <neilb@suse.de>

31c176ec

md: make it easier to wait for bad blocks to be acknowledged. · de393cde

由 NeilBrown 提交于 7月 28, 2011

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NNeilBrown <neilb@suse.de>

de393cde

md: don't allow arrays to contain devices with bad blocks. · 34b343cf

由 NeilBrown 提交于 7月 28, 2011

As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

34b343cf

27 7月, 2011 9 次提交

md/raid5: Avoid BUG caused by multiple failures. · 8cfa7b0f

由 NeilBrown 提交于 7月 27, 2011

While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.

If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.

So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.
Signed-off-by: NNeilBrown <neilb@suse.de>

8cfa7b0f

md/raid5: move rdev->corrected_errors counting · ddd5115f

由 Namhyung Kim 提交于 7月 27, 2011

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ddd5115f

md: introduce link/unlink_rdev() helpers · 36fad858

由 Namhyung Kim 提交于 7月 27, 2011

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

36fad858

md/raid: use printk_ratelimited instead of printk_ratelimit · 8bda470e

由 Christian Dietrich 提交于 7月 27, 2011

As per printk_ratelimit comment, it should not be used.
Signed-off-by: NChristian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

8bda470e

md/raid5: finalise new merged handle_stripe. · acfe726b

由 NeilBrown 提交于 7月 27, 2011

handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.

It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

acfe726b

md/raid5: move some more common code into handle_stripe · 474af965

由 NeilBrown 提交于 7月 27, 2011

The RAID6 version of this code is usable for RAID5 providing:
  - we test "conf->max_degraded" rather than "2" as appropriate
  - we make sure s->failed_num[1] is meaningful (and not '-1')
    when s->failed > 1

The 'return 1' must become 'goto finish' in the new location.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

474af965

md/raid5: move more common code into handle_stripe · 84789554

由 NeilBrown 提交于 7月 27, 2011

Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.

So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

84789554

md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 · c8ac1803

由 NeilBrown 提交于 7月 27, 2011

RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical.  So resolve these differences and create just one function.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

c8ac1803

md/raid5: unite fetch_block5 and fetch_block6 · 93b3dbce

由 NeilBrown 提交于 7月 27, 2011

Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.

So remove the latter and rename the former to simply "fetch_block".

Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

93b3dbce

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功