提交 · f70cac8d9c7125f83048f8b3d1c60f5a041a165c · openanolis / cloud-kernel

31 8月, 2011 1 次提交

md/raid5: fix a hang on device failure. · 43220aa0

由 NeilBrown 提交于 8月 31, 2011

Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
cannot work with internal metadata as it is the raid5d thread which
will clear the blocked flag.
This wasn't a problem in 3.0 and earlier as we only set the blocked
flag when external metadata was used then.
However we now set it always, so we need to be more careful.
Signed-off-by: NNeilBrown <neilb@suse.de>

43220aa0

28 7月, 2011 7 次提交

md/raid5: Clear bad blocks on successful write. · b84db560

由 NeilBrown 提交于 7月 28, 2011

On a successful write to a known bad block, flag the sh
so that raid5d can remove the known bad block from the list.
Signed-off-by: NNeilBrown <neilb@suse.de>

b84db560

md/raid5. Don't write to known bad block on doubtful devices. · 73e92e51

由 NeilBrown 提交于 7月 28, 2011

If a device has seen write errors, don't write to any known
bad blocks on that device.
Signed-off-by: NNeilBrown <neilb@suse.de>

73e92e51

md/raid5: write errors should be recorded as bad blocks if possible. · bc2607f3

由 NeilBrown 提交于 7月 28, 2011

When a write error is detected, don't mark the device as failed
immediately but rather record the fact for handle_stripe to deal with.

Handle_stripe then attempts to record a bad block.  Only if that fails
does the device get marked as faulty.
Signed-off-by: NNeilBrown <neilb@suse.de>

bc2607f3

md/raid5: use bad-block log to improve handling of uncorrectable read errors. · 7f0da59b

由 NeilBrown 提交于 7月 28, 2011

If we get an uncorrectable read error - record a bad block rather than
failing the device.
And if these errors (which may be due to known bad blocks) cause
recovery to be impossible, record a bad block on the recovering
devices, or abort the recovery.

As we might abort a recovery without failing a device we need to teach
RAID5 about recovery_disabled handling.
Signed-off-by: NNeilBrown <neilb@suse.de>

7f0da59b

md/raid5: avoid reading from known bad blocks. · 31c176ec

由 NeilBrown 提交于 7月 28, 2011

There are two times that we might read in raid5:
1/ when a read request fits within a chunk on a single
   working device.
   In this case, if there is any bad block in the range of
   the read, we simply fail the cache-bypass read and
   perform the read though the stripe cache.

2/ when reading into the stripe cache.  In this case we
   mark as failed any device which has a bad block in that
   strip (1 page wide).
   Note that we will both avoid reading and avoid writing.
   This is correct (as we will never read from the block, there
   is no point writing), but not optimal (as writing could 'fix'
   the error) - that will be addressed later.

If we have not seen any write errors on the device yet, we treat a bad
block like a recent read error.  This will encourage an attempt to fix
the read error which will either generate a write error, or will
ensure good data is stored there.  We don't yet forget the bad block
in that case.  That comes later.

Now that we honour bad blocks when reading we can allow devices with
bad blocks into the array.
Signed-off-by: NNeilBrown <neilb@suse.de>

31c176ec

md: make it easier to wait for bad blocks to be acknowledged. · de393cde

由 NeilBrown 提交于 7月 28, 2011

It is only safe to choose not to write to a bad block if that bad
block is safely recorded in metadata - i.e. if it has been
'acknowledged'.

If it hasn't we need to wait for the acknowledgement.

We support that using rdev->blocked wait and
md_wait_for_blocked_rdev by introducing a new device flag
'BlockedBadBlock'.

This flag is only advisory.
It is cleared whenever we acknowledge a bad block, so that a waiter
can re-check the particular bad blocks that it is interested it.

It should be set by a caller when they find they need to wait.
This (set after test) is inherently racy, but as
md_wait_for_blocked_rdev already has a timeout, losing the race will
have minimal impact.

When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
was set incorrectly (see above race).

We also modify the way we manage 'Blocked' to fit better with the new
handling of 'BlockedBadBlocks' and to make it consistent between
externally managed and internally managed metadata.   This requires
that each raidXd loop checks if the metadata needs to be written and
triggers a write (md_check_recovery) if needed.  Otherwise a queued
write request might cause raidXd to wait for the metadata to write,
and only that thread can write it.

Before writing metadata, we set FaultRecorded for all devices that
are Faulty, then after writing the metadata we clear Blocked for any
device for which the Fault was certainly Recorded.

The 'faulty' device flag now appears in sysfs if the device is faulty
*or* it has unacknowledged bad blocks.  So user-space which does not
understand bad blocks can continue to function correctly.
User space which does, should not assume a device is faulty until it
sees the 'faulty' flag, and then sees the list of unacknowledged bad
blocks is empty.
Signed-off-by: NNeilBrown <neilb@suse.de>

de393cde

md: don't allow arrays to contain devices with bad blocks. · 34b343cf

由 NeilBrown 提交于 7月 28, 2011

As no personality understand bad block lists yet, we must
reject any device that is known to contain bad blocks.
As the personalities get taught, these tests can be removed.

This only applies to raid1/raid5/raid10.
For linear/raid0/multipath/faulty the whole concept of bad blocks
doesn't mean anything so there is no point adding the checks.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

34b343cf

27 7月, 2011 13 次提交

md/raid5: Avoid BUG caused by multiple failures. · 8cfa7b0f

由 NeilBrown 提交于 7月 27, 2011

While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.

If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.

So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.
Signed-off-by: NNeilBrown <neilb@suse.de>

8cfa7b0f

md/raid5: move rdev->corrected_errors counting · ddd5115f

由 Namhyung Kim 提交于 7月 27, 2011

Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ddd5115f

md: introduce link/unlink_rdev() helpers · 36fad858

由 Namhyung Kim 提交于 7月 27, 2011

There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

36fad858

md/raid: use printk_ratelimited instead of printk_ratelimit · 8bda470e

由 Christian Dietrich 提交于 7月 27, 2011

As per printk_ratelimit comment, it should not be used.
Signed-off-by: NChristian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

8bda470e

md/raid5: finalise new merged handle_stripe. · acfe726b

由 NeilBrown 提交于 7月 27, 2011

handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.

It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

acfe726b

md/raid5: move some more common code into handle_stripe · 474af965

由 NeilBrown 提交于 7月 27, 2011

The RAID6 version of this code is usable for RAID5 providing:
  - we test "conf->max_degraded" rather than "2" as appropriate
  - we make sure s->failed_num[1] is meaningful (and not '-1')
    when s->failed > 1

The 'return 1' must become 'goto finish' in the new location.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

474af965

md/raid5: move more common code into handle_stripe · 84789554

由 NeilBrown 提交于 7月 27, 2011

Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.

So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

84789554

md/raid5: unite handle_stripe_dirtying5 and handle_stripe_dirtying6 · c8ac1803

由 NeilBrown 提交于 7月 27, 2011

RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical.  So resolve these differences and create just one function.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

c8ac1803

md/raid5: unite fetch_block5 and fetch_block6 · 93b3dbce

由 NeilBrown 提交于 7月 27, 2011

Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.

So remove the latter and rename the former to simply "fetch_block".

Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

93b3dbce

md/raid5: rearrange a test in fetch_block6. · 5d35e09c

由 NeilBrown 提交于 7月 27, 2011

Next patch will unite fetch_block5 and fetch_block6.
First I want to make the differences a little more clear.

For RAID6 if we are writing at all and there is a failed device, then
we need to load or compute every block so we can do a
reconstruct-write.
This case isn't needed for RAID5 - we will do a read-modify-write in
that case.
So make that test a separate test in fetch_block6 rather than merged
with two other tests.

Make a similar change in fetch_block5 so the one bit that is not
needed for RAID6 is clearly separate.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

5d35e09c

md/raid5: move more code into common handle_stripe · c5a31000

由 NeilBrown 提交于 7月 27, 2011

The difference between the RAID5 and RAID6 code here is easily
resolved using conf->max_degraded.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

c5a31000

md/raid5: Move code for finishing a reconstruction into handle_stripe. · 3687c061

由 NeilBrown 提交于 7月 27, 2011

Prior to commit ab69ae12 the code in handle_stripe5 and
handle_stripe6 to "Finish reconstruct operations initiated by the
expansion process" was identical.
That commit added an identical stanza of code to each function, but in
different places.  That was careless.

The raid5 code was correct, so move that out into handle_stripe and
remove raid6 version.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

3687c061

md/raid5: Remove stripe_head_state arg from handle_stripe_expansion. · 86c374ba

由 NeilBrown 提交于 7月 27, 2011

This arg is only used to differentiate between RAID5 and RAID6 but
that is not needed.  For RAID5, raid5_compute_sector will set qd_idx
to "~0" so j with certainly not equals qd_idx, so there is no need
for a guard on that condition.

So remove the guard and remove the arg from the declaration and
callers of handle_stripe_expansion.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

86c374ba

26 7月, 2011 7 次提交

md/raid5: move stripe_head_state and more code into handle_stripe. · cc94015a

由 NeilBrown 提交于 7月 26, 2011

By defining the 'stripe_head_state' in 'handle_stripe', we can move
some common code out of handle_stripe[56]() and into handle_stripe.

The means that all accesses for stripe_head_state in handle_stripe[56]
need to be 's->' instead of 's.', but the compiler should inline
those functions and just use a direct stack reference, and future
patches while hoist most of this code up into handle_stripe()
so we will revert to "s.".
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

cc94015a

md/raid5: add some more fields to stripe_head_state · c5709ef6

由 NeilBrown 提交于 7月 26, 2011

Adding these three fields will allow more common code to be moved
to handle_stripe()

struct field rearrangement by Namhyung Kim.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

c5709ef6

md/raid5: unify stripe_head_state and r6_state · f2b3b44d

由 NeilBrown 提交于 7月 26, 2011

'struct stripe_head_state' stores state about the 'current' stripe
that is passed around while handling the stripe.
For RAID6 there is an extension structure: r6_state, which is also
passed around.
There is no value in keeping these separate, so move the fields from
the latter into the former.

This means that all code now needs to treat s->failed_num as an small
array, but this is a small cost.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

f2b3b44d

md/raid5: move common code into handle_stripe · 82e5a171

由 NeilBrown 提交于 7月 26, 2011

There is common code at the start of handle_stripe5 and
handle_stripe6.  Move it into handle_stripe.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

82e5a171

md/raid5: replace sh->lock with an 'active' flag. · c4c1663b

由 NeilBrown 提交于 7月 26, 2011

sh->lock is now mainly used to ensure that two threads aren't running
in the locked part of handle_stripe[56] at the same time.

That can more neatly be achieved with an 'active' flag which we set
while running handle_stripe.  If we find the flag is set, we simply
requeue the stripe for later by setting STRIPE_HANDLE.

For safety we take ->device_lock while examining the state of the
stripe and creating a summary in 'stripe_head_state / r6_state'.
This possibly isn't needed but as shared fields like ->toread,
->towrite are checked it is safer for now at least.

We leave the label after the old 'unlock' called "unlock" because it
will disappear in a few patches, so renaming seems pointless.

This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
later, but that is not a problem.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

c4c1663b

md/raid5: Protect some more code with ->device_lock. · cbe47ec5

由 NeilBrown 提交于 7月 26, 2011

Other places that change or follow dev->towrite and dev->written take
the device_lock as well as the sh->lock.
So it should really be held in these places too.
Also, doing so will allow sh->lock to be discarded.

with merged fixes by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

cbe47ec5

md/raid5: Remove use of sh->lock in sync_request · 83206d66

由 NeilBrown 提交于 7月 26, 2011

This is the start of a series of patches to remove sh->lock.

sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
there is no race with testing it in handle_stripe[56].

Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
in handle_stripe[56] (after getting the same lock) and perform the
same set/clear operations if it was set.
Signed-off-by: NNeilBrown <neilb@suse.de>
Reviewed-by: NNamhyung Kim <namhyung@gmail.com>

83206d66

18 7月, 2011 2 次提交

md/raid5: get rid of duplicated call to bio_data_dir() · ffd96e35

由 Namhyung Kim 提交于 7月 18, 2011

In raid5::make_request(), once bio_data_dir(@bi) is detected
it never (and couldn't) be changed. Use the result always.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ffd96e35

md/raid5: use kmem_cache_zalloc() · 6ce32846

由 Namhyung Kim 提交于 7月 18, 2011

Replace kmem_cache_alloc + memset(,0,) to kmem_cache_zalloc.
I think it's not harmful since @conf->slab_cache already knows
actual size of struct stripe_head.
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

6ce32846

14 6月, 2011 3 次提交

md/raid5: remove unusual use of bio_iovec_idx() · fcde9075

由 Namhyung Kim 提交于 6月 14, 2011

In the bio_for_each_segment loop, bvl always points current
bio_vec, so the same as bio_iovec_idx(, i). Let's get rid of
it.

Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fcde9075

md/raid5: fix FUA request handling in ops_run_io() · b062962e

由 Namhyung Kim 提交于 6月 14, 2011

Commit e9c7469b ("md: implment REQ_FLUSH/FUA support")
introduced R5_WantFUA flag and set rw to WRITE_FUA in that case.
However remaining code still checks whether rw is exactly same
as WRITE or not, so FUAed-write ends up with being treated as
READ. Fix it.

This bug has been present since 2.6.37 and the fix is suitable for any
-stable kernel since then.  It is not clear why this has not caused
more problems.

Cc: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b062962e

md/raid5: fix raid5_set_bi_hw_segments · 9b2dc8b6

由 Namhyung Kim 提交于 6月 13, 2011

The @bio->bi_phys_segments consists of active stripes count in the
lower 16 bits and processed stripes count in the upper 16 bits. So
logical-OR operator should be bitwise one.

This bug has been present since 2.6.27 and the fix is suitable for any
-stable kernel since then.  Fortunately the bad code is only used on
error paths and is relatively unlikely to be hit.

Cc: stable@kernel.org
Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9b2dc8b6

09 6月, 2011 1 次提交

MD: raid5 do not set fullsync · d6b212f4

由 Jonathan Brassow 提交于 6月 08, 2011

Add check to determine if a device needs full resync or if partial resync will do

RAID 5 was assuming that if a device was not In_sync, it must undergo a full
resync.  We add a check to see if 'saved_raid_disk' is the same as 'raid_disk'.
If it is, we can safely skip the full resync and rely on the bitmap for
partial recovery instead.  This is the legitimate purpose of 'saved_raid_disk',
from md.h:
int saved_raid_disk;            /* role that device used to have in the
                                 * array and could again if we did a partial
                                 * resync from the bitmap
                                 */
Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d6b212f4

11 5月, 2011 2 次提交

md: allow resync_start to be set while an array is active. · b098636c

由 NeilBrown 提交于 5月 11, 2011

The sysfs attribute 'resync_start' (known internally as recovery_cp),
records where a resync is up to.  A value of 0 means the array is
not known to be in-sync at all.  A value of MaxSector means the array
is believed to be fully in-sync.

When the size of member devices of an array (RAID1,RAID4/5/6) is
increased, the array can be increased to match.  This process sets
resync_start to the old end-of-device offset so that the new part of
the array gets resynced.

However with RAID1 (and RAID6) a resync is not technically necessary
and may be undesirable.  So it would be good if the implied resync
after the array is resized could be avoided.

So: change 'resync_start' so the value can be changed while the array
is active, and as a precaution only allow it to be changed while
resync/recovery is 'frozen'.  Changing it once resync has started is
not going to be useful anyway.

This allows the array to be resized without a resync by:
  write 'frozen' to 'sync_action'
  write new size to 'component_size' (this will set resync_start)
  write 'none' to 'resync_start'
  write 'idle' to 'sync_action'.

Also slightly improve some tests on recovery_cp when resizing
raid1/raid5.  Now that an arbitrary value could be set we should be
more careful in our tests.
Signed-off-by: NNeilBrown <neilb@suse.de>

b098636c

md: make error_handler functions more uniform and correct. · 6f8d0c77

由 NeilBrown 提交于 5月 11, 2011

- there is no need to test_bit Faulty, as that was already done in
  md_error which is the only caller of these functions.
- MD_CHANGE_DEVS should be set *after* faulty is set to ensure
  metadata is updated correctly.
- spinlock should be held while updating ->degraded.
Signed-off-by: NNeilBrown <neilb@suse.de>

6f8d0c77

10 5月, 2011 1 次提交

md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course'). · aeb878b0

由 Jesper Juhl 提交于 4月 10, 2011

There's a small typo in a comment in drivers/md/raid5.c - 'Of course' is
misspelled as 'Ofcourse'. This patch fixes the spelling error.
Signed-off-by: NJesper Juhl <jj@chaosbits.net>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

aeb878b0

22 4月, 2011 1 次提交

raid5: fix build error, sector_t usage · d76c8420

由 Randy Dunlap 提交于 4月 21, 2011

Change <sectors> from unsigned long long to sector_t.
This matches its source field.

  ERROR: "__udivdi3" [drivers/md/raid456.ko] undefined!
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d76c8420

20 4月, 2011 2 次提交

md: Fix dev_sectors on takeover from raid0 to raid4/5 · 3b71bd93

由 NeilBrown 提交于 4月 20, 2011

A raid0 array doesn't set 'dev_sectors' as each device might
contribute a different number of sectors.
So when converting to a RAID4 or RAID5 we need to set dev_sectors
as they need the number.
We have already verified that in fact all devices do contribute
the same number of sectors, so use that number.
Signed-off-by: NNeilBrown <neilb@suse.de>

3b71bd93

md/raid5: remove setting of ->queue_lock · 2b7da309

由 NeilBrown 提交于 4月 20, 2011

We previously needed to set ->queue_lock to match the raid5
device_lock so we could safely use queue_flag_* operations (e.g. for
plugging). which test the ->queue_lock is in fact locked.

However that need has completely gone away and is unlikely to come
back to remove this now-pointless setting.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b7da309

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功