提交 · 25aa6a7ae46c6a041c46a2d314b9ab7c4f2baa41 · openanolis / cloud-kernel

02 8月, 2012 2 次提交

raid5: raid5d handle stripe in batch way · 46a06401

由 Shaohua Li 提交于 8月 02, 2012

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

46a06401

raid5: make_request use batch stripe release · 8811b596

由 Shaohua Li 提交于 8月 02, 2012

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8811b596

31 7月, 2012 3 次提交

md: remove plug_cnt feature of plugging. · 0021b7bc

由 NeilBrown 提交于 7月 31, 2012

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0021b7bc

md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. · 895e3c5c

由 majianpeng 提交于 7月 31, 2012

'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.

So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC.  This is consistent with the documented meaning
of REQ_NOIDLE:

        __REQ_NOIDLE,           /* don't anticipate more IO after this one */
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

895e3c5c

raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer · 3f9e7c14

由 majianpeng 提交于 7月 31, 2012

Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.

V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

3f9e7c14

19 7月, 2012 4 次提交

raid5: add a per-stripe lock · b17459c0

由 Shaohua Li 提交于 7月 19, 2012

Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.

stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.

If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared,  there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b17459c0

raid5: remove unnecessary bitmap write optimization · 7eaf7e8e

由 Shaohua Li 提交于 7月 19, 2012

Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7eaf7e8e

raid5: lockless access raid5 overrided bi_phys_segments · e7836bd6

由 Shaohua Li 提交于 7月 19, 2012

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e7836bd6

raid5: reduce chance release_stripe() taking device_lock · 4eb788df

由 Shaohua Li 提交于 7月 19, 2012

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

4eb788df

03 7月, 2012 8 次提交

md: fix up plugging (again). · b357f04a

由 NeilBrown 提交于 7月 03, 2012

The value returned by "mddev_check_plug" is only valid until the
next 'schedule' as that will unplug things.  This could happen at any
call to mempool_alloc.
So just calling mddev_check_plug at the start doesn't really make
sense.

So call it just before, or just after, queuing things for the thread.
As the action that happens at unplug is to wake the thread, this makes
lots of sense.
If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
wake thread immediately.

RAID5 is a bit different.  Requests are queued for the thread and the
thread is woken by release_stripe.  So we don't need to wake the
thread on failure.
However the thread doesn't perform certain actions when there is any
active plug, so it is important to install a plug before waking the
thread.  So for RAID5 we install the plug *before* queuing the request
and waking the thread.

Without this patch it is possible for raid1 or raid10 to queue a
request without then waking the thread, resulting in the array locking
up.

Also change raid10 to only flush_pending_write when there are not
active plugs, just like raid1.

This patch is suitable for 3.0 or later.  I plan to submit it to
-stable, but I'll like to let it spend a few weeks in mainline
first to be sure it is completely safe.
Signed-off-by: NNeilBrown <neilb@suse.de>

b357f04a

raid5: delayed stripe fix · fab363b5

由 Shaohua Li 提交于 7月 03, 2012

There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fab363b5

md/raid456: When read error cannot be recovered, record bad block · 2e8ac303

由 majianpeng 提交于 7月 03, 2012

We may not be able to fix a bad block if:
 - the array is degraded
 - the over-write fails.

In these cases we currently eject the device, but we should
record a bad block if possible.
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

2e8ac303

md: make 'name' arg to md_register_thread non-optional. · 0232605d

由 NeilBrown 提交于 7月 03, 2012

Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.

So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0232605d

md/raid5: fix refcount problem when blocked_rdev is set. · 5f066c63

由 NeilBrown 提交于 7月 03, 2012

commit 43220aa0
    md/raid5: fix a hang on device failure.

fixed a hang, but introduced a refcounting in-balance so
that if the presence of bad-blocks ever caused an rdev to
be 'blocked' we would increment the refcount on the rdev and
never decrement it.

So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
is not called.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5f066c63

md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev · 1850753d

由 majianpeng 提交于 7月 03, 2012

In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
nr_pending so we lose the reference we hold on the rdev.
So atomic_inc it first to maintain the reference.

This bug was introduced by commit  73e92e51
    md/raid5.  Don't write to known bad block on doubtful devices.

which appeared in 3.0, so patch is suitable for stable kernels since
then.

Cc: stable@vger.kernel.org
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1850753d

md/raid5: Do not add data_offset before call to is_badblock · 6c0544e2

由 majianpeng 提交于 6月 12, 2012

In chunk_aligned_read() we are adding data_offset before calling
is_badblock.  But is_badblock also adds data_offset, so that is bad.

So move the addition of data_offset to after the call to
is_badblock.

This bug was introduced by commit 31c176ec
     md/raid5: avoid reading from known bad blocks.
which first appeared in 3.0.  So that patch is suitable for any
-stable kernel from 3.0.y onwards.  However it will need minor
revision for most of those (as the comment didn't appear until
recently).

Cc: stable@vger.kernel.org
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

6c0544e2

md/raid5: prefer replacing failed devices over want-replacement devices. · 5cfb22a1

由 NeilBrown 提交于 7月 03, 2012

If a RAID5 has both a failed device and a device marked as
'WantReplacement', then we should preferentially replace the failed
device.
However the current code replaces whichever is found first.
So split into 2 loops, check fail failed/missing first, and only check
for WantReplacement if nothing is failed or missing.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5cfb22a1

22 5月, 2012 5 次提交

md/raid5: improve removal of extra devices after reshape. · da7613b8

由 NeilBrown 提交于 5月 22, 2012

After a reshape which reduced the number of devices we need
to disconnect the extra devices.
The code for this doesn't currently handle 'replacement' devices.
It is very unlikely that such devices will be present, but it is
safest to handle them anyway.

So simplify the handling.  Just clear In_sync and leave it
to remove_and_add_spaces (which will be called soon) to do
the real works.
Signed-off-by: NNeilBrown <neilb@suse.de>

da7613b8

md/raid5: Allow reshape while a bitmap is present. · 30b67645

由 NeilBrown 提交于 5月 22, 2012

We always should have allowed this.  A raid5 reshape doesn't change
the size of the bitmap, so not need to restrict it.

Also add a test to make sure we don't try to start a reshape on a
failed array.
Signed-off-by: NNeilBrown <neilb@suse.de>

30b67645

md: allow array to be resized while bitmap is present. · a4a6125a

由 NeilBrown 提交于 5月 22, 2012

Now that bitmaps can be resized, we can allow an array to be resized
while the bitmap is present.

This only covers resizing that involves changing the effective size
of member devices, not resizing that changes the number of devices.
Signed-off-by: NNeilBrown <neilb@suse.de>

a4a6125a

raid5: support sync request · bc0934f0

由 Shaohua Li 提交于 5月 22, 2012

REQ_SYNC is ignored in current raid5 code. Block layer does use it to do
policy,
for example ioscheduler. This patch adds it.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

bc0934f0

raid5: remove unused variables · cceeca43

由 Shaohua Li 提交于 5月 22, 2012

The two variables are useless.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

cceeca43

21 5月, 2012 4 次提交

md/raid5: allow for change in data_offset while managing a reshape. · b5254dd5

由 NeilBrown 提交于 5月 21, 2012

The important issue here is incorporating the different in data_offset
into calculations concerning when we might need to over-write data
that is still thought to be valid.

To this end we find the minimum offset difference across all devices
and add that where appropriate.
Signed-off-by: NNeilBrown <neilb@suse.de>

b5254dd5

md/raid5: Use correct data_offset for all IO. · 05616be5

由 NeilBrown 提交于 5月 21, 2012

As there can now be two different data_offsets - an 'old' and
a 'new' - we need to carefully choose between them.
Signed-off-by: NNeilBrown <neilb@suse.de>

05616be5

md: add possibility to change data-offset for devices. · c6563a8c

由 NeilBrown 提交于 5月 21, 2012

When reshaping we can avoid costly intermediate backup by
changing the 'start' address of the array on the device
(if there is enough room).

So as a first step, allow such a change to be requested
through sysfs, and recorded in v1.x metadata.

(As we didn't previous check that all 'pad' fields were zero,
 we need a new FEATURE flag for this.
 A (belatedly) check that all remaining 'pad' fields are
 zero to avoid a repeat of this)

The new data offset must be requested separately for each device.
This allows each to have a different change in the data offset.
This is not likely to be used often but as data_offset can be
set per-device, new_data_offset should be too.

This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
it is never used and never will be.  At the same time we add a new
arg ('in_new') which is currently always zero but will be used more
soon.

When a reshape finishes we will need to update the data_offset
and rdev->sectors.  So provide an exported function to do that.
Signed-off-by: NNeilBrown <neilb@suse.de>

c6563a8c

md: allow a reshape operation to be reversed. · 2c810cdd

由 NeilBrown 提交于 5月 21, 2012

Currently a reshape operation always progresses from the start
of the array to the end unless the number of devices is being
reduced, in which case it progressed in the opposite direction.

To reverse a partial reshape which changes the number of devices
you can stop the array and re-assemble with the raid-disks numbers
reversed and it will undo.

However for a reshape that does not change the number of devices
it is not possible to reverse the reshape in the middle - you have to
wait until it completes.

So add a 'reshape_direction' attribute with is either 'forwards' or
'backwards' and can be explicitly set when delta_disks is zero.

This will become more important when we allow the data_offset to
change in a reshape.  Then the explicit statement of what direction is
being used will be more useful.

This can be enabled in raid5 trivially as it already supports
reverse reshape and just needs to use a different trigger to request it.
Signed-off-by: NNeilBrown <neilb@suse.de>

2c810cdd

03 4月, 2012 2 次提交

md/raid5: Fix a bug about judging if the operation is syncing or replacing · c6d2e084

由 majianpeng 提交于 4月 02, 2012

When create a raid5 using assume-clean and echo check or repair to
sync_action.Then component disks did not operated IO but the raid
check/resync faster than normal.
Because the judgement in function analyse_stripe():
		if (do_recovery ||
		    sh->sector >= conf->mddev->recovery_cp)
			s->syncing = 1;
		else
			s->replacing = 1;
When check or repair,the recovery_cp == MaxSectore,so syncing equal zero
not one.

This bug was introduced by commit 9a3e1101
    md/raid5:  detect and handle replacements during recovery.
so this patch is suitable for 3.3-stable.

Cc: stable@vger.kernel.org
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

c6d2e084

md/raid5: fix handling of bad blocks during recovery. · 18b9837e

由 NeilBrown 提交于 4月 01, 2012

1/ We can only treat a known-bad-block like a read-error if we
   have the data that belongs in that block.  So fix that test.

2/ If we cannot recovery a stripe due to insufficient data,
   don't tell "md_done_sync" that the sync failed unless we really
   did fail something.  If we successfully record bad blocks,
   that is success.
Reported-by: N"majianpeng" <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

18b9837e

19 3月, 2012 2 次提交

md: tidy up rdev_for_each usage. · dafb20fa

由 NeilBrown 提交于 3月 19, 2012

md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
mddev.  However it uses the 'safe' version of list_for_each_entry,
and so requires the extra variable, but doesn't include 'safe' in the
name, which is useful documentation.

Consequently some places use this safe version without needing it, and
many use an explicity list_for_each entry.

So:
 - rename rdev_for_each to rdev_for_each_safe
 - create a new rdev_for_each which uses the plain
   list_for_each_entry,
 - use the 'safe' version only where needed, and convert all other
   list_for_each_entry calls to use rdev_for_each.
Signed-off-by: NNeilBrown <neilb@suse.de>

dafb20fa

md: allow re-add to failed arrays. · dc10c643

由 NeilBrown 提交于 3月 19, 2012

When an array is failed (some data inaccessible) then there is no
point attempting to add a spare as it could not possibly be recovered.

However that may be value in re-adding a recently removed device.
e.g. if there is a write-intent-bitmap and it is clear, then access
to the data could be restored by this action.

So don't reject a re-add to a failed array for RAID10 and RAID5 (the
only arrays  types that check for a failed array).
Signed-off-by: NNeilBrown <neilb@suse.de>

dc10c643

13 3月, 2012 3 次提交
- M
  md/raid5: use atomic_dec_return() instead of atomic_dec() and atomic_read(). · 41fe75f6
  由 majianpeng 提交于 3月 13, 2012
```
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  41fe75f6
- N
  md/raid5: removed unused 'added_devices' variable. · 9d4c7d87
  由 NeilBrown 提交于 3月 13, 2012
```
commit 908f4fbd removed the last user of this variable,
so we should discard it completely.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  9d4c7d87
- N
  md/raid5: make sure reshape_position is cleared on error path. · 1e3fa9bd
  由 NeilBrown 提交于 3月 13, 2012
```
Leaving a valid reshape_position value in place could be confusing.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
  1e3fa9bd
23 12月, 2011 7 次提交

md/raid5: Mark device want_replacement when we see a write error. · 3a6de292

由 NeilBrown 提交于 12月 23, 2011

Now that WantReplacement drives are replaced cleanly, mark a drive
as WantReplacement when we see a write error.  It might get failed soon so
the WantReplacement flag is irrelevant, but if the write error is recorded
in the bad block log, we still want to activate any spare that might
be available.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

3a6de292

md/raid5: If there is a spare and a want_replacement device, start replacement. · 7bfec5f3

由 NeilBrown 提交于 12月 23, 2011

When attempting to add a spare to a RAID[456] array, also consider
adding it as a replacement for a want_replacement device.

This requires that common md code attempt hot_add even when the array
is not formally degraded.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7bfec5f3

md/raid5: recognise replacements when assembling array. · 17045f52

由 NeilBrown 提交于 12月 23, 2011

If a Replacement is seen, file it as such.

If we see two replacements (or two normal devices) for the one slot,
abort.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

17045f52

md/raid5: handle activation of replacement device when recovery completes. · dd054fce

由 NeilBrown 提交于 12月 23, 2011

When recovery completes - as reported by a call to ->spare_active,
we clear In_sync on the original and set it on the replacement.

Then when the original gets removed we move the replacement from
'replacement' to 'rdev'.

This could race with other code that is looking at these pointers,
so we use memory barriers and careful ordering to ensure that
a reader might see one device twice, but never no devices.
Then the readers guard against using both devices, which could
only happen when writing.
Signed-off-by: NNeilBrown <neilb@suse.de>

dd054fce

md/raid5: detect and handle replacements during recovery. · 9a3e1101

由 NeilBrown 提交于 12月 23, 2011

During recovery we want to write to the replacement but not
the original.  So we have two new flags
 - R5_NeedReplace if this stripe has a replacement that needs to
   be written at some stage
 - R5_WantReplace if NeedReplace, and the data is available, and
   a 'sync' has been requested on this stripe.

We also distinguish between 'sync and replace' which need to read
all other devices, and 'replace' which only needs to read the
devices being replaced.

Note that during resync we always write to any replacement device.
It might not need to be written to, but as we don't read to compare,
we have to write to be sure.
Signed-off-by: NNeilBrown <neilb@suse.de>

9a3e1101

md/raid5: writes should get directed to replacement as well as original. · 977df362

由 NeilBrown 提交于 12月 23, 2011

When writing, we need to submit two writes, one to the original, and
one to the replacement - if there is a replacement.

If the write to the replacement results in a write error, we just fail
the device.  We only try to record write errors to the original.

When writing for recovery, we shouldn't write to the original.  This
will be addressed in a subsequent patch that generally addresses
recovery.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

977df362

md/raid5: allow removal for failed replacement devices. · 657e3e4d

由 NeilBrown 提交于 12月 23, 2011

Enhance raid5_remove_disk to be able to remove ->replacement
as well as ->rdev.
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

657e3e4d

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功