提交 · f73a1c7d117d07a96d89475066188a2b79e53c48 · openanolis / cloud-kernel

24 3月, 2013 1 次提交

由 Kent Overstreet 提交于 9月 25, 2012

Just a little convenience macro - main reason to add it now is preparing
for immutable bio vecs, it'll reduce the size of the patch that puts
bi_sector/bi_size/bi_idx into a struct bvec_iter.
Signed-off-by: NKent Overstreet <koverstreet@google.com>
CC: Jens Axboe <axboe@kernel.dk>
CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
CC: Jiri Kosina <jkosina@suse.cz>
CC: Alasdair Kergon <agk@redhat.com>
CC: dm-devel@redhat.com
CC: Neil Brown <neilb@suse.de>
CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
CC: Heiko Carstens <heiko.carstens@de.ibm.com>
CC: linux-s390@vger.kernel.org
CC: Chris Mason <chris.mason@fusionio.com>
CC: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: NSteven Whitehouse <swhiteho@redhat.com>

f73a1c7d

28 2月, 2013 2 次提交

hlist: drop the node parameter from iterators · b67bfe0d

由 Sasha Levin 提交于 2月 27, 2013

I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b67bfe0d

md: remove CONFIG_MULTICORE_RAID456 · 51acbcec

由 NeilBrown 提交于 2月 28, 2013

This doesn't seem to actually help and we have an alternate
multi-threading approach waiting in the wings, so just get
rid of this config option and associated code.

As a bonus, we remove one use of CONFIG_EXPERIMENTAL

Cc: Dan Williams <djbw@fb.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

51acbcec

14 1月, 2013 1 次提交

block: add missing block_bio_complete() tracepoint · 3a366e61

由 Tejun Heo 提交于 1月 11, 2013

bio completion didn't kick block_bio_complete TP.  Only dm was
explicitly triggering the TP on IO completion.  This makes
block_bio_complete TP useless for tracers which want to know about
bios, and all other bio based drivers skip generating blktrace
completion events.

This patch makes all bio completions via bio_endio() generate
block_bio_complete TP.

* Explicit trace_block_bio_complete() invocation removed from dm and
  the trace point is unexported.

* @rq dropped from trace_block_bio_complete().  bios may fly around
  w/o queue associated.  Verifying and accessing the assocaited queue
  belongs to TP probes.

* blktrace now gets both request and bio completions.  Make it ignore
  bio completions if request completion path is happening.

This makes all bio based drivers generate blktrace completion events
properly and makes the block_bio_complete TP actually useful.

v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev.  Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.
Signed-off-by: NTejun Heo <tj@kernel.org>
Original-patch-by: NNamhyung Kim <namhyung@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: dm-devel@redhat.com
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3a366e61

18 12月, 2012 1 次提交

md/raid5: add blktrace calls · a9add5d9

由 NeilBrown 提交于 10月 31, 2012

This makes it easier to trace what raid5 is doing.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9add5d9

13 12月, 2012 1 次提交

md/raid5: use async_tx_quiesce() instead of open-coding it. · 749586b7

由 NeilBrown 提交于 11月 20, 2012

handle_stripe_expansion contains:

        if (tx) {
                async_tx_ack(tx);
                dma_wait_for_async_tx(tx);
        }

which is very similar to the body of async_tx_quiesce(),
except that the later handles an error from dma_wait_for_async_tx()
(admittedly by panicing, but that decision belongs in the dma
code, not the md code).

So just us async_tx_quiesce().
Acked-by: NDan Williams <djbw@fb.com>
Reported-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

749586b7

30 11月, 2012 1 次提交

wait: add wait_event_lock_irq() interface · eed8c02e

由 Lukas Czerner 提交于 11月 30, 2012

New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
moves the private wait_event_lock_irq() macro from MD to regular wait
includes, introduces new macro wait_event_lock_irq_cmd() instead of using
the old method with omitting cmd parameter which is ugly and makes a use
of new macros in the MD. It also introduces the _interruptible_ variant.

The use of new interface is when one have a special lock to protect data
structures used in the condition, or one also needs to invoke "cmd"
before putting it to sleep.

All new macros are expected to be called with the lock taken. The lock
is released before sleep and is reacquired afterwards. We will leave the
macro with the lock held.

Note to DM: IMO this should also fix theoretical race on waitqueue while
using simultaneously wait_event_lock_irq() and wait_event() because of
lack of locking around current state setting and wait queue removal.
Signed-off-by: NLukas Czerner <lczerner@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

eed8c02e

22 11月, 2012 2 次提交

md/raid5: Make sure we clear R5_Discard when discard is finished. · ca64cae9

由 NeilBrown 提交于 11月 21, 2012

commit 9e444768
    MD: raid5 avoid unnecessary zero page for trim

change raid5 to clear R5_Discard when the complete request is
handled rather than when submitting the per-device discard request.
However it did not clear R5_Discard for the parity device.

This means that if the stripe_head was reused before it expired from
the cache, the setting would be wrong and a hang would result.

Also if the R5_Uptodate bit happens to be set, R5_Discard again
won't be cleared.  But R5_Uptodate really should be clear at this point.

So make sure R5_Discard is cleared in all cases, and clear
R5_Uptodate when a 'discard' completes.
Signed-off-by: NNeilBrown <neilb@suse.de>

ca64cae9

md/raid5: move resolving of reconstruct_state earlier in · ef5b7c69

由 NeilBrown 提交于 11月 22, 2012

stripe_handle.

The chunk of code in stripe_handle which responds to a
*_result value in reconstruct_state is really the completion
of some processing that happened outside of handle_stripe
(possibly asynchronously) and so should be one of the first
things done in handle_stripe().

After the next patch it will be important that it happens before
handle_stripe_clean_event(), as that will clear some dev->flags
bit that this code tests.
Signed-off-by: NNeilBrown <neilb@suse.de>

ef5b7c69

20 11月, 2012 1 次提交

md/raid5: round discard alignment up to power of 2. · 4ac6875e

由 NeilBrown 提交于 11月 19, 2012

blkdev_issue_discard currently assumes that the granularity
is a power of 2.  So in raid5, round the chosen number up to
avoid embarrassment.

Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

4ac6875e

30 10月, 2012 1 次提交

md: Fix typo in drivers/md · 83f0d77a

由 Masanari Iida 提交于 10月 30, 2012

Correct spelling typo in drivers/md.
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

83f0d77a

11 10月, 2012 9 次提交

md/raid5: be careful not to resize_stripes too big. · e56108d6

由 NeilBrown 提交于 10月 11, 2012

When a RAID5 is reshaping, conf->raid_disks is increased
before mddev->delta_disks becomes zero.
This can result in check_reshape calling resize_stripes with a
number that is too large.  This particularly happens
when md_check_recovery calls ->check_reshape().

If we use ->previous_raid_disks, we don't risk this.
Signed-off-by: NNeilBrown <neilb@suse.de>

e56108d6

Subject: [PATCH] md:change resync_mismatches to atomic64_t to avoid races · 7f7583d4

由 Jianpeng Ma 提交于 10月 11, 2012

Now that multiple threads can handle stripes, it is safer to
use an atomic64_t for resync_mismatches, to avoid update races.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7f7583d4

md/raid5: make sure to_read and to_write never go negative. · 1ed850f3

由 NeilBrown 提交于 10月 11, 2012

to_read and to_write are part of the result of analysing
a stripe before handling it.
Their use is to avoid some loops and tests if the values are
known to be zero.  Thus it is not a problem if they are a
little bit larger than they should be.

So decrementing them in handle_failed_stripe serves little value, and
due to races it could cause some loops to be skipped incorrectly.

So remove those decrements.
Reported-by: N"Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1ed850f3

md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write. · a7854487

由 Alexander Lyakas 提交于 10月 11, 2012

Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
Suggested-by: NYair Hershko <yair@zadarastorage.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a7854487

md/raid5: protect debug message against NULL derefernce. · b97390ae

由 NeilBrown 提交于 10月 11, 2012

The pr_debug in add_stripe_bio could race with something
changing *bip, so it is best to hold the lock until
after the pr_debug.
Reported-by: N"Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b97390ae

md/raid5: add some missing locking in handle_failed_stripe. · 143c4d05

由 NeilBrown 提交于 10月 11, 2012

We really should hold the stripe_lock while accessing
'toread' else we could race with add_stripe_bio and corrupt
a list.
Reported-by: N"Jianpeng Ma" <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

143c4d05

MD: raid5 avoid unnecessary zero page for trim · 9e444768

由 Shaohua Li 提交于 10月 11, 2012

We want to avoid zero discarded dev page, because it's useless for discard.
But if we don't zero it, another read/write hit such page in the cache and will
get inconsistent data.

To avoid zero the page, we don't set R5_UPTODATE flag after construction is
done. In this way, discard write request is still issued and finished, but read
will not hit the page. If the stripe gets accessed soon, we need reread the
stripe, but since the chance is low, the reread isn't a big deal.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9e444768

MD: raid5 trim support · 620125f2

由 Shaohua Li 提交于 10月 11, 2012


Discard for raid4/5/6 has limitation. If discard request size is
small, we do discard for one disk, but we need calculate parity and
write parity disk.  To correctly calculate parity, zero_after_discard
must be guaranteed. Even it's true, we need do discard for one disk
but write another disks, which makes the parity disks wear out
fast. This doesn't make sense. So an efficient discard for raid4/5/6
should discard all data disks and parity disks, which requires the
write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
is smaller than chunk_size, such pattern is almost impossible in
practice. So in this patch, I only handle the case that A's size
equals to chunk_size. That is discard request should be aligned to
stripe size and its size is multiple of stripe size.

Since we can only handle request with specific alignment and size (or
part of the request fitting stripes), we can't guarantee
zero_after_discard even zero_after_discard is true in low level
drives.

The block layer doesn't send down correctly aligned requests even
correct discard alignment is set, so I must filter out.

For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
zero_after_discard is true for all disks, data is consistent after
discard.  Otherwise, data might be lost. Let's consider a scenario:
discard a stripe, write data to one disk and write parity disk. The
stripe could be still inconsistent till then depending on using data
from other data disks or parity disks to calculate new parity. If the
disk is broken, we can't restore it. So in this patch, we only enable
discard support if all disks have zero_after_discard.

If discard fails in one disk, we face the similar inconsistent issue
above. The patch will make discard follow the same path as normal
write request. If discard fails, a resync will be scheduled to make
the data consistent. This isn't good to have extra writes, but data
consistency is important.

If a subsequent read/write request hits raid5 cache of a discarded
stripe, the discarded dev page should have zero filled, so the data is
consistent. This patch will always zero dev page for discarded request
stripe. This isn't optimal because discard request doesn't need such
payload. Next patch will avoid it.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

620125f2

MD: change the parameter of md thread · 4ed8731d

由 Shaohua Li 提交于 10月 11, 2012

Change the thread parameter, so the thread can carry extra info. Next patch
will use it.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

4ed8731d

24 9月, 2012 1 次提交

md/raid5: add missing spin_lock_init. · cb13ff69

由 NeilBrown 提交于 9月 24, 2012

commit b17459c0
   raid5: add a per-stripe lock

added a spin_lock to the 'stripe_head' struct.
Unfortunately there are two places where this struct is allocated
but the spin lock was only initialised in one of them.

So add the missing spin_lock_init.
Signed-off-by: NNeilBrown <neilb@suse.de>

cb13ff69

19 9月, 2012 2 次提交

md/raid5: fix calculate of 'degraded' when a replacement becomes active. · e5c86471

由 NeilBrown 提交于 9月 19, 2012

When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.

So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.

This is suitable for -stable as an incorrect 'degraded' value can
confuse md and could lead to data corruption.
This is only relevant for 3.3 and later.

Cc: stable@vger.kernel.org
Reported-by: NRobin Hill <robin@robinhill.me.uk>
Reported-by: NJohn Drescher <drescherjm@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e5c86471

Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE." · a852d7b8

由 NeilBrown 提交于 9月 19, 2012

This reverts commit 895e3c5c.

While this patch seemed like a good idea and did help some workloads,
it hurts other workloads.
Large sequential O_DIRECT writes were faster,
Small random O_DIRECT writes were slower.

Other changes (batching RAID5 writes) have improved the sequential
writes using a different mechanism, so the net result of this patch
is definitely negative.  So revert it.
Reported-by: NShaohua Li <shli@kernel.org>
Tested-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

a852d7b8

02 8月, 2012 2 次提交

raid5: raid5d handle stripe in batch way · 46a06401

由 Shaohua Li 提交于 8月 02, 2012

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

46a06401

raid5: make_request use batch stripe release · 8811b596

由 Shaohua Li 提交于 8月 02, 2012

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

8811b596

31 7月, 2012 3 次提交

md: remove plug_cnt feature of plugging. · 0021b7bc

由 NeilBrown 提交于 7月 31, 2012

This seemed like a good idea at the time, but after further thought I
cannot see it making a difference other than very occasionally and
testing to try to exercise the case it is most likely to help did not
show any performance difference by removing it.

So remove the counting of active plugs and allow 'pending writes' to
be activated at any time, not just when no plugs are active.

This is only relevant when there is a write-intent bitmap, and the
updating of the bitmap will likely introduce enough delay that
the single-threading of bitmap updates will be enough to collect large
numbers of updates together.

Removing this will make it easier to centralise the unplug code, and
will clear the other for other unplug enhancements which have a
measurable effect.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0021b7bc

md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE. · 895e3c5c

由 majianpeng 提交于 7月 31, 2012

'sync' writes set both REQ_SYNC and REQ_NOIDLE.
O_DIRECT writes set REQ_SYNC but not REQ_NOIDLE.

We currently assume that a REQ_SYNC request will not be followed by
more requests and so set STRIPE_PREREAD_ACTIVE to expedite the
request.
This is appropriate for sync requests, but not for O_DIRECT requests.

So make the setting of STRIPE_PREREAD_ACTIVE conditional on REQ_NOIDLE
rather than REQ_SYNC.  This is consistent with the documented meaning
of REQ_NOIDLE:

        __REQ_NOIDLE,           /* don't anticipate more IO after this one */
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

895e3c5c

raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer · 3f9e7c14

由 majianpeng 提交于 7月 31, 2012

Because bios will merge at block-layer,so bios-error may caused by other
bio which be merged into to the same request.
Using this flag,it will find exactly error-sector and not do redundant
operation like re-write and re-read.

V0->V1:Using REQ_FLUSH instead REQ_NOMERGE avoid bio merging at block
layer.
Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

3f9e7c14

19 7月, 2012 4 次提交

raid5: add a per-stripe lock · b17459c0

由 Shaohua Li 提交于 7月 19, 2012

Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
lock contention of conf->device_lock.

stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
list of the stripe is always serialized by this lock, so adding bio to the
lists (add_stripe_bio()) and removing bio from the lists (like
ops_run_biofill()) not race.

If bio in ->read, ->written ... list are not shared by multiple stripes, we
don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
protect them. If the bio are shared,  there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b17459c0

raid5: remove unnecessary bitmap write optimization · 7eaf7e8e

由 Shaohua Li 提交于 7月 19, 2012

Neil pointed out the bitmap write optimization in handle_stripe_clean_event()
is unnecessary, because the chance one stripe gets written twice in the mean
time is rare. We can always do a bitmap_startwrite when a write request is
added to a stripe and bitmap_endwrite after write request is done. Delete the
optimization. With it, we can delete some cases of device_lock.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7eaf7e8e

raid5: lockless access raid5 overrided bi_phys_segments · e7836bd6

由 Shaohua Li 提交于 7月 19, 2012

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e7836bd6

raid5: reduce chance release_stripe() taking device_lock · 4eb788df

由 Shaohua Li 提交于 7月 19, 2012

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

4eb788df

03 7月, 2012 8 次提交

md: fix up plugging (again). · b357f04a

由 NeilBrown 提交于 7月 03, 2012

The value returned by "mddev_check_plug" is only valid until the
next 'schedule' as that will unplug things.  This could happen at any
call to mempool_alloc.
So just calling mddev_check_plug at the start doesn't really make
sense.

So call it just before, or just after, queuing things for the thread.
As the action that happens at unplug is to wake the thread, this makes
lots of sense.
If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
wake thread immediately.

RAID5 is a bit different.  Requests are queued for the thread and the
thread is woken by release_stripe.  So we don't need to wake the
thread on failure.
However the thread doesn't perform certain actions when there is any
active plug, so it is important to install a plug before waking the
thread.  So for RAID5 we install the plug *before* queuing the request
and waking the thread.

Without this patch it is possible for raid1 or raid10 to queue a
request without then waking the thread, resulting in the array locking
up.

Also change raid10 to only flush_pending_write when there are not
active plugs, just like raid1.

This patch is suitable for 3.0 or later.  I plan to submit it to
-stable, but I'll like to let it spend a few weeks in mainline
first to be sure it is completely safe.
Signed-off-by: NNeilBrown <neilb@suse.de>

b357f04a

raid5: delayed stripe fix · fab363b5

由 Shaohua Li 提交于 7月 03, 2012

There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

fab363b5

md/raid456: When read error cannot be recovered, record bad block · 2e8ac303

由 majianpeng 提交于 7月 03, 2012

We may not be able to fix a bad block if:
 - the array is degraded
 - the over-write fails.

In these cases we currently eject the device, but we should
record a bad block if possible.
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

2e8ac303

md: make 'name' arg to md_register_thread non-optional. · 0232605d

由 NeilBrown 提交于 7月 03, 2012

Having the 'name' arg optional and defaulting to the current
personality name is no necessary and leads to errors, as when
changing the level of an array we can end up using the
name of the old level instead of the new one.

So make it non-optional and always explicitly pass the name
of the level that the array will be.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

0232605d

md/raid5: fix refcount problem when blocked_rdev is set. · 5f066c63

由 NeilBrown 提交于 7月 03, 2012

commit 43220aa0
    md/raid5: fix a hang on device failure.

fixed a hang, but introduced a refcounting in-balance so
that if the presence of bad-blocks ever caused an rdev to
be 'blocked' we would increment the refcount on the rdev and
never decrement it.

So added the needed rdev_dec_pending when md_wait_for_blocked_rdev
is not called.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5f066c63

md/raid5: In ops_run_io, inc nr_pending before calling md_wait_for_blocked_rdev · 1850753d

由 majianpeng 提交于 7月 03, 2012

In ops_run_io(), the call to md_wait_for_blocked_rdev will decrement
nr_pending so we lose the reference we hold on the rdev.
So atomic_inc it first to maintain the reference.

This bug was introduced by commit  73e92e51
    md/raid5.  Don't write to known bad block on doubtful devices.

which appeared in 3.0, so patch is suitable for stable kernels since
then.

Cc: stable@vger.kernel.org
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

1850753d

md/raid5: Do not add data_offset before call to is_badblock · 6c0544e2

由 majianpeng 提交于 6月 12, 2012

In chunk_aligned_read() we are adding data_offset before calling
is_badblock.  But is_badblock also adds data_offset, so that is bad.

So move the addition of data_offset to after the call to
is_badblock.

This bug was introduced by commit 31c176ec
     md/raid5: avoid reading from known bad blocks.
which first appeared in 3.0.  So that patch is suitable for any
-stable kernel from 3.0.y onwards.  However it will need minor
revision for most of those (as the comment didn't appear until
recently).

Cc: stable@vger.kernel.org
Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

6c0544e2

md/raid5: prefer replacing failed devices over want-replacement devices. · 5cfb22a1

由 NeilBrown 提交于 7月 03, 2012

If a RAID5 has both a failed device and a device marked as
'WantReplacement', then we should preferentially replace the failed
device.
However the current code replaces whichever is found first.
So split into 2 loops, check fail failed/missing first, and only check
for WantReplacement if nothing is failed or missing.
Reported-by: Nmajianpeng <majianpeng@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

5cfb22a1

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功