提交 · 59fc630b8b5f9f21c8ce3ba153341c107dce1b0c · openanolis / cloud-kernel

22 4月, 2015 5 次提交

RAID5: batch adjacent full stripe write · 59fc630b

由 shli@kernel.org 提交于 12月 15, 2014

stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
unit. Idealy we should use big size for adjacent full stripe writes. Bigger
stripe cache size means less stripes runing in the state machine so can reduce
cpu overhead. And also bigger size can cause bigger IO size dispatched to under
layer disks.

With below patch, we will automatically batch adjacent full stripe write
together. Such stripes will be added to the batch list. Only the first stripe
of the list will be put to handle_list and so run handle_stripe(). Some steps
of handle_stripe() are extended to cover all stripes of the list, including
ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
running in handle_stripe() and we send IO of whole stripe list together to
increase IO size.

Stripes added to a batch list have some limitations. A batch list can only
include full stripe write and can't cross chunk boundary to make sure stripes
have the same parity disks. Stripes in a batch list must be in the same state
(no written, toread and so on). If a stripe is in a batch list, all new
read/write to add_stripe_bio will be blocked to overlap conflict till the batch
list is handled. The limitations will make sure stripes in a batch list be in
exactly the same state in the life circly.

I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
PCIe SSD. This patch improves around 30% performance and IO size to under layer
disk is exactly 32k. I also run a 4k randwrite test in the same array to make
sure the performance isn't changed with the patch.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

59fc630b

raid5: track overwrite disk count · 7a87f434

由 shli@kernel.org 提交于 12月 15, 2014

Track overwrite disk count, so we can know if a stripe is a full stripe write.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7a87f434

raid5: add a new flag to track if a stripe can be batched · da41ba65

由 shli@kernel.org 提交于 12月 15, 2014

A freshly new stripe with write request can be batched. Any time the stripe is
handled or new read is queued, the flag will be cleared.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

da41ba65

raid5: use flex_array for scribble data · 46d5b785

由 shli@kernel.org 提交于 12月 15, 2014

Use flex_array for scribble data. Next patch will batch several stripes
together, so scribble data should be able to cover several stripes, so this
patch also allocates scribble data for stripes across a chunk.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

46d5b785

md: remove 'go_faster' option from ->sync_request() · 09314799

由 NeilBrown 提交于 2月 19, 2015

This option is not well justified and testing suggests that
it hardly ever makes any difference.

The comment suggests there might be a need to wait for non-resync
activity indicated by ->nr_waiting, however raise_barrier()
already waits for all of that.

So just remove it to simplify reasoning about speed limiting.

This allows us to remove a 'FIXME' comment from raid5.c as that
never used the flag.
Signed-off-by: NNeilBrown <neilb@suse.de>

09314799

25 2月, 2015 1 次提交

raid5: check faulty flag for array status during recovery. · 16d9cfab

由 Eric Mei 提交于 1月 06, 2015

When we have more than 1 drive failure, it's possible we start
rebuild one drive while leaving another faulty drive in array.
To determine whether array will be optimal after building, current
code only check whether a drive is missing, which could potentially
lead to data corruption. This patch is to add checking Faulty flag.
Signed-off-by: NNeilBrown <neilb@suse.de>

16d9cfab

18 2月, 2015 1 次提交

md/raid5: Fix livelock when array is both resyncing and degraded. · 26ac1073

由 NeilBrown 提交于 2月 18, 2015

Commit a7854487:
  md: When RAID5 is dirty, force reconstruct-write instead of read-modify-write.

Causes an RCW cycle to be forced even when the array is degraded.
A degraded array cannot support RCW as that requires reading all data
blocks, and one may be missing.

Forcing an RCW when it is not possible causes a live-lock and the code
spins, repeatedly deciding to do something that cannot succeed.

So change the condition to only force RCW on non-degraded arrays.
Reported-by: NManibalan P <pmanibalan@amiindia.co.in>
Bisected-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Tested-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>
Fixes: a7854487
Cc: stable@vger.kernel.org (v3.7+)

26ac1073

06 2月, 2015 2 次提交

md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e

由 NeilBrown 提交于 12月 15, 2014

Rather than using mddev_lock() to take the reconfig_mutex
when writing to any md sysfs file, we only take mddev_lock()
in the particular _store() functions that require it.
Admittedly this is most, but it isn't all.

This also allows us to remove special-case handling for new_dev_store
(in md_attr_store).
Signed-off-by: NNeilBrown <neilb@suse.de>

6791875e

md/raid5: use ->lock to protect accessing raid5 sysfs attributes. · 7b1485ba

由 NeilBrown 提交于 12月 15, 2014

It is important that mddev->private isn't freed while
a sysfs attribute function is accessing it.

So use mddev->lock to protect the setting of ->private to NULL, and
take that lock when checking ->private for NULL and de-referencing it
in the sysfs access functions.

This only applies to the read ('show') side of access.  Write
access will be handled separately.
Signed-off-by: NNeilBrown <neilb@suse.de>

7b1485ba

04 2月, 2015 9 次提交

md: rename ->stop to ->free · afa0f557

由 NeilBrown 提交于 12月 15, 2014

Now that the ->stop function only frees the private data,
rename is accordingly.

Also pass in the private pointer as an arg rather than using
mddev->private.  This flexibility will be useful in level_store().

Finally, don't clear ->private.  It doesn't make sense to clear
it seeing that isn't what we free, and it is no longer necessary
to clear ->private (it was some time ago before  ->to_remove was
introduced).

Setting ->to_remove in ->free() is a bit of a wart, but not a
big problem at the moment.
Signed-off-by: NNeilBrown <neilb@suse.de>

afa0f557

md: split detach operation out from ->stop. · 5aa61f42

由 NeilBrown 提交于 12月 15, 2014

Each md personality has a 'stop' operation which does two
things:
 1/ it finalizes some aspects of the array to ensure nothing
    is accessing the ->private data
 2/ it frees the ->private data.

All the steps in '1' can apply to all arrays and so can be
performed in common code.

This is useful as in the case where we change the personality which
manages an array (in level_store()), it would be helpful to do
step 1 early, and step 2 later.

So split the 'step 1' functionality out into a new mddev_detach().
Signed-off-by: NNeilBrown <neilb@suse.de>

5aa61f42

md: make merge_bvec_fn more robust in face of personality changes. · 64590f45

由 NeilBrown 提交于 12月 15, 2014

There is no locking around calls to merge_bvec_fn(), so
it is possible that calls which coincide with a level (or personality)
change could go wrong.

So create a central dispatch point for these functions and use
rcu_read_lock().
If the array is suspended, reject any merge that can be rejected.
If not, we know it is safe to call the function.
Signed-off-by: NNeilBrown <neilb@suse.de>

64590f45

md: make ->congested robust against personality changes. · 5c675f83

由 NeilBrown 提交于 12月 15, 2014

There is currently no locking around calls to the 'congested'
bdi function.  If called at an awkward time while an array is
being converted from one level (or personality) to another, there
is a tiny chance of running code in an unreferenced module etc.

So add a 'congested' function to the md_personality operations
structure, and call it with appropriate locking from a central
'mddev_congested'.

When the array personality is changing the array will be 'suspended'
so no IO is processed.
If mddev_congested detects this, it simply reports that the
array is congested, which is a safe guess.
As mddev_suspend calls synchronize_rcu(), mddev_congested can
avoid races by included the whole call inside an rcu_read_lock()
region.
This require that the congested functions for all subordinate devices
can be run under rcu_lock.  Fortunately this is the case.
Signed-off-by: NNeilBrown <neilb@suse.de>

5c675f83

md/raid5: need_this_block: tidy/fix last condition. · ea664c82

由 NeilBrown 提交于 2月 02, 2015

That last condition is unclear and over cautious.

There are two related issues here.

If a partial write is destined for a missing device, then
either RMW or RCW can work.  We must read all the available
block.  Only then can the missing blocks be calculated, and
then the parity update performed.

If RMW is not an option, then there is a complication even
without partial writes.  If we would need to read a missing
device to perform the reconstruction, then we must first read every
block so the missing device data can be computed.
This is the case for RAID6 (Which currently does not support
RMW) and for times when we don't trust the parity (after a crash)
and so are in the process of resyncing it.

So make these two cases more clear and separate, and perform
the relevant tests more  thoroughly.
Signed-off-by: NNeilBrown <neilb@suse.de>

ea664c82

md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950

由 NeilBrown 提交于 2月 02, 2015

Both the last two cases are only relevant if something has failed and
something needs to be written (but not over-written), and if it is OK
to pre-read blocks at this point.  So factor out those tests and
explain them.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9d56950

md/raid5: separate out the easy conditions in need_this_block. · a79cfe12

由 NeilBrown 提交于 2月 02, 2015

Some of the conditions in need_this_block have very straight
forward motivation.  Separate those out and document them.
Signed-off-by: NNeilBrown <neilb@suse.de>

a79cfe12

md/raid5: separate large if clause out of fetch_block(). · 2c58f06e

由 NeilBrown 提交于 2月 02, 2015

fetch_block() has a very large and hard to read 'if' condition.

Separate it into its own function so that it can be
made more readable.
Signed-off-by: NNeilBrown <neilb@suse.de>

2c58f06e

md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6

由 Jes Sorensen 提交于 1月 29, 2015

67f45548 introduced a call to
md_wakeup_thread() when adding to the delayed_list. However the md
thread is woken up unconditionally just below.

Remove the unnecessary wakeup call.
Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

ad3ab8b6

02 2月, 2015 1 次提交

md/raid5: fix another livelock caused by non-aligned writes. · b1b02fe9

由 NeilBrown 提交于 2月 02, 2015

If a non-page-aligned write is destined for a device which
is missing/faulty, we can deadlock.

As the target device is missing, a read-modify-write cycle
is not possible.
As the write is not for a full-page, a recontruct-write cycle
is not possible.

This should be handled by logic in fetch_block() which notices
there is a non-R5_OVERWRITE write to a missing device, and so
loads all blocks.

However since commit 67f45548, that code requires
STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
never set STRIPE_PREREAD_ACTIVE.

So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
after a suitable delay.

Fixes: 67f45548
Cc: stable@vger.kernel.org (v3.16+)
Reported-by: NMikulas Patocka <mpatocka@redhat.com>
Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b1b02fe9

03 12月, 2014 1 次提交

md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a

由 NeilBrown 提交于 12月 03, 2014

It is critical that fetch_block() and handle_stripe_dirtying()
are consistent in their analysis of what needs to be loaded.
Otherwise raid5 can wait forever for a block that won't be loaded.

Currently when writing to a RAID5 that is resyncing, to a location
beyond the resync offset, handle_stripe_dirtying chooses a
reconstruct-write cycle, but fetch_block() assumes a
read-modify-write, and a lockup can happen.

So treat that case just like RAID6, just as we do in
handle_stripe_dirtying.  RAID6 always does reconstruct-write.

This bug was introduced when the behaviour of handle_stripe_dirtying
was changed in 3.7, so the patch is suitable for any kernel since,
though it will need careful merging for some versions.

Cc: stable@vger.kernel.org (v3.7+)
Fixes: a7854487Reported-by: NHenry Cai <henryplusplus@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

108cef3a

14 10月, 2014 2 次提交

N
md: remove unwanted white space from md.c · f72ffdd6
由 NeilBrown 提交于 9月 30, 2014
```
My editor shows much of this is RED.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
f72ffdd6

md/raid5: fix init_stripe() inconsistencies · b8e6a15a

由 Markus Stockhausen 提交于 8月 23, 2014

raid5: fix init_stripe() inconsistencies

1) remove_hash() is not necessary. We will only be called right after
   get_free_stripe(). There we have already a call to remove_hash().

2) Tracing prints out the sector of the freed stripe and not the sector
   that we want to initialize.
Signed-off-by: NNeilBrown <neilb@suse.de>

b8e6a15a

09 10月, 2014 1 次提交

md: use set_bit/clear_bit instead of shift/mask for bi_flags changes. · 3fd83717

由 NeilBrown 提交于 8月 23, 2014

Using {set,clear}_bit is more consistent than shifting and masking.

No functional change.
Signed-off-by: NNeilBrown <neilb@suse.de>

3fd83717

02 10月, 2014 1 次提交

md/raid5: disable 'DISCARD' by default due to safety concerns. · 8e0e99ba

由 NeilBrown 提交于 10月 02, 2014

It has come to my attention (thanks Martin) that 'discard_zeroes_data'
is only a hint.  Some devices in some cases don't do what it
says on the label.

The use of DISCARD in RAID5 depends on reads from discarded regions
being predictably zero.  If a write to a previously discarded region
performs a read-modify-write cycle it assumes that the parity block
was consistent with the data blocks.  If all were zero, this would
be the case.  If some are and some aren't this would not be the case.
This could lead to data corruption after a device failure when
data needs to be reconstructed from the parity.

As we cannot trust 'discard_zeroes_data', ignore it by default
and so disallow DISCARD on all raid4/5/6 arrays.

As many devices are trustworthy, and as there are benefits to using
DISCARD, add a module parameter to over-ride this caution and cause
DISCARD to work if discard_zeroes_data is set.

If a site want to enable DISCARD on some arrays but not on others they
should select DISCARD support at the filesystem level, and set the
raid456 module parameter.
    raid456.devices_handle_discard_safely=Y

As this is a data-safety issue, I believe this patch is suitable for
-stable.
DISCARD support for RAID456 was added in 3.7

Cc: Shaohua Li <shli@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Heinz Mauelshagen <heinzm@redhat.com>
Cc: stable@vger.kernel.org (3.7+)
Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Fixes: 620125f2Signed-off-by: NNeilBrown <neilb@suse.de>

8e0e99ba

18 8月, 2014 2 次提交

md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 9c4bdf69

由 NeilBrown 提交于 8月 13, 2014

During recovery of a double-degraded RAID6 it is possible for
some blocks not to be recovered properly, leading to corruption.

If a write happens to one block in a stripe that would be written to a
missing device, and at the same time that stripe is recovering data
to the other missing device, then that recovered data may not be written.

This patch skips, in the double-degraded case, an optimisation that is
only safe for single-degraded arrays.

Bug was introduced in 2.6.32 and fix is suitable for any kernel since
then.  In an older kernel with separate handle_stripe5() and
handle_stripe6() functions the patch must change handle_stripe6().

Cc: stable@vger.kernel.org (2.6.32+)
Fixes: 6c0069c0
Cc: Yuri Tikhonov <yur@emcraft.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
Tested-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: NNeilBrown <neilb@suse.de>
Acked-by: NDan Williams <dan.j.williams@intel.com>

9c4bdf69

md/raid5: avoid livelock caused by non-aligned writes. · a40687ff

由 NeilBrown 提交于 8月 13, 2014

If a stripe in a raid6 array received a write to each data block while
the array is degraded, and if any of these writes to a missing device
are not page-aligned, then a live-lock happens.

In this case the P and Q blocks need to be read so that the part of
the missing block which is *not* being updated by the write can be
constructed.  Due to a logic error, these blocks are not loaded, so
the update cannot proceed and the stripe is 'handled' repeatedly in an
infinite loop.

This bug is unlikely as most writes are page aligned.  However as it
can lead to a livelock it is suitable for -stable.  It was introduced
in 3.16.

Cc: stable@vger.kernel.org (v3.16)
Fixed: 67f45548Signed-off-by: NNeilBrown <neilb@suse.de>

a40687ff

10 6月, 2014 1 次提交

raid5: speedup sync_request processing · 053f5b65

由 Eivind Sarto 提交于 6月 09, 2014

The raid5 sync_request() processing calls handle_stripe() within the context of
the resync-thread.  The resync-thread issues the first set of read requests
and this adds execution latency and slows down the scheduling of the next
sync_request().
The current rebuild/resync speed of raid5 is not much faster than what
rotational HDDs can sustain.
Testing the following patch on a 6-drive array, I can increase the rebuild
speed from 100 MB/s to 175 MB/s.
The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
creates some more parallelism between the resync-thread and raid5 kernel daemon.
Signed-off-by: NEivind Sarto <esarto@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

053f5b65

05 6月, 2014 1 次提交

md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32

由 hui jiao 提交于 6月 05, 2014

A chunk aligned read increases counter active_aligned_reads and
decreases it after sub-device handle it successfully. But when a read
error occurs,  the read redispatched by raid5d, and the
active_aligned_reads will not be decreased until we can grab a stripe
head in retry_aligned_read. Now suppose, a barrier io comes, set
conf->quiesce to 2, and wait until both active_stripes and
active_aligned_reads are zero. The retried chunk aligned read gets
stuck at get_active_stripe waiting until conf->quiesce becomes 0.
Retry_aligned_read and barrier io are waiting each other now.
One possible solution is that we ignore conf->quiesce, let the retried
aligned read finish. I reproduced this deadlock and test this patch on
centos6.0
Signed-off-by: NNeilBrown <neilb@suse.de>

2844dc32

29 5月, 2014 3 次提交

raid5: add an option to avoid copy data from bio to stripe cache · d592a996

由 Shaohua Li 提交于 5月 21, 2014

The stripe cache has two goals:
1. cache data, so next time if data can be found in stripe cache, disk access
can be avoided.
2. stable data. data is copied from bio to stripe cache and calculated parity.
data written to disk is from stripe cache, so if upper layer changes bio data,
data written to disk isn't impacted.

In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
can guarantee 2 too. For 1, it's not common too. block plug mechanism will
dispatch a bunch of sequentail small requests together. And since I'm using
SSD, I'm using small chunk size. It's rare case stripe cache is really useful.

So I'd like to avoid the copy from bio to stripe cache and it's very helpful
for performance. In my 1M randwrite tests, avoid the copy can increase the
performance more than 30%.

Of course, this shouldn't be enabled by default. It's reported enabling
BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
control it.

Neilb:
  changed BUG_ON to WARN_ON
  Removed some assignments from raid5_build_block which are now not needed.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

d592a996

raid5: avoid release list until last reference of the stripe · cf170f3f

由 Eivind Sarto 提交于 5月 28, 2014

The (lockless) release_list reduces lock contention, but there is excessive
queueing and dequeuing of stripes on this list. A stripe will currently be
queued on the release_list with a stripe reference count > 1. This can cause
the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
without doing any other useful processing of the stripe. The are two cases
when the stripe can be put on the release_list multiple times before it is
actually handled by the kernel thread(s).
1) make_request() activates the stripe processing in 4k increments. When a
write request is large enough to span multiple chunks of a stripe_head, the
first 4k chunk adds the stripe to the plug list. The next 4k chunk that is
processed for the same stripe puts the stripe on the release_list with a
refcount=2. This can cause the kernel thread to process and decrement the
stripe before the stripe us unplugged, which again will put it back on the
release_list.
2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
refcount is set to the number of active IO (for each chunk). The stripe is
released as each IO complete, and can be queued and dequeued multiple times
on the release_list, until its refcount finally reached zero.

This simple patch will ensure a stripe is only queued on the release_list when
its refcount=1 and is ready to be handled by the kernel thread(s). I added some
instrumentation to raid5 and counted the number of times striped were queued on
the release_list for a variety of write IO sizes. Without this patch the number
of times stripes got queued on the release_list was 100-500% higher than with
the patch. The excess queuing will increase with the IO size. The patch also
improved throughput by 5-10%.
Signed-off-by: NEivind Sarto <esarto@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

cf170f3f

md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548

由 NeilBrown 提交于 5月 28, 2014

If it is found that we need to pre-read some blocks before a write
can succeed, we normally set STRIPE_DELAYED and don't actually perform
the read until STRIPE_PREREAD_ACTIVE subsequently gets set.

However for a degraded RAID6 we currently perform the reads as soon
as we see that a write is pending.  This significantly hurts
throughput.

So:
 - when handle_stripe_dirtying find a block that it wants on a device
   that is failed, set STRIPE_DELAY, instead of doing nothing, and
 - when fetch_block detects that a read might be required to satisfy a
   write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
   and if we would actually need to read something to complete the write.

This also helps RAID5, though less often as RAID5 supports a
read-modify-write cycle.  For RAID5 the read is performed too early
only if the write is not a full 4K aligned write (i.e. no an
R5_OVERWRITE).

Also clean up a couple of horrible bits of formatting.
Reported-by: NPatrik Horník <patrik@dsl.sk>
Signed-off-by: NNeilBrown <neilb@suse.de>

67f45548

18 4月, 2014 1 次提交

arch: Mass conversion of smp_mb__*() · 4e857c58

由 Peter Zijlstra 提交于 3月 17, 2014

Mostly scripted conversion of the smp_mb__* barriers.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-arch@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

4e857c58

17 4月, 2014 1 次提交

raid5: fix a race of stripe count check · c7a6d35e

由 Shaohua Li 提交于 4月 15, 2014

I hit another BUG_ON with e240c183. In __get_priority_stripe(),
stripe count equals to 0 initially. Between atomic_inc and BUG_ON,
get_active_stripe() finds the stripe. So the stripe count isn't 1 any more.

V2: keeps the BUG_ON suggested by Neil.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

c7a6d35e

09 4月, 2014 2 次提交

raid5: get_active_stripe avoids device_lock · e240c183

由 Shaohua Li 提交于 4月 09, 2014

For sequential workload (or request size big workload), get_active_stripe can
find cached stripe. In this case, we always hold device_lock, which exposes a
lot of lock contention for such workload. If stripe count isn't 0, we don't
need hold the lock actually, since we just increase its count. And this is the
hot code path for such workload. Unfortunately we must delete the BUG_ON.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e240c183

raid5: make_request does less prepare wait · 27c0f68f

由 Shaohua Li 提交于 4月 09, 2014

In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
lot of contention for sequential workload (or big request size
workload). For such workload, each bio includes several stripes. So we
can just do prepare_to_wait/finish_wait once for the whold bio instead
of every stripe.  This reduces the lock contention completely for such
workload. Random workload might have the similar lock contention too,
but I didn't see it yet, maybe because my stroage is still not fast
enough.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

27c0f68f

13 2月, 2014 1 次提交

md/raid5: Fix CPU hotplug callback registration · 789b5e03

由 Oleg Nesterov 提交于 2月 06, 2014

Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:

	register_cpu_notifier(&foobar_cpu_notifier);

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	put_online_cpus();

A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.

So reorganize the code in raid5 this way to fix the deadlock with callback
registration.

Cc: linux-raid@vger.kernel.org
Cc: stable@vger.kernel.org (v2.6.32+)
Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

789b5e03

22 1月, 2014 1 次提交

md/raid5: close recently introduced race in stripe_head management. · 7da9d450

由 NeilBrown 提交于 1月 22, 2014

As release_stripe and __release_stripe decrement ->count and then
manipulate ->lru both under ->device_lock, it is important that
get_active_stripe() increments ->count and clears ->lru also under
->device_lock.

However we currently list_del_init ->lru under the lock, but increment
the ->count outside the lock.  This can lead to races and list
corruption.

So move the atomic_inc(&sh->count) up inside the ->device_lock
protected region.

Note that we still increment ->count without device lock in the case
where get_free_stripe() was called, and in fact don't take
->device_lock at all in that path.
This is safe because if the stripe_head can be found by
get_free_stripe, then the hash lock assures us the no-one else could
possibly be calling release_stripe() at the same time.

Fixes: 566c09c5
Cc: stable@vger.kernel.org (3.13)
Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

7da9d450

16 1月, 2014 1 次提交

md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1

由 NeilBrown 提交于 1月 16, 2014

Before a write starts we set a bit in the write-intent bitmap.
When the write completes we clear that bit if the write was successful
to all devices.  However if the write wasn't fully successful we
should not clear the bit.  If the faulty drive is subsequently
re-added, the fact that the bit is still set ensure that we will
re-write the data that is missing.

This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
bitmap bit when this flag is not set.
Currently we correctly set the flag if a write starts when some
devices are failed or missing.  But we do *not* set the flag if some
device failed during the write attempt.
This is wrong and can result in clearing the bit inappropriately.

So: set the flag when a write fails.

This bug has been present since bitmaps were introduces, so the fix is
suitable for any -stable kernel.
Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
Cc: stable@vger.kernel.org
Signed-off-by: NNeilBrown <neilb@suse.de>

9f97e4b1

14 1月, 2014 2 次提交

md/raid5: fix a recently broken BUG_ON(). · 5af9bef7

由 NeilBrown 提交于 1月 14, 2014

commit 6d183de4
    md/raid5: fix newly-broken locking in get_active_stripe.

simplified a BUG_ON, but removed too much so now it sometimes fires
when it shouldn't.

When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
special list while multiple stripe_heads are collected, or it might
not be on any list, even a 'free' list when the refcount is zero.  As
long as STRIPE_EXPANDING is set, it will be found and added back to a
list eventually.

So both of the BUG_ONs which test for the ->lru being empty or not
need to avoid the case where STRIPE_EXPANDING is set.

The patch which broke this was marked for -stable, so this patch needs
to be applied to any branch that received 6d183de4

Fixes: 6d183de4
Cc: stable@vger.kernel.org (any release to which above was applied)
Signed-off-by: NNeilBrown <neilb@suse.de>

5af9bef7

md/raid5: Fix possible confusion when multiple write errors occur. · 1cc03eb9

由 NeilBrown 提交于 1月 06, 2014

commit 5d8c71f9
    md: raid5 crash during degradation

Fixed a crash in an overly simplistic way which could leave
R5_WriteError or R5_MadeGood set in the stripe cache for devices
for which it is no longer relevant.
When those devices are removed and spares added the flags are still
set and can cause incorrect behaviour.

commit 14a75d3e
    md/raid5: preferentially read from replacement device if possible.

Fixed the same bug if a more effective way, so we can now revert
the original commit.
Reported-and-tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
Fixes: 5d8c71f9Signed-off-by: NNeilBrown <neilb@suse.de>

1cc03eb9

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功