提交 · 34a6f80e1639b124f24b5fadc1d45d69417cbace · openeuler / Kernel

01 9月, 2015 8 次提交

md/raid5: use bio_list for the list of bios to return. · 34a6f80e

由 NeilBrown 提交于 8月 14, 2015

This will make it easier to splice two lists together which will
be needed in future patch.
Signed-off-by: NNeilBrown <neilb@suse.com>

34a6f80e

md/raid5: handle possible race as reshape completes. · 6cbd8148

由 NeilBrown 提交于 7月 24, 2015

It is possible (though unlikely) for a reshape to be
interrupted between the time that end_reshape is called
and the time when raid5_finish_reshape is called.

This can leave conf->reshape_progress set to MaxSector,
but mddev->reshape_position not.

This combination confused reshape_request() when ->reshape_backwards.
As conf->reshape_progress is so high, it seems the reshape hasn't
really begun.  But assuming MaxSector is a valid address only
leads to sorrow.

So ensure reshape_position and reshape_progress both agree,
and add an extra check in reshape_request() just in case they don't.
Signed-off-by: NNeilBrown <neilb@suse.com>

6cbd8148

md: be careful when testing resync_max against curr_resync_completed. · c5e19d90

由 NeilBrown 提交于 7月 17, 2015

While it generally shouldn't happen, it is not impossible for
curr_resync_completed to exceed resync_max.
This can particularly happen when reshaping RAID5 - the current
status isn't copied to curr_resync_completed promptly, so when it
is, it can exceed resync_max.
This happens when the reshape is 'frozen', resync_max is set low,
and reshape is re-enabled.

Taking a difference between two unsigned numbers is always dangerous
anyway, so add a test to behave correctly if
   curr_resync_completed > resync_max
Signed-off-by: NNeilBrown <neilb@suse.com>

c5e19d90

md/raid5: remove incorrect "min_t()" when calculating writepos. · c74c0d76

由 NeilBrown 提交于 7月 15, 2015

This code is calculating:
  writepos, which is the furthest along address (device-space) that we
     *will* be writing to
  readpos, which is the earliest address that we *could* possible read
     from, and
  safepos, which is the earliest address in the 'old' section that we
     might read from after a crash when the reshape position is
     recovered from metadata.

  The first is a precise calculation, so clipping at zero doesn't
  make sense.  As the reshape position is now guaranteed to always be
  a multiple of reshape_sectors and as we already BUG_ON when
  reshape_progress is zero, there is no point in this min_t() call.

  The readpos and safepos are worst case - actual value depends on
  precise geometry.  That worst case could be negative, which is only
  a problem because we are storing the value in an unsigned.
  So leave the min_t() for those.
Signed-off-by: NNeilBrown <neilb@suse.com>

c74c0d76

md/raid5: strengthen check on reshape_position at run. · 05256d98

由 NeilBrown 提交于 7月 15, 2015

When reshaping, we work in units of the largest chunk size.
If changing from a larger to a smaller chunk size, that means we
reshape more than one stripe at a time.  So the required alignment
of reshape_position needs to take into account both the old
and new chunk size.

This means that both 'here_new' and 'here_old' are calculated with
respect to the same (maximum) chunk size, so testing if they are the
same when delta_disks is zero becomes pointless.
Signed-off-by: NNeilBrown <neilb@suse.com>

05256d98

md/raid5: switch to use conf->chunk_sectors in place of mddev->chunk_sectors where possible · 3cb5edf4

由 NeilBrown 提交于 7月 15, 2015

The chunk_sectors and new_chunk_sectors fields of mddev can be changed
any time (via sysfs) that the reconfig mutex can be taken.  So raid5
keeps internal copies in 'conf' which are stable except for a short
locked moment when reshape stops/starts.

So any access that does not hold reconfig_mutex should use the 'conf'
values, not the 'mddev' values.
Several don't.

This could result in corruption if new values were written at awkward
times.

Also use min() or max() rather than open-coding.
Signed-off-by: NNeilBrown <neilb@suse.com>

3cb5edf4

md/raid5: always set conf->prev_chunk_sectors and ->prev_algo · 5cac6bcb

由 NeilBrown 提交于 7月 17, 2015

These aren't really needed when no reshape is happening,
but it is safer to have them always set to a meaningful value.
The next patch will use ->prev_chunk_sectors without checking
if a reshape is happening (because that makes the code simpler),
and this patch makes that safe.
Signed-off-by: NNeilBrown <neilb@suse.com>

5cac6bcb

md/raid5: consider updating reshape_position at start of reshape. · 92140480

由 NeilBrown 提交于 7月 06, 2015

md/raid5 only updates ->reshape_position (which is stored in
metadata and is authoritative) occasionally, but particularly
when getting closed to ->resync_max as it must be correct
when ->resync_max is reached.

When mdadm tries to stop an array which is reshaping it will:
 - freeze the reshape,
 - set resync_max to where the reshape has reached.
 - unfreeze the reshape.
When this happens, the reshape is aborted and then restarted.

The restart doesn't check that resync_max is close, and so doesn't
update ->reshape_position like it should.
This results in the reshape stopping, but ->reshape_position being
incorrect.

So on that first call to reshape_request, make sure ->reshape_position
is updated if needed.
Signed-off-by: NNeilBrown <neilb@suse.com>

92140480

03 8月, 2015 1 次提交

md/raid5: don't let shrink_slab shrink too far. · 49895bcc

由 NeilBrown 提交于 8月 03, 2015

I have a report of drop_one_stripe() called from
raid5_cache_scan() apparently finding ->max_nr_stripes == 0.

This should not be allowed.

So add a test to keep max_nr_stripes above min_nr_stripes.

Also use a 'mask' rather than a 'mod' in drop_one_stripe
to ensure 'hash' is valid even if max_nr_stripes does reach zero.


Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please release with 2d5b569b)
Reported-by: NTomas Papan <tomas.papan@gmail.com>
Signed-off-by: NNeilBrown <neilb@suse.com>

49895bcc

24 7月, 2015 1 次提交

md/raid5: clear R5_NeedReplace when no longer needed. · e6030cb0

由 NeilBrown 提交于 7月 17, 2015

This flag is currently never cleared, which can in rare cases
trigger a warn-on if it is still set but the block isn't
InSync.

So clear it when it isn't need, which includes if the replacement
device has failed.
Signed-off-by: NNeilBrown <neilb@suse.com>

e6030cb0

22 7月, 2015 1 次提交

md/raid5: avoid races when changing cache size. · 2d5b569b

由 NeilBrown 提交于 7月 06, 2015

Cache size can grow or shrink due to various pressures at
any time.  So when we resize the cache as part of a 'grow'
operation (i.e. change the size to allow more devices) we need
to blocks that automatic growing/shrinking.

So introduce a mutex.  auto grow/shrink uses mutex_trylock()
and just doesn't bother if there is a blockage.
Resizing the whole cache holds the mutex to ensure that
the correct number of new stripes is allocated.

This bug can result in some stripes not being freed when an
array is stopped.  This leads to the kmem_cache not being
freed and a subsequent array can try to use the same kmem_cache
and get confused.

Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
Signed-off-by: NNeilBrown <neilb@suse.com>

2d5b569b

17 6月, 2015 3 次提交

md/raid5: ignore released_stripes check · 713bc5c2

由 Shaohua Li 提交于 5月 28, 2015

conf->released_stripes list isn't always related to where there are
free stripes pending. Active stripes can be in the list too.
And even free stripes were active very recently.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

713bc5c2

md/raid5: per hash value and exclusive wait_for_stripe · e9e4c377

由 Yuanhan Liu 提交于 5月 08, 2015

I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.
Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

e9e4c377

md/raid5: split wait_for_stripe and introduce wait_for_quiescent · b1b46486

由 Yuanhan Liu 提交于 5月 08, 2015

I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.

After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.

Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.

So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.

v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
    Commit log refactor suggestion from Neil.
Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

b1b46486

12 6月, 2015 1 次提交

md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0

由 NeilBrown 提交于 6月 12, 2015

MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
resync etc finished.  However it is possible for raid5_start_reshape
to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
can lean to multiple reshapes running at the same time, which isn't
good.

To make sure it is cleared before starting a reshape, and also clear
it when reaping a thread, just to be safe.
Signed-off-by: NNeilBrown  <neilb@suse.de>

ea358cd0

28 5月, 2015 9 次提交

md/raid5: break stripe-batches when the array has failed. · 626f2092

由 NeilBrown 提交于 5月 22, 2015

Once the array has too much failure, we need to break
stripe-batches up so they can all be dealt with.
Signed-off-by: NNeilBrown <neilb@suse.de>

626f2092

md/raid5: call break_stripe_batch_list from handle_stripe_clean_event · 787b76fa

由 NeilBrown 提交于 5月 21, 2015

Now that the code in break_stripe_batch_list() is nearly identical
to the end of handle_stripe_clean_event, replace the later
with a function call.

The only remaining difference of any interest is the masking that is
applieds to dev[i].flags copied from head_sh.
R5_WriteError certainly isn't wanted as it is set per-stripe, not
per-patch.  R5_Overlap isn't wanted as it is explicitly handled.
Signed-off-by: NNeilBrown <neilb@suse.de>

787b76fa

md/raid5: be more selective about distributing flags across batch. · 1b956f7a

由 NeilBrown 提交于 5月 21, 2015

When a batch of stripes is broken up, we keep some of the flags
that were per-stripe, and copy other flags from the head to all
others.

This only happens while a stripe is being handled, so many of the
flags are irrelevant.

The "SYNC_FLAGS" (which I've renamed to make it clear there are
several) and STRIPE_DEGRADED are set per-stripe and so need to be
preserved.  STRIPE_INSYNC is the only flag that is set on the head
that needs to be propagated to all others.

For safety, add a WARN_ON if others are set, except:
 STRIPE_HANDLE - this is safe and per-stripe and we are going to set
      in several cases anyway
 STRIPE_INSYNC
 STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
 STRIPE_ON_PLUG_LIST
 STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
           stripe to be on one of these lists, but it can happen
           as can be safely ignored.
Signed-off-by: NNeilBrown <neilb@suse.de>

1b956f7a

md/raid5: add handle_flags arg to break_stripe_batch_list. · 3960ce79

由 NeilBrown 提交于 5月 21, 2015

When we break a stripe_batch_list we sometimes want to set
STRIPE_HANDLE on the individual stripes, and sometimes not.

So pass a 'handle_flags' arg.  If it is zero, always set STRIPE_HANDLE
(on non-head stripes).  If not zero, only set it if any of the given
flags are present.
Signed-off-by: NNeilBrown <neilb@suse.de>

3960ce79

md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list · fb642b92

由 NeilBrown 提交于 5月 21, 2015

break_stripe_batch list didn't clear head_sh->batch_head.
This was probably a bug.

Also clear all R5_Overlap flags and if any were cleared, wake up
'wait_for_overlap'.
This isn't always necessary but the worst effect is a little
extra checking for code that is waiting on wait_for_overlap.

Also, don't use wake_up_nr() because that does the wrong thing
if 'nr' is zero, and it number of flags cleared doesn't
strongly correlate with the number of threads to wake.
Signed-off-by: NNeilBrown <neilb@suse.de>

fb642b92

md/raid5: remove condition test from check_break_stripe_batch_list. · 4e3d62ff

由 NeilBrown 提交于 5月 21, 2015

handle_stripe_clean_event() contains a chunk of code very
similar to check_break_stripe_batch_list().
If we make the latter more like the former, we can end up
with just one copy of this code.

This  first step removed the condition (and the 'check_') part
of the name.  This has the added advantage of making it clear
what check is being performed at the point where the function is
called.
Signed-off-by: NNeilBrown <neilb@suse.de>

4e3d62ff

md/raid5: Ensure a batch member is not handled prematurely. · b15a9dbd

由 NeilBrown 提交于 5月 22, 2015

If a stripe is a member of a batch, but not the head, it must
not be handled separately from the rest of the batch.

'clear_batch_ready()' handles this requirement to some
extent but not completely.  If a member is passed to handle_stripe()
a second time it returns '0' indicating the stripe can be handled,
which is wrong.
So add an extra test.
Signed-off-by: NNeilBrown <neilb@suse.de>

b15a9dbd

md/raid5: close race between STRIPE_BIT_DELAY and batching. · d0852df5

由 NeilBrown 提交于 5月 27, 2015

When we add a write to a stripe we need to make sure the bitmap
bit is set.  While doing that the stripe is not locked so it could
be added to a batch after which further changes to STRIPE_BIT_DELAY
and ->bm_seq are ineffective.

So we need to hold off adding to a stripe until bitmap_startwrite has
completed at least once, and we need to avoid further changes to
STRIPE_BIT_DELAY once the stripe has been added to a batch.

If a bitmap_startwrite() completes after the stripe was added to a
batch, it will not have set the bit, only incremented a counter, so no
extra delay of the stripe is needed.
Reported-by: NShaohua Li <shli@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>

d0852df5

md/raid5: ensure whole batch is delayed for all required bitmap updates. · 2b6b2457

由 NeilBrown 提交于 5月 21, 2015

When we add a stripe to a batch, we need to be sure that
head stripe will wait for the bitmap update required for the new
stripe.
Signed-off-by: NNeilBrown <neilb@suse.de>

2b6b2457

21 5月, 2015 1 次提交

raid5: fix broken async operation chain · 48769695

由 Shaohua Li 提交于 5月 13, 2015

ops_run_reconstruct6() doesn't correctly chain asyn operations. The tx returned
by async_gen_syndrome should be added as the dependent tx of next stripe.

The issue is introduced by commit 59fc630b
    RAID5: batch adjacent full stripe write
Reported-and-tested-by: NMaxime Ripard <maxime.ripard@free-electrons.com>
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

48769695

08 5月, 2015 6 次提交

md/raid5: fix handling of degraded stripes in batches. · bb27051f

由 NeilBrown 提交于 5月 08, 2015

There is no need for special handling of stripe-batches when the array
is degraded.

There may be if there is a failure in the batch, but STRIPE_DEGRADED
does not imply an error.

So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
degraded.
This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
check_break_stripe_batch_list() and so the bitmap bit gets cleared
when it shouldn't.

So in check_break_stripe_batch_list(), split the batch up completely -
again STRIPE_DEGRADED isn't meaningful.

Also don't set STRIPE_BATCH_ERR when there is a write error to a
replacement device.  This simply removes the replacement device and
requires no extra handling.
Signed-off-by: NNeilBrown <neilb@suse.de>

bb27051f

md/raid5: fix allocation of 'scribble' array. · 738a2738

由 NeilBrown 提交于 5月 08, 2015

As the new 'scribble' array is sized based on chunk size,
we need to make sure the size matches the largest of 'old'
and 'new' chunk sizes when the array is undergoing reshape.

We also potentially need to resize it even when not resizing
the stripe cache, as chunk size can change without changing
number of devices.

So move the 'resize' code into a separate function, and
consider old and new sizes when allocating.
Signed-off-by: NNeilBrown <neilb@suse.de>
Fixes: 46d5b785 ("raid5: use flex_array for scribble data")

738a2738

md/raid5: don't record new size if resize_stripes fails. · 6e9eac2d

由 NeilBrown 提交于 5月 08, 2015

If any memory allocation in resize_stripes fails we will return
-ENOMEM, but in some cases we update conf->pool_size anyway.

This means that if we try again, the allocations will be assumed
to be larger than they are, and badness results.

So only update pool_size if there is no error.

This bug was introduced in 2.6.17 and the patch is suitable for
-stable.

Fixes: ad01c9e3 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
Cc: stable@vger.kernel.org (v2.6.17+)
Signed-off-by: NNeilBrown <neilb@suse.de>

6e9eac2d

md/raid5: avoid reading parity blocks for full-stripe write to degraded array · 10d82c5f

由 NeilBrown 提交于 5月 08, 2015

When performing a reconstruct write, we need to read all blocks
that are not being over-written .. except the parity (P and Q) blocks.

The code currently reads these (as they are not being over-written!)
unnecessarily.
Signed-off-by: NNeilBrown <neilb@suse.de>
Fixes: ea664c82 ("md/raid5: need_this_block: tidy/fix last condition.")

10d82c5f

md/raid5: more incorrect BUG_ON in handle_stripe_fill. · b0c783b3

由 NeilBrown 提交于 5月 08, 2015

It is not incorrect to call handle_stripe_fill() when
a batch of full-stripe writes is active.
It is, however, a BUG if fetch_block() then decides
it needs to actually fetch anything.

So move the 'BUG_ON' to where it belongs.
Signed-off-by: NNeilBrown  <neilb@suse.de>
Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")

b0c783b3

md/raid5: new alloc_stripe() to allocate an initialize a stripe. · f18c1a35

由 NeilBrown 提交于 5月 08, 2015

The new batch_lock and batch_list fields are being initialized in
grow_one_stripe() but not in resize_stripes().  This causes a crash
on resize.

So separate the core initialization into a new function and call it
from both allocation sites.
Signed-off-by: NNeilBrown <neilb@suse.de>
Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")

f18c1a35

22 4月, 2015 9 次提交

md/raid5: don't do chunk aligned read on degraded array. · 9ffc8f7c

由 Eric Mei 提交于 3月 18, 2015

When array is degraded, read data landed on failed drives will result in
reading rest of data in a stripe. So a single sequential read would
result in same data being read twice.

This patch is to avoid chunk aligned read for degraded array. The
downside is to involve stripe cache which means associated CPU overhead
and extra memory copy.

Test Results:
Following test are done on a enterprise storage node with Seagate 6T SAS
drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
chunk size 128 KiB.

I use FIO, using direct-io with various bs size, enough queue depth,
tested sequential and 100% random read against 3 array config:
 1) optimal, as baseline;
 2) degraded;
 3) degraded with this patch.
Kernel version is 4.0-rc3.

Each individual test I only did once so there might be some variations,
but we just focus on big trend.

Sequential Read:
  bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
   1024       1608            656              995
    512       1624            710              956
    256       1635            728              980
    128       1636            771              983
     64       1612           1119             1000
     32       1580           1420             1004
     16       1368            688              986
      8        768            647              953
      4        411            413              850

Random Read:
  bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
   1024        163            160              156
    512        274            273              272
    256        426            428              424
    128        576            592              591
     64        726            724              726
     32        849            848              837
     16        900            970              971
      8        927            940              929
      4        948            940              955

Some notes:
  * In sequential + optimal, as bs size getting smaller, the FIO thread
become CPU bound.
  * In sequential + degraded, there's big increase when bs is 64K and
32K, I don't have explanation.
  * In sequential + degraded-with-patch, the MD thread mostly become CPU
bound.

If you want to we can discuss specific data point in those data. But in
general it seems with this patch, we have more predictable and in most
cases significant better sequential read performance when array is
degraded, and almost no noticeable impact on random read.

Performance is a complicated thing, the patch works well for this
particular configuration, but may not be universal. For example I
imagine testing on all SSD array may have very different result. But I
personally think in most cases IO bandwidth is more scarce resource than
CPU.
Signed-off-by: NEric Mei <eric.mei@seagate.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

9ffc8f7c

md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab

由 NeilBrown 提交于 2月 26, 2015

The default setting of 256 stripe_heads is probably
much too small for many configurations.  So it is best to make it
auto-configure.

Shrinking the cache under memory pressure is easy.  The only
interesting part here is that we put a fairly high cost
('seeks') on shrinking the cache as the cost is greater than
just having to read more data, it reduces parallelism.

Growing the cache on demand needs to be done carefully.  If we allow
fast growth, that can upset memory balance as lots of dirty memory can
quickly turn into lots of memory queued in the stripe_cache.
It is important for the raid5 block device to appear congested to
allow write-throttling to work.

So we only add stripes slowly. We set a flag when an allocation
fails because all stripes are in use, allocate at a convenient
time when that flag is set, and don't allow it to be set again
until at least one stripe_head has been released for re-use.

This means that a spurt of requests will only cause one stripe_head
to be allocated, but a steady stream of requests will slowly
increase the cache size - until memory pressure puts it back again.

It could take hours to reach a steady state.

The value written to, and displayed in, stripe_cache_size is
used as a minimum.  The cache can grow above this and shrink back
down to it.  The actual size is not directly visible, though it can
be deduced to some extent by watching stripe_cache_active.
Signed-off-by: NNeilBrown <neilb@suse.de>

edbe83ab

N
md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
由 NeilBrown 提交于 2月 26, 2015
```
This allows us to easily add more (atomic) flags.
Signed-off-by: NNeilBrown <neilb@suse.de>
```
5423399a

md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644

由 NeilBrown 提交于 2月 25, 2015

Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
succeeds, do it inside the functions.

Also choose the correct hash to handle next inside the functions.

This removes duplication and will help with future new uses of
{grow,drop}_one_stripe.

This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
message always said "0kB".
Signed-off-by: NNeilBrown <neilb@suse.de>

486f0644

md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79

由 NeilBrown 提交于 2月 25, 2015

This is needed for future improvement to stripe cache management.
Signed-off-by: NNeilBrown <neilb@suse.de>

a9683a79

md/raid5: introduce configuration option rmw_level · d06f191f

由 Markus Stockhausen 提交于 12月 15, 2014

Depending on the available coding we allow optimized rmw logic for write
operations. To support easier testing this patch allows manual control
of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.

The configuration can handle three levels of control.

rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
calculation has no implementation path yet to factor in/out chunks of
a syndrome. Enforcing this level can be benefical for slow CPUs with
hardware syndrome support and fast SSDs.

rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
save IOs. This equals the "old" unpatched behaviour and will be the
default.

rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
equal. We might have higher CPU consumption because of calculating the
parity twice but it can be benefical otherwise. E.g. RAID4 with fast
dedicated parity disk/SSD. The option is implemented just to be
forward-looking and will ONLY work with this patch!
Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

d06f191f

md/raid5: activate raid6 rmw feature · 584acdd4

由 Markus Stockhausen 提交于 12月 15, 2014

Glue it altogehter. The raid6 rmw path should work the same as the
already existing raid5 logic. So emulate the prexor handling/flags
and split functions as needed.

1) Enable xor_syndrome() in the async layer.

2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
at the start of a rmw run as we did it before for the single parity.

3) Take care of rmw run in ops_run_reconstruct6(). Again process only
the changed pages to get syndrome back into sync.

4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
run. The lower layers will calculate start & end pages from that and
call the xor_syndrome() correspondingly.

5) Adapt the several places where we ignored Q handling up to now.

Performance numbers for a single E5630 system with a mix of 10 7200k
desktop/server disks. 300 seconds random write with 8 threads onto a
3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)

bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
        skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
   4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
   8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
  16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
  32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
  64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
 128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
 256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
 512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
Signed-off-by: NNeilBrown <neilb@suse.de>

584acdd4

raid5: handle expansion/resync case with stripe batching · dabc4ec6

由 shli@kernel.org 提交于 12月 15, 2014

expansion/resync can grab a stripe when the stripe is in batch list. Since all
stripes in batch list must be in the same state, we can't allow some stripes
run into expansion/resync. So we delay expansion/resync for stripe in batch
list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

dabc4ec6

raid5: handle io error of batch list · 72ac7330

由 shli@kernel.org 提交于 12月 15, 2014

If io error happens in any stripe of a batch list, the batch list will be
split, then normal process will run for the stripes in the list.
Signed-off-by: NShaohua Li <shli@fusionio.com>
Signed-off-by: NNeilBrown <neilb@suse.de>

72ac7330

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功