1. 12 6月, 2015 1 次提交
    • N
      md: make sure MD_RECOVERY_DONE is clear before starting recovery/resync · ea358cd0
      NeilBrown 提交于
      MD_RECOVERY_DONE is normally cleared by md_check_recovery after a
      resync etc finished.  However it is possible for raid5_start_reshape
      to race and start a reshape before MD_RECOVERY_DONE is cleared.  This
      can lean to multiple reshapes running at the same time, which isn't
      good.
      
      To make sure it is cleared before starting a reshape, and also clear
      it when reaping a thread, just to be safe.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      ea358cd0
  2. 28 5月, 2015 9 次提交
    • N
      md/raid5: break stripe-batches when the array has failed. · 626f2092
      NeilBrown 提交于
      Once the array has too much failure, we need to break
      stripe-batches up so they can all be dealt with.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      626f2092
    • N
      md/raid5: call break_stripe_batch_list from handle_stripe_clean_event · 787b76fa
      NeilBrown 提交于
      Now that the code in break_stripe_batch_list() is nearly identical
      to the end of handle_stripe_clean_event, replace the later
      with a function call.
      
      The only remaining difference of any interest is the masking that is
      applieds to dev[i].flags copied from head_sh.
      R5_WriteError certainly isn't wanted as it is set per-stripe, not
      per-patch.  R5_Overlap isn't wanted as it is explicitly handled.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      787b76fa
    • N
      md/raid5: be more selective about distributing flags across batch. · 1b956f7a
      NeilBrown 提交于
      When a batch of stripes is broken up, we keep some of the flags
      that were per-stripe, and copy other flags from the head to all
      others.
      
      This only happens while a stripe is being handled, so many of the
      flags are irrelevant.
      
      The "SYNC_FLAGS" (which I've renamed to make it clear there are
      several) and STRIPE_DEGRADED are set per-stripe and so need to be
      preserved.  STRIPE_INSYNC is the only flag that is set on the head
      that needs to be propagated to all others.
      
      For safety, add a WARN_ON if others are set, except:
       STRIPE_HANDLE - this is safe and per-stripe and we are going to set
            in several cases anyway
       STRIPE_INSYNC
       STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
       STRIPE_ON_PLUG_LIST
       STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
                 stripe to be on one of these lists, but it can happen
                 as can be safely ignored.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1b956f7a
    • N
      md/raid5: add handle_flags arg to break_stripe_batch_list. · 3960ce79
      NeilBrown 提交于
      When we break a stripe_batch_list we sometimes want to set
      STRIPE_HANDLE on the individual stripes, and sometimes not.
      
      So pass a 'handle_flags' arg.  If it is zero, always set STRIPE_HANDLE
      (on non-head stripes).  If not zero, only set it if any of the given
      flags are present.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3960ce79
    • N
      md/raid5: duplicate some more handle_stripe_clean_event code in break_stripe_batch_list · fb642b92
      NeilBrown 提交于
      break_stripe_batch list didn't clear head_sh->batch_head.
      This was probably a bug.
      
      Also clear all R5_Overlap flags and if any were cleared, wake up
      'wait_for_overlap'.
      This isn't always necessary but the worst effect is a little
      extra checking for code that is waiting on wait_for_overlap.
      
      Also, don't use wake_up_nr() because that does the wrong thing
      if 'nr' is zero, and it number of flags cleared doesn't
      strongly correlate with the number of threads to wake.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fb642b92
    • N
      md/raid5: remove condition test from check_break_stripe_batch_list. · 4e3d62ff
      NeilBrown 提交于
      handle_stripe_clean_event() contains a chunk of code very
      similar to check_break_stripe_batch_list().
      If we make the latter more like the former, we can end up
      with just one copy of this code.
      
      This  first step removed the condition (and the 'check_') part
      of the name.  This has the added advantage of making it clear
      what check is being performed at the point where the function is
      called.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4e3d62ff
    • N
      md/raid5: Ensure a batch member is not handled prematurely. · b15a9dbd
      NeilBrown 提交于
      If a stripe is a member of a batch, but not the head, it must
      not be handled separately from the rest of the batch.
      
      'clear_batch_ready()' handles this requirement to some
      extent but not completely.  If a member is passed to handle_stripe()
      a second time it returns '0' indicating the stripe can be handled,
      which is wrong.
      So add an extra test.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b15a9dbd
    • N
      md/raid5: close race between STRIPE_BIT_DELAY and batching. · d0852df5
      NeilBrown 提交于
      When we add a write to a stripe we need to make sure the bitmap
      bit is set.  While doing that the stripe is not locked so it could
      be added to a batch after which further changes to STRIPE_BIT_DELAY
      and ->bm_seq are ineffective.
      
      So we need to hold off adding to a stripe until bitmap_startwrite has
      completed at least once, and we need to avoid further changes to
      STRIPE_BIT_DELAY once the stripe has been added to a batch.
      
      If a bitmap_startwrite() completes after the stripe was added to a
      batch, it will not have set the bit, only incremented a counter, so no
      extra delay of the stripe is needed.
      Reported-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0852df5
    • N
      md/raid5: ensure whole batch is delayed for all required bitmap updates. · 2b6b2457
      NeilBrown 提交于
      When we add a stripe to a batch, we need to be sure that
      head stripe will wait for the bitmap update required for the new
      stripe.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b6b2457
  3. 21 5月, 2015 1 次提交
  4. 08 5月, 2015 6 次提交
    • N
      md/raid5: fix handling of degraded stripes in batches. · bb27051f
      NeilBrown 提交于
      There is no need for special handling of stripe-batches when the array
      is degraded.
      
      There may be if there is a failure in the batch, but STRIPE_DEGRADED
      does not imply an error.
      
      So don't set STRIPE_BATCH_ERR in ops_run_io just because the array is
      degraded.
      This actually causes a bug: the STRIPE_DEGRADED flag gets cleared in
      check_break_stripe_batch_list() and so the bitmap bit gets cleared
      when it shouldn't.
      
      So in check_break_stripe_batch_list(), split the batch up completely -
      again STRIPE_DEGRADED isn't meaningful.
      
      Also don't set STRIPE_BATCH_ERR when there is a write error to a
      replacement device.  This simply removes the replacement device and
      requires no extra handling.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bb27051f
    • N
      md/raid5: fix allocation of 'scribble' array. · 738a2738
      NeilBrown 提交于
      As the new 'scribble' array is sized based on chunk size,
      we need to make sure the size matches the largest of 'old'
      and 'new' chunk sizes when the array is undergoing reshape.
      
      We also potentially need to resize it even when not resizing
      the stripe cache, as chunk size can change without changing
      number of devices.
      
      So move the 'resize' code into a separate function, and
      consider old and new sizes when allocating.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 46d5b785 ("raid5: use flex_array for scribble data")
      738a2738
    • N
      md/raid5: don't record new size if resize_stripes fails. · 6e9eac2d
      NeilBrown 提交于
      If any memory allocation in resize_stripes fails we will return
      -ENOMEM, but in some cases we update conf->pool_size anyway.
      
      This means that if we try again, the allocations will be assumed
      to be larger than they are, and badness results.
      
      So only update pool_size if there is no error.
      
      This bug was introduced in 2.6.17 and the patch is suitable for
      -stable.
      
      Fixes: ad01c9e3 ("[PATCH] md: Allow stripes to be expanded in preparation for expanding an array")
      Cc: stable@vger.kernel.org (v2.6.17+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6e9eac2d
    • N
      md/raid5: avoid reading parity blocks for full-stripe write to degraded array · 10d82c5f
      NeilBrown 提交于
      When performing a reconstruct write, we need to read all blocks
      that are not being over-written .. except the parity (P and Q) blocks.
      
      The code currently reads these (as they are not being over-written!)
      unnecessarily.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: ea664c82 ("md/raid5: need_this_block: tidy/fix last condition.")
      10d82c5f
    • N
      md/raid5: more incorrect BUG_ON in handle_stripe_fill. · b0c783b3
      NeilBrown 提交于
      It is not incorrect to call handle_stripe_fill() when
      a batch of full-stripe writes is active.
      It is, however, a BUG if fetch_block() then decides
      it needs to actually fetch anything.
      
      So move the 'BUG_ON' to where it belongs.
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")
      b0c783b3
    • N
      md/raid5: new alloc_stripe() to allocate an initialize a stripe. · f18c1a35
      NeilBrown 提交于
      The new batch_lock and batch_list fields are being initialized in
      grow_one_stripe() but not in resize_stripes().  This causes a crash
      on resize.
      
      So separate the core initialization into a new function and call it
      from both allocation sites.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Fixes: 59fc630b ("RAID5: batch adjacent full stripe write")
      f18c1a35
  5. 22 4月, 2015 14 次提交
    • E
      md/raid5: don't do chunk aligned read on degraded array. · 9ffc8f7c
      Eric Mei 提交于
      When array is degraded, read data landed on failed drives will result in
      reading rest of data in a stripe. So a single sequential read would
      result in same data being read twice.
      
      This patch is to avoid chunk aligned read for degraded array. The
      downside is to involve stripe cache which means associated CPU overhead
      and extra memory copy.
      
      Test Results:
      Following test are done on a enterprise storage node with Seagate 6T SAS
      drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2,
      chunk size 128 KiB.
      
      I use FIO, using direct-io with various bs size, enough queue depth,
      tested sequential and 100% random read against 3 array config:
       1) optimal, as baseline;
       2) degraded;
       3) degraded with this patch.
      Kernel version is 4.0-rc3.
      
      Each individual test I only did once so there might be some variations,
      but we just focus on big trend.
      
      Sequential Read:
        bs=(KiB)  optimal(MiB/s)  degraded(MiB/s)  degraded-with-patch (MiB/s)
         1024       1608            656              995
          512       1624            710              956
          256       1635            728              980
          128       1636            771              983
           64       1612           1119             1000
           32       1580           1420             1004
           16       1368            688              986
            8        768            647              953
            4        411            413              850
      
      Random Read:
        bs=(KiB)  optimal(IOPS)  degraded(IOPS)  degraded-with-patch (IOPS)
         1024        163            160              156
          512        274            273              272
          256        426            428              424
          128        576            592              591
           64        726            724              726
           32        849            848              837
           16        900            970              971
            8        927            940              929
            4        948            940              955
      
      Some notes:
        * In sequential + optimal, as bs size getting smaller, the FIO thread
      become CPU bound.
        * In sequential + degraded, there's big increase when bs is 64K and
      32K, I don't have explanation.
        * In sequential + degraded-with-patch, the MD thread mostly become CPU
      bound.
      
      If you want to we can discuss specific data point in those data. But in
      general it seems with this patch, we have more predictable and in most
      cases significant better sequential read performance when array is
      degraded, and almost no noticeable impact on random read.
      
      Performance is a complicated thing, the patch works well for this
      particular configuration, but may not be universal. For example I
      imagine testing on all SSD array may have very different result. But I
      personally think in most cases IO bandwidth is more scarce resource than
      CPU.
      Signed-off-by: NEric Mei <eric.mei@seagate.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9ffc8f7c
    • N
      md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab
      NeilBrown 提交于
      The default setting of 256 stripe_heads is probably
      much too small for many configurations.  So it is best to make it
      auto-configure.
      
      Shrinking the cache under memory pressure is easy.  The only
      interesting part here is that we put a fairly high cost
      ('seeks') on shrinking the cache as the cost is greater than
      just having to read more data, it reduces parallelism.
      
      Growing the cache on demand needs to be done carefully.  If we allow
      fast growth, that can upset memory balance as lots of dirty memory can
      quickly turn into lots of memory queued in the stripe_cache.
      It is important for the raid5 block device to appear congested to
      allow write-throttling to work.
      
      So we only add stripes slowly. We set a flag when an allocation
      fails because all stripes are in use, allocate at a convenient
      time when that flag is set, and don't allow it to be set again
      until at least one stripe_head has been released for re-use.
      
      This means that a spurt of requests will only cause one stripe_head
      to be allocated, but a steady stream of requests will slowly
      increase the cache size - until memory pressure puts it back again.
      
      It could take hours to reach a steady state.
      
      The value written to, and displayed in, stripe_cache_size is
      used as a minimum.  The cache can grow above this and shrink back
      down to it.  The actual size is not directly visible, though it can
      be deduced to some extent by watching stripe_cache_active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edbe83ab
    • N
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown 提交于
      This allows us to easily add more (atomic) flags.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5423399a
    • N
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown 提交于
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: NNeilBrown <neilb@suse.de>
      486f0644
    • N
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown 提交于
      This is needed for future improvement to stripe cache management.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9683a79
    • M
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen 提交于
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d06f191f
    • M
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen 提交于
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      584acdd4
    • S
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org 提交于
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dabc4ec6
    • S
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org 提交于
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72ac7330
    • S
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org 提交于
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      59fc630b
    • S
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org 提交于
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a87f434
    • S
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org 提交于
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da41ba65
    • S
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org 提交于
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      46d5b785
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
  6. 25 2月, 2015 1 次提交
    • E
      raid5: check faulty flag for array status during recovery. · 16d9cfab
      Eric Mei 提交于
      When we have more than 1 drive failure, it's possible we start
      rebuild one drive while leaving another faulty drive in array.
      To determine whether array will be optimal after building, current
      code only check whether a drive is missing, which could potentially
      lead to data corruption. This patch is to add checking Faulty flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      16d9cfab
  7. 18 2月, 2015 1 次提交
  8. 06 2月, 2015 2 次提交
    • N
      md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e
      NeilBrown 提交于
      Rather than using mddev_lock() to take the reconfig_mutex
      when writing to any md sysfs file, we only take mddev_lock()
      in the particular _store() functions that require it.
      Admittedly this is most, but it isn't all.
      
      This also allows us to remove special-case handling for new_dev_store
      (in md_attr_store).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6791875e
    • N
      md/raid5: use ->lock to protect accessing raid5 sysfs attributes. · 7b1485ba
      NeilBrown 提交于
      It is important that mddev->private isn't freed while
      a sysfs attribute function is accessing it.
      
      So use mddev->lock to protect the setting of ->private to NULL, and
      take that lock when checking ->private for NULL and de-referencing it
      in the sysfs access functions.
      
      This only applies to the read ('show') side of access.  Write
      access will be handled separately.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7b1485ba
  9. 04 2月, 2015 5 次提交
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown 提交于
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5aa61f42
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
    • N
      md/raid5: need_this_block: tidy/fix last condition. · ea664c82
      NeilBrown 提交于
      That last condition is unclear and over cautious.
      
      There are two related issues here.
      
      If a partial write is destined for a missing device, then
      either RMW or RCW can work.  We must read all the available
      block.  Only then can the missing blocks be calculated, and
      then the parity update performed.
      
      If RMW is not an option, then there is a complication even
      without partial writes.  If we would need to read a missing
      device to perform the reconstruction, then we must first read every
      block so the missing device data can be computed.
      This is the case for RAID6 (Which currently does not support
      RMW) and for times when we don't trust the parity (after a crash)
      and so are in the process of resyncing it.
      
      So make these two cases more clear and separate, and perform
      the relevant tests more  thoroughly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ea664c82