1. 22 4月, 2015 12 次提交
    • N
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown 提交于
      This allows us to easily add more (atomic) flags.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5423399a
    • N
      md/raid5: move max_nr_stripes management into grow_one_stripe and drop_one_stripe · 486f0644
      NeilBrown 提交于
      Rather than adjusting max_nr_stripes whenever {grow,drop}_one_stripe()
      succeeds, do it inside the functions.
      
      Also choose the correct hash to handle next inside the functions.
      
      This removes duplication and will help with future new uses of
      {grow,drop}_one_stripe.
      
      This also fixes a minor bug where the "md/raid:%md: allocate XXkB"
      message always said "0kB".
      Signed-off-by: NNeilBrown <neilb@suse.de>
      486f0644
    • N
      md/raid5: pass gfp_t arg to grow_one_stripe() · a9683a79
      NeilBrown 提交于
      This is needed for future improvement to stripe cache management.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9683a79
    • M
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen 提交于
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d06f191f
    • M
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen 提交于
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      584acdd4
    • S
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org 提交于
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dabc4ec6
    • S
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org 提交于
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72ac7330
    • S
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org 提交于
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      59fc630b
    • S
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org 提交于
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a87f434
    • S
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org 提交于
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da41ba65
    • S
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org 提交于
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      46d5b785
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
  2. 25 2月, 2015 1 次提交
    • E
      raid5: check faulty flag for array status during recovery. · 16d9cfab
      Eric Mei 提交于
      When we have more than 1 drive failure, it's possible we start
      rebuild one drive while leaving another faulty drive in array.
      To determine whether array will be optimal after building, current
      code only check whether a drive is missing, which could potentially
      lead to data corruption. This patch is to add checking Faulty flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      16d9cfab
  3. 18 2月, 2015 1 次提交
  4. 06 2月, 2015 2 次提交
    • N
      md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e
      NeilBrown 提交于
      Rather than using mddev_lock() to take the reconfig_mutex
      when writing to any md sysfs file, we only take mddev_lock()
      in the particular _store() functions that require it.
      Admittedly this is most, but it isn't all.
      
      This also allows us to remove special-case handling for new_dev_store
      (in md_attr_store).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6791875e
    • N
      md/raid5: use ->lock to protect accessing raid5 sysfs attributes. · 7b1485ba
      NeilBrown 提交于
      It is important that mddev->private isn't freed while
      a sysfs attribute function is accessing it.
      
      So use mddev->lock to protect the setting of ->private to NULL, and
      take that lock when checking ->private for NULL and de-referencing it
      in the sysfs access functions.
      
      This only applies to the read ('show') side of access.  Write
      access will be handled separately.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7b1485ba
  5. 04 2月, 2015 9 次提交
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown 提交于
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5aa61f42
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
    • N
      md/raid5: need_this_block: tidy/fix last condition. · ea664c82
      NeilBrown 提交于
      That last condition is unclear and over cautious.
      
      There are two related issues here.
      
      If a partial write is destined for a missing device, then
      either RMW or RCW can work.  We must read all the available
      block.  Only then can the missing blocks be calculated, and
      then the parity update performed.
      
      If RMW is not an option, then there is a complication even
      without partial writes.  If we would need to read a missing
      device to perform the reconstruction, then we must first read every
      block so the missing device data can be computed.
      This is the case for RAID6 (Which currently does not support
      RMW) and for times when we don't trust the parity (after a crash)
      and so are in the process of resyncing it.
      
      So make these two cases more clear and separate, and perform
      the relevant tests more  thoroughly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ea664c82
    • N
      md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950
      NeilBrown 提交于
      Both the last two cases are only relevant if something has failed and
      something needs to be written (but not over-written), and if it is OK
      to pre-read blocks at this point.  So factor out those tests and
      explain them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9d56950
    • N
      md/raid5: separate out the easy conditions in need_this_block. · a79cfe12
      NeilBrown 提交于
      Some of the conditions in need_this_block have very straight
      forward motivation.  Separate those out and document them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a79cfe12
    • N
      md/raid5: separate large if clause out of fetch_block(). · 2c58f06e
      NeilBrown 提交于
      fetch_block() has a very large and hard to read 'if' condition.
      
      Separate it into its own function so that it can be
      made more readable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c58f06e
    • J
      md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6
      Jes Sorensen 提交于
      67f45548 introduced a call to
      md_wakeup_thread() when adding to the delayed_list. However the md
      thread is woken up unconditionally just below.
      
      Remove the unnecessary wakeup call.
      Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad3ab8b6
  6. 02 2月, 2015 1 次提交
    • N
      md/raid5: fix another livelock caused by non-aligned writes. · b1b02fe9
      NeilBrown 提交于
      If a non-page-aligned write is destined for a device which
      is missing/faulty, we can deadlock.
      
      As the target device is missing, a read-modify-write cycle
      is not possible.
      As the write is not for a full-page, a recontruct-write cycle
      is not possible.
      
      This should be handled by logic in fetch_block() which notices
      there is a non-R5_OVERWRITE write to a missing device, and so
      loads all blocks.
      
      However since commit 67f45548, that code requires
      STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
      never set STRIPE_PREREAD_ACTIVE.
      
      So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
      set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
      after a suitable delay.
      
      Fixes: 67f45548
      Cc: stable@vger.kernel.org (v3.16+)
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b1b02fe9
  7. 03 12月, 2014 1 次提交
    • N
      md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a
      NeilBrown 提交于
      It is critical that fetch_block() and handle_stripe_dirtying()
      are consistent in their analysis of what needs to be loaded.
      Otherwise raid5 can wait forever for a block that won't be loaded.
      
      Currently when writing to a RAID5 that is resyncing, to a location
      beyond the resync offset, handle_stripe_dirtying chooses a
      reconstruct-write cycle, but fetch_block() assumes a
      read-modify-write, and a lockup can happen.
      
      So treat that case just like RAID6, just as we do in
      handle_stripe_dirtying.  RAID6 always does reconstruct-write.
      
      This bug was introduced when the behaviour of handle_stripe_dirtying
      was changed in 3.7, so the patch is suitable for any kernel since,
      though it will need careful merging for some versions.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Fixes: a7854487Reported-by: NHenry Cai <henryplusplus@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      108cef3a
  8. 14 10月, 2014 2 次提交
  9. 09 10月, 2014 1 次提交
  10. 02 10月, 2014 1 次提交
    • N
      md/raid5: disable 'DISCARD' by default due to safety concerns. · 8e0e99ba
      NeilBrown 提交于
      It has come to my attention (thanks Martin) that 'discard_zeroes_data'
      is only a hint.  Some devices in some cases don't do what it
      says on the label.
      
      The use of DISCARD in RAID5 depends on reads from discarded regions
      being predictably zero.  If a write to a previously discarded region
      performs a read-modify-write cycle it assumes that the parity block
      was consistent with the data blocks.  If all were zero, this would
      be the case.  If some are and some aren't this would not be the case.
      This could lead to data corruption after a device failure when
      data needs to be reconstructed from the parity.
      
      As we cannot trust 'discard_zeroes_data', ignore it by default
      and so disallow DISCARD on all raid4/5/6 arrays.
      
      As many devices are trustworthy, and as there are benefits to using
      DISCARD, add a module parameter to over-ride this caution and cause
      DISCARD to work if discard_zeroes_data is set.
      
      If a site want to enable DISCARD on some arrays but not on others they
      should select DISCARD support at the filesystem level, and set the
      raid456 module parameter.
          raid456.devices_handle_discard_safely=Y
      
      As this is a data-safety issue, I believe this patch is suitable for
      -stable.
      DISCARD support for RAID456 was added in 3.7
      
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Heinz Mauelshagen <heinzm@redhat.com>
      Cc: stable@vger.kernel.org (3.7+)
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: 620125f2Signed-off-by: NNeilBrown <neilb@suse.de>
      8e0e99ba
  11. 18 8月, 2014 2 次提交
    • N
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 9c4bdf69
      NeilBrown 提交于
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      
      Cc: stable@vger.kernel.org (2.6.32+)
      Fixes: 6c0069c0
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      9c4bdf69
    • N
      md/raid5: avoid livelock caused by non-aligned writes. · a40687ff
      NeilBrown 提交于
      If a stripe in a raid6 array received a write to each data block while
      the array is degraded, and if any of these writes to a missing device
      are not page-aligned, then a live-lock happens.
      
      In this case the P and Q blocks need to be read so that the part of
      the missing block which is *not* being updated by the write can be
      constructed.  Due to a logic error, these blocks are not loaded, so
      the update cannot proceed and the stripe is 'handled' repeatedly in an
      infinite loop.
      
      This bug is unlikely as most writes are page aligned.  However as it
      can lead to a livelock it is suitable for -stable.  It was introduced
      in 3.16.
      
      Cc: stable@vger.kernel.org (v3.16)
      Fixed: 67f45548Signed-off-by: NNeilBrown <neilb@suse.de>
      a40687ff
  12. 10 6月, 2014 1 次提交
    • E
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto 提交于
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      053f5b65
  13. 05 6月, 2014 1 次提交
    • H
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao 提交于
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2844dc32
  14. 29 5月, 2014 3 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
    • E
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto 提交于
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cf170f3f
    • N
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown 提交于
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: NPatrik Horník <patrik@dsl.sk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67f45548
  15. 18 4月, 2014 1 次提交
  16. 17 4月, 2014 1 次提交