1. 11 10月, 2012 1 次提交
    • S
      MD: raid5 trim support · 620125f2
      Shaohua Li 提交于
      
      Discard for raid4/5/6 has limitation. If discard request size is
      small, we do discard for one disk, but we need calculate parity and
      write parity disk.  To correctly calculate parity, zero_after_discard
      must be guaranteed. Even it's true, we need do discard for one disk
      but write another disks, which makes the parity disks wear out
      fast. This doesn't make sense. So an efficient discard for raid4/5/6
      should discard all data disks and parity disks, which requires the
      write pattern to be (A, A+chunk_size, A+chunk_size*2...). If A's size
      is smaller than chunk_size, such pattern is almost impossible in
      practice. So in this patch, I only handle the case that A's size
      equals to chunk_size. That is discard request should be aligned to
      stripe size and its size is multiple of stripe size.
      
      Since we can only handle request with specific alignment and size (or
      part of the request fitting stripes), we can't guarantee
      zero_after_discard even zero_after_discard is true in low level
      drives.
      
      The block layer doesn't send down correctly aligned requests even
      correct discard alignment is set, so I must filter out.
      
      For raid4/5/6 parity calculation, if data is 0, parity is 0. So if
      zero_after_discard is true for all disks, data is consistent after
      discard.  Otherwise, data might be lost. Let's consider a scenario:
      discard a stripe, write data to one disk and write parity disk. The
      stripe could be still inconsistent till then depending on using data
      from other data disks or parity disks to calculate new parity. If the
      disk is broken, we can't restore it. So in this patch, we only enable
      discard support if all disks have zero_after_discard.
      
      If discard fails in one disk, we face the similar inconsistent issue
      above. The patch will make discard follow the same path as normal
      write request. If discard fails, a resync will be scheduled to make
      the data consistent. This isn't good to have extra writes, but data
      consistency is important.
      
      If a subsequent read/write request hits raid5 cache of a discarded
      stripe, the discarded dev page should have zero filled, so the data is
      consistent. This patch will always zero dev page for discarded request
      stripe. This isn't optimal because discard request doesn't need such
      payload. Next patch will avoid it.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      620125f2
  2. 02 8月, 2012 1 次提交
    • S
      raid5: make_request use batch stripe release · 8811b596
      Shaohua Li 提交于
      make_request() does stripe release for every stripe and the stripe usually has
      count 1, which makes previous release_stripe() optimization not work. In my
      test, this release_stripe() becomes the heaviest pleace to take
      conf->device_lock after previous patches applied.
      
      Below patch makes stripe release batch. All the stripes will be released in
      unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
      lru.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8811b596
  3. 31 7月, 2012 1 次提交
  4. 19 7月, 2012 1 次提交
    • S
      raid5: add a per-stripe lock · b17459c0
      Shaohua Li 提交于
      Add a per-stripe lock to protect stripe specific data. The purpose is to reduce
      lock contention of conf->device_lock.
      
      stripe ->toread, ->towrite are protected by per-stripe lock.  Accessing bio
      list of the stripe is always serialized by this lock, so adding bio to the
      lists (add_stripe_bio()) and removing bio from the lists (like
      ops_run_biofill()) not race.
      
      If bio in ->read, ->written ... list are not shared by multiple stripes, we
      don't need any lock to protect ->read, ->written, because STRIPE_ACTIVE will
      protect them. If the bio are shared,  there are two protections:
      1. bi_phys_segments acts as a reference count
      2. traverse the list uses r5_next_bio, which makes traverse never access bio
      not belonging to the stripe
      
      Let's have an example:
      |  stripe1 |  stripe2    |  stripe3  |
      ...bio1......|bio2|bio3|....bio4.....
      
      stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
      all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
      bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
      because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
      stripe3 will end_bio bio4.
      
      before add_stripe_bio() addes a bio to a stripe, we already increament the bio
      bi_phys_segments, so don't worry other stripes release the bio.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b17459c0
  5. 22 5月, 2012 1 次提交
  6. 21 5月, 2012 1 次提交
  7. 23 12月, 2011 4 次提交
    • N
      md/raid5: detect and handle replacements during recovery. · 9a3e1101
      NeilBrown 提交于
      During recovery we want to write to the replacement but not
      the original.  So we have two new flags
       - R5_NeedReplace if this stripe has a replacement that needs to
         be written at some stage
       - R5_WantReplace if NeedReplace, and the data is available, and
         a 'sync' has been requested on this stripe.
      
      We also distinguish between 'sync and replace' which need to read
      all other devices, and 'replace' which only needs to read the
      devices being replaced.
      
      Note that during resync we always write to any replacement device.
      It might not need to be written to, but as we don't read to compare,
      we have to write to be sure.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9a3e1101
    • N
      md/raid5: writes should get directed to replacement as well as original. · 977df362
      NeilBrown 提交于
      When writing, we need to submit two writes, one to the original, and
      one to the replacement - if there is a replacement.
      
      If the write to the replacement results in a write error, we just fail
      the device.  We only try to record write errors to the original.
      
      When writing for recovery, we shouldn't write to the original.  This
      will be addressed in a subsequent patch that generally addresses
      recovery.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      977df362
    • N
      md/raid5: raid5.h cleanup · ede7ee8b
      NeilBrown 提交于
      Remove some #defines that are no longer used, and replace some
      others with an enum.
      And remove an unused field.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ede7ee8b
    • N
      md/raid5: allow each slot to have an extra replacement device · 671488cc
      NeilBrown 提交于
      Just enhance data structures to record a second device per slot to be
      used as a 'replacement' device, replacing the original.
      We also have a second bio in each slot in each stripe_head.  This will
      only be used when writing to the array - we need to write to both the
      original and the replacement at the same time, so will need two bios.
      
      For now, only try using the replacement drive for aligned-reads.
      In this case, we prefer the replacement if it has been recovered far
      enough, otherwise use the original.
      
      This includes a small enhancement.  Previously we would only do
      aligned reads if the target device was fully recovered.  Now we also
      do them if it has recovered far enough.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      671488cc
  8. 11 10月, 2011 4 次提交
  9. 28 7月, 2011 3 次提交
  10. 26 7月, 2011 4 次提交
    • N
      md/raid5: add some more fields to stripe_head_state · c5709ef6
      NeilBrown 提交于
      Adding these three fields will allow more common code to be moved
      to handle_stripe()
      
      struct field rearrangement by Namhyung Kim.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      c5709ef6
    • N
      md/raid5: unify stripe_head_state and r6_state · f2b3b44d
      NeilBrown 提交于
      'struct stripe_head_state' stores state about the 'current' stripe
      that is passed around while handling the stripe.
      For RAID6 there is an extension structure: r6_state, which is also
      passed around.
      There is no value in keeping these separate, so move the fields from
      the latter into the former.
      
      This means that all code now needs to treat s->failed_num as an small
      array, but this is a small cost.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      f2b3b44d
    • N
      md/raid5: replace sh->lock with an 'active' flag. · c4c1663b
      NeilBrown 提交于
      sh->lock is now mainly used to ensure that two threads aren't running
      in the locked part of handle_stripe[56] at the same time.
      
      That can more neatly be achieved with an 'active' flag which we set
      while running handle_stripe.  If we find the flag is set, we simply
      requeue the stripe for later by setting STRIPE_HANDLE.
      
      For safety we take ->device_lock while examining the state of the
      stripe and creating a summary in 'stripe_head_state / r6_state'.
      This possibly isn't needed but as shared fields like ->toread,
      ->towrite are checked it is safer for now at least.
      
      We leave the label after the old 'unlock' called "unlock" because it
      will disappear in a few patches, so renaming seems pointless.
      
      This leaves the stripe 'locked' for longer as we clear STRIPE_ACTIVE
      later, but that is not a problem.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      c4c1663b
    • N
      md/raid5: Remove use of sh->lock in sync_request · 83206d66
      NeilBrown 提交于
      This is the start of a series of patches to remove sh->lock.
      
      sync_request takes sh->lock before setting STRIPE_SYNCING to ensure
      there is no race with testing it in handle_stripe[56].
      
      Instead, use a new flag STRIPE_SYNC_REQUESTED and test it early
      in handle_stripe[56] (after getting the same lock) and perform the
      same set/clear operations if it was set.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      83206d66
  11. 18 4月, 2011 1 次提交
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  12. 10 3月, 2011 1 次提交
  13. 10 9月, 2010 1 次提交
    • T
      md: implment REQ_FLUSH/FUA support · e9c7469b
      Tejun Heo 提交于
      This patch converts md to support REQ_FLUSH/FUA instead of now
      deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
      changes are notable.
      
      * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
        processing of other requests and thus there is no reason to mark the
        queue congested while FLUSH/FUA is in progress.
      
      * REQ_FLUSH/FUA failures are final and its users don't need retry
        logic.  Retry logic is removed.
      
      * Preflush needs to be issued to all member devices but FUA writes can
        be handled the same way as other writes - their processing can be
        deferred to request_queue of member devices.  md_barrier_request()
        is renamed to md_flush_request() and simplified accordingly.
      
      For linear, raid0 and multipath, the core changes are enough.  raid1,
      5 and 10 need the following conversions.
      
      * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
        request_queues of member devices.  Barrier related logic removed.
      
      * raid5: Queue draining logic dropped.  FUA bit is propagated through
        biodrain and stripe resconstruction such that all the updated parts
        of the stripe are written out with FUA writes if any of the dirtying
        writes was FUA.  preread_active_stripes handling in make_request()
        is updated as suggested by Neil Brown.
      
      * raid10: FUA bit needs to be propagated to write clones.
      
      linear, raid0, 1, 5 and 10 tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e9c7469b
  14. 26 7月, 2010 4 次提交
  15. 21 7月, 2010 1 次提交
  16. 17 2月, 2010 1 次提交
    • T
      percpu: add __percpu sparse annotations to what's left · a29d8b8e
      Tejun Heo 提交于
      Add __percpu sparse annotations to places which didn't make it in one
      of the previous patches.  All converions are trivial.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Neil Brown <neilb@suse.de>
      a29d8b8e
  17. 16 10月, 2009 2 次提交
    • N
      md: fix problems with RAID6 calculations for DDF. · e4424fee
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e4424fee
    • D
      md/raid456: downlevel multicore operations to raid_run_ops · 417b8d4a
      Dan Williams 提交于
      The percpu conversion allowed a straightforward handoff of stripe
      processing to the async subsytem that initially showed some modest gains
      (+4%).  However, this model is too simplistic and leads to stripes
      bouncing between raid5d and the async thread pool for every invocation
      of handle_stripe().  As reported by Holger this can fall into a
      pathological situation severely impacting throughput (6x performance
      loss).
      
      By downleveling the parallelism to raid_run_ops the pathological
      stripe_head bouncing is eliminated.  This version still exhibits an
      average 11% throughput loss for:
      
      	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
      	echo 1024 > /sys/block/md0/md/stripe_cache_size
      	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
      
      ...but the results are at least stable and can be used as a base for
      further multicore experimentation.
      Reported-by: NHolger Kiehl <Holger.Kiehl@dwd.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      417b8d4a
  18. 30 8月, 2009 4 次提交
    • D
      md/raid6: asynchronous raid6 operations · ac6b53b6
      Dan Williams 提交于
      [ Based on an original patch by Yuri Tikhonov ]
      
      The raid_run_ops routine uses the asynchronous offload api and
      the stripe_operations member of a stripe_head to carry out xor+pq+copy
      operations asynchronously, outside the lock.
      
      The operations performed by RAID-6 are the same as in the RAID-5 case
      except for no support of STRIPE_OP_PREXOR operations. All the others
      are supported:
      STRIPE_OP_BIOFILL
       - copy data into request buffers to satisfy a read request
      STRIPE_OP_COMPUTE_BLK
       - generate missing blocks (1 or 2) in the cache from the other blocks
      STRIPE_OP_BIODRAIN
       - copy data out of request buffers to satisfy a write request
      STRIPE_OP_RECONSTRUCT
       - recalculate parity for new data that has entered the cache
      STRIPE_OP_CHECK
       - verify that the parity is correct
      
      The flow is the same as in the RAID-5 case, and reuses some routines, namely:
      1/ ops_complete_postxor (renamed to ops_complete_reconstruct)
      2/ ops_complete_compute (updated to set up to 2 targets uptodate)
      3/ ops_run_check (renamed to ops_run_check_p for xor parity checks)
      
      [neilb@suse.de: fixes to get it to pass mdadm regression suite]
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NYuri Tikhonov <yur@emcraft.com>
      Signed-off-by: NIlya Yanok <yanok@emcraft.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      
      
      ac6b53b6
    • D
      async_tx: add sum check flags · ad283ea4
      Dan Williams 提交于
      Replace the flat zero_sum_result with a collection of flags to contain
      the P (xor) zero-sum result, and the soon to be utilized Q (raid6 reed
      solomon syndrome) zero-sum result.  Use the SUM_CHECK_ namespace instead
      of DMA_ since these flags will be used on non-dma-zero-sum enabled
      platforms.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ad283ea4
    • D
      md/raid5,6: add percpu scribble region for buffer lists · d6f38f31
      Dan Williams 提交于
      Use percpu memory rather than stack for storing the buffer lists used in
      parity calculations.  Include space for dma address conversions and pass
      that to async_tx via the async_submit_ctl.scribble pointer.
      
      [ Impact: move memory pressure from stack to heap ]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      
      d6f38f31
    • D
      md/raid6: move the spare page to a percpu allocation · 36d1c647
      Dan Williams 提交于
      In preparation for asynchronous handling of raid6 operations move the
      spare page to a percpu allocation to allow multiple simultaneous
      synchronous raid6 recovery operations.
      
      Make this allocation cpu hotplug aware to maximize allocation
      efficiency.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      
      36d1c647
  19. 18 6月, 2009 1 次提交
  20. 16 6月, 2009 1 次提交
    • N
      md: remove mddev_to_conf "helper" macro · 070ec55d
      NeilBrown 提交于
      Having a macro just to cast a void* isn't really helpful.
      I would must rather see that we are simply de-referencing ->private,
      than have to know what the macro does.
      
      So open code the macro everywhere and remove the pointless cast.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      070ec55d
  21. 31 3月, 2009 2 次提交
    • N
      md/raid5 revise rules for when to update metadata during reshape · c8f517c4
      NeilBrown 提交于
      We currently update the metadata :
       1/ every 3Megabytes
       2/ When the place we will write new-layout data to is recorded in
          the metadata as still containing old-layout data.
      
      Rule one exists to avoid having to re-do too much reshaping in the
      face of a crash/restart.  So it should really be time based rather
      than size based.  So change it to "every 10 seconds".
      
      Rule two turns out to be too harsh when restriping an array
      'in-place', as in that case the metadata much be updates for every
      stripe.
      For the in-place update, it can only possibly be safe from a crash if
      some user-space program data a backup of every e.g. few hundred
      stripes before allowing them to be reshaped.  In that case, the
      constant metadata update is pointless.
      So only update the metadata if the new metadata will report that the
      end of the 'old-layout' data is beyond where we are currently
      writing 'new-layout' data.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c8f517c4
    • N
      md/raid5: prepare for allowing reshape to change layout · e183eaed
      NeilBrown 提交于
      Add prev_algo to raid5_conf_t along the same lines as prev_chunk
      and previous_raid_disks.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e183eaed