1. 31 3月, 2009 4 次提交
  2. 25 2月, 2009 3 次提交
    • N
      md: avoid races when stopping resync. · 73d5c38a
      NeilBrown 提交于
      There has been a race in raid10 and raid1 for a long time
      which has only recently started showing up due to a scheduler changed.
      
      When a sync_read request finishes, as soon as reschedule_retry
      is called, another thread can mark the resync request as having
      completed, so md_do_sync can finish, ->stop can be called, and
      ->conf can be freed.  So using conf after reschedule_retry is not
      safe.
      
      Similarly, when finishing a sync_write, calling md_done_sync must be
      the last thing we do, as it allows a chain of events which will free
      conf and other data structures.
      
      The first of these requires action in raid10.c
      The second requires action in raid1.c and raid10.c
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      73d5c38a
    • N
      md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. · 78200d45
      NeilBrown 提交于
      For raid1/4/5/6, resync (fixing inconsistencies between devices) is
      very similar to recovery (rebuilding a failed device onto a spare).
      The both walk through the device addresses in order.
      
      For raid10 it can be quite different.  resync follows the 'array'
      address, and makes sure all copies are the same.  Recover walks
      through 'device' addresses and recreates each missing block.
      
      The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
      (When present) to be updated to reflect a partially completed resync.
      It makes assumptions which mean that it does not work correctly for
      raid10 recovery at all.
      
      In particularly, it can cause bitmap-directed recovery of a raid10 to
      not recovery some of the blocks that need to be recovered.
      
      So move the call to bitmap_cond_end_sync into the resync path, rather
      than being in the common "resync or recovery" path.
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78200d45
    • N
      md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. · 09b4068a
      NeilBrown 提交于
      When doing recovery on a raid10 with a write-intent bitmap, we only
      need to recovery chunks that are flagged in the bitmap.
      
      However if we choose to skip a chunk as it isn't flag, the code
      currently skips the whole raid10-chunk, thus it might not recovery
      some blocks that need recovering.
      
      This patch fixes it.
      
      In case that is confusing, it might help to understand that there
      is a 'raid10 chunk size' which guides how data is distributed across
      the devices, and a 'bitmap chunk size' which says how much data
      corresponds to a single bit in the bitmap.
      
      This bug only affects cases where the bitmap chunk size is smaller
      than the raid10 chunk size.
      
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09b4068a
  3. 09 1月, 2009 1 次提交
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
  4. 06 11月, 2008 1 次提交
    • N
      md: fix bug in raid10 recovery. · a53a6c85
      NeilBrown 提交于
      Adding a spare to a raid10 doesn't cause recovery to start.
      This is due to an silly type in
        commit 6c2fce2e
      and so is a bug in 2.6.27 and .28-rc.
      
      Thanks to Thomas Backlund for bisecting to find this.
      
      Cc: Thomas Backlund <tmb@mandriva.org>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a53a6c85
  5. 15 10月, 2008 1 次提交
    • S
      md: build failure due to missing delay.h · 25570727
      Stephen Rothwell 提交于
      Today's linux-next build (powerpc ppc64_defconfig) failed like this:
      
      drivers/md/raid1.c: In function 'sync_request':
      drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid1.o] Error 1
      make[3]: *** Waiting for unfinished jobs....
      drivers/md/raid10.c: In function 'sync_request':
      drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid10.o] Error 1
      drivers/md/md.c: In function 'md_do_sync':
      drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
      
      Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
      unnecessary #includes, #defines, and function declarations").  I added
      the following patch.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25570727
  6. 13 10月, 2008 1 次提交
    • N
      md: Relax minimum size restrictions on chunk_size. · 4bbf3771
      NeilBrown 提交于
      Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
      
      This makes moving an array to a machine with a larger PAGE_SIZE, or
      changing the kernel to use a larger PAGE_SIZE, can stop an array from
      working.
      
      For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
      process works on whole pages at a time, and assumes them to be wholly
      within a stripe.  For other raid personalities, this restriction is
      not needed at all and can be dropped.
      
      So remove the test on chunk_size from common can, and add it in just
      the places where it is needed: raid10 and raid4/5/6.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bbf3771
  7. 09 10月, 2008 5 次提交
    • D
      block: mark bio_split_pool static · 6feef531
      Denis ChengRq 提交于
      Since all bio_split calls refer the same single bio_split_pool, the bio_split
      function can use bio_split_pool directly instead of the mempool_t parameter;
      
      then the mempool_t parameter can be removed from bio_split param list, and
      bio_split_pool is only referred in fs/bio.c file, can be marked static.
      Signed-off-by: NDenis ChengRq <crquan@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6feef531
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: fix diskstats access · c9959059
      Tejun Heo 提交于
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c9959059
    • J
      block: raid fixups for removal of bi_hw_segments · 960e739d
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      960e739d
    • M
      drop vmerge accounting · 5df97b91
      Mikulas Patocka 提交于
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5df97b91
  8. 05 8月, 2008 1 次提交
    • N
      Allow raid10 resync to happening in larger chunks. · 0310fa21
      NeilBrown 提交于
      The raid10 resync/recovery code currently limits the amount of
      in-flight resync IO to 2Meg.  This was copied from raid1 where
      it seems quite adequate.  However for raid10, some layouts require
      a bit of seeking to perform a resync, and allowing a larger buffer
      size means that the seeking can be significantly reduced.
      
      There is probably no real need to limit the amount of in-flight
      IO at all.  Any shortage of memory will naturally reduce the
      amount of buffer space available down to a set minimum, and any
      concurrent normal IO will quickly cause resync IO to back off.
      
      The only problem would be that normal IO has to wait for all resync IO
      to finish, so a very large amount of resync IO could cause unpleasant
      latency when normal IO starts up.
      
      So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
      available) which seems to be a good amount.  Also reduce the amount
      of memory reserved as there is no need to keep 2Meg just for resync if
      memory is tight.
      
      Thanks to Keld for the suggestion.
      
      Cc: Keld Jørn Simonsen <keld@dkuug.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0310fa21
  9. 01 8月, 2008 1 次提交
  10. 21 7月, 2008 1 次提交
  11. 03 7月, 2008 1 次提交
    • A
      Add bvec_merge_data to handle stacked devices and ->merge_bvec() · cc371e66
      Alasdair G Kergon 提交于
      When devices are stacked, one device's merge_bvec_fn may need to perform
      the mapping and then call one or more functions for its underlying devices.
      
      The following bio fields are used:
        bio->bi_sector
        bio->bi_bdev
        bio->bi_size
        bio->bi_rw  using bio_data_dir()
      
      This patch creates a new struct bvec_merge_data holding a copy of those
      fields to avoid having to change them directly in the struct bio when
      going down the stack only to have to change them back again on the way
      back up.  (And then when the bio gets mapped for real, the whole
      exercise gets repeated, but that's a problem for another day...)
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Milan Broz <mbroz@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      cc371e66
  12. 28 6月, 2008 3 次提交
    • N
      rationalise return value for ->hot_add_disk method. · 199050ea
      Neil Brown 提交于
      For all array types but linear, ->hot_add_disk returns 1 on
      success, 0 on failure.
      For linear, it returns 0 on success and -errno on failure.
      
      This doesn't cause a functional problem because the ->hot_add_disk
      function of linear is used quite differently to the others.
      However it is confusing.
      
      So convert all to return 0 for success or -errno on failure
      and fix call sites to match.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      199050ea
    • N
      Support adding a spare to a live md array with external metadata. · 6c2fce2e
      Neil Brown 提交于
      i.e. extend the 'md/dev-XXX/slot' attribute so that you can
      tell a device to fill an vacant slot in an and md array.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      6c2fce2e
    • N
      Ensure interrupted recovery completed properly (v1 metadata plus bitmap) · 8c2e870a
      Neil Brown 提交于
      If, while assembling an array, we find a device which is not fully
      in-sync with the array, it is important to set the "fullsync" flags.
      This is an exact analog to the setting of this flag in hot_add_disk
      methods.
      
      Currently, only v1.x metadata supports having devices in an array
      which are not fully in-sync (it keep track of how in sync they are).
      The 'fullsync' flag only makes a difference when a write-intent bitmap
      is being used.  In this case it tells recovery to ignore the bitmap
      and recovery all blocks.
      
      This fix is already in place for raid1, but not raid5/6 or raid10.
      
      So without this fix, a raid1 ir raid4/5/6 array with version 1.x
      metadata and a write intent bitmaps, that is stopped in the middle
      of a recovery, will appear to complete the recovery instantly
      after it is reassembled, but the recovery will not be correct.
      
      If you might have an array like that, issueing
         echo repair > /sys/block/mdXX/md/sync_action
      
      will make sure recovery completes properly.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      8c2e870a
  13. 25 5月, 2008 1 次提交
    • N
      md: restart recovery cleanly after device failure. · dfc70645
      NeilBrown 提交于
      When we get any IO error during a recovery (rebuilding a spare), we abort
      the recovery and restart it.
      
      For RAID6 (and multi-drive RAID1) it may not be best to restart at the
      beginning: when multiple failures can be tolerated, the recovery may be
      able to continue and re-doing all that has already been done doesn't make
      sense.
      
      We already have the infrastructure to record where a recovery is up to
      and restart from there, but it is not being used properly.
      This is because:
        - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
          which causes the recovery not be be checkpointed.
        - We remove spares and then re-added them which loses important state
          information.
      
      The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
      needed.  If there is an error, the relevant drive will be marked as
      Faulty, and that is enough to ensure correct handling of the error.  So we
      first remove MD_RECOVERY_ERR, changing some of the uses of it to
      MD_RECOVERY_INTR.
      
      Then we cause the attempt to remove a non-faulty device from an array to
      fail (unless recovery is impossible as the array is too degraded).  Then
      when remove_and_add_spares attempts to remove the devices on which
      recovery can continue, it will fail, they will remain in place, and
      recovery will continue on them as desired.
      
      Issue:  If we are halfway through rebuilding a spare and another drive
      fails, and a new spare is immediately available,  do we want to:
       1/ complete the current rebuild, then go back and rebuild the new spare or
       2/ restart the rebuild from the start and rebuild both devices in
          parallel.
      
      Both options can be argued for.  The code currently takes option 2 as
        a/ this requires least code change
        b/ this results in a minimally-degraded array in minimal time.
      
      Cc: "Eivind Sarto" <ivan@kasenna.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfc70645
  14. 15 5月, 2008 1 次提交
    • N
      Remove blkdev warning triggered by using md · e7e72bf6
      Neil Brown 提交于
      As setting and clearing queue flags now requires that we hold a spinlock
      on the queue, and as blk_queue_stack_limits is called without that lock,
      get the lock inside blk_queue_stack_limits.
      
      For blk_queue_stack_limits to be able to find the right lock, each md
      personality needs to set q->queue_lock to point to the appropriate lock.
      Those personalities which didn't previously use a spin_lock, us
      q->__queue_lock.  So always initialise that lock when allocated.
      
      With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
      longer cause warnings as it will be clear that the proper lock is held.
      
      Thanks to Dan Williams for review and fixing the silly bugs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Alistair John Strachan <alistair@devzero.co.uk>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Jacek Luczak <difrost.kernel@gmail.com>
      Cc: Prakash Punnoor <prakash@punnoor.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e72bf6
  15. 09 5月, 2008 1 次提交
  16. 30 4月, 2008 1 次提交
  17. 28 4月, 2008 1 次提交
  18. 05 3月, 2008 4 次提交
    • K
      md: the md RAID10 resync thread could cause a md RAID10 array deadlock · a07e6ab4
      K.Tanaka 提交于
      This message describes another issue about md RAID10 found by testing the
      2.6.24 md RAID10 using new scsi fault injection framework.
      
      Abstract:
      
      When a scsi error results in disabling a disk during RAID10 recovery, the
      resync threads of md RAID10 could stall.
      
      This case, the raid array has already been broken and it may not matter.  But
      I think stall is not preferable.  If it occurs, even shutdown or reboot will
      fail because of resource busy.
      
      The deadlock mechanism:
      
      The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
      be handled when recovering.  The "remaining" counter is incremented when
      building a BIO in sync_request() and is decremented when finish a BIO in
      end_sync_write().
      
      If building a BIO fails for some reasons in sync_request(), the "remaining"
      should be decremented if it has already been incremented.  I found a case
      where this decrement is forgotten.  This causes a md_do_sync() deadlock
      because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
      end_sync_write() never calls md_done_sync() because of the "remaining" counter
      mismatch.
      
      For example, this problem would be reproduced in the following case:
      
      Personalities : [raid10]
      md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
            3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
            [>....................]  recovery =  2.2% (45376/1959808) finish=0.7min speed=45376K/sec
      
      This case, sdf1 is recovering, sdb1 and sde1 are disabled.
      An additional error with detaching sdd will cause a deadlock.
      
      md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
            3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
            [=>...................]  recovery =  5.0% (99520/1959808) finish=5.9min speed=5237K/sec
      
       2739 ?        S<     0:17 [md0_raid10]
      28608 ?        D<     0:00 [md0_resync]
      28629 pts/1    Ss     0:00 bash
      28830 pts/1    R+     0:00 ps ax
      31819 ?        D<     0:00 [kjournald]
      
      The resync thread keeps working, but actually it is deadlocked.
      
      Patch:
      By this patch, the remaining counter will be decremented if needed.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a07e6ab4
    • N
      md: fix possible raid1/raid10 deadlock on read error during resync · 1c830532
      NeilBrown 提交于
      Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
      another possible deadlock in raid1/raid10 error handing.
      
      If a read request returns an error while a resync is happening and a resync
      request is pending, the attempt to fix the error will block until the resync
      progresses, and the resync will block until the read request completes.  Thus
      a deadlock.
      
      This patch fixes the problem.
      
      Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c830532
    • K
      md: don't attempt read-balancing for raid10 'far' layouts · 8ed3a195
      Keld Simonsen 提交于
      This patch changes the disk to be read for layout "far > 1" to always be the
      disk with the lowest block address.
      
      Thus the chunks to be read will always be (for a fully functioning array) from
      the first band of stripes, and the raid will then work as a raid0 consisting
      of the first band of stripes.
      
      Some advantages:
      
      The fastest part which is the outer sectors of the disks involved will be
      used.  The outer blocks of a disk may be as much as 100 % faster than the
      inner blocks.
      
      Average seek time will be smaller, as seeks will always be confined to the
      first part of the disks.
      
      Mixed disks with different performance characteristics will work better, as
      they will work as raid0, the sequential read rate will be number of disks
      involved times the IO rate of the slowest disk.
      
      If a disk is malfunctioning, the first disk which is working, and has the
      lowest block address for the logical block will be used.
      Signed-off-by: NKeld Simonsen <keld@dkuug.dk>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ed3a195
    • N
      md: fix deadlock in md/raid1 and md/raid10 when handling a read error · a35e63ef
      NeilBrown 提交于
      When handling a read error, we freeze the array to stop any other IO while
      attempting to over-write with correct data.
      
      This is done in the raid1d(raid10d) thread and must wait for all submitted IO
      to complete (except for requests that failed and are sitting in the retry
      queue - these are counted in ->nr_queue and will stay there during a freeze).
      
      However write requests need attention from raid1d as bitmap updates might be
      required.  This can cause a deadlock as raid1 is waiting for requests to
      finish that themselves need attention from raid1d.
      
      So we create a new function 'flush_pending_writes' to give that attention, and
      call it in freeze_array to be sure that we aren't waiting on raid1d.
      
      Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
      problem.
      
      Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a35e63ef
  19. 07 2月, 2008 3 次提交
  20. 09 11月, 2007 1 次提交
  21. 16 10月, 2007 1 次提交
  22. 10 10月, 2007 1 次提交
  23. 01 8月, 2007 2 次提交