1. 14 12月, 2009 1 次提交
    • N
      md: support barrier requests on all personalities. · a2826aa9
      NeilBrown 提交于
      Previously barriers were only supported on RAID1.  This is because
      other levels requires synchronisation across all devices and so needed
      a different approach.
      Here is that approach.
      
      When a barrier arrives, we send a zero-length barrier to every active
      device.  When that completes - and if the original request was not
      empty -  we submit the barrier request itself (with the barrier flag
      cleared) and then submit a fresh load of zero length barriers.
      
      The barrier request itself is asynchronous, but any subsequent
      request will block until the barrier completes.
      
      The reason for clearing the barrier flag is that a barrier request is
      allowed to fail.  If we pass a non-empty barrier through a striping
      raid level it is conceivable that part of it could succeed and part
      could fail.  That would be way too hard to deal with.
      So if the first run of zero length barriers succeed, we assume all is
      sufficiently well that we send the request and ignore errors in the
      second run of barriers.
      
      RAID5 needs extra care as write requests may not have been submitted
      to the underlying devices yet.  So we flush the stripe cache before
      proceeding with the barrier.
      
      Note that the second set of zero-length barriers are submitted
      immediately after the original request is submitted.  Thus when
      a personality finds mddev->barrier to be set during make_request,
      it should not return from make_request until the corresponding
      per-device request(s) have been queued.
      
      That will be done in later patches.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      a2826aa9
  2. 16 10月, 2009 2 次提交
    • N
      md: raid1/raid10: handle allocation errors during array setup. · ed9bfdf1
      NeilBrown 提交于
      Both raid1 and raid10 create a mempool during startup.
      If the 'alloc' function for this mempool fails, unplug_slaves
      is called.
      If that happens when the pool is being initialised, unplug_slaves
      will try to use the 'conf' structure that isn't filled in yet, and
      badness will happen.
      
      So ensure that unplug_slaves doesn't get called unless we know
      that the conf structure if fully initialised.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ed9bfdf1
    • N
      md/raid1/raid10: add a cond_resched · 1d9d5241
      NeilBrown 提交于
      During 'check' of a raid1 or raid10 it is possible for the management
      thread to spend a lot of time running 'memcmp' on blocks from
      different devices, so make sure the thread has a chance to schedule.
      raid5d already has a cond_resched (in process_stripe).
      Reported-By: NLee Howard <faxguy@howardsilvan.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1d9d5241
  3. 23 9月, 2009 4 次提交
  4. 11 9月, 2009 1 次提交
  5. 03 8月, 2009 1 次提交
    • A
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll 提交于
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac5e7113
  6. 01 7月, 2009 1 次提交
  7. 18 6月, 2009 2 次提交
  8. 16 6月, 2009 2 次提交
  9. 23 5月, 2009 1 次提交
  10. 07 5月, 2009 1 次提交
    • N
      md/raid10: don't clear bitmap during recovery if array will still be degraded. · 18055569
      NeilBrown 提交于
      If we have a raid10 with multiple missing devices, and we recover just
      one of these to a spare, then we risk (depending on the bitmap and
      array chunk size) clearing bits of the bitmap for which recovery isn't
      complete (because a device is still missing).
      
      This can lead to a subsequent "re-add" being recovered without
      any IO happening, which would result in loss of data.
      
      This patch takes the safe approach of not clearing bitmap bits
      if the array will still be degraded.
      
      This patch is suitable for all active -stable kernels.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      18055569
  11. 15 4月, 2009 1 次提交
  12. 31 3月, 2009 8 次提交
    • D
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams 提交于
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b522adcd
    • D
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams 提交于
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1f403624
    • D
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams 提交于
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • N
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown 提交于
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      409c57f3
    • A
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll 提交于
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58c0fed4
    • N
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown 提交于
      It really is nicer to keep related code together..
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43b2e5d8
    • N
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown 提交于
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bff61975
    • C
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig 提交于
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef740c37
  13. 25 2月, 2009 3 次提交
    • N
      md: avoid races when stopping resync. · 73d5c38a
      NeilBrown 提交于
      There has been a race in raid10 and raid1 for a long time
      which has only recently started showing up due to a scheduler changed.
      
      When a sync_read request finishes, as soon as reschedule_retry
      is called, another thread can mark the resync request as having
      completed, so md_do_sync can finish, ->stop can be called, and
      ->conf can be freed.  So using conf after reschedule_retry is not
      safe.
      
      Similarly, when finishing a sync_write, calling md_done_sync must be
      the last thing we do, as it allows a chain of events which will free
      conf and other data structures.
      
      The first of these requires action in raid10.c
      The second requires action in raid1.c and raid10.c
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      73d5c38a
    • N
      md/raid10: Don't call bitmap_cond_end_sync when we are doing recovery. · 78200d45
      NeilBrown 提交于
      For raid1/4/5/6, resync (fixing inconsistencies between devices) is
      very similar to recovery (rebuilding a failed device onto a spare).
      The both walk through the device addresses in order.
      
      For raid10 it can be quite different.  resync follows the 'array'
      address, and makes sure all copies are the same.  Recover walks
      through 'device' addresses and recreates each missing block.
      
      The 'bitmap_cond_end_sync' function allows the write-intent-bitmap
      (When present) to be updated to reflect a partially completed resync.
      It makes assumptions which mean that it does not work correctly for
      raid10 recovery at all.
      
      In particularly, it can cause bitmap-directed recovery of a raid10 to
      not recovery some of the blocks that need to be recovered.
      
      So move the call to bitmap_cond_end_sync into the resync path, rather
      than being in the common "resync or recovery" path.
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78200d45
    • N
      md/raid10: Don't skip more than 1 bitmap-chunk at a time during recovery. · 09b4068a
      NeilBrown 提交于
      When doing recovery on a raid10 with a write-intent bitmap, we only
      need to recovery chunks that are flagged in the bitmap.
      
      However if we choose to skip a chunk as it isn't flag, the code
      currently skips the whole raid10-chunk, thus it might not recovery
      some blocks that need recovering.
      
      This patch fixes it.
      
      In case that is confusing, it might help to understand that there
      is a 'raid10 chunk size' which guides how data is distributed across
      the devices, and a 'bitmap chunk size' which says how much data
      corresponds to a single bit in the bitmap.
      
      This bug only affects cases where the bitmap chunk size is smaller
      than the raid10 chunk size.
      
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09b4068a
  14. 09 1月, 2009 1 次提交
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
  15. 06 11月, 2008 1 次提交
    • N
      md: fix bug in raid10 recovery. · a53a6c85
      NeilBrown 提交于
      Adding a spare to a raid10 doesn't cause recovery to start.
      This is due to an silly type in
        commit 6c2fce2e
      and so is a bug in 2.6.27 and .28-rc.
      
      Thanks to Thomas Backlund for bisecting to find this.
      
      Cc: Thomas Backlund <tmb@mandriva.org>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a53a6c85
  16. 15 10月, 2008 1 次提交
    • S
      md: build failure due to missing delay.h · 25570727
      Stephen Rothwell 提交于
      Today's linux-next build (powerpc ppc64_defconfig) failed like this:
      
      drivers/md/raid1.c: In function 'sync_request':
      drivers/md/raid1.c:1759: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid1.o] Error 1
      make[3]: *** Waiting for unfinished jobs....
      drivers/md/raid10.c: In function 'sync_request':
      drivers/md/raid10.c:1749: error: implicit declaration of function 'msleep_interruptible'
      make[3]: *** [drivers/md/raid10.o] Error 1
      drivers/md/md.c: In function 'md_do_sync':
      drivers/md/md.c:5915: error: implicit declaration of function 'msleep'
      
      Caused by commit 6caa3b0bbdb474647f6bdd8a958ffc46f78d8d58 ("md: Remove
      unnecessary #includes, #defines, and function declarations").  I added
      the following patch.
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      25570727
  17. 13 10月, 2008 1 次提交
    • N
      md: Relax minimum size restrictions on chunk_size. · 4bbf3771
      NeilBrown 提交于
      Currently, the 'chunk_size' of an array must be at-least PAGE_SIZE.
      
      This makes moving an array to a machine with a larger PAGE_SIZE, or
      changing the kernel to use a larger PAGE_SIZE, can stop an array from
      working.
      
      For RAID10 and RAID4/5/6, this is non-trivial to fix as the resync
      process works on whole pages at a time, and assumes them to be wholly
      within a stripe.  For other raid personalities, this restriction is
      not needed at all and can be dropped.
      
      So remove the test on chunk_size from common can, and add it in just
      the places where it is needed: raid10 and raid4/5/6.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bbf3771
  18. 09 10月, 2008 5 次提交
    • D
      block: mark bio_split_pool static · 6feef531
      Denis ChengRq 提交于
      Since all bio_split calls refer the same single bio_split_pool, the bio_split
      function can use bio_split_pool directly instead of the mempool_t parameter;
      
      then the mempool_t parameter can be removed from bio_split param list, and
      bio_split_pool is only referred in fs/bio.c file, can be marked static.
      Signed-off-by: NDenis ChengRq <crquan@gmail.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      6feef531
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: fix diskstats access · c9959059
      Tejun Heo 提交于
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c9959059
    • J
      block: raid fixups for removal of bi_hw_segments · 960e739d
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      960e739d
    • M
      drop vmerge accounting · 5df97b91
      Mikulas Patocka 提交于
      Remove hw_segments field from struct bio and struct request. Without virtual
      merge accounting they have no purpose.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5df97b91
  19. 05 8月, 2008 1 次提交
    • N
      Allow raid10 resync to happening in larger chunks. · 0310fa21
      NeilBrown 提交于
      The raid10 resync/recovery code currently limits the amount of
      in-flight resync IO to 2Meg.  This was copied from raid1 where
      it seems quite adequate.  However for raid10, some layouts require
      a bit of seeking to perform a resync, and allowing a larger buffer
      size means that the seeking can be significantly reduced.
      
      There is probably no real need to limit the amount of in-flight
      IO at all.  Any shortage of memory will naturally reduce the
      amount of buffer space available down to a set minimum, and any
      concurrent normal IO will quickly cause resync IO to back off.
      
      The only problem would be that normal IO has to wait for all resync IO
      to finish, so a very large amount of resync IO could cause unpleasant
      latency when normal IO starts up.
      
      So: increase RESYNC_DEPTH to allow 32Meg of buffer (if memory is
      available) which seems to be a good amount.  Also reduce the amount
      of memory reserved as there is no need to keep 2Meg just for resync if
      memory is tight.
      
      Thanks to Keld for the suggestion.
      
      Cc: Keld Jørn Simonsen <keld@dkuug.dk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0310fa21
  20. 01 8月, 2008 1 次提交
  21. 21 7月, 2008 1 次提交