1. 31 3月, 2009 27 次提交
    • N
      md: allow number of drives in raid5 to be reduced · ec32a2bd
      NeilBrown 提交于
      When reshaping a raid5 to have fewer devices, we work from the end of
      the array to the beginning.
      md_do_sync gives addresses to sync_request that go from the beginning
      to the end.  So largely ignore them use the internal state variable
      "reshape_progress" to keep track of what to do next.
      
      Never allow the size to be reduced below the minimum (4 for raid6,
      3 otherwise).
      
      We require that the size of the array has already been reduced before
      the array is reshaped to a smaller size.  This is because simply
      reducing the size is an easily reversible operation, while the reshape
      is immediately destructive and so is not reversible for the blocks at
      the ends of the devices.
      Thus to reshape an array to have fewer devices, you must first write
      an appropriately small size to md/array_size.
      
      When reshape finished, we remove any drives that are no longer
      needed and fix up ->degraded.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ec32a2bd
    • N
      md/raid5: change reshape-progress measurement to cope with reshaping backwards. · fef9c61f
      NeilBrown 提交于
      When reducing the number of devices in a raid4/5/6, the reshape
      process has to start at the end of the array and work down to the
      beginning.  So we need to handle expand_progress and expand_lo
      differently.
      
      This patch renames "expand_progress" and "expand_lo" to avoid the
      implication that anything is getting bigger (expand->reshape) and
      every place they are used, we make sure that they are used the right
      way depending on whether delta_disks is positive or negative.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fef9c61f
    • N
      md: add explicit method to signal the end of a reshape. · cea9c228
      NeilBrown 提交于
      Currently raid5 (the only module that supports restriping)
      notices that the reshape has finished be sync_request being
      given a large value, and handles any cleanup them.
      
      This patch changes it so md_check_recovery calls into an
      explicit finish_reshape method as well.
      
      The clean-up from sync_request can do things that need to be
      done promptly, typically things local to the raid5_conf_t
      structure.
      
      The "finish_reshape" method is called under the mddev_lock
      so it can do things involving reconfiguring the device.
      
      This allows us to get rid of md_set_array_sectors_locked, which
      would have caused a deadlock if you tried to stop and array
      while a reshape was happening.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cea9c228
    • N
      md/raid5: enhance raid5_size to work correctly with negative delta_disks · 7ec05478
      NeilBrown 提交于
      This is the first of four patches which combine to allow md/raid5 to
      reduce the number of devices in the array by restriping the data over
      a subset of the devices.
      
      If the number of disks in a raid4/5/6 is being reduced, then the
      default size must be based on the new number, not the old number
      of devices.
      In general, it should be based on the smaller of new and old.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7ec05478
    • N
      md/raid5: drop qd_idx from r6_state · 34e04e87
      NeilBrown 提交于
      We now have this value in stripe_head so we don't need to duplicate
      it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      34e04e87
    • D
      md/raid6: move raid6 data processing to raid6_pq.ko · f701d589
      Dan Williams 提交于
      Move the raid6 data processing routines into a standalone module
      (raid6_pq) to prepare them to be called from async_tx wrappers and other
      non-md drivers/modules.  This precludes a circular dependency of raid456
      needing the async modules for data processing while those modules in
      turn depend on raid456 for the base level synchronous raid6 routines.
      
      To support this move:
      1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
      2/ The raid6_call, recovery calls, and table symbols are exported
      3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
         compile
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f701d589
    • A
      md: raid5 run(): Fix max_degraded for raid level 4. · 18b00334
      Andre Noll 提交于
      raid4 allows only one failed disk.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      18b00334
    • D
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams 提交于
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b522adcd
    • D
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams 提交于
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1f403624
    • D
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams 提交于
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • N
      md: add takeover support for converting raid6 back into raid5 · fc9739c6
      NeilBrown 提交于
      If a raid6 is still in the layout that comes from converting raid5
      into a raid6. this will allow us to convert it back again.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fc9739c6
    • N
      md: add takeover support for raid4 -> raid5 conversion. · e9d4758f
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e9d4758f
    • N
      md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035
      NeilBrown 提交于
      2-drive raid5's aren't very interesting.  But if you are converting
      a raid1 into a raid5, you will at least temporarily have one.  And
      that it a good time to set the layout/chunksize for the new RAID5
      if you aren't happy with the defaults.
      
      layout and chunksize don't actually affect the placement of data
      on a 2-drive raid5, so we just do some internal book-keeping.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b3546035
    • N
      md: add ->takeover method for raid5 to be able to take over raid1 · d562b0c4
      NeilBrown 提交于
      The RAID1 must have two drives and be a suitable size to
      be a multiple of a chunksize that isn't too small.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d562b0c4
    • N
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown 提交于
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      245f46c2
    • N
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown 提交于
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0cf8f04
    • N
      md/raid5: refactor raid5 "run" · 91adb564
      NeilBrown 提交于
      .. so that the code to create the private data structures is separate.
      This will help with future code to change the level of an active
      array.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      91adb564
    • N
      md/raid5: finish support for DDF/raid6 · 67cc2b81
      NeilBrown 提交于
      DDF requires RAID6 calculations over different devices in a different
      order.
      For md/raid6, we calculate over just the data devices, starting
      immediately after the 'Q' block.
      For ddf/raid6 we calculate over all devices, using zeros in place of
      the P and Q blocks.
      
      This requires unfortunately complex loops...
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67cc2b81
    • N
      md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f
      NeilBrown 提交于
      DDF uses different layouts for P and Q blocks than current md/raid6
      so add those that are missing.
      Also add support for RAID6 layouts that are identical to various
      raid5 layouts with the simple addition of one device to hold all of
      the 'Q' blocks.
      Finally add 'raid5' layouts to match raid4.
      These last to will allow online level conversion.
      
      Note that this does not provide correct support for DDF/raid6 yet
      as the order in which data blocks are summed to produce the Q block
      is significant and different between current md code and DDF
      requirements.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      99c0fb5f
    • N
      md/raid5: simplify raid5_compute_sector interface · 911d4ee8
      NeilBrown 提交于
      Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
      a 'struct stripe_head *' and fill in the relevant fields.  This is
      more extensible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      911d4ee8
    • N
      md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e
      NeilBrown 提交于
      
      Code currently assumes that the devices in a raid6 stripe are
        0 1 ... N-1 P Q
      in some rotated order.  We will shortly add new layouts in which
      this strict pattern is broken.
      So remove this expectation.  We still assume that the data disks
      are roughly in-order.  However P and Q can be inserted anywhere within
      that order.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0dabf7e
    • N
      md/raid5: change raid5_compute_sector and stripe_to_pdidx to take a 'previous' argument · 112bf897
      NeilBrown 提交于
      This similar to the recent change to get_active_stripe.
      There is no functional change, just come rearrangement to make
      future patches cleaner.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      112bf897
    • N
      md/raid5: simplify interface for init_stripe and get_active_stripe · b5663ba4
      NeilBrown 提交于
      Rather than passing 'pd_idx' and 'disks' to these functions, just pass
      'previous' which tells whether to use the 'previous' or 'current'
      geometry during a reshape, and let init_stripe calculate
      disks and pd_idx and anything else it might need.
      
      This is not a substantial simplification and even adds a division.
      However we will shortly be adding more complexity to init_stripe
      to handle more interesting 'reshape' activities, and without this
      change, the interface to these functions would get very complex.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5663ba4
    • A
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll 提交于
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58c0fed4
    • N
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown 提交于
      It really is nicer to keep related code together..
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43b2e5d8
    • N
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown 提交于
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bff61975
    • C
      md: move headers out of include/linux/raid/ · ef740c37
      Christoph Hellwig 提交于
      Move the headers with the local structures for the disciplines and
      bitmap.h into drivers/md/ so that they are more easily grepable for
      hacking and not far away.  md.h is left where it is for now as there
      are some uses from the outside.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ef740c37
  2. 09 1月, 2009 1 次提交
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
  3. 13 10月, 2008 3 次提交
  4. 09 10月, 2008 4 次提交
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: fix diskstats access · c9959059
      Tejun Heo 提交于
      There are two variants of stat functions - ones prefixed with double
      underbars which don't care about preemption and ones without which
      disable preemption before manipulating per-cpu counters.  It's unclear
      whether the underbarred ones assume that preemtion is disabled on
      entry as some callers don't do that.
      
      This patch unifies diskstats access by implementing disk_stat_lock()
      and disk_stat_unlock() which take care of both RCU (for partition
      access) and preemption (for per-cpu counter access).  diskstats access
      should always be enclosed between the two functions.  As such, there's
      no need for the versions which disables preemption.  They're removed
      and double underbars ones are renamed to drop the underbars.  As an
      extra argument is added, there's no danger of using the old version
      unconverted.
      
      disk_stat_lock() uses get_cpu() and returns the cpu index and all
      diskstat functions which access per-cpu counters now has @cpu
      argument to help RT.
      
      This change adds RCU or preemption operations at some places but also
      collapses several preemption ops into one at others.  Overall, the
      performance difference should be negligible as all involved ops are
      very lightweight per-cpu ones.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      c9959059
    • J
      block: make bi_phys_segments an unsigned int instead of short · 5b99c2ff
      Jens Axboe 提交于
      raid5 can overflow with more than 255 stripes, and we can increase it
      to an int for free on both 32 and 64-bit archs due to the padding.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      5b99c2ff
    • J
      block: raid fixups for removal of bi_hw_segments · 960e739d
      Jens Axboe 提交于
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      960e739d
  5. 05 8月, 2008 2 次提交
    • N
      Don't let a blocked_rdev interfere with read request in raid5/6 · ac4090d2
      NeilBrown 提交于
      When we have externally managed metadata, we need to mark a failed
      device as 'Blocked' and not allow any writes until that device
      have been marked as faulty in the metadata and the Blocked flag has
      been removed.
      
      However it is perfectly OK to allow read requests when there is a
      Blocked device, and with a readonly array, there may not be any
      metadata-handler watching for blocked devices.
      
      So in raid5/raid6 only allow a Blocked device to interfere with
      Write request or resync.  Read requests go through untouched.
      
      raid1 and raid10 already differentiate between read and write
      properly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac4090d2
    • N
      Fail safely when trying to grow an array with a write-intent bitmap. · dba034ee
      NeilBrown 提交于
      We cannot currently change the size of a write-intent bitmap.
      So if we change the size of an array which has such a bitmap, it
      tries to set bits beyond the end of the bitmap.
      
      For now, simply reject any request to change the size of an array
      which has a bitmap.  mdadm can remove the bitmap and add a new one
      after the array has changed size.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dba034ee
  6. 29 7月, 2008 1 次提交
  7. 24 7月, 2008 2 次提交
    • D
      md: fix merge error · 23397883
      Dan Williams 提交于
      The original STRIPE_OP_IO removal patch had the following hunk:
      
      -               for (i = conf->raid_disks; i--; ) {
      +               for (i = conf->raid_disks; i--; )
                              set_bit(R5_Wantwrite, &sh->dev[i].flags);
      -                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
      -                               sh->ops.count++;
      -               }
      
      However it appears the hunk became broken after merging:
      -               for (i = conf->raid_disks; i--; ) {
      +               for (i = conf->raid_disks; i--; )
                              set_bit(R5_Wantwrite, &sh->dev[i].flags);
                              set_bit(R5_LOCKED, &dev->flags);
                              s.locked++;
      -                       if (!test_and_set_bit(STRIPE_OP_IO, &sh->ops.pending))
      -                               sh->ops.count++;
      -               }
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      23397883
    • D
      md: move async_tx_issue_pending_all outside spin_lock_irq · c9f21aaf
      Dan Williams 提交于
      Some dma drivers need to call spin_lock_bh in their device_issue_pending
      routines.  This change avoids:
      
      WARNING: at kernel/softirq.c:136 local_bh_enable_ip+0x3a/0x85()
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c9f21aaf