1. 16 6月, 2009 2 次提交
  2. 09 6月, 2009 3 次提交
    • N
      md/raid5: fix bug in reshape code when chunk_size decreases. · 0e6e0271
      NeilBrown 提交于
      Now that we support changing the chunksize, we calculate
      "reshape_sectors" to be the max of number of sectors in old
      and new chunk size.
      However there is one please where we still use 'chunksize'
      rather than 'reshape_sectors'.
      This causes a reshape that reduces the size of chunks to freeze.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0e6e0271
    • N
      md/raid5 - avoid deadlocks in get_active_stripe during reshape · a8c906ca
      NeilBrown 提交于
      md has functionality to 'quiesce' and array so that all pending
      IO completed and no new IO starts.  This is used to achieve a
      stable state before making internal changes.
      
      Currently this quiescing applies equally to normal IO, resync
      IO, and reshape IO.
      However there is a problem with applying it to reshape IO.
      Reshape can have multiple 'stripe_heads' that must be active together.
      If the quiesce come between allocating the first and the last of
      such a collection, then we deadlock, as the last will not be allocated
      until the quiesce is lifted, the quiesce will not be lifted until the
      first (which has been allocated) gets used, and that first cannot be
      used until the last is allocated.
      
      It is not necessary to inhibit reshape IO when a quiesce is
      requested.  Those places in the code that require a full quiesce will
      ensure the reshape thread is not running at all.
      
      So allow reshape requests to get access to new stripe_heads without
      being blocked by a 'quiesce'.
      
      This only affects in-place reshapes (i.e. where the array does not
      grow or shrink) and these are only newly supported.  So this patch is
      not needed in earlier kernels.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c906ca
    • N
      md/raid5: use conf->raid_disks in preference to mddev->raid_disk · f001a70c
      NeilBrown 提交于
      mddev->raid_disks can be changed and any time by a request from
      user-space.  It is a suggestion as to what number of raid_disks is
      desired.
      
      conf->raid_disks can only be changed by the raid5 module with suitable
      locks in place.  It is a statement as to the current number of
      raid_disks.
      
      There are two places where the latter should be used, but the former
      is used.  This can lead to a crash when reshaping an array.
      
      This patch changes to mddev-> to conf->
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f001a70c
  3. 27 5月, 2009 1 次提交
  4. 26 5月, 2009 1 次提交
  5. 23 5月, 2009 1 次提交
  6. 17 4月, 2009 1 次提交
    • N
      md: update sync_completed and reshape_position even more often. · c03f6a19
      NeilBrown 提交于
      There are circumstances when a user-space process might need to
      "oversee" a resync/reshape process.  For example when doing an
      in-place reshape of a raid5, it is prudent to take a backup of each
      section before reshaping it as this is the only way to provide
      safety against an unplanned shutdown (i.e. crash/power failure).
      
      The sync_max sysfs value can be used to stop the resync from
      advancing beyond a particular point.
      So user-space can:
        suspend IO to the first section and back it up
        set 'sync_max' to the end of the section
        wait for 'sync_completed' to reach that point
        resume IO on the first section and move on to the next section.
      
      However this process requires the kernel and user-space to run in
      lock-step which could introduce unnecessary delays.
      
      It would be better if a 'double buffered' approach could be used with
      userspace and kernel space working on different sections with the
      'next' section always ready when the 'current' section is finished.
      
      One problem with implementing this is that sync_completed is only
      guaranteed to be updated when the sync process reaches sync_max.
      (it is updated on a time basis at other times, but it is hard to rely
      on that).  This defeats some of the double buffering.
      
      With this patch, sync_completed (and reshape_position) get updated as
      the current position approaches sync_max, so there is room for
      userspace to advance sync_max early without losing updates.
      
      To be precise, sync_completed is updated when the current sync
      position reaches half way between the current value of sync_completed
      and the value of sync_max.  This will usually be a good time for user
      space to update sync_max.
      
      If sync_max does not get updated, the updates to sync_completed
      (together with associated metadata updates) will occur at an
      exponentially increasing frequency which will get unreasonably fast
      (one update every page) immediately before the process hits sync_max
      and stops.  So the update rate will be unreasonably fast only for an
      insignificant period of time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c03f6a19
  7. 14 4月, 2009 1 次提交
    • N
      md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0
      NeilBrown 提交于
      The sync_completed file reports how much of a resync (or recovery or
      reshape) has been completed.
      However due to the possibility of out-of-order completion of writes,
      it is not certain to be accurate.
      
      We have an internal value - mddev->curr_resync_completed - which is an
      accurate value (though it might not always be quite so uptodate).
      
      So:
       - make curr_resync_completed be uptodate a little more often,
         particularly when raid5 reshape updates status in the metadata
       - report curr_resync_completed in the sysfs file
       - allow poll/select to report all updates to md/sync_completed.
      
      This makes sync_completed completed usable by any external metadata
      handler that wants to record this status information in its metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      acb180b0
  8. 31 3月, 2009 30 次提交
    • N
      md/raid5 revise rules for when to update metadata during reshape · c8f517c4
      NeilBrown 提交于
      We currently update the metadata :
       1/ every 3Megabytes
       2/ When the place we will write new-layout data to is recorded in
          the metadata as still containing old-layout data.
      
      Rule one exists to avoid having to re-do too much reshaping in the
      face of a crash/restart.  So it should really be time based rather
      than size based.  So change it to "every 10 seconds".
      
      Rule two turns out to be too harsh when restriping an array
      'in-place', as in that case the metadata much be updates for every
      stripe.
      For the in-place update, it can only possibly be safe from a crash if
      some user-space program data a backup of every e.g. few hundred
      stripes before allowing them to be reshaped.  In that case, the
      constant metadata update is pointless.
      So only update the metadata if the new metadata will report that the
      end of the 'old-layout' data is beyond where we are currently
      writing 'new-layout' data.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c8f517c4
    • N
      md/raid5: minor code cleanups in make_request. · b0f9ec04
      NeilBrown 提交于
      ... and to be certain the that make_request doesn't wait forever,
      add a 'wake_up' when ->reshape_progress has been set to MaxSector
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b0f9ec04
    • N
      md: remove CONFIG_MD_RAID_RESHAPE config option. · 2cffc4a0
      NeilBrown 提交于
      This was only needed when the code was experimental.  Most of it
      is well tested now, so the option is no longer useful.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2cffc4a0
    • N
      md/raid5: be more careful about write ordering when reshaping. · ab69ae12
      NeilBrown 提交于
      When we are reshaping an array, it is very important that we read
      the data from a particular sector offset before writing new data
      at that offset.
      
      In most cases when growing or shrinking an array we read long before
      we even consider writing.  But when restriping an array without
      changing it size, there is a small possibility that we might have
      some data to available write before the read has happened at the same
      location.  This would require some stripes to be in cache already.
      
      To guard against this small possibility, we check, before writing,
      that the 'old' stripe at the same location is not in the process of
      being read.  And we ensure that we mark all 'source' stripes as such
      before allowing new 'destination' stripes to proceed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ab69ae12
    • N
      md/raid5: allow layout and chunksize to be changed on active array. · 88ce4930
      NeilBrown 提交于
      If an array has 3 or more devices, we allow the chunksize or layout
      to be changed and when a reshape starts, we use these as the 'new'
      values.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      88ce4930
    • N
      md/raid5: reshape using largest of old and new chunk size · 7a661381
      NeilBrown 提交于
      This ensures that even when old and new stripes are overlapping,
      we will try to read all of the old before having to write any
      of the new.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a661381
    • N
      md/raid5: prepare for allowing reshape to change layout · e183eaed
      NeilBrown 提交于
      Add prev_algo to raid5_conf_t along the same lines as prev_chunk
      and previous_raid_disks.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e183eaed
    • N
      md/raid5: prepare for allowing reshape to change chunksize. · 784052ec
      NeilBrown 提交于
      Add "prev_chunk" to raid5_conf_t, similar to "previous_raid_disks", to
      remember what the chunk size was before the reshape that is currently
      underway.
      
      This seems like duplication with "chunk_size" and "new_chunk" in
      mddev_t, and to some extent it is, but there are differences.
      The values in mddev_t are always defined and often the same.
      The prev* values are only defined if a reshape is underway.
      
      Also (and more significantly) the raid5_conf_t values will be changed
      at the same time (inside an appropriate lock) that the reshape is
      started by setting reshape_position.  In contrast, the new_chunk value
      is set when the sysfs file is written which could be well before the
      reshape starts.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      784052ec
    • N
      md/raid5: clearly differentiate 'before' and 'after' stripes during reshape. · 86b42c71
      NeilBrown 提交于
      During a raid5 reshape, we have some stripes in the cache that are
      'before' the reshape (and are still to be processed) and some that are
      'after'.  They are currently differentiated by having different
      ->disks values as the only reshape current supported involves changing
      the number of disks.
      
      However we will soon support reshapes that do not change the number
      of disks (chunk parity or chunk size).  So make the difference more
      explicit with a 'generation' number.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      86b42c71
    • N
      md: allow number of drives in raid5 to be reduced · ec32a2bd
      NeilBrown 提交于
      When reshaping a raid5 to have fewer devices, we work from the end of
      the array to the beginning.
      md_do_sync gives addresses to sync_request that go from the beginning
      to the end.  So largely ignore them use the internal state variable
      "reshape_progress" to keep track of what to do next.
      
      Never allow the size to be reduced below the minimum (4 for raid6,
      3 otherwise).
      
      We require that the size of the array has already been reduced before
      the array is reshaped to a smaller size.  This is because simply
      reducing the size is an easily reversible operation, while the reshape
      is immediately destructive and so is not reversible for the blocks at
      the ends of the devices.
      Thus to reshape an array to have fewer devices, you must first write
      an appropriately small size to md/array_size.
      
      When reshape finished, we remove any drives that are no longer
      needed and fix up ->degraded.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ec32a2bd
    • N
      md/raid5: change reshape-progress measurement to cope with reshaping backwards. · fef9c61f
      NeilBrown 提交于
      When reducing the number of devices in a raid4/5/6, the reshape
      process has to start at the end of the array and work down to the
      beginning.  So we need to handle expand_progress and expand_lo
      differently.
      
      This patch renames "expand_progress" and "expand_lo" to avoid the
      implication that anything is getting bigger (expand->reshape) and
      every place they are used, we make sure that they are used the right
      way depending on whether delta_disks is positive or negative.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fef9c61f
    • N
      md: add explicit method to signal the end of a reshape. · cea9c228
      NeilBrown 提交于
      Currently raid5 (the only module that supports restriping)
      notices that the reshape has finished be sync_request being
      given a large value, and handles any cleanup them.
      
      This patch changes it so md_check_recovery calls into an
      explicit finish_reshape method as well.
      
      The clean-up from sync_request can do things that need to be
      done promptly, typically things local to the raid5_conf_t
      structure.
      
      The "finish_reshape" method is called under the mddev_lock
      so it can do things involving reconfiguring the device.
      
      This allows us to get rid of md_set_array_sectors_locked, which
      would have caused a deadlock if you tried to stop and array
      while a reshape was happening.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cea9c228
    • N
      md/raid5: enhance raid5_size to work correctly with negative delta_disks · 7ec05478
      NeilBrown 提交于
      This is the first of four patches which combine to allow md/raid5 to
      reduce the number of devices in the array by restriping the data over
      a subset of the devices.
      
      If the number of disks in a raid4/5/6 is being reduced, then the
      default size must be based on the new number, not the old number
      of devices.
      In general, it should be based on the smaller of new and old.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7ec05478
    • N
      md/raid5: drop qd_idx from r6_state · 34e04e87
      NeilBrown 提交于
      We now have this value in stripe_head so we don't need to duplicate
      it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      34e04e87
    • D
      md/raid6: move raid6 data processing to raid6_pq.ko · f701d589
      Dan Williams 提交于
      Move the raid6 data processing routines into a standalone module
      (raid6_pq) to prepare them to be called from async_tx wrappers and other
      non-md drivers/modules.  This precludes a circular dependency of raid456
      needing the async modules for data processing while those modules in
      turn depend on raid456 for the base level synchronous raid6 routines.
      
      To support this move:
      1/ The exportable definitions in raid6.h move to include/linux/raid/pq.h
      2/ The raid6_call, recovery calls, and table symbols are exported
      3/ Extra #ifdef __KERNEL__ statements to enable the userspace raid6test to
         compile
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f701d589
    • A
      md: raid5 run(): Fix max_degraded for raid level 4. · 18b00334
      Andre Noll 提交于
      raid4 allows only one failed disk.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      18b00334
    • D
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams 提交于
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b522adcd
    • D
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams 提交于
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1f403624
    • D
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams 提交于
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • N
      md: add takeover support for converting raid6 back into raid5 · fc9739c6
      NeilBrown 提交于
      If a raid6 is still in the layout that comes from converting raid5
      into a raid6. this will allow us to convert it back again.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fc9739c6
    • N
      md: add takeover support for raid4 -> raid5 conversion. · e9d4758f
      NeilBrown 提交于
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e9d4758f
    • N
      md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035
      NeilBrown 提交于
      2-drive raid5's aren't very interesting.  But if you are converting
      a raid1 into a raid5, you will at least temporarily have one.  And
      that it a good time to set the layout/chunksize for the new RAID5
      if you aren't happy with the defaults.
      
      layout and chunksize don't actually affect the placement of data
      on a 2-drive raid5, so we just do some internal book-keeping.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b3546035
    • N
      md: add ->takeover method for raid5 to be able to take over raid1 · d562b0c4
      NeilBrown 提交于
      The RAID1 must have two drives and be a suitable size to
      be a multiple of a chunksize that isn't too small.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d562b0c4
    • N
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown 提交于
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      245f46c2
    • N
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown 提交于
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0cf8f04
    • N
      md/raid5: refactor raid5 "run" · 91adb564
      NeilBrown 提交于
      .. so that the code to create the private data structures is separate.
      This will help with future code to change the level of an active
      array.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      91adb564
    • N
      md/raid5: finish support for DDF/raid6 · 67cc2b81
      NeilBrown 提交于
      DDF requires RAID6 calculations over different devices in a different
      order.
      For md/raid6, we calculate over just the data devices, starting
      immediately after the 'Q' block.
      For ddf/raid6 we calculate over all devices, using zeros in place of
      the P and Q blocks.
      
      This requires unfortunately complex loops...
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67cc2b81
    • N
      md/raid5: Add support for new layouts for raid5 and raid6. · 99c0fb5f
      NeilBrown 提交于
      DDF uses different layouts for P and Q blocks than current md/raid6
      so add those that are missing.
      Also add support for RAID6 layouts that are identical to various
      raid5 layouts with the simple addition of one device to hold all of
      the 'Q' blocks.
      Finally add 'raid5' layouts to match raid4.
      These last to will allow online level conversion.
      
      Note that this does not provide correct support for DDF/raid6 yet
      as the order in which data blocks are summed to produce the Q block
      is significant and different between current md code and DDF
      requirements.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      99c0fb5f
    • N
      md/raid5: simplify raid5_compute_sector interface · 911d4ee8
      NeilBrown 提交于
      Rather than passing 'pd_idx' and 'qd_idx' to be filled in, pass
      a 'struct stripe_head *' and fill in the relevant fields.  This is
      more extensible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      911d4ee8
    • N
      md/raid6: remove expectation that Q device is immediately after P device. · d0dabf7e
      NeilBrown 提交于
      
      Code currently assumes that the devices in a raid6 stripe are
        0 1 ... N-1 P Q
      in some rotated order.  We will shortly add new layouts in which
      this strict pattern is broken.
      So remove this expectation.  We still assume that the data disks
      are roughly in-order.  However P and Q can be inserted anywhere within
      that order.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0dabf7e