1. 14 12月, 2009 9 次提交
    • R
      raid: improve MD/raid10 handling of correctable read errors. · 1e50915f
      Robert Becker 提交于
      We've noticed severe lasting performance degradation of our raid
      arrays when we have drives that yield large amounts of media errors.
      The raid10 module will queue each failed read for retry, and also
      will attempt call fix_read_error() to perform the read recovery.
      Read recovery is performed while the array is frozen, so repeated
      recovery attempts can degrade the performance of the array for
      extended periods of time.
      
      With this patch I propose adding a per md device max number of
      corrected read attempts.  Each rdev will maintain a count of
      read correction attempts in the rdev->read_errors field (not
      used currently for raid10). When we enter fix_read_error()
      we'll check to see when the last read error occurred, and
      divide the read error count by 2 for every hour since the
      last read error. If at that point our read error count
      exceeds the read error threshold, we'll fail the raid device.
      
      In addition in this patch I add sysfs nodes (get/set) for
      the per md max_read_errors attribute, the rdev->read_errors
      attribute, and added some printk's to indicate when
      fix_read_error fails to repair an rdev.
      
      For testing I used debugfs->fail_make_request to inject
      IO errors to the rdev while doing IO to the raid array.
      Signed-off-by: NRobert Becker <Rob.Becker@riverbed.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1e50915f
    • N
      md: support updating bitmap parameters via sysfs. · 43a70507
      NeilBrown 提交于
      A new attribute directory 'bitmap' in 'md' is created which
      contains files for configuring the bitmap.
      'location' identifies where the bitmap is, either 'none',
      or 'file' or 'sector offset from metadata'.
      Writing 'location' can create or remove a bitmap.
      Adding a 'file' bitmap this way is not yet supported.
      'chunksize' and 'time_base' must be set before 'location'
      can be set.
      
      'chunksize' can be set before creating a bitmap, but is
      currently always over-ridden by the bitmap superblock.
      
      'time_base' and 'backlog' can be updated at any time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      43a70507
    • N
      md: factor out parsing of fixed-point numbers · 72e02075
      NeilBrown 提交于
      safe_delay_store can parse fixed point numbers (for fractions
      of a second).  We will want to do that for another sysfs
      file soon, so factor out the code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72e02075
    • N
      md: move offset, daemon_sleep and chunksize out of bitmap structure · 42a04b50
      NeilBrown 提交于
      ... and into bitmap_info.  These are all configuration parameters
      that need to be set before the bitmap is created.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      42a04b50
    • N
      md: collect bitmap-specific fields into one structure. · c3d9714e
      NeilBrown 提交于
      In preparation for making bitmap fields configurable via sysfs,
      start tidying up by making a single structure to contain the
      configuration fields.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c3d9714e
    • N
      md: support barrier requests on all personalities. · a2826aa9
      NeilBrown 提交于
      Previously barriers were only supported on RAID1.  This is because
      other levels requires synchronisation across all devices and so needed
      a different approach.
      Here is that approach.
      
      When a barrier arrives, we send a zero-length barrier to every active
      device.  When that completes - and if the original request was not
      empty -  we submit the barrier request itself (with the barrier flag
      cleared) and then submit a fresh load of zero length barriers.
      
      The barrier request itself is asynchronous, but any subsequent
      request will block until the barrier completes.
      
      The reason for clearing the barrier flag is that a barrier request is
      allowed to fail.  If we pass a non-empty barrier through a striping
      raid level it is conceivable that part of it could succeed and part
      could fail.  That would be way too hard to deal with.
      So if the first run of zero length barriers succeed, we assume all is
      sufficiently well that we send the request and ignore errors in the
      second run of barriers.
      
      RAID5 needs extra care as write requests may not have been submitted
      to the underlying devices yet.  So we flush the stripe cache before
      proceeding with the barrier.
      
      Note that the second set of zero-length barriers are submitted
      immediately after the original request is submitted.  Thus when
      a personality finds mddev->barrier to be set during make_request,
      it should not return from make_request until the corresponding
      per-device request(s) have been queued.
      
      That will be done in later patches.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      a2826aa9
    • N
      md: don't reset curr_resync_completed after an interrupted resync · efa59339
      NeilBrown 提交于
      If a resync/recovery/check/repair is interrupted for some reason, it
      can be useful to know exactly where it got up to.
      So in that case, do not clear curr_resync_completed.
      Initialise it when starting a resync/recovery/... instead.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      efa59339
    • N
      md: adjust resync_min usefully when resync aborts. · c07b70ad
      NeilBrown 提交于
      When a 'check' or 'repair' finished we should clear resync_min
      so that a future check/repair will cover the whole array (by default).
      However if it is interrupted, we should update resync_min to
      where we got up to, so that when the check/repair continues it
      just does the remainder of the array.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c07b70ad
    • N
      md/bitmap: protect against bitmap removal while being updated. · aa5cbd10
      NeilBrown 提交于
      A write intent bitmap can be removed from an array while the
      array is active.
      When this happens, all IO is suspended and flushed before the
      bitmap is removed.
      However it is possible that bitmap_daemon_work is still running to
      clear old bits from the bitmap.  If it is, it can dereference the
      bitmap after it has been freed.
      
      So introduce a new mutex to protect bitmap_daemon_work and get it
      before destroying a bitmap.
      
      This is suitable for any current -stable kernel.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      aa5cbd10
  2. 19 11月, 2009 1 次提交
  3. 13 11月, 2009 1 次提交
    • N
      md: allow v0.91 metadata to record devices as being active but not in-sync. · 0261cd9f
      NeilBrown 提交于
      This is a combination that didn't really make sense before.
      However when a reshape is converting e.g. raid5 -> raid6, the extra
      device is not fully in-sync, but is certainly active and contains
      important data.
      So allow that start to be meaningful and in particular get
      the 'recovery_offset' value (which is needed for any non-in-sync
      active device) from the reshape_position.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0261cd9f
  4. 12 11月, 2009 2 次提交
    • E
      sysctl drivers: Remove dead binary sysctl support · 894d2491
      Eric W. Biederman 提交于
      Now that sys_sysctl is a wrapper around /proc/sys all of
      the binary sysctl support elsewhere in the tree is
      dead code.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Corey Minyard <minyard@acm.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@suse.de>
      Acked-by: Clemens Ladisch <clemens@ladisch.de> for drivers/char/hpet.c
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      894d2491
    • N
      md: factor out updating of 'recovery_offset'. · 5e865106
      NeilBrown 提交于
      Each device has its own 'recovery_offset' showing how far
      recovery has progressed on the device.
      As the only real significance of this is that fact that it can
      be stored in the metadata and recovered at restart, and as
      only 1.x metadata can do this, we were only updating
      'recovery_offset' to 'curr_resync_completed' when updating
      v1.x metadata.
      But this is wrong, and we will shortly make limited use of this
      field in v0.90 metadata.
      
      So move the update into common code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5e865106
  5. 06 11月, 2009 1 次提交
    • N
      md: don't clear endpoint for resync when resync is interrupted. · 24395a85
      NeilBrown 提交于
      If a 'sync_max' has been set (via sysfs), it is wrong to clear it
      until a resync (or reshape or recovery ...) actually reached that
      point.
      So if a resync is interrupted (e.g. by device failure),
      leave 'resync_max' unchanged.
      
      This is particularly important for 'reshape' operations that do not
      change the size of the array.  For such operations mdadm needs to
      monitor the reshape taking rolling backups of the section being
      reshaped.  If resync_max gets cleared, the reshape can get ahead of
      mdadm and then the backups that mdadm creates are useless.
      
      This is suitable for 2.6.31.y stable kernels.
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      24395a85
  6. 16 10月, 2009 1 次提交
    • N
      md: Fix handling of raid5 array which is being reshaped to fewer devices. · 5e5e3e78
      NeilBrown 提交于
      When a raid5 (or raid6) array is being reshaped to have fewer devices,
      conf->raid_disks is the latter and hence smaller number of devices.
      However sometimes we want to use a number which is the total number of
      currently required devices - the larger of the 'old' and 'new' sizes.
      Before we implemented reducing the number of devices, this was always
      'new' i.e. ->raid_disks.
      Now we need max(raid_disks, previous_raid_disks) in those places.
      
      This particularly affects assembling an array that was shutdown while
      in the middle of a reshape to fewer devices.
      
      md.c needs a similar fix when interpreting the md metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5e5e3e78
  7. 23 9月, 2009 3 次提交
  8. 22 9月, 2009 1 次提交
  9. 18 8月, 2009 1 次提交
  10. 13 8月, 2009 2 次提交
    • N
      md: allow upper limit for resync/reshape to be set when array is read-only · 4d484a4a
      NeilBrown 提交于
      Normally we only allow the upper limit for a reshape to be decreased
      when the array not performing a sync/recovery/reshape, otherwise there
      could be races.  But if an array is part-way through a reshape when it
      is assembled the reshape is started immediately leaving no window
      to set an upper bound.
      
      If the array is started read-only, the reshape will be suspended until
      the array becomes writable, so that provides a window during which it
      is perfectly safe to reduce the upper limit of a reshape.
      
      So: allow the upper limit (sync_max) to be reduced even if the reshape
      thread is running, as long as the array is still read-only.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4d484a4a
    • N
      md: never advance 'events' counter by more than 1. · 51d5668c
      NeilBrown 提交于
      When assembling arrays, md allows two devices to have different event
      counts as long as the difference is only '1'.  This is to cope with
      a system failure between updating the metadata on two difference
      devices.
      
      However there are currently times when we update the event count by
      2.  This was done to keep the event count even when the array is clean
      and odd when it is dirty, which allows us to avoid writing common
      update to spare devices and so allow those spares to go to sleep.
      
      This is bad for the above reason.  So change it to never increase by
      two.  This means that the alignment between 'odd/even' and
      'clean/dirty' might take a little longer to attain, but that is only a
      small cost.  The spares will get a few more updates but that will
      still be spared (;-) most updates and can still go to sleep.
      
      Prior to this patch there was a small chance that after a crash an
      array would fail to assemble due to the overly large event count
      mismatch.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      51d5668c
  11. 10 8月, 2009 1 次提交
    • N
      Remove deadlock potential in md_open · c8c00a69
      NeilBrown 提交于
      A recent commit:
        commit 449aad3e
      
      introduced the possibility of an A-B/B-A deadlock between
      bd_mutex and reconfig_mutex.
      
      __blkdev_get holds bd_mutex while calling md_open which takes
         reconfig_mutex,
      do_md_run is always called with reconfig_mutex held, and it now
         takes bd_mutex in the call the revalidate_disk.
      
      This potential deadlock was not caught by lockdep due to the
      use of mutex_lock_interruptible_nexted which was introduced
      by
         commit d63a5a74
      do avoid a warning of an impossible deadlock.
      
      It is quite possible to split reconfig_mutex in to two locks.
      One protects the array data structures while it is being
      reconfigured, the other ensures that an array is never even partially
      open while it is being deactivated.
      In particular, the second lock prevents an open from completing
      between the time when do_md_stop checks if there are any active opens,
      and the time when the array is either set read-only, or when ->pers is
      set to NULL.  So we can be certain that no IO is in flight as the
      array is being destroyed.
      
      So create a new lock, open_mutex, just to ensure exclusion between
      'open' and 'stop'.
      
      This avoids the deadlock and also avoids the lockdep warning mentioned
      in commit d63a5a74Reported-by: N"Mike Snitzer" <snitzer@gmail.com>
      Reported-by: N"H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c8c00a69
  12. 03 8月, 2009 5 次提交
    • N
      md: Use revalidate_disk to effect changes in size of device. · 449aad3e
      NeilBrown 提交于
      As revalidate_disk calls check_disk_size_change, it will cause
      any capacity change of a gendisk to be propagated to the blockdev
      inode.  So use that instead of mucking about with locks and
      i_size_write.
      
      Also add a call to revalidate_disk in do_md_run and a few other places
      where the gendisk capacity is changed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      449aad3e
    • N
      md: Handle growth of v1.x metadata correctly. · 70471daf
      NeilBrown 提交于
      The v1.x metadata does not have a fixed size and can grow
      when devices are added.
      If it grows enough to require an extra sector of storage,
      we need to update the 'sb_size' to match.
      
      Without this, md can write out an incomplete superblock with a
      bad checksum, which will be rejected when trying to re-assemble
      the array.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      70471daf
    • N
      md: avoid array overflow with bad v1.x metadata · 3673f305
      NeilBrown 提交于
      We trust the 'desc_nr' field in v1.x metadata enough to use it
      as an index in an array.  This isn't really safe.
      So range-check the value first.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3673f305
    • N
      md: when a level change reduces the number of devices, remove the excess. · 3a981b03
      NeilBrown 提交于
      When an array is changed from RAID6 to RAID5, fewer drives are
      needed.  So any device that is made superfluous by the level
      conversion must be marked as not-active.
      For the RAID6->RAID5 conversion, this will be a drive which only
      has 'Q' blocks on it.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3a981b03
    • A
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll 提交于
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac5e7113
  13. 09 7月, 2009 1 次提交
  14. 01 7月, 2009 4 次提交
  15. 18 6月, 2009 7 次提交
    • A
      md: Move check for bitmap presence to personality code. · 0894cc30
      Andre Noll 提交于
      If the superblock of a component device indicates the presence of a
      bitmap but the corresponding raid personality does not support bitmaps
      (raid0, linear, multipath, faulty), then something is seriously wrong
      and we'd better refuse to run such an array.
      
      Currently, this check is performed while the superblocks are examined,
      i.e. before entering personality code. Therefore the generic md layer
      must know which raid levels support bitmaps and which do not.
      
      This patch avoids this layer violation without adding identical code
      to various personalities. This is accomplished by introducing a new
      public function to md.c, md_check_no_bitmap(), which replaces the
      hard-coded checks in the superblock loading functions.
      
      A call to md_check_no_bitmap() is added to the ->run method of each
      personality which does not support bitmaps and assembly is aborted
      if at least one component device contains a bitmap.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0894cc30
    • N
      md: remove chunksize rounding from common code. · 8190e754
      NeilBrown 提交于
      It is easiest to round sizes to multiples of chunk size in
      the personality code for those personalities which care.
      Those personalities now do the rounding, so we can
      remove that function from common code.
      
      Also remove the upper bound on the size of a chunk, and the lower
      bound on the size of a device (1 chunk), neither of which really buy
      us anything.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8190e754
    • N
      md: move assignment of ->utime so that it never gets skipped. · 1b57f132
      NeilBrown 提交于
      Currently the assignment to utime gets skipped for 'external'
      metadata.  So move it to the top of the function so that it
      always gets effected.
      This is of largely cosmetic interest.  Nothing actually depends
      on ->utime being right for external arrays.
      "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
      mdadm-3.0, this is not important for external metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1b57f132
    • A
      md: Push down reconstruction log message to personality code. · 8c6ac868
      Andre Noll 提交于
      Currently, the md layer checks in analyze_sbs() if the raid level
      supports reconstruction (mddev->level >= 1) and if reconstruction is
      in progress (mddev->recovery_cp != MaxSector).
      
      Move that printk into the personality code of those raid levels that
      care (levels 1, 4, 5, 6, 10).
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8c6ac868
    • N
      md: merge reconfig and check_reshape methods. · 50ac168a
      NeilBrown 提交于
      The difference between these two methods is artificial.
      Both check that a pending reshape is valid, and perform any
      aspect of it that can be done immediately.
      'reconfig' handles chunk size and layout.
      'check_reshape' handles raid_disks.
      
      So make them just one method.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      50ac168a
    • N
      md: remove unnecessary arguments from ->reconfig method. · 597a711b
      NeilBrown 提交于
      Passing the new layout and chunksize as args is not necessary as
      the mddev has fields for new_check and new_layout.
      
      This is preparation for combining the check_reshape and reconfig
      methods
      Signed-off-by: NNeilBrown <neilb@suse.de>
      597a711b
    • A
      md: Convert mddev->new_chunk to sectors. · 664e7c41
      Andre Noll 提交于
      A straight-forward conversion which gets rid of some
      multiplications/divisions/shifts. The patch also introduces a couple
      of new ones, most of which are due to conf->chunk_size still being
      represented in bytes. This will be cleaned up in subsequent patches.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      664e7c41