1. 23 12月, 2011 6 次提交
    • N
      md/raid5: preferentially read from replacement device if possible. · 14a75d3e
      NeilBrown 提交于
      If a replacement device is present and has been recovered far enough,
      then use it for reading into the stripe cache.
      
      If we get an error we don't try to repair it, we just fail the device.
      A replacement device that gives errors does not sound sensible.
      
      This requires removing the setting of R5_ReadError when we get
      a read error during a read that bypasses the cache.  It was probably
      a bad idea anyway as we don't know that every block in the read
      caused an error, and it could cause ReadError to be set for the
      replacement device, which is bad.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      14a75d3e
    • N
      md/raid5: remove redundant bio initialisations. · 995c4275
      NeilBrown 提交于
      We current initialise some fields of a bio when preparing a
      stripe_head, and again just before submitting the request.
      
      Remove the duplication by only setting the fields that lower level
      devices don't touch in raid5_build_block, and only set the changeable
      fields in ops_run_io.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      995c4275
    • N
      md/raid5: allow each slot to have an extra replacement device · 671488cc
      NeilBrown 提交于
      Just enhance data structures to record a second device per slot to be
      used as a 'replacement' device, replacing the original.
      We also have a second bio in each slot in each stripe_head.  This will
      only be used when writing to the array - we need to write to both the
      original and the replacement at the same time, so will need two bios.
      
      For now, only try using the replacement drive for aligned-reads.
      In this case, we prefer the replacement if it has been recovered far
      enough, otherwise use the original.
      
      This includes a small enhancement.  Previously we would only do
      aligned reads if the target device was fully recovered.  Now we also
      do them if it has recovered far enough.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      671488cc
    • N
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown 提交于
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8321b68
    • N
      md/raid5: be more thorough in calculating 'degraded' value. · 908f4fbd
      NeilBrown 提交于
      When an array is being reshaped to change the number of devices,
      the two halves can be differently degraded.  e.g. one could be
      missing a device and the other not.
      
      So we need to be more careful about calculating the 'degraded'
      attribute.
      
      Instead of just inc/dec at appropriate times, perform a full
      re-calculation examining both possible cases.  This doesn't happen
      often so it not a big cost, and we already have most of the code to
      do it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      908f4fbd
    • N
      md/raid5: ensure correct assessment of drives during degraded reshape. · 30d7a483
      NeilBrown 提交于
      While reshaping a degraded array (as when reshaping a RAID0 by first
      converting it to a degraded RAID4) we currently get confused about
      which devices are in_sync.  In most cases we get it right, but in the
      region that is being reshaped we need to treat non-failed devices as
      in-sync when we have the data but haven't actually written it out yet.
      Reported-by: NAdam Kwolek <adam.kwolek@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30d7a483
  2. 09 12月, 2011 1 次提交
  3. 08 12月, 2011 1 次提交
    • N
      md/raid5: never wait for bad-block acks on failed device. · 9283d8c5
      NeilBrown 提交于
      Once a device is failed we really want to completely ignore it.
      It should go away soon anyway.
      
      In particular the presence of bad blocks on it should not cause us to
      block as we won't be trying to write there anyway.
      
      So as soon as we can check if a device is Faulty, do so and pretend
      that it is already gone if it is Faulty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9283d8c5
  4. 08 11月, 2011 2 次提交
  5. 01 11月, 2011 1 次提交
  6. 26 10月, 2011 2 次提交
    • N
      md: Fix some bugs in recovery_disabled handling. · d890fa2b
      NeilBrown 提交于
      In 3.0 we changed the way recovery_disabled was handle so that instead
      of testing against zero, we test an mddev-> value against a conf->
      value.
      Two problems:
        1/ one place in raid1 was missed and still sets to '1'.
        2/ We didn't explicitly set the conf-> value at array creation
           time.
           It defaulted to '0' just like the mddev value does so they
           could appear equal and thus disable recovery.
           This did not affect normal 'md' as it calls bind_rdev_to_array
           which changes the mddev value.  However the dmraid interface
           doesn't call this and so doesn't change ->recovery_disabled; so at
           array start all recovery is incorrectly disabled.
      
      So initialise the 'conf' value to one less that the mddev value, so
      the will only be the same when explicitly set that way.
      Reported-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown  <neilb@suse.de>
      d890fa2b
    • N
      md/raid5: fix bug that could result in reads from a failed device. · 355840e7
      NeilBrown 提交于
      This bug was introduced in 415e72d0
      which was in 2.6.36.
      
      There is a small window of time between when a device fails and when
      it is removed from the array.  During this time we might still read
      from it, but we won't write to it - so it is possible that we could
      read stale data.
      
      We didn't need the test of 'Faulty' before because the test on
      In_sync is sufficient.  Since we started allowing reads from the early
      part of non-In_sync devices we need a test on Faulty too.
      
      This is suitable for any kernel from 2.6.36 onwards, though the patch
      might need a bit of tweaking in 3.0 and earlier.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      355840e7
  7. 11 10月, 2011 5 次提交
  8. 07 10月, 2011 3 次提交
  9. 21 9月, 2011 1 次提交
    • N
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown 提交于
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  10. 12 9月, 2011 1 次提交
  11. 31 8月, 2011 1 次提交
    • N
      md/raid5: fix a hang on device failure. · 43220aa0
      NeilBrown 提交于
      Waiting for a 'blocked' rdev to become unblocked in the raid5d thread
      cannot work with internal metadata as it is the raid5d thread which
      will clear the blocked flag.
      This wasn't a problem in 3.0 and earlier as we only set the blocked
      flag when external metadata was used then.
      However we now set it always, so we need to be more careful.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43220aa0
  12. 28 7月, 2011 7 次提交
    • N
      md/raid5: Clear bad blocks on successful write. · b84db560
      NeilBrown 提交于
      On a successful write to a known bad block, flag the sh
      so that raid5d can remove the known bad block from the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b84db560
    • N
      md/raid5. Don't write to known bad block on doubtful devices. · 73e92e51
      NeilBrown 提交于
      If a device has seen write errors, don't write to any known
      bad blocks on that device.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      73e92e51
    • N
      md/raid5: write errors should be recorded as bad blocks if possible. · bc2607f3
      NeilBrown 提交于
      When a write error is detected, don't mark the device as failed
      immediately but rather record the fact for handle_stripe to deal with.
      
      Handle_stripe then attempts to record a bad block.  Only if that fails
      does the device get marked as faulty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bc2607f3
    • N
      md/raid5: use bad-block log to improve handling of uncorrectable read errors. · 7f0da59b
      NeilBrown 提交于
      If we get an uncorrectable read error - record a bad block rather than
      failing the device.
      And if these errors (which may be due to known bad blocks) cause
      recovery to be impossible, record a bad block on the recovering
      devices, or abort the recovery.
      
      As we might abort a recovery without failing a device we need to teach
      RAID5 about recovery_disabled handling.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7f0da59b
    • N
      md/raid5: avoid reading from known bad blocks. · 31c176ec
      NeilBrown 提交于
      There are two times that we might read in raid5:
      1/ when a read request fits within a chunk on a single
         working device.
         In this case, if there is any bad block in the range of
         the read, we simply fail the cache-bypass read and
         perform the read though the stripe cache.
      
      2/ when reading into the stripe cache.  In this case we
         mark as failed any device which has a bad block in that
         strip (1 page wide).
         Note that we will both avoid reading and avoid writing.
         This is correct (as we will never read from the block, there
         is no point writing), but not optimal (as writing could 'fix'
         the error) - that will be addressed later.
      
      If we have not seen any write errors on the device yet, we treat a bad
      block like a recent read error.  This will encourage an attempt to fix
      the read error which will either generate a write error, or will
      ensure good data is stored there.  We don't yet forget the bad block
      in that case.  That comes later.
      
      Now that we honour bad blocks when reading we can allow devices with
      bad blocks into the array.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      31c176ec
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: don't allow arrays to contain devices with bad blocks. · 34b343cf
      NeilBrown 提交于
      As no personality understand bad block lists yet, we must
      reject any device that is known to contain bad blocks.
      As the personalities get taught, these tests can be removed.
      
      This only applies to raid1/raid5/raid10.
      For linear/raid0/multipath/faulty the whole concept of bad blocks
      doesn't mean anything so there is no point adding the checks.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      34b343cf
  13. 27 7月, 2011 9 次提交