1. 25 4月, 2008 7 次提交
  2. 22 4月, 2008 1 次提交
  3. 11 4月, 2008 1 次提交
  4. 29 3月, 2008 2 次提交
  5. 20 3月, 2008 2 次提交
  6. 11 3月, 2008 2 次提交
  7. 05 3月, 2008 9 次提交
    • K
      md: the md RAID10 resync thread could cause a md RAID10 array deadlock · a07e6ab4
      K.Tanaka 提交于
      This message describes another issue about md RAID10 found by testing the
      2.6.24 md RAID10 using new scsi fault injection framework.
      
      Abstract:
      
      When a scsi error results in disabling a disk during RAID10 recovery, the
      resync threads of md RAID10 could stall.
      
      This case, the raid array has already been broken and it may not matter.  But
      I think stall is not preferable.  If it occurs, even shutdown or reboot will
      fail because of resource busy.
      
      The deadlock mechanism:
      
      The r10bio_s structure has a "remaining" member to keep track of BIOs yet to
      be handled when recovering.  The "remaining" counter is incremented when
      building a BIO in sync_request() and is decremented when finish a BIO in
      end_sync_write().
      
      If building a BIO fails for some reasons in sync_request(), the "remaining"
      should be decremented if it has already been incremented.  I found a case
      where this decrement is forgotten.  This causes a md_do_sync() deadlock
      because md_do_sync() waits for md_done_sync() called by end_sync_write(), but
      end_sync_write() never calls md_done_sync() because of the "remaining" counter
      mismatch.
      
      For example, this problem would be reproduced in the following case:
      
      Personalities : [raid10]
      md0 : active raid10 sdf1[4] sde1[5](F) sdd1[2] sdc1[1] sdb1[6](F)
            3919616 blocks 64K chunks 2 near-copies [4/2] [_UU_]
            [>....................]  recovery =  2.2% (45376/1959808) finish=0.7min speed=45376K/sec
      
      This case, sdf1 is recovering, sdb1 and sde1 are disabled.
      An additional error with detaching sdd will cause a deadlock.
      
      md0 : active raid10 sdf1[4] sde1[5](F) sdd1[6](F) sdc1[1] sdb1[7](F)
            3919616 blocks 64K chunks 2 near-copies [4/1] [_U__]
            [=>...................]  recovery =  5.0% (99520/1959808) finish=5.9min speed=5237K/sec
      
       2739 ?        S<     0:17 [md0_raid10]
      28608 ?        D<     0:00 [md0_resync]
      28629 pts/1    Ss     0:00 bash
      28830 pts/1    R+     0:00 ps ax
      31819 ?        D<     0:00 [kjournald]
      
      The resync thread keeps working, but actually it is deadlocked.
      
      Patch:
      By this patch, the remaining counter will be decremented if needed.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a07e6ab4
    • N
      md: fix possible raid1/raid10 deadlock on read error during resync · 1c830532
      NeilBrown 提交于
      Thanks to K.Tanaka and the scsi fault injection framework, here is a fix for
      another possible deadlock in raid1/raid10 error handing.
      
      If a read request returns an error while a resync is happening and a resync
      request is pending, the attempt to fix the error will block until the resync
      progresses, and the resync will block until the read request completes.  Thus
      a deadlock.
      
      This patch fixes the problem.
      
      Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c830532
    • K
      md: don't attempt read-balancing for raid10 'far' layouts · 8ed3a195
      Keld Simonsen 提交于
      This patch changes the disk to be read for layout "far > 1" to always be the
      disk with the lowest block address.
      
      Thus the chunks to be read will always be (for a fully functioning array) from
      the first band of stripes, and the raid will then work as a raid0 consisting
      of the first band of stripes.
      
      Some advantages:
      
      The fastest part which is the outer sectors of the disks involved will be
      used.  The outer blocks of a disk may be as much as 100 % faster than the
      inner blocks.
      
      Average seek time will be smaller, as seeks will always be confined to the
      first part of the disks.
      
      Mixed disks with different performance characteristics will work better, as
      they will work as raid0, the sequential read rate will be number of disks
      involved times the IO rate of the slowest disk.
      
      If a disk is malfunctioning, the first disk which is working, and has the
      lowest block address for the logical block will be used.
      Signed-off-by: NKeld Simonsen <keld@dkuug.dk>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ed3a195
    • N
      md: lock access to rdev attributes properly · 27c529bb
      NeilBrown 提交于
      When we access attributes of an rdev (component device on an md array) through
      sysfs, we really need to lock the array against concurrent changes.  We
      currently do that when we change an attribute, but not when we read an
      attribute.  We need to lock when reading as well else rdev->mddev could become
      NULL while we are accessing it.
      
      So add appropriate locking (mddev_lock) to rdev_attr_show.
      
      rdev_size_store requires some extra care as well as it needs to unlock the
      mddev while scanning other mddevs for overlapping regions.  We currently
      assume that rdev->mddev will still be unchanged after the scan, but that
      cannot be certain.  So take a copy of rdev->mddev for use at the end of the
      function.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c529bb
    • N
      md: make sure a reshape is started when device switches to read-write · 25156198
      NeilBrown 提交于
      A resync/reshape/recovery thread will refuse to progress when the array is
      marked read-only.  So whenever it mark it not read-only, it is important to
      wake up thread resync thread.  There is one place we didn't do this.
      
      The problem manifests if the start_ro module parameters is set, and a raid5
      array that is in the middle of a reshape (restripe) is started.  The array
      will initially be semi-read-only (meaning it acts like it is readonly until
      the first write).  So the reshape will not proceed.
      
      On the first write, the array will become read-write, but the reshape will not
      be started, and there is no event which will ever restart that thread.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25156198
    • N
      md: clean up irregularity with raid autodetect · d0fae18f
      NeilBrown 提交于
      When a raid1 array is stopped, all components currently get added to the list
      for auto-detection.  However we should really only add components that were
      found by autodetection in the first place.  So add a flag to record that
      information, and use it.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0fae18f
    • N
      md: guard against possible bad array geometry in v1 metadata · a1801f85
      NeilBrown 提交于
      Make sure the data doesn't start before the end of the superblock when the
      superblock is at the start of the device.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1801f85
    • N
      md: reduce CPU wastage on idle md array with a write-intent bitmap · 8311c29d
      NeilBrown 提交于
      On an md array with a write-intent bitmap, a thread wakes up every few seconds
      and scans the bitmap looking for work to do.  If the array is idle, there will
      be no work to do, but a lot of scanning is done to discover this.
      
      So cache the fact that the bitmap is completely clean, and avoid scanning the
      whole bitmap when the cache is known to be clean.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8311c29d
    • N
      md: fix deadlock in md/raid1 and md/raid10 when handling a read error · a35e63ef
      NeilBrown 提交于
      When handling a read error, we freeze the array to stop any other IO while
      attempting to over-write with correct data.
      
      This is done in the raid1d(raid10d) thread and must wait for all submitted IO
      to complete (except for requests that failed and are sitting in the retry
      queue - these are counted in ->nr_queue and will stay there during a freeze).
      
      However write requests need attention from raid1d as bitmap updates might be
      required.  This can cause a deadlock as raid1 is waiting for requests to
      finish that themselves need attention from raid1d.
      
      So we create a new function 'flush_pending_writes' to give that attention, and
      call it in freeze_array to be sure that we aren't waiting on raid1d.
      
      Thanks to "K.Tanaka" <k-tanaka@ce.jp.nec.com> for finding and reporting this
      problem.
      
      Cc: "K.Tanaka" <k-tanaka@ce.jp.nec.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a35e63ef
  8. 20 2月, 2008 1 次提交
  9. 15 2月, 2008 4 次提交
  10. 14 2月, 2008 1 次提交
  11. 08 2月, 2008 10 次提交
    • J
      dm raid1: report fault status · af195ac8
      Jonathan Brassow 提交于
      This patch adds extra information to the mirror status output, so that
      it can be determined which device(s) have failed.  For each mirror device,
      a character is printed indicating the most severe error encountered.  The
      characters are:
       *    A => Alive - No failures
       *    D => Dead - A write failure occurred leaving mirror out-of-sync
       *    S => Sync - A sychronization failure occurred, mirror out-of-sync
       *    R => Read - A read failure occurred, mirror data unaffected
      This allows userspace to properly reconfigure the mirror set.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      af195ac8
    • J
      dm raid1: handle read failures · 06386bbf
      Jonathan Brassow 提交于
      This patch gives the ability to respond-to/record device failures
      that happen during read operations.  It also adds the ability to
      read from mirror devices that are not the primary if they are
      in-sync.
      
      There are essentially two read paths in mirroring; the direct path
      and the queued path.  When a read request is mapped, if the region
      is 'in-sync' the direct path is taken; otherwise the queued path
      is taken.
      
      If the direct path is taken, we must record bio information so that
      if the read fails we can retry it.  We then discover the status of
      a direct read through mirror_end_io.  If the read has failed, we will
      mark the device from which the read was attempted as failed (so we
      don't try to read from it again), restore the bio and try again.
      
      If the queued path is taken, we discover the results of the read
      from 'read_callback'.  If the device failed, we will mark the device
      as failed and attempt the read again if there is another device
      where this region is known to be 'in-sync'.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      06386bbf
    • J
      dm raid1: fix EIO after log failure · b80aa7a0
      Jonathan Brassow 提交于
      This patch adds the ability to requeue write I/O to
      core device-mapper when there is a log device failure.
      
      If a write to the log produces and error, the pending writes are
      put on the "failures" list.  Since the log is marked as failed,
      they will stay on the failures list until a suspend happens.
      
      Suspends come in two phases, presuspend and postsuspend.  We must
      make sure that all the writes on the failures list are requeued
      in the presuspend phase (a requirement of dm core).  This means
      that recovery must be complete (because writes may be delayed
      behind it) and the failures list must be requeued before we
      return from presuspend.
      
      The mechanisms to ensure recovery is complete (or stopped) was
      already in place, but needed to be moved from postsuspend to
      presuspend.  We rely on 'flush_workqueue' to ensure that the
      mirror thread is complete and therefore, has requeued all writes
      in the failures list.
      
      Because we are using flush_workqueue, we must ensure that no
      additional 'queue_work' calls will produce additional I/O
      that we need to requeue (because once we return from
      presuspend, we are unable to do anything about it).  'queue_work'
      is called in response to the following functions:
      - complete_resync_work = NA, recovery is stopped
      - rh_dec (mirror_end_io) = NA, only calls 'queue_work' if it
                                 is ready to recover the region
                                 (recovery is stopped) or it needs
                                 to clear the region in the log*
                                 **this doesn't get called while
                                 suspending**
      - rh_recovery_end = NA, recovery is stopped
      - rh_recovery_start = NA, recovery is stopped
      - write_callback = 1) Writes w/o failures simply call
                         bio_endio -> mirror_end_io -> rh_dec
                         (see rh_dec above)
                         2) Writes with failures are put on
                         the failures list and queue_work is
                         called**
                         ** write_callbacks don't happen
                         during suspend **
      - do_failures = NA, 'queue_work' not called if suspending
      - add_mirror (initialization) = NA, only done on mirror creation
      - queue_bio = NA, 1) delayed I/O scheduled before flush_workqueue
                    is called.  2) No more I/Os are being issued.
                    3) Re-attempted READs can still be handled.
                    (Write completions are handled through rh_dec/
                    write_callback - mention above - and do not
                    use queue_bio.)
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      b80aa7a0
    • J
      dm raid1: handle recovery failures · 8f0205b7
      Jonathan Brassow 提交于
      This patch adds the calls to 'fail_mirror' if an error occurs during
      mirror recovery (aka resynchronization).  'fail_mirror' is responsible
      for recording the type of error by mirror device and ensuring an event
      gets raised for the purpose of notifying userspace.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      8f0205b7
    • J
      dm raid1: handle write failures · 72f4b314
      Jonathan Brassow 提交于
      This patch gives mirror the ability to handle device failures
      during normal write operations.
      
      The 'write_callback' function is called when a write completes.
      If all the writes failed or succeeded, we report failure or
      success respectively.  If some of the writes failed, we call
      fail_mirror; which increments the error count for the device, notes
      the type of error encountered (DM_RAID1_WRITE_ERROR),  and
      selects a new primary (if necessary).  Note that the primary
      device can never change while the mirror is not in-sync (IOW,
      while recovery is happening.)  This means that the scenario
      where a failed write changes the primary and gives
      recovery_complete a chance to misread the primary never happens.
      The fact that the primary can change has necessitated the change
      to the default_mirror field.  We need to protect against reading
      garbage while the primary changes.  We then add the bio to a new
      list in the mirror set, 'failures'.  For every bio in the 'failures'
      list, we call a new function, '__bio_mark_nosync', where we mark
      the region 'not-in-sync' in the log and properly set the region
      state as, RH_NOSYNC.  Userspace must also be notified of the
      failure.  This is done by 'raising an event' (dm_table_event()).
      If fail_mirror is called in process context the event can be raised
      right away.  If in interrupt context, the event is deferred to the
      kmirrord thread - which raises the event if 'event_waiting' is set.
      
      Backwards compatibility is maintained by ignoring errors if
      the DM_FEATURES_HANDLE_ERRORS flag is not present.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      72f4b314
    • M
      dm snapshot: combine consecutive exceptions in memory · d74f81f8
      Milan Broz 提交于
      Provided sector_t is 64 bits, reduce the in-memory footprint of the
      snapshot exception table by the simple method of using unused bits of
      the chunk number to combine consecutive entries.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      d74f81f8
    • B
      dm: stripe enhanced status return · 4f7f5c67
      Brian Wood 提交于
      This patch adds additional information to the status line. It is added at the
      end of the returned text so it will not interfere with existing
      implementations using this data. The addition of this information will allow
      for a common return interface to match that returned with the dm-raid1.c
      status line (with Jonathan Brassow's patches).
      
      Here is a sample of what is returned with a mirror "status" call:
      isw_eeaaabgfg_mirror: 0 488390920 mirror 2 8:16 8:32 3727/3727 1 AA 1 core
      
      Here's what's returned with this patch for a stripe "status" call:
      isw_dheeijjdej_stripe: 0 976783872 striped 2 8:16 8:32 1 AA
      Signed-off-by: NBrian Wood <brian.j.wood@intel.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      4f7f5c67
    • B
      dm: stripe trigger event on failure · a25eb944
      Brian Wood 提交于
      This patch adds the stripe_end_io function to process errors that might
      occur after an IO operation. As part of this there are a number of
      enhancements made to record and trigger events:
      
      - New atomic variable in struct stripe to record the number of
      errors each stripe volume device has experienced (could be used
      later with uevents to report back directly to userspace)
      
      - New workqueue/work struct setup to process the trigger_event function
      
      - New end_io function. It is here that testing for BIO error conditions
      take place. It determines the exact stripe that cause the error,
      records this in the new atomic variable, and calls the queue_work() function
      
      - New trigger_event function to process failure events. This
      calls dm_table_event()
      Signed-off-by: NBrian Wood <brian.j.wood@intel.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      a25eb944
    • J
      dm log: auto load modules · fb8b2848
      Jonathan Brassow 提交于
      If the log type is not recognised, attempt to load the module
      'dm-log-<type>.ko'.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      fb8b2848
    • M
      dm: move deferred bio flushing to workqueue · 304f3f6a
      Milan Broz 提交于
      Add a single-thread workqueue for each mapped device
      and move flushing of the lists of pushback and deferred bios
      to this new workqueue.
      Signed-off-by: NMilan Broz <mbroz@redhat.com>
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      304f3f6a