1. 21 7月, 2008 6 次提交
    • N
      md: Protect access to mddev->disks list using RCU · 4b80991c
      NeilBrown 提交于
      All modifications and most access to the mddev->disks list are made
      under the reconfig_mutex lock.  However there are three places where
      the list is walked without any locking.  If a reconfig happens at this
      time, havoc (and oops) can ensue.
      
      So use RCU to protect these accesses:
        - wrap them in rcu_read_{,un}lock()
        - use list_for_each_entry_rcu
        - add to the list with list_add_rcu
        - delete from the list with list_del_rcu
        - delay the 'free' with call_rcu rather than schedule_work
      
      Note that export_rdev did a list_del_init on this list.  In almost all
      cases the entry was not in the list anymore so it was a no-op and so
      safe.  It is no longer safe as after list_del_rcu we may not touch
      the list_head.
      An audit shows that export_rdev is called:
        - after unbind_rdev_from_array, in which case the delete has
           already been done,
        - after bind_rdev_to_array fails, in which case the delete isn't needed.
        - before the device has been put on a list at all (e.g. in
            add_new_disk where reading the superblock fails).
        - and in autorun devices after a failure when the device is on a
            different list.
      
      So remove the list_del_init call from export_rdev, and add it back
      immediately before the called to export_rdev for that last case.
      
      Note also that ->same_set is sometimes used for lists other than
      mddev->list (e.g. candidates).  In these cases rcu is not needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b80991c
    • N
      md: only count actual openers as access which prevent a 'stop' · f2ea68cf
      NeilBrown 提交于
      Open isn't the only thing that increments ->active.  e.g. reading
      /proc/mdstat will increment it briefly.  So to avoid false positives
      in testing for concurrent access, introduce a new counter that counts
      just the number of times the md device it open.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f2ea68cf
    • A
      md: Make mddev->array_size sector-based. · f233ea5c
      Andre Noll 提交于
      This patch renames the array_size field of struct mddev_s to array_sectors
      and converts all instances to use units of 512 byte sectors instead of 1k
      blocks.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f233ea5c
    • A
      md: Make super_type->rdev_size_change() take sector-based sizes. · 15f4a5fd
      Andre Noll 提交于
      Also, change the type of the size parameter from unsigned long long to
      sector_t and rename it to num_sectors.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      15f4a5fd
    • A
      md: Fix check for overlapping devices. · d07bd3bc
      Andre Noll 提交于
      The checks in overlaps() expect all parameters either in block-based
      or sector-based quantities. However, its single caller passes two
      rdev->data_offset arguments as well as two rdev->size arguments, the
      former being sector counts while the latter are measured in 1K blocks.
      
      This could cause rdev_size_store() to accept an invalid size from user
      space. Fix it by passing only sector-based quantities to overlaps().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d07bd3bc
    • N
      md: Tidy up rdev_size_store a bit: · d7027458
      Neil Brown 提交于
       - used strict_strtoull in place of simple_strtoull
       - use my_mddev in place of rdev->mddev (they have the same value)
      and more significantly,
       - don't adjust mddev->size to fit, rather reject changes which make
         rdev->size smaller than mddev->size
      
      Adjusting mddev->size is a hangover from bind_rdev_to_array which
      does a similar thing.  But it really is a better design to insist that
      mddev->size is set as required, then the rdev->sizes are set to allow
      for that.  The previous way invites confusion.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d7027458
  2. 11 7月, 2008 10 次提交
  3. 08 7月, 2008 7 次提交
  4. 01 7月, 2008 1 次提交
    • D
      md: resolve external metadata handling deadlock in md_allow_write · b5470dc5
      Dan Williams 提交于
      md_allow_write() marks the metadata dirty while holding mddev->lock and then
      waits for the write to complete.  For externally managed metadata this causes a
      deadlock as userspace needs to take the lock to communicate that the metadata
      update has completed.
      
      Change md_allow_write() in the 'external' case to start the 'mark active'
      operation and then return -EAGAIN.  The expected side effects while waiting for
      userspace to write 'active' to 'array_state' are holding off reshape (code
      currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
      fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
      cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
      these failures can be mitigated by coordinating with mdmon.
      
      md_write_start() still prevents writes from occurring until the metadata
      handler has had a chance to take action as it unconditionally waits for
      MD_CHANGE_CLEAN to be cleared.
      
      [neilb@suse.de: return -EAGAIN, try GFP_NOIO]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b5470dc5
  5. 28 6月, 2008 13 次提交
    • C
      Support changing rdev size on running arrays. · 0cd17fec
      Chris Webb 提交于
      From: Chris Webb <chris@arachsys.com>
      
      Allow /sys/block/mdX/md/rdY/size to change on running arrays, moving the
      superblock if necessary for this metadata version. We prevent the available
      space from shrinking to less than the used size, and allow it to be set to zero
      to fill all the available space on the underlying device.
      Signed-off-by: NChris Webb <chris@arachsys.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      0cd17fec
    • N
      Make sure all changes to md/dev-XX/state are notified · 52664732
      Neil Brown 提交于
      The important state change happens during an interrupt
      in md_error.  So just set a flag there and call sysfs_notify
      later in process context.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      52664732
    • N
      Make sure all changes to md/degraded are notified. · a99ac971
      Neil Brown 提交于
      When a device fails, when a spare is activated, when
      an array is reshaped, or when an array is started,
      the extent to which the array is degraded can change.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      a99ac971
    • N
      Make sure all changes to md/sync_action are notified. · 72a23c21
      Neil Brown 提交于
      When the 'resync' thread starts or stops, when we explicitly
      set sync_action, or when we determine that there is definitely nothing
      to do, we notify sync_action.
      
      To stop "sync_action" from occasionally showing the wrong value,
      we introduce a new flags - MD_RECOVERY_RECOVER - to say that a
      recovery is probably needed or happening, and we make sure
      that we set MD_RECOVERY_RUNNING before clearing MD_RECOVERY_NEEDED.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      72a23c21
    • N
      Make sure all changes to md/array_state are notified. · 0fd62b86
      Neil Brown 提交于
      Changes in md/array_state could be of interest to a monitoring
      program.  So make sure all changes trigger a notification.
      
      Exceptions:
         changing active_idle to active is not reported because it
            is frequent and not interesting.
         changing active to active_idle is only reported on arrays
            with externally managed metadata, as it is not interesting
            otherwise.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      0fd62b86
    • N
      Don't reject HOT_REMOVE_DISK request for an array that is not yet started. · c7d0c941
      Neil Brown 提交于
      There is really no need for this test here, and there are valid
      cases for selectively removing devices from an array that
      it not actually active.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      c7d0c941
    • N
      rationalise return value for ->hot_add_disk method. · 199050ea
      Neil Brown 提交于
      For all array types but linear, ->hot_add_disk returns 1 on
      success, 0 on failure.
      For linear, it returns 0 on success and -errno on failure.
      
      This doesn't cause a functional problem because the ->hot_add_disk
      function of linear is used quite differently to the others.
      However it is confusing.
      
      So convert all to return 0 for success or -errno on failure
      and fix call sites to match.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      199050ea
    • N
      Support adding a spare to a live md array with external metadata. · 6c2fce2e
      Neil Brown 提交于
      i.e. extend the 'md/dev-XXX/slot' attribute so that you can
      tell a device to fill an vacant slot in an and md array.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      6c2fce2e
    • N
      Enable setting of 'offset' and 'size' of a hot-added spare. · 8ed0a521
      Neil Brown 提交于
      offset_store and rdev_size_store allow control of the region of a
      device which is to be using in an md/raid array.
      They only allow these values to be set when an array is being assembled,
      as changing them on an active array could be dangerous.
      However when adding a spare device to an array, we might need to
      set the offset and size before starting recovery.  So allow
      these values to be set also if "->raid_disk < 0" which indicates that
      the device is still a spare.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      8ed0a521
    • N
      Don't try to make md arrays dirty if that is not meaningful. · 1a0fd497
      Neil Brown 提交于
      Arrays personalities such as 'raid0' and 'linear' have no redundancy,
      and so marking them as 'clean' or 'dirty' is not meaningful.
      So always allow write requests without requiring a superblock update.
      
      Such arrays types are detected by ->sync_request being NULL.  If it is
      not possible to send a sync request we don't need a 'dirty' flag because
      all a dirty flag does is trigger some sync_requests.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      1a0fd497
    • N
      Close race in md_probe · f48ed538
      Neil Brown 提交于
      There is a possible race in md_probe.  If two threads call md_probe
      for the same device, then one could exit (having checked that
      ->gendisk exists) before the other has called kobject_init_and_add,
      thus returning an incomplete kobj which will cause problems when
      we try to add children to it.
      
      So extend the range of protection of disks_mutex slightly to
      avoid this possibility.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      f48ed538
    • N
      Allow setting start point for requested check/repair · 5e96ee65
      Neil Brown 提交于
      This makes it possible to just resync a small part of an array.
      e.g. if a drive reports that it has questionable sectors,
      a 'repair' of just the region covering those sectors will
      cause them to be read and, if there is an error, re-written
      with correct data.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      5e96ee65
    • N
      Fix error paths if md_probe fails. · 9bbbca3a
      Neil Brown 提交于
      md_probe can fail (e.g. alloc_disk could fail) without
      returning an error (as it alway returns NULL).
      So when we call mddev_find immediately afterwards, we need
      to check that md_probe actually succeeded.  This means checking
      that mdev->gendisk is non-NULL.
      
      cc: <stable@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      9bbbca3a
  6. 07 6月, 2008 1 次提交
  7. 25 5月, 2008 2 次提交
    • N
      md: restart recovery cleanly after device failure. · dfc70645
      NeilBrown 提交于
      When we get any IO error during a recovery (rebuilding a spare), we abort
      the recovery and restart it.
      
      For RAID6 (and multi-drive RAID1) it may not be best to restart at the
      beginning: when multiple failures can be tolerated, the recovery may be
      able to continue and re-doing all that has already been done doesn't make
      sense.
      
      We already have the infrastructure to record where a recovery is up to
      and restart from there, but it is not being used properly.
      This is because:
        - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
          which causes the recovery not be be checkpointed.
        - We remove spares and then re-added them which loses important state
          information.
      
      The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
      needed.  If there is an error, the relevant drive will be marked as
      Faulty, and that is enough to ensure correct handling of the error.  So we
      first remove MD_RECOVERY_ERR, changing some of the uses of it to
      MD_RECOVERY_INTR.
      
      Then we cause the attempt to remove a non-faulty device from an array to
      fail (unless recovery is impossible as the array is too degraded).  Then
      when remove_and_add_spares attempts to remove the devices on which
      recovery can continue, it will fail, they will remain in place, and
      recovery will continue on them as desired.
      
      Issue:  If we are halfway through rebuilding a spare and another drive
      fails, and a new spare is immediately available,  do we want to:
       1/ complete the current rebuild, then go back and rebuild the new spare or
       2/ restart the rebuild from the start and rebuild both devices in
          parallel.
      
      Both options can be argued for.  The code currently takes option 2 as
        a/ this requires least code change
        b/ this results in a minimally-degraded array in minimal time.
      
      Cc: "Eivind Sarto" <ivan@kasenna.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfc70645
    • B
      md: allow parallel resync of md-devices. · 90b08710
      Bernd Schubert 提交于
      In some configurations, a raid6 resync can be limited by CPU speed
      (Calculating P and Q and moving data) rather than by device speed.  In
      these cases there is nothing to be gained byt serialising resync of arrays
      that share a device, and doing the resync in parallel can provide benefit.
       So add a sysfs tunable to flag an array as being allowed to resync in
      parallel with other arrays that use (a different part of) the same device.
      Signed-off-by: NBernd Schubert <bs@q-leap.de>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b08710