1. 31 3月, 2009 12 次提交
    • D
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams 提交于
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1f403624
    • D
      md: add 'size' as a personality method · 80c3a6ce
      Dan Williams 提交于
      In preparation for giving userspace control over ->array_sectors we need
      to be able to retrieve the 'default' size, and the 'anticipated' size
      when a reshape is requested.  For personalities that do not reshape emit
      a warning if anything but the default size is requested.
      
      In the raid5 case we need to update ->previous_raid_disks to make the
      new 'default' size available.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      80c3a6ce
    • N
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown 提交于
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      245f46c2
    • N
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown 提交于
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      409c57f3
    • A
      md: Represent raid device size in sectors. · dd8ac336
      Andre Noll 提交于
      This patch renames the "size" field of struct mdk_rdev_s to
      "sectors" and changes this field to store sectors instead of
      blocks.
      
      All users of this field, linear.c, raid0.c and md.c, are fixed up
      accordingly which gets rid of many multiplications and divisions.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dd8ac336
    • A
      md: Make mddev->size sector-based. · 58c0fed4
      Andre Noll 提交于
      This patch renames the "size" field of struct mddev_s to "dev_sectors"
      and stores the number of 512-byte sectors instead of the number of
      1K-blocks in it.
      
      All users of that field, including raid levels 1,4-6,10, are adjusted
      accordingly. This simplifies the code a bit because it allows to get
      rid of a couple of divisions/multiplications by two.
      
      In order to make checkpatch happy, some minor coding style issues
      have also been addressed. In particular, size_store() now uses
      strict_strtoull() instead of simple_strtoull().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      58c0fed4
    • N
      md: occasionally checkpoint drive recovery to reduce duplicate effort after a crash · 97e4f42d
      NeilBrown 提交于
      Version 1.x metadata has the ability to record the status of a
      partially completed drive recovery.
      However we only update that record on a clean shutdown.
      It would be nice to update it on unclean shutdowns too, particularly
      when using a bitmap that removes much to the 'sync' effort after an
      unclean shutdown.
      
      One complication with checkpointing recovery is that we only know
      where we are up to in terms of IO requests started, not which ones
      have completed.  And we need to know what has completed to record
      how much is recovered.  So occasionally pause the recovery until all
      submitted requests are completed, then update the record of where
      we are up to.
      
      When we have a bitmap, we already do that pause occasionally to keep
      the bitmap up-to-date.  So enhance that code to record the recovery
      offset and schedule a superblock update.
      And when there is no bitmap, just pause 16 times during the resync to
      do a checkpoint.
      '16' is a fairly arbitrary number.  But we don't really have any good
      way to judge how often is acceptable, and it seems like a reasonable
      number for now.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97e4f42d
    • N
      md: move md_k.h from include/linux/raid/ to drivers/md/ · 43b2e5d8
      NeilBrown 提交于
      It really is nicer to keep related code together..
      Signed-off-by: NNeilBrown <neilb@suse.de>
      43b2e5d8
    • N
      md: move lots of #include lines out of .h files and into .c · bff61975
      NeilBrown 提交于
      This makes the includes more explicit, and is preparation for moving
      md_k.h to drivers/md/md.h
      
      Remove include/raid/md.h as its only remaining use was to #include
      other files.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bff61975
    • N
      md: move most content from md.h to md_k.h · 92022950
      NeilBrown 提交于
      The extern function definitions are kernel-internal definitions, so
      they belong in md_k.h
      
      The MD_*_VERSION values could reasonably go in a number of places,
      but md_u.h seems most reasonable.
      
      This leaves almost nothing in md.h.  It will go soon.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      92022950
    • N
      md: move LEVEL_* definition from md_k.h to md_u.h · 8b2b5c21
      NeilBrown 提交于
      .. as they are part of the user-space interface.
      Also move MdpMinorShift into there so we can remove duplication.
      
      Lastly move mdp_major in.  It is less obviously part of the user-space
      interface, but do_mounts_md.c uses it, and it is acting a bit like
      user-space.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8b2b5c21
    • N
      md: Fix is_mddev_idle test (again). · eea1bf38
      NeilBrown 提交于
      There are two problems with is_mddev_idle.
      
      1/ sync_io is 'atomic_t' and hence 'int'.  curr_events and all the
         rest are 'long'.
         So if sync_io were to wrap on a 64bit host, the value of
         curr_events would go very negative suddenly, and take a very
         long time to return to positive.
      
         So do all calculations as 'int'.  That gives us plenty of precision
         for what we need.
      
      2/ To initialise rdev->last_events we simply call is_mddev_idle, on
         the assumption that it will make sure that last_events is in a
         suitable range.  It used to do this, but now it does not.
         So now we need to be more explicit about initialisation.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      eea1bf38
  2. 09 1月, 2009 5 次提交
    • N
      md: don't retry recovery of raid1 that fails due to error on source drive. · 4044ba58
      NeilBrown 提交于
      If a raid1 has only one working drive and it has a sector which
      gives an error on read, then an attempt to recover onto a spare will
      fail, but as the single remaining drive is not removed from the
      array, the recovery will be immediately re-attempted, resulting
      in an infinite recovery loop.
      
      So detect this situation and don't retry recovery once an error
      on the lone remaining drive is detected.
      
      Allow recovery to be retried once every time a spare is added
      in case the problem wasn't actually a media error.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4044ba58
    • N
      md: Allow md devices to be created by name. · efeb53c0
      NeilBrown 提交于
      Using sequential numbers to identify md devices is somewhat artificial.
      Using names can be a lot more user-friendly.
      
      Also, creating md devices by opening the device special file is a bit
      awkward.
      
      So this patch provides a new option for creating and naming devices.
      
      Writing a name such as "md_home" to
          /sys/modules/md_mod/parameters/new_array
      will cause an array with that name to be created.  It will appear in
      /sys/block/ /proc/partitions and /proc/mdstat as 'md_home'.
      It will have an arbitrary minor number allocated.
      
      md devices that a created by an open are destroyed on the last
      close when the device is inactive.
      For named md devices, they will not be destroyed until the array
      is explicitly stopped, either with the STOP_ARRAY ioctl or by
      writing 'clear' to /sys/block/md_XXXX/md/array_state.
      
      The name of the array must start 'md_' to avoid conflict with
      other devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      efeb53c0
    • N
      md: make devices disappear when they are no longer needed. · d3374825
      NeilBrown 提交于
      Currently md devices, once created, never disappear until the module
      is unloaded.  This is essentially because the gendisk holds a
      reference to the mddev, and the mddev holds a reference to the
      gendisk, this a circular reference.
      
      If we drop the reference from mddev to gendisk, then we need to ensure
      that the mddev is destroyed when the gendisk is destroyed.  However it
      is not possible to hook into the gendisk destruction process to enable
      this.
      
      So we drop the reference from the gendisk to the mddev and destroy the
      gendisk when the mddev gets destroyed.  However this has a
      complication.
      Between the call
         __blkdev_get->get_gendisk->kobj_lookup->md_probe
      and the call
         __blkdev_get->md_open
      
      there is no obvious way to hold a reference on the mddev any more, so
      unless something is done, it will disappear and gendisk will be
      destroyed prematurely.
      
      Also, once we decide to destroy the mddev, there will be an unlockable
      moment before the gendisk is unlinked (blk_unregister_region) during
      which a new reference to the gendisk can be created.  We need to
      ensure that this reference can not be used.  i.e. the ->open must
      fail.
      
      So:
       1/  in md_probe we set a flag in the mddev (hold_active) which
           indicates that the array should be treated as active, even
           though there are no references, and no appearance of activity.
           This is cleared by md_release when the device is closed if it
           is no longer needed.
           This ensures that the gendisk will survive between md_probe and
           md_open.
      
       2/  In md_open we check if the mddev we expect to open matches
           the gendisk that we did open.
           If there is a mismatch we return -ERESTARTSYS and modify
           __blkdev_get to retry from the top in that case.
           In the -ERESTARTSYS sys case we make sure to wait until
           the old gendisk (that we succeeded in opening) is really gone so
           we loop at most once.
      
      Some udev configurations will always open an md device when it first
      appears.   If we allow an md device that was just created by an open
      to disappear on an immediate close, then this can race with such udev
      configurations and result in an infinite loop the device being opened
      and closed, then re-open due to the 'ADD' even from the first open,
      and then close and so on.
      So we make sure an md device, once created by an open, remains active
      at least until some md 'ioctl' has been made on it.  This means that
      all normal usage of md devices will allow them to disappear promptly
      when not needed, but the worst that an incorrect usage will do it
      cause an inactive md device to be left in existence (it can easily be
      removed).
      
      As an array can be stopped by writing to a sysfs attribute
        echo clear > /sys/block/mdXXX/md/array_state
      we need to use scheduled work for deleting the gendisk and other
      kobjects.  This allows us to wait for any pending gendisk deletion to
      complete by simply calling flush_scheduled_work().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d3374825
    • C
      md: use list_for_each_entry macro directly · 159ec1fc
      Cheng Renquan 提交于
      The rdev_for_each macro defined in <linux/raid/md_k.h> is identical to
      list_for_each_entry_safe, from <linux/list.h>, it should be defined to
      use list_for_each_entry_safe, instead of reinventing the wheel.
      
      But some calls to each_entry_safe don't really need a safe version,
      just a direct list_for_each_entry is enough, this could save a temp
      variable (tmp) in every function that used rdev_for_each.
      
      In this patch, most rdev_for_each loops are replaced by list_for_each_entry,
      totally save many tmp vars; and only in the other situations that will call
      list_del to delete an entry, the safe version is used.
      Signed-off-by: NCheng Renquan <crquan@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      159ec1fc
    • N
      md: use sysfs_notify_dirent to notify changes to md/sync_action. · 0c3573f1
      NeilBrown 提交于
      There is no compelling need for this, but sysfs_notify_dirent is a
      nicer interface and the change is good for consistency.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c3573f1
  3. 21 10月, 2008 2 次提交
  4. 24 7月, 2008 1 次提交
  5. 21 7月, 2008 3 次提交
    • N
      md: Protect access to mddev->disks list using RCU · 4b80991c
      NeilBrown 提交于
      All modifications and most access to the mddev->disks list are made
      under the reconfig_mutex lock.  However there are three places where
      the list is walked without any locking.  If a reconfig happens at this
      time, havoc (and oops) can ensue.
      
      So use RCU to protect these accesses:
        - wrap them in rcu_read_{,un}lock()
        - use list_for_each_entry_rcu
        - add to the list with list_add_rcu
        - delete from the list with list_del_rcu
        - delay the 'free' with call_rcu rather than schedule_work
      
      Note that export_rdev did a list_del_init on this list.  In almost all
      cases the entry was not in the list anymore so it was a no-op and so
      safe.  It is no longer safe as after list_del_rcu we may not touch
      the list_head.
      An audit shows that export_rdev is called:
        - after unbind_rdev_from_array, in which case the delete has
           already been done,
        - after bind_rdev_to_array fails, in which case the delete isn't needed.
        - before the device has been put on a list at all (e.g. in
            add_new_disk where reading the superblock fails).
        - and in autorun devices after a failure when the device is on a
            different list.
      
      So remove the list_del_init call from export_rdev, and add it back
      immediately before the called to export_rdev for that last case.
      
      Note also that ->same_set is sometimes used for lists other than
      mddev->list (e.g. candidates).  In these cases rcu is not needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b80991c
    • N
      md: only count actual openers as access which prevent a 'stop' · f2ea68cf
      NeilBrown 提交于
      Open isn't the only thing that increments ->active.  e.g. reading
      /proc/mdstat will increment it briefly.  So to avoid false positives
      in testing for concurrent access, introduce a new counter that counts
      just the number of times the md device it open.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f2ea68cf
    • A
      md: Make mddev->array_size sector-based. · f233ea5c
      Andre Noll 提交于
      This patch renames the array_size field of struct mddev_s to array_sectors
      and converts all instances to use units of 512 byte sectors instead of 1k
      blocks.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f233ea5c
  6. 11 7月, 2008 1 次提交
  7. 28 6月, 2008 3 次提交
  8. 25 5月, 2008 2 次提交
    • N
      md: restart recovery cleanly after device failure. · dfc70645
      NeilBrown 提交于
      When we get any IO error during a recovery (rebuilding a spare), we abort
      the recovery and restart it.
      
      For RAID6 (and multi-drive RAID1) it may not be best to restart at the
      beginning: when multiple failures can be tolerated, the recovery may be
      able to continue and re-doing all that has already been done doesn't make
      sense.
      
      We already have the infrastructure to record where a recovery is up to
      and restart from there, but it is not being used properly.
      This is because:
        - We sometimes abort with MD_RECOVERY_ERR rather than just MD_RECOVERY_INTR,
          which causes the recovery not be be checkpointed.
        - We remove spares and then re-added them which loses important state
          information.
      
      The distinction between MD_RECOVERY_ERR and MD_RECOVERY_INTR really isn't
      needed.  If there is an error, the relevant drive will be marked as
      Faulty, and that is enough to ensure correct handling of the error.  So we
      first remove MD_RECOVERY_ERR, changing some of the uses of it to
      MD_RECOVERY_INTR.
      
      Then we cause the attempt to remove a non-faulty device from an array to
      fail (unless recovery is impossible as the array is too degraded).  Then
      when remove_and_add_spares attempts to remove the devices on which
      recovery can continue, it will fail, they will remain in place, and
      recovery will continue on them as desired.
      
      Issue:  If we are halfway through rebuilding a spare and another drive
      fails, and a new spare is immediately available,  do we want to:
       1/ complete the current rebuild, then go back and rebuild the new spare or
       2/ restart the rebuild from the start and rebuild both devices in
          parallel.
      
      Both options can be argued for.  The code currently takes option 2 as
        a/ this requires least code change
        b/ this results in a minimally-degraded array in minimal time.
      
      Cc: "Eivind Sarto" <ivan@kasenna.com>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfc70645
    • B
      md: allow parallel resync of md-devices. · 90b08710
      Bernd Schubert 提交于
      In some configurations, a raid6 resync can be limited by CPU speed
      (Calculating P and Q and moving data) rather than by device speed.  In
      these cases there is nothing to be gained byt serialising resync of arrays
      that share a device, and doing the resync in parallel can provide benefit.
       So add a sysfs tunable to flag an array as being allowed to resync in
      parallel with other arrays that use (a different part of) the same device.
      Signed-off-by: NBernd Schubert <bs@q-leap.de>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90b08710
  9. 30 4月, 2008 1 次提交
  10. 05 3月, 2008 1 次提交
  11. 07 2月, 2008 5 次提交
  12. 24 7月, 2007 1 次提交
  13. 18 7月, 2007 1 次提交
  14. 10 5月, 2007 2 次提交