1. 23 12月, 2011 6 次提交
    • N
      md/raid5: If there is a spare and a want_replacement device, start replacement. · 7bfec5f3
      NeilBrown 提交于
      When attempting to add a spare to a RAID[456] array, also consider
      adding it as a replacement for a want_replacement device.
      
      This requires that common md code attempt hot_add even when the array
      is not formally degraded.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bfec5f3
    • N
      md: create externally visible flags for supporting hot-replace. · 2d78f8c4
      NeilBrown 提交于
      hot-replace is a feature being added to md which will allow a
      device to be replaced without removing it from the array first.
      
      With hot-replace a spare can be activated and recovery can start while
      the original device is still in place, thus allowing a transition from
      an unreliable device to a reliable device without leaving the array
      degraded during the transition.  It can also be use when the original
      device is still reliable but it not wanted for some reason.
      
      This will eventually be supported in RAID4/5/6 and RAID10.
      
      This patch adds a super-block flag to distinguish the replacement
      device.  If an old kernel sees this flag it will reject the device.
      
      It also adds two per-device flags which are viewable and settable via
      sysfs.
         "want_replacement" can be set to request that a device be replaced.
         "replacement" is set to show that this device is replacing another
         device.
      
      The "rd%d" links in /sys/block/mdXx/md only apply to the original
      device, not the replacement.  We currently don't make links for the
      replacement - there doesn't seem to be a need.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d78f8c4
    • N
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown 提交于
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8321b68
    • N
      md: remove test for duplicate device when setting slot number. · 476a7abb
      NeilBrown 提交于
      When setting the slot number on a device in an active array we
      currently check that the number is not already in use.
      We then call into the personality's hot_add_disk function
      which performs the same test and returns the same error.
      
      Thus the common test is not needed.
      
      As we will shortly be changing some personalities to allow duplicates
      in some cases (to support hot-replace), the common test will become
      inconvenient.
      
      So remove the common test.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      476a7abb
    • N
      md: allow non-privileged uses to GET_*_INFO about raid arrays. · 506c9e44
      NeilBrown 提交于
      The info is already available in /proc/mdstat and /sys/block in
      an accessible form so there is no point in putting a road-block in
      the ioctl for information gathering.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      506c9e44
    • N
      md: don't give up looking for spares on first failure-to-add · 60fc1370
      NeilBrown 提交于
      Before performing a recovery we try to remove any spares that
      might not be working, then add any that might have become relevant.
      
      Currently we abort on the first spare that cannot be added.
      This is a false optimisation.
      It is conceivable that - depending on rules in the personality - a
      subsequent spare might be accepted.
      Also the loop does other things like count the available spares and
      reset the 'recovery_offset' value.
      
      If we abort early these might not happen properly.
      
      So remove the early abort.
      
      In particular if you have an array what is undergoing recovery and
      which has extra spares, then the recovery may not restart after as
      reboot as the could of 'spares' might end up as zero.
      Reported-by: NAnssi Hannula <anssi.hannula@iki.fi>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60fc1370
  2. 08 12月, 2011 4 次提交
    • N
      md: ensure new badblocks are handled promptly. · 8bd2f0a0
      NeilBrown 提交于
      When we mark blocks as bad we need them to be acknowledged by the
      metadata handler promptly.
      
      For an in-kernel metadata handler that was already being done.  But
      for an external metadata handler we need to alert it of the change by
      sending a notification through the sysfs file.  This adds that
      notification.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8bd2f0a0
    • N
      md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152
      NeilBrown 提交于
      Once a device is marked Faulty the badblocks - whether acknowledged or
      not - become irrelevant.  So they shouldn't cause the device to be
      marked as Blocked.
      
      Without this patch, a process might write "-blocked" to clear the
      Blocked status, but while that will correctly fail the device, it
      won't remove the apparent 'blocked' status.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      52c64152
    • N
      md: take a reference to mddev during sysfs access. · af8a2434
      NeilBrown 提交于
      
      When we are accessing an mddev via sysfs we know that the
      mddev cannot disappear because it has an embedded kobj which
      is refcounted by sysfs.
      And we also take the mddev_lock.
      However this is not enough.
      
      The final mddev_put could have been called and the
      mddev_delayed_delete is waiting for sysfs to let go so it can destroy
      the kobj and mddev.
      In this state there are a lot of changes that should not be attempted.
      
      To to guard against this we:
       - initialise mddev->all_mddevs in on last put so the state can be
         easily detected.
       - in md_attr_show and md_attr_store, check ->all_mddevs under
         all_mddevs_lock and mddev_get the mddev if it still appears to
         be active.
      
      This means that if we get to sysfs as the mddev is being deleted we
      will get -EBUSY.
      
      rdev_attr_store and rdev_attr_show are similar but already have
      sufficient protection.  They check that rdev->mddev still points to
      mddev after taking mddev_lock.  As this is cleared  before delayed
      removal which can only be requested under the mddev_lock, this
      ensure the rdev and mddev are still alive.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      af8a2434
    • N
      md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178
      NeilBrown 提交于
      We like md devices to disappear when they really are not needed.
      However it is not possible to tell from the current state whether it
      is needed or not.  We can only tell from recent history of changes.
      
      In particular immediately after we create an md device it looks very
      similar to immediately after we have finished with it.
      
      So we always preserve a newly created md device until something
      significant happens.  This state is stored in 'hold_active'.
      
      The normal case is to keep it until an ioctl happens, as that will
      normally either activate it, or explicitly de-activate it.  If it
      doesn't then it was probably created by mistake and it is now time to
      get rid of it.
      
      We can also modify an array via sysfs (instead of via ioctl) and we
      currently treat any change via sysfs like an ioctl as a sign that if
      it now isn't more active, it should be destroyed.
      However this is not appropriate as changes made via sysfs are more
      gradual so we should look for a more definitive change.
      
      So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
      the array_state is changed via sysfs.  Other changes via sysfs
      are ignored.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1d23f178
  3. 01 11月, 2011 1 次提交
  4. 19 10月, 2011 1 次提交
  5. 18 10月, 2011 2 次提交
  6. 11 10月, 2011 4 次提交
  7. 07 10月, 2011 1 次提交
  8. 23 9月, 2011 1 次提交
    • D
      md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91
      Daniel P. Berrange 提交于
      The md_notify_reboot() method includes a call to mdelay(1000),
      to deal with "exotic SCSI devices" which are too volatile on
      reboot. The delay is unconditional. Even if the machine does
      not have any block devices, let alone MD devices, the kernel
      shutdown sequence is slowed down.
      
      1 second does not matter much with physical hardware, but with
      certain virtualization use cases any wasted time in the bootup
      & shutdown sequence counts for alot.
      
      * drivers/md/md.c: md_notify_reboot() - only impose a delay if
        there was at least one MD device to be stopped during reboot
      Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2dba6a91
  9. 21 9月, 2011 1 次提交
    • N
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown 提交于
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  10. 12 9月, 2011 1 次提交
  11. 10 9月, 2011 1 次提交
    • N
      md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260
      NeilBrown 提交于
      0.90 metadata uses an unsigned 32bit number to count the number of
      kilobytes used from each device.
      This should allow up to 4TB per device.
      However we multiply this by 2 (to get sectors) before casting to a
      larger type, so sizes above 2TB get truncated.
      
      Also we allow rdev->sectors to be larger than 4TB, so it is possible
      for the array to be resized larger than the metadata can handle.
      So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
      used.
      
      Also the sanity check at the end of super_90_load should include level
      1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
      Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27a7b260
  12. 30 8月, 2011 1 次提交
  13. 25 8月, 2011 3 次提交
  14. 28 7月, 2011 9 次提交
    • N
      md/raid10 record bad blocks as needed during recovery. · e875ecea
      NeilBrown 提交于
      When recovering one or more devices, if all the good devices have
      bad blocks we should record a bad block on the device being rebuilt.
      
      If this fails, we need to abort the recovery.
      
      To ensure we don't think that we aborted later than we actually did,
      we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
      in particular before mddev->curr_resync is updated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e875ecea
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: add 'write_error' flag to component devices. · d7a9d443
      NeilBrown 提交于
      If a device has ever seen a write error, we will want to handle
      known-bad-blocks differently.
      So create an appropriate state flag and export it via sysfs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      d7a9d443
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: Disable bad blocks and v0.90 metadata. · 9f2f3830
      NeilBrown 提交于
      v0.90 metadata cannot record bad blocks, so when loading metadata
      for such a device, set shift to -1.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f2f3830
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5
      NeilBrown 提交于
      This can show the log (providing it fits in one page) and
      allows bad blocks to be 'acknowledged' meaning that they
      have safely been recorded in metadata.
      
      Clearing bad blocks is not allowed via sysfs (except for
      code testing).  A bad block can only be cleared when
      a write to the block succeeds.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      16c791a5
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
    • N
      md: remove suspicious size_of() · a519b26d
      NeilBrown 提交于
      When calling bioset_create we pass the size of the front_pad as
         sizeof(mddev)
      which looks suspicious as mddev is a pointer and so it looks like a
      common mistake where
         sizeof(*mddev)
      was intended.
      The size is actually correct as we want to store a pointer in the
      front padding of the bios created by the bioset, so make the intent
      more explicit by using
         sizeof(mddev_t *)
      Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a519b26d
  15. 27 7月, 2011 4 次提交