1. 23 12月, 2011 1 次提交
    • N
      md: don't give up looking for spares on first failure-to-add · 60fc1370
      NeilBrown 提交于
      Before performing a recovery we try to remove any spares that
      might not be working, then add any that might have become relevant.
      
      Currently we abort on the first spare that cannot be added.
      This is a false optimisation.
      It is conceivable that - depending on rules in the personality - a
      subsequent spare might be accepted.
      Also the loop does other things like count the available spares and
      reset the 'recovery_offset' value.
      
      If we abort early these might not happen properly.
      
      So remove the early abort.
      
      In particular if you have an array what is undergoing recovery and
      which has extra spares, then the recovery may not restart after as
      reboot as the could of 'spares' might end up as zero.
      Reported-by: NAnssi Hannula <anssi.hannula@iki.fi>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60fc1370
  2. 08 12月, 2011 4 次提交
    • N
      md: ensure new badblocks are handled promptly. · 8bd2f0a0
      NeilBrown 提交于
      When we mark blocks as bad we need them to be acknowledged by the
      metadata handler promptly.
      
      For an in-kernel metadata handler that was already being done.  But
      for an external metadata handler we need to alert it of the change by
      sending a notification through the sysfs file.  This adds that
      notification.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8bd2f0a0
    • N
      md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152
      NeilBrown 提交于
      Once a device is marked Faulty the badblocks - whether acknowledged or
      not - become irrelevant.  So they shouldn't cause the device to be
      marked as Blocked.
      
      Without this patch, a process might write "-blocked" to clear the
      Blocked status, but while that will correctly fail the device, it
      won't remove the apparent 'blocked' status.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      52c64152
    • N
      md: take a reference to mddev during sysfs access. · af8a2434
      NeilBrown 提交于
      
      When we are accessing an mddev via sysfs we know that the
      mddev cannot disappear because it has an embedded kobj which
      is refcounted by sysfs.
      And we also take the mddev_lock.
      However this is not enough.
      
      The final mddev_put could have been called and the
      mddev_delayed_delete is waiting for sysfs to let go so it can destroy
      the kobj and mddev.
      In this state there are a lot of changes that should not be attempted.
      
      To to guard against this we:
       - initialise mddev->all_mddevs in on last put so the state can be
         easily detected.
       - in md_attr_show and md_attr_store, check ->all_mddevs under
         all_mddevs_lock and mddev_get the mddev if it still appears to
         be active.
      
      This means that if we get to sysfs as the mddev is being deleted we
      will get -EBUSY.
      
      rdev_attr_store and rdev_attr_show are similar but already have
      sufficient protection.  They check that rdev->mddev still points to
      mddev after taking mddev_lock.  As this is cleared  before delayed
      removal which can only be requested under the mddev_lock, this
      ensure the rdev and mddev are still alive.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      af8a2434
    • N
      md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178
      NeilBrown 提交于
      We like md devices to disappear when they really are not needed.
      However it is not possible to tell from the current state whether it
      is needed or not.  We can only tell from recent history of changes.
      
      In particular immediately after we create an md device it looks very
      similar to immediately after we have finished with it.
      
      So we always preserve a newly created md device until something
      significant happens.  This state is stored in 'hold_active'.
      
      The normal case is to keep it until an ioctl happens, as that will
      normally either activate it, or explicitly de-activate it.  If it
      doesn't then it was probably created by mistake and it is now time to
      get rid of it.
      
      We can also modify an array via sysfs (instead of via ioctl) and we
      currently treat any change via sysfs like an ioctl as a sign that if
      it now isn't more active, it should be destroyed.
      However this is not appropriate as changes made via sysfs are more
      gradual so we should look for a more definitive change.
      
      So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
      the array_state is changed via sysfs.  Other changes via sysfs
      are ignored.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1d23f178
  3. 01 11月, 2011 1 次提交
  4. 19 10月, 2011 1 次提交
  5. 18 10月, 2011 2 次提交
  6. 11 10月, 2011 4 次提交
  7. 07 10月, 2011 1 次提交
  8. 23 9月, 2011 1 次提交
    • D
      md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91
      Daniel P. Berrange 提交于
      The md_notify_reboot() method includes a call to mdelay(1000),
      to deal with "exotic SCSI devices" which are too volatile on
      reboot. The delay is unconditional. Even if the machine does
      not have any block devices, let alone MD devices, the kernel
      shutdown sequence is slowed down.
      
      1 second does not matter much with physical hardware, but with
      certain virtualization use cases any wasted time in the bootup
      & shutdown sequence counts for alot.
      
      * drivers/md/md.c: md_notify_reboot() - only impose a delay if
        there was at least one MD device to be stopped during reboot
      Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2dba6a91
  9. 21 9月, 2011 1 次提交
    • N
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown 提交于
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  10. 12 9月, 2011 1 次提交
  11. 10 9月, 2011 1 次提交
    • N
      md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260
      NeilBrown 提交于
      0.90 metadata uses an unsigned 32bit number to count the number of
      kilobytes used from each device.
      This should allow up to 4TB per device.
      However we multiply this by 2 (to get sectors) before casting to a
      larger type, so sizes above 2TB get truncated.
      
      Also we allow rdev->sectors to be larger than 4TB, so it is possible
      for the array to be resized larger than the metadata can handle.
      So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
      used.
      
      Also the sanity check at the end of super_90_load should include level
      1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
      Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27a7b260
  12. 30 8月, 2011 1 次提交
  13. 25 8月, 2011 3 次提交
  14. 28 7月, 2011 9 次提交
    • N
      md/raid10 record bad blocks as needed during recovery. · e875ecea
      NeilBrown 提交于
      When recovering one or more devices, if all the good devices have
      bad blocks we should record a bad block on the device being rebuilt.
      
      If this fails, we need to abort the recovery.
      
      To ensure we don't think that we aborted later than we actually did,
      we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
      in particular before mddev->curr_resync is updated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e875ecea
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: add 'write_error' flag to component devices. · d7a9d443
      NeilBrown 提交于
      If a device has ever seen a write error, we will want to handle
      known-bad-blocks differently.
      So create an appropriate state flag and export it via sysfs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      d7a9d443
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: Disable bad blocks and v0.90 metadata. · 9f2f3830
      NeilBrown 提交于
      v0.90 metadata cannot record bad blocks, so when loading metadata
      for such a device, set shift to -1.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f2f3830
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5
      NeilBrown 提交于
      This can show the log (providing it fits in one page) and
      allows bad blocks to be 'acknowledged' meaning that they
      have safely been recorded in metadata.
      
      Clearing bad blocks is not allowed via sysfs (except for
      code testing).  A bad block can only be cleared when
      a write to the block succeeds.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      16c791a5
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
    • N
      md: remove suspicious size_of() · a519b26d
      NeilBrown 提交于
      When calling bioset_create we pass the size of the front_pad as
         sizeof(mddev)
      which looks suspicious as mddev is a pointer and so it looks like a
      common mistake where
         sizeof(*mddev)
      was intended.
      The size is actually correct as we want to store a pointer in the
      front padding of the bios created by the bioset, so make the intent
      more explicit by using
         sizeof(mddev_t *)
      Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a519b26d
  15. 27 7月, 2011 5 次提交
  16. 21 7月, 2011 1 次提交
  17. 28 6月, 2011 1 次提交
    • N
      md: avoid endless recovery loop when waiting for fail device to complete. · 4274215d
      NeilBrown 提交于
      If a device fails in a way that causes pending request to take a while
      to complete, md will not be able to immediately remove it from the
      array in remove_and_add_spares.
      It will then incorrectly look like a spare device and md will try to
      recover it even though it is failed.
      This leads to a recovery process starting and instantly aborting over
      and over again.
      
      We should check if the device is faulty before considering it to be a
      spare.  This will avoid trying to start a recovery that cannot
      proceed.
      
      This bug was introduced in 2.6.26 so that patch is suitable for any
      kernel since then.
      
      Cc: stable@kernel.org
      Reported-by: NJim Paradis <james.paradis@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4274215d
  18. 09 6月, 2011 2 次提交
    • N
      md: check ->hot_remove_disk when removing disk · 01393f3d
      Namhyung Kim 提交于
      Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
      during disk removal. The linear personality only has ->hot_add_disk and
      no ->hot_remove_disk, so that removing disk in the array resulted to
      following kernel bug:
      
      $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
      $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<          (null)>]           (null)
       PGD c9f5d067 PUD 8575a067 PMD 0
       Oops: 0010 [#1] SMP
       CPU 2
       Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg
      
       Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
       RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
       RSP: 0018:ffff880085757df0  EFLAGS: 00010282
       RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
       RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
       RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
       R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
       R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
       FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
       Stack:
        ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
        ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
        ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
       Call Trace:
        [<ffffffff8138496a>] ? slot_store+0xaa/0x265
        [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
        [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
        [<ffffffff81106b87>] vfs_write+0xb1/0x10d
        [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
        [<ffffffff81106cac>] sys_write+0x4d/0x77
        [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
       Code:  Bad RIP value.
       RIP  [<          (null)>]           (null)
        RSP <ffff880085757df0>
       CR2: 0000000000000000
       ---[ end trace ba5fc64319a826fb ]---
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      01393f3d