1. 18 10月, 2011 1 次提交
    • N
      md: clear In_sync bit on devices added to an active array. · d30519fc
      NeilBrown 提交于
      When we add a device to an active array it can be meaningful to set
      the 'insync' flag.  This indicates that the device is in-sync with the
      array except for locations recorded in the bitmap.
      A bitmap-based recovery can then bring it completely in-sync.
      
      Internally we move that flag to 'saved_raid_disk' but forgot to clear
      In_sync like we do in add_new_disk.
      
      So clear In_sync after moving its value to saved_raid_disk.
      Reported-by: NAndrei Warkentin <andreiw@vmware.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d30519fc
  2. 11 10月, 2011 4 次提交
  3. 07 10月, 2011 1 次提交
  4. 23 9月, 2011 1 次提交
    • D
      md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91
      Daniel P. Berrange 提交于
      The md_notify_reboot() method includes a call to mdelay(1000),
      to deal with "exotic SCSI devices" which are too volatile on
      reboot. The delay is unconditional. Even if the machine does
      not have any block devices, let alone MD devices, the kernel
      shutdown sequence is slowed down.
      
      1 second does not matter much with physical hardware, but with
      certain virtualization use cases any wasted time in the bootup
      & shutdown sequence counts for alot.
      
      * drivers/md/md.c: md_notify_reboot() - only impose a delay if
        there was at least one MD device to be stopped during reboot
      Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2dba6a91
  5. 21 9月, 2011 1 次提交
    • N
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown 提交于
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  6. 10 9月, 2011 1 次提交
    • N
      md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260
      NeilBrown 提交于
      0.90 metadata uses an unsigned 32bit number to count the number of
      kilobytes used from each device.
      This should allow up to 4TB per device.
      However we multiply this by 2 (to get sectors) before casting to a
      larger type, so sizes above 2TB get truncated.
      
      Also we allow rdev->sectors to be larger than 4TB, so it is possible
      for the array to be resized larger than the metadata can handle.
      So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
      used.
      
      Also the sanity check at the end of super_90_load should include level
      1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
      Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27a7b260
  7. 30 8月, 2011 1 次提交
  8. 25 8月, 2011 3 次提交
  9. 28 7月, 2011 9 次提交
    • N
      md/raid10 record bad blocks as needed during recovery. · e875ecea
      NeilBrown 提交于
      When recovering one or more devices, if all the good devices have
      bad blocks we should record a bad block on the device being rebuilt.
      
      If this fails, we need to abort the recovery.
      
      To ensure we don't think that we aborted later than we actually did,
      we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync,
      in particular before mddev->curr_resync is updated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e875ecea
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: add 'write_error' flag to component devices. · d7a9d443
      NeilBrown 提交于
      If a device has ever seen a write error, we will want to handle
      known-bad-blocks differently.
      So create an appropriate state flag and export it via sysfs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      d7a9d443
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: Disable bad blocks and v0.90 metadata. · 9f2f3830
      NeilBrown 提交于
      v0.90 metadata cannot record bad blocks, so when loading metadata
      for such a device, set shift to -1.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f2f3830
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5
      NeilBrown 提交于
      This can show the log (providing it fits in one page) and
      allows bad blocks to be 'acknowledged' meaning that they
      have safely been recorded in metadata.
      
      Clearing bad blocks is not allowed via sysfs (except for
      code testing).  A bad block can only be cleared when
      a write to the block succeeds.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      16c791a5
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
    • N
      md: remove suspicious size_of() · a519b26d
      NeilBrown 提交于
      When calling bioset_create we pass the size of the front_pad as
         sizeof(mddev)
      which looks suspicious as mddev is a pointer and so it looks like a
      common mistake where
         sizeof(*mddev)
      was intended.
      The size is actually correct as we want to store a pointer in the
      front padding of the bios created by the bioset, so make the intent
      more explicit by using
         sizeof(mddev_t *)
      Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a519b26d
  10. 27 7月, 2011 5 次提交
  11. 21 7月, 2011 1 次提交
  12. 28 6月, 2011 1 次提交
    • N
      md: avoid endless recovery loop when waiting for fail device to complete. · 4274215d
      NeilBrown 提交于
      If a device fails in a way that causes pending request to take a while
      to complete, md will not be able to immediately remove it from the
      array in remove_and_add_spares.
      It will then incorrectly look like a spare device and md will try to
      recover it even though it is failed.
      This leads to a recovery process starting and instantly aborting over
      and over again.
      
      We should check if the device is faulty before considering it to be a
      spare.  This will avoid trying to start a recovery that cannot
      proceed.
      
      This bug was introduced in 2.6.26 so that patch is suitable for any
      kernel since then.
      
      Cc: stable@kernel.org
      Reported-by: NJim Paradis <james.paradis@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4274215d
  13. 09 6月, 2011 2 次提交
    • N
      md: check ->hot_remove_disk when removing disk · 01393f3d
      Namhyung Kim 提交于
      Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
      during disk removal. The linear personality only has ->hot_add_disk and
      no ->hot_remove_disk, so that removing disk in the array resulted to
      following kernel bug:
      
      $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
      $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<          (null)>]           (null)
       PGD c9f5d067 PUD 8575a067 PMD 0
       Oops: 0010 [#1] SMP
       CPU 2
       Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg
      
       Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
       RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
       RSP: 0018:ffff880085757df0  EFLAGS: 00010282
       RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
       RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
       RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
       R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
       R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
       FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
       Stack:
        ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
        ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
        ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
       Call Trace:
        [<ffffffff8138496a>] ? slot_store+0xaa/0x265
        [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
        [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
        [<ffffffff81106b87>] vfs_write+0xb1/0x10d
        [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
        [<ffffffff81106cac>] sys_write+0x4d/0x77
        [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
       Code:  Bad RIP value.
       RIP  [<          (null)>]           (null)
        RSP <ffff880085757df0>
       CR2: 0000000000000000
       ---[ end trace ba5fc64319a826fb ]---
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      01393f3d
  14. 08 6月, 2011 5 次提交
  15. 11 5月, 2011 3 次提交
    • N
      md: allow resync_start to be set while an array is active. · b098636c
      NeilBrown 提交于
      The sysfs attribute 'resync_start' (known internally as recovery_cp),
      records where a resync is up to.  A value of 0 means the array is
      not known to be in-sync at all.  A value of MaxSector means the array
      is believed to be fully in-sync.
      
      When the size of member devices of an array (RAID1,RAID4/5/6) is
      increased, the array can be increased to match.  This process sets
      resync_start to the old end-of-device offset so that the new part of
      the array gets resynced.
      
      However with RAID1 (and RAID6) a resync is not technically necessary
      and may be undesirable.  So it would be good if the implied resync
      after the array is resized could be avoided.
      
      So: change 'resync_start' so the value can be changed while the array
      is active, and as a precaution only allow it to be changed while
      resync/recovery is 'frozen'.  Changing it once resync has started is
      not going to be useful anyway.
      
      This allows the array to be resized without a resync by:
        write 'frozen' to 'sync_action'
        write new size to 'component_size' (this will set resync_start)
        write 'none' to 'resync_start'
        write 'idle' to 'sync_action'.
      
      Also slightly improve some tests on recovery_cp when resizing
      raid1/raid5.  Now that an arbitrary value could be set we should be
      more careful in our tests.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b098636c
    • N
      md: reject a re-add request that cannot be honoured. · bedd86b7
      NeilBrown 提交于
      The 'add_new_disk' ioctl can be used to add a device either as a
      spare, or as an active disk that just needs to be resynced based on
      write-intent-bitmap information (re-add)
      
      Currently if a re-add is requested but fails we add as a spare
      instead.  This makes it impossible for user-space to check for
      failure.
      
      So change to require that a re-add attempt will either succeed or
      completely fail.  User-space can then decide what to do next.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bedd86b7
    • N
      md: Fix race when creating a new md device. · b0140891
      NeilBrown 提交于
      There is a race when creating an md device by opening /dev/mdXX.
      
      If two processes do this at much the same time they will follow the
      call path
        __blkdev_get -> get_gendisk -> kobj_lookup
      
      The first will call
        -> md_probe -> md_alloc -> add_disk -> blk_register_region
      
      and the race happens when the second gets to kobj_lookup after
      add_disk has called blk_register_region but before it returns to
      md_alloc.
      
      In the case the second will not call md_probe (as the probe is already
      done) but will get a handle on the gendisk, return to __blkdev_get
      which will then call md_open (via the ->open) pointer.
      
      As mddev->gendisk hasn't been set yet, md_open will think something is
      wrong an return with ERESTARTSYS.
      
      This can loop endlessly while the first thread makes no progress
      through add_disk.  Nothing is blocking it, but due to scheduler
      behaviour it doesn't get a turn.
      So this is essentially a live-lock.
      
      We fix this by simply moving the assignment to mddev->gendisk before
      the call the add_disk() so md_open doesn't get confused.
      Also move blk_queue_flush earlier because add_disk should be as late
      as possible.
      
      To make sure that md_open doesn't complete until md_alloc has done all
      that is needed, we take mddev->open_mutex during the last part of
      md_alloc.  md_open will wait for this.
      
      This can cause a lock-up on boot so Cc:ing for stable.
      For 2.6.36 and earlier a different patch will be needed as the
      'blk_queue_flush' call isn't there.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Tested-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Cc: stable@kernel.org
      b0140891
  16. 20 4月, 2011 1 次提交