1. 28 7月, 2011 6 次提交
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: Disable bad blocks and v0.90 metadata. · 9f2f3830
      NeilBrown 提交于
      v0.90 metadata cannot record bad blocks, so when loading metadata
      for such a device, set shift to -1.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f2f3830
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md/bad-block-log: add sysfs interface for accessing bad-block-log. · 16c791a5
      NeilBrown 提交于
      This can show the log (providing it fits in one page) and
      allows bad blocks to be 'acknowledged' meaning that they
      have safely been recorded in metadata.
      
      Clearing bad blocks is not allowed via sysfs (except for
      code testing).  A bad block can only be cleared when
      a write to the block succeeds.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      16c791a5
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
    • N
      md: remove suspicious size_of() · a519b26d
      NeilBrown 提交于
      When calling bioset_create we pass the size of the front_pad as
         sizeof(mddev)
      which looks suspicious as mddev is a pointer and so it looks like a
      common mistake where
         sizeof(*mddev)
      was intended.
      The size is actually correct as we want to store a pointer in the
      front padding of the bios created by the bioset, so make the intent
      more explicit by using
         sizeof(mddev_t *)
      Reported-by: NZdenek Kabelac <zdenek.kabelac@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a519b26d
  2. 27 7月, 2011 5 次提交
  3. 28 6月, 2011 1 次提交
    • N
      md: avoid endless recovery loop when waiting for fail device to complete. · 4274215d
      NeilBrown 提交于
      If a device fails in a way that causes pending request to take a while
      to complete, md will not be able to immediately remove it from the
      array in remove_and_add_spares.
      It will then incorrectly look like a spare device and md will try to
      recover it even though it is failed.
      This leads to a recovery process starting and instantly aborting over
      and over again.
      
      We should check if the device is faulty before considering it to be a
      spare.  This will avoid trying to start a recovery that cannot
      proceed.
      
      This bug was introduced in 2.6.26 so that patch is suitable for any
      kernel since then.
      
      Cc: stable@kernel.org
      Reported-by: NJim Paradis <james.paradis@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4274215d
  4. 09 6月, 2011 2 次提交
    • N
      md: check ->hot_remove_disk when removing disk · 01393f3d
      Namhyung Kim 提交于
      Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
      during disk removal. The linear personality only has ->hot_add_disk and
      no ->hot_remove_disk, so that removing disk in the array resulted to
      following kernel bug:
      
      $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
      $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<          (null)>]           (null)
       PGD c9f5d067 PUD 8575a067 PMD 0
       Oops: 0010 [#1] SMP
       CPU 2
       Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg
      
       Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
       RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
       RSP: 0018:ffff880085757df0  EFLAGS: 00010282
       RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
       RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
       RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
       R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
       R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
       FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
       Stack:
        ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
        ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
        ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
       Call Trace:
        [<ffffffff8138496a>] ? slot_store+0xaa/0x265
        [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
        [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
        [<ffffffff81106b87>] vfs_write+0xb1/0x10d
        [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
        [<ffffffff81106cac>] sys_write+0x4d/0x77
        [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
       Code:  Bad RIP value.
       RIP  [<          (null)>]           (null)
        RSP <ffff880085757df0>
       CR2: 0000000000000000
       ---[ end trace ba5fc64319a826fb ]---
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      01393f3d
  5. 08 6月, 2011 5 次提交
  6. 11 5月, 2011 3 次提交
    • N
      md: allow resync_start to be set while an array is active. · b098636c
      NeilBrown 提交于
      The sysfs attribute 'resync_start' (known internally as recovery_cp),
      records where a resync is up to.  A value of 0 means the array is
      not known to be in-sync at all.  A value of MaxSector means the array
      is believed to be fully in-sync.
      
      When the size of member devices of an array (RAID1,RAID4/5/6) is
      increased, the array can be increased to match.  This process sets
      resync_start to the old end-of-device offset so that the new part of
      the array gets resynced.
      
      However with RAID1 (and RAID6) a resync is not technically necessary
      and may be undesirable.  So it would be good if the implied resync
      after the array is resized could be avoided.
      
      So: change 'resync_start' so the value can be changed while the array
      is active, and as a precaution only allow it to be changed while
      resync/recovery is 'frozen'.  Changing it once resync has started is
      not going to be useful anyway.
      
      This allows the array to be resized without a resync by:
        write 'frozen' to 'sync_action'
        write new size to 'component_size' (this will set resync_start)
        write 'none' to 'resync_start'
        write 'idle' to 'sync_action'.
      
      Also slightly improve some tests on recovery_cp when resizing
      raid1/raid5.  Now that an arbitrary value could be set we should be
      more careful in our tests.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b098636c
    • N
      md: reject a re-add request that cannot be honoured. · bedd86b7
      NeilBrown 提交于
      The 'add_new_disk' ioctl can be used to add a device either as a
      spare, or as an active disk that just needs to be resynced based on
      write-intent-bitmap information (re-add)
      
      Currently if a re-add is requested but fails we add as a spare
      instead.  This makes it impossible for user-space to check for
      failure.
      
      So change to require that a re-add attempt will either succeed or
      completely fail.  User-space can then decide what to do next.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bedd86b7
    • N
      md: Fix race when creating a new md device. · b0140891
      NeilBrown 提交于
      There is a race when creating an md device by opening /dev/mdXX.
      
      If two processes do this at much the same time they will follow the
      call path
        __blkdev_get -> get_gendisk -> kobj_lookup
      
      The first will call
        -> md_probe -> md_alloc -> add_disk -> blk_register_region
      
      and the race happens when the second gets to kobj_lookup after
      add_disk has called blk_register_region but before it returns to
      md_alloc.
      
      In the case the second will not call md_probe (as the probe is already
      done) but will get a handle on the gendisk, return to __blkdev_get
      which will then call md_open (via the ->open) pointer.
      
      As mddev->gendisk hasn't been set yet, md_open will think something is
      wrong an return with ERESTARTSYS.
      
      This can loop endlessly while the first thread makes no progress
      through add_disk.  Nothing is blocking it, but due to scheduler
      behaviour it doesn't get a turn.
      So this is essentially a live-lock.
      
      We fix this by simply moving the assignment to mddev->gendisk before
      the call the add_disk() so md_open doesn't get confused.
      Also move blk_queue_flush earlier because add_disk should be as late
      as possible.
      
      To make sure that md_open doesn't complete until md_alloc has done all
      that is needed, we take mddev->open_mutex during the last part of
      md_alloc.  md_open will wait for this.
      
      This can cause a lock-up on boot so Cc:ing for stable.
      For 2.6.36 and earlier a different patch will be needed as the
      'blk_queue_flush' call isn't there.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Tested-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Cc: stable@kernel.org
      b0140891
  7. 20 4月, 2011 1 次提交
  8. 18 4月, 2011 2 次提交
    • N
      md: provide generic support for handling unplug callbacks. · 97658cdd
      NeilBrown 提交于
      When an md device adds a request to a queue, it can call
      mddev_check_plugged.
      If this succeeds then we know that the md thread will be woken up
      shortly, and ->plug_cnt will be non-zero until then, so some
      processing can be delayed.
      
      If it fails, then no unplug callback is expected and the make_request
      function needs to do whatever is required to make the request happen.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97658cdd
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  9. 31 3月, 2011 1 次提交
  10. 29 3月, 2011 1 次提交
  11. 17 3月, 2011 1 次提交
  12. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  13. 24 2月, 2011 1 次提交
    • N
      md: Fix - again - partition detection when array becomes active · f0b4f7e2
      NeilBrown 提交于
      Revert
          b821eaa5
      and
          f3b99be1
      
      When I wrote the first of these I had a wrong idea about the
      lifetime of 'struct block_device'.  It can disappear at any time that
      the block device is not open if it falls out of the inode cache.
      
      So relying on the 'size' recorded with it to detect when the
      device size has changed and so we need to revalidate, is wrong.
      
      Rather, we really do need the 'changed' attribute stored directly in
      the mddev and set/tested as appropriate.
      
      Without this patch, a sequence of:
         mknod / open / close / unlink
      
      (which can cause a block_device to be created and then destroyed)
      will result in a rescan of the partition table and consequence removal
      and addition of partitions.
      Several of these in a row can get udev racing to create and unlink and
      other code can get confused.
      
      With the patch, the rescan is only performed when needed and so there
      are no races.
      
      This is suitable for any stable kernel from 2.6.35.
      Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      f0b4f7e2
  14. 16 2月, 2011 2 次提交
    • N
      md: correctly handle probe of an 'mdp' device. · 8f5f02c4
      NeilBrown 提交于
      'mdp' devices are md devices with preallocated device numbers
      for partitions. As such it is possible to mknod and open a partition
      before opening the whole device.
      
      this causes  md_probe() to be called with a device number of a
      partition, which in-turn calls mddev_find with such a number.
      
      However mddev_find expects the number of a 'whole device' and
      does the wrong thing with partition numbers.
      
      So add code to mddev_find to remove the 'partition' part of
      a device number and just work with the 'whole device'.
      
      This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652
      
      Reported-by: hkmaly@bigfoot.com
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: <stable@kernel.org>
      8f5f02c4
    • N
      md: don't set_capacity before array is active. · cbe6ef1d
      NeilBrown 提交于
      If the desired size of an array is set (via sysfs) before the array is
      active (which is the normal sequence), we currrently call set_capacity
      immediately.
      This means that a subsequent 'open' (as can be caused by some
      udev-triggers program) will notice the new size and try to probe for
      partitions.  However as the array isn't quite ready yet the read will
      fail.  Then when the array is read, as the size doesn't change again
      we don't try to re-probe.
      
      So when setting array size via sysfs, only call set_capacity if the
      array is already active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cbe6ef1d
  15. 08 2月, 2011 1 次提交
    • C
      md_make_request: don't touch the bio after calling make_request · e91ece55
      Chris Mason 提交于
      md_make_request was calling bio_sectors() for part_stat_add
      after it was calling the make_request function.  This is
      bad because the make_request function can free the bio and
      because the bi_size field can change around.
      
      The fix here was suggested by Jens Axboe.  It saves the
      sector count before the make_request call.  I hit this
      with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
      his pretty fusionio card.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e91ece55
  16. 02 2月, 2011 1 次提交
  17. 31 1月, 2011 4 次提交
    • N
      md: don't clear curr_resync_completed at end of resync. · 7281f812
      NeilBrown 提交于
      There is no need to set this to zero at this point.  It will be
      set to zero by remove_and_add_spares or at the start of
      md_do_sync at the latest.
      And setting it to zero before MD_RECOVERY_RUNNING is cleared can
      make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
      just as resync is finishing.
      
      So simply remove this setting to zero.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7281f812
    • N
      md: Don't use remove_and_add_spares to remove failed devices from a read-only array · a8c42c7f
      NeilBrown 提交于
      remove_and_add_spares is called in two places where the needs really
      are very different.
      remove_and_add_spares should not be called on an array which is about
      to be reshaped as some extra devices might have been manually added
      and that would remove them.  However if the array is 'read-auto',
      that will currently happen, which is bad.
      
      So in the 'ro != 0' case don't call remove_and_add_spares but simply
      remove the failed devices as the comment suggests is needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c42c7f
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
    • N
      md: revert change to raid_disks on failure. · de171cb9
      NeilBrown 提交于
      If we try to update_raid_disks and it fails, we should put
      'delta_disks' back to zero.  This is important because some code,
      such as slot_store, assumes that delta_disks has been validated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de171cb9
  18. 25 1月, 2011 1 次提交