1. 09 6月, 2011 2 次提交
    • N
      md: check ->hot_remove_disk when removing disk · 01393f3d
      Namhyung Kim 提交于
      Check pers->hot_remove_disk instead of pers->hot_add_disk in slot_store()
      during disk removal. The linear personality only has ->hot_add_disk and
      no ->hot_remove_disk, so that removing disk in the array resulted to
      following kernel bug:
      
      $ sudo mdadm --create /dev/md0 --level=linear --raid-devices=4 /dev/loop[0-3]
      $ echo none | sudo tee /sys/block/md0/md/dev-loop2/slot
       BUG: unable to handle kernel NULL pointer dereference at           (null)
       IP: [<          (null)>]           (null)
       PGD c9f5d067 PUD 8575a067 PMD 0
       Oops: 0010 [#1] SMP
       CPU 2
       Modules linked in: linear loop bridge stp llc kvm_intel kvm asus_atk0110 sr_mod cdrom sg
      
       Pid: 10450, comm: tee Not tainted 3.0.0-rc1-leonard+ #173 System manufacturer System Product Name/P5G41TD-M PRO
       RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
       RSP: 0018:ffff880085757df0  EFLAGS: 00010282
       RAX: ffffffffa00168e0 RBX: ffff8800d1431800 RCX: 000000000000006e
       RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff88008543c000
       RBP: ffff880085757e48 R08: 0000000000000002 R09: 000000000000000a
       R10: 0000000000000000 R11: ffff88008543c2e0 R12: 00000000ffffffff
       R13: ffff8800b4641000 R14: 0000000000000005 R15: 0000000000000000
       FS:  00007fe8c9e05700(0000) GS:ffff88011fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000000 CR3: 00000000b4502000 CR4: 00000000000406e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process tee (pid: 10450, threadinfo ffff880085756000, task ffff8800c9f08000)
       Stack:
        ffffffff8138496a ffff8800b4641000 ffff88008543c268 0000000000000000
        ffff8800b4641000 ffff88008543c000 ffff8800d1431868 ffffffff81a78a90
        ffff8800b4641000 ffff88008543c000 ffff8800d1431800 ffff880085757e98
       Call Trace:
        [<ffffffff8138496a>] ? slot_store+0xaa/0x265
        [<ffffffff81384bae>] rdev_attr_store+0x89/0xa8
        [<ffffffff8115a96a>] sysfs_write_file+0x108/0x144
        [<ffffffff81106b87>] vfs_write+0xb1/0x10d
        [<ffffffff8106e6c0>] ? trace_hardirqs_on_caller+0x111/0x135
        [<ffffffff81106cac>] sys_write+0x4d/0x77
        [<ffffffff814fe702>] system_call_fastpath+0x16/0x1b
       Code:  Bad RIP value.
       RIP  [<          (null)>]           (null)
        RSP <ffff880085757df0>
       CR2: 0000000000000000
       ---[ end trace ba5fc64319a826fb ]---
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      01393f3d
  2. 08 6月, 2011 5 次提交
  3. 11 5月, 2011 3 次提交
    • N
      md: allow resync_start to be set while an array is active. · b098636c
      NeilBrown 提交于
      The sysfs attribute 'resync_start' (known internally as recovery_cp),
      records where a resync is up to.  A value of 0 means the array is
      not known to be in-sync at all.  A value of MaxSector means the array
      is believed to be fully in-sync.
      
      When the size of member devices of an array (RAID1,RAID4/5/6) is
      increased, the array can be increased to match.  This process sets
      resync_start to the old end-of-device offset so that the new part of
      the array gets resynced.
      
      However with RAID1 (and RAID6) a resync is not technically necessary
      and may be undesirable.  So it would be good if the implied resync
      after the array is resized could be avoided.
      
      So: change 'resync_start' so the value can be changed while the array
      is active, and as a precaution only allow it to be changed while
      resync/recovery is 'frozen'.  Changing it once resync has started is
      not going to be useful anyway.
      
      This allows the array to be resized without a resync by:
        write 'frozen' to 'sync_action'
        write new size to 'component_size' (this will set resync_start)
        write 'none' to 'resync_start'
        write 'idle' to 'sync_action'.
      
      Also slightly improve some tests on recovery_cp when resizing
      raid1/raid5.  Now that an arbitrary value could be set we should be
      more careful in our tests.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b098636c
    • N
      md: reject a re-add request that cannot be honoured. · bedd86b7
      NeilBrown 提交于
      The 'add_new_disk' ioctl can be used to add a device either as a
      spare, or as an active disk that just needs to be resynced based on
      write-intent-bitmap information (re-add)
      
      Currently if a re-add is requested but fails we add as a spare
      instead.  This makes it impossible for user-space to check for
      failure.
      
      So change to require that a re-add attempt will either succeed or
      completely fail.  User-space can then decide what to do next.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bedd86b7
    • N
      md: Fix race when creating a new md device. · b0140891
      NeilBrown 提交于
      There is a race when creating an md device by opening /dev/mdXX.
      
      If two processes do this at much the same time they will follow the
      call path
        __blkdev_get -> get_gendisk -> kobj_lookup
      
      The first will call
        -> md_probe -> md_alloc -> add_disk -> blk_register_region
      
      and the race happens when the second gets to kobj_lookup after
      add_disk has called blk_register_region but before it returns to
      md_alloc.
      
      In the case the second will not call md_probe (as the probe is already
      done) but will get a handle on the gendisk, return to __blkdev_get
      which will then call md_open (via the ->open) pointer.
      
      As mddev->gendisk hasn't been set yet, md_open will think something is
      wrong an return with ERESTARTSYS.
      
      This can loop endlessly while the first thread makes no progress
      through add_disk.  Nothing is blocking it, but due to scheduler
      behaviour it doesn't get a turn.
      So this is essentially a live-lock.
      
      We fix this by simply moving the assignment to mddev->gendisk before
      the call the add_disk() so md_open doesn't get confused.
      Also move blk_queue_flush earlier because add_disk should be as late
      as possible.
      
      To make sure that md_open doesn't complete until md_alloc has done all
      that is needed, we take mddev->open_mutex during the last part of
      md_alloc.  md_open will wait for this.
      
      This can cause a lock-up on boot so Cc:ing for stable.
      For 2.6.36 and earlier a different patch will be needed as the
      'blk_queue_flush' call isn't there.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Tested-by: NThomas Jarosch <thomas.jarosch@intra2net.com>
      Cc: stable@kernel.org
      b0140891
  4. 20 4月, 2011 1 次提交
  5. 18 4月, 2011 2 次提交
    • N
      md: provide generic support for handling unplug callbacks. · 97658cdd
      NeilBrown 提交于
      When an md device adds a request to a queue, it can call
      mddev_check_plugged.
      If this succeeds then we know that the md thread will be woken up
      shortly, and ->plug_cnt will be non-zero until then, so some
      processing can be delayed.
      
      If it fails, then no unplug callback is expected and the make_request
      function needs to do whatever is required to make the request happen.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97658cdd
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  6. 31 3月, 2011 1 次提交
  7. 29 3月, 2011 1 次提交
  8. 17 3月, 2011 1 次提交
  9. 10 3月, 2011 2 次提交
    • J
      block: kill off REQ_UNPLUG · 721a9602
      Jens Axboe 提交于
      With the plugging now being explicitly controlled by the
      submitter, callers need not pass down unplugging hints
      to the block layer. If they want to unplug, it's because they
      manually plugged on their own - in which case, they should just
      unplug at will.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      721a9602
    • J
      block: remove per-queue plugging · 7eaceacc
      Jens Axboe 提交于
      Code has been converted over to the new explicit on-stack plugging,
      and delay users have been converted to use the new API for that.
      So lets kill off the old plugging along with aops->sync_page().
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7eaceacc
  10. 24 2月, 2011 1 次提交
    • N
      md: Fix - again - partition detection when array becomes active · f0b4f7e2
      NeilBrown 提交于
      Revert
          b821eaa5
      and
          f3b99be1
      
      When I wrote the first of these I had a wrong idea about the
      lifetime of 'struct block_device'.  It can disappear at any time that
      the block device is not open if it falls out of the inode cache.
      
      So relying on the 'size' recorded with it to detect when the
      device size has changed and so we need to revalidate, is wrong.
      
      Rather, we really do need the 'changed' attribute stored directly in
      the mddev and set/tested as appropriate.
      
      Without this patch, a sequence of:
         mknod / open / close / unlink
      
      (which can cause a block_device to be created and then destroyed)
      will result in a rescan of the partition table and consequence removal
      and addition of partitions.
      Several of these in a row can get udev racing to create and unlink and
      other code can get confused.
      
      With the patch, the rescan is only performed when needed and so there
      are no races.
      
      This is suitable for any stable kernel from 2.6.35.
      Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      f0b4f7e2
  11. 16 2月, 2011 2 次提交
    • N
      md: correctly handle probe of an 'mdp' device. · 8f5f02c4
      NeilBrown 提交于
      'mdp' devices are md devices with preallocated device numbers
      for partitions. As such it is possible to mknod and open a partition
      before opening the whole device.
      
      this causes  md_probe() to be called with a device number of a
      partition, which in-turn calls mddev_find with such a number.
      
      However mddev_find expects the number of a 'whole device' and
      does the wrong thing with partition numbers.
      
      So add code to mddev_find to remove the 'partition' part of
      a device number and just work with the 'whole device'.
      
      This patch addresses https://bugzilla.kernel.org/show_bug.cgi?id=28652
      
      Reported-by: hkmaly@bigfoot.com
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: <stable@kernel.org>
      8f5f02c4
    • N
      md: don't set_capacity before array is active. · cbe6ef1d
      NeilBrown 提交于
      If the desired size of an array is set (via sysfs) before the array is
      active (which is the normal sequence), we currrently call set_capacity
      immediately.
      This means that a subsequent 'open' (as can be caused by some
      udev-triggers program) will notice the new size and try to probe for
      partitions.  However as the array isn't quite ready yet the read will
      fail.  Then when the array is read, as the size doesn't change again
      we don't try to re-probe.
      
      So when setting array size via sysfs, only call set_capacity if the
      array is already active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cbe6ef1d
  12. 08 2月, 2011 1 次提交
    • C
      md_make_request: don't touch the bio after calling make_request · e91ece55
      Chris Mason 提交于
      md_make_request was calling bio_sectors() for part_stat_add
      after it was calling the make_request function.  This is
      bad because the make_request function can free the bio and
      because the bi_size field can change around.
      
      The fix here was suggested by Jens Axboe.  It saves the
      sector count before the make_request call.  I hit this
      with CONFIG_DEBUG_PAGEALLOC turned on while trying to break
      his pretty fusionio card.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e91ece55
  13. 02 2月, 2011 1 次提交
  14. 31 1月, 2011 4 次提交
    • N
      md: don't clear curr_resync_completed at end of resync. · 7281f812
      NeilBrown 提交于
      There is no need to set this to zero at this point.  It will be
      set to zero by remove_and_add_spares or at the start of
      md_do_sync at the latest.
      And setting it to zero before MD_RECOVERY_RUNNING is cleared can
      make a 'zero' appear briefly in the 'sync_completed' sysfs attribute
      just as resync is finishing.
      
      So simply remove this setting to zero.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7281f812
    • N
      md: Don't use remove_and_add_spares to remove failed devices from a read-only array · a8c42c7f
      NeilBrown 提交于
      remove_and_add_spares is called in two places where the needs really
      are very different.
      remove_and_add_spares should not be called on an array which is about
      to be reshaped as some extra devices might have been manually added
      and that would remove them.  However if the array is 'read-auto',
      that will currently happen, which is bad.
      
      So in the 'ro != 0' case don't call remove_and_add_spares but simply
      remove the failed devices as the comment suggests is needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c42c7f
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
    • N
      md: revert change to raid_disks on failure. · de171cb9
      NeilBrown 提交于
      If we try to update_raid_disks and it fails, we should put
      'delta_disks' back to zero.  This is important because some code,
      such as slot_store, assumes that delta_disks has been validated.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de171cb9
  15. 25 1月, 2011 1 次提交
  16. 15 1月, 2011 1 次提交
    • T
      block: restore multiple bd_link_disk_holder() support · 49731baa
      Tejun Heo 提交于
      Commit e09b457b (block: simplify holder symlink handling) incorrectly
      assumed that there is only one link at maximum.  dm may use multiple
      links and expects block layer to track reference count for each link,
      which is different from and unrelated to the exclusive device holder
      identified by @holder when the device is opened.
      
      Remove the single holder assumption and automatic removal of the link
      and revive the per-link reference count tracking.  The code
      essentially behaves the same as before commit e09b457b sans the
      unnecessary kobject reference count dancing.
      
      While at it, note that this facility should not be used by anyone else
      than the current ones.  Sysfs symlinks shouldn't be abused like this
      and the whole thing doesn't belong in the block layer at all.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NMilan Broz <mbroz@redhat.com>
      Cc: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: linux-raid@vger.kernel.org
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49731baa
  17. 14 1月, 2011 11 次提交
    • N
      md: Fix removal of extra drives when converting RAID6 to RAID5 · bf2cb0da
      NeilBrown 提交于
      When a RAID6 is converted to a RAID5, the extra drive should
      be discarded.  However it isn't due to a typo in a comparison.
      
      This bug was introduced in commit e93f68a1 in 2.6.35-rc4
      and is suitable for any -stable since than.
      
      As the extra drive is not removed, the 'degraded' counter is wrong and
      so the RAID5 will not respond correctly to a subsequent failure.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bf2cb0da
    • N
      md: range check slot number when manually adding a spare. · ba1b41b6
      NeilBrown 提交于
      When adding a spare to an active array, we should check the slot
      number, but allow it to be larger than raid_disks if a reshape
      is being prepared.
      
      Apply the same test when adding a device to an
      array-under-construction.  It already had most of the test in place,
      but not quite all.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ba1b41b6
    • R
      md: fix sync_completed reporting for very large drives (>2TB) · 13ae864b
      Rémi Rérolle 提交于
      The values exported in the sync_completed file are unsigned long, which
      overflows with very large drives, resulting in wrong values reported.
      
      Since sync_completed uses sectors as unit, we'll start getting wrong
      values with components larger than 2TB.
      
      This patch simply replaces the use of unsigned long by unsigned long long.
      Signed-off-by: NRémi Rérolle <rrerolle@lacie.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13ae864b
    • N
      md: allow suspend_lo and suspend_hi to decrease as well as increase. · 23ddff37
      NeilBrown 提交于
      The sysfs attributes 'suspend_lo' and 'suspend_hi' describe a region
      to which read/writes are suspended so that the under lying data can be
      manipulated without user-space noticing.
      Currently the window they describe can only move forwards along the
      device.  However this is an unnecessary restriction which will cause
      problems with planned developments.
      So relax this restriction and allow these endpoints to move
      arbitrarily.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      23ddff37
    • N
      md: Don't let implementation detail of curr_resync leak out through sysfs. · 75d3da43
      NeilBrown 提交于
      mddev->curr_resync has artificial values of '1' and '2' which are used
      by the code which ensures only one resync is happening at a time on
      any given device.
      
      These values are internal and should never be exposed to user-space
      (except when translated appropriately as in the 'pending' status in
      /proc/mdstat).
      
      Unfortunately they are as ->curr_resync is assigned to
      ->curr_resync_completed and that value is directly visible through
      sysfs.
      
      So change the assignments to ->curr_resync_completed to get the same
      valued from elsewhere in a form that doesn't have the magic '1' or '2'
      values.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      75d3da43
    • J
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow 提交于
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6ff7e08
    • J
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow 提交于
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • J
      md-new-param-to-calc_dev_sboffset · 57b2caa3
      Jonathan Brassow 提交于
      When we allow for separate devices for data and metadata
      in a later patch, we will need to be able to calculate
      the superblock offset based on more than the bdev.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      57b2caa3
    • N
      md: Be more careful about clearing flags bit in ->recovery · 7ebc0be7
      NeilBrown 提交于
      Setting ->recovery to 0 is generally not a good idea as it could clear
      bits that shouldn't be cleared.  In particular, MD_RECOVERY_FROZEN
      should only be cleared on explicit request from user-space.
      
      So when we need to clear things, just clear the bits that need
      clearing.
      
      As there are a few different places which reap a resync process - and
      some do an incomplte job - factor out the code for doing the from
      md_check_recovery and call that function instead of open coding part
      of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: NJonathan Brassow <jbrassow@redhat.com>
      7ebc0be7
    • N
      md: md_stop_writes requires mddev_lock. · defad61a
      NeilBrown 提交于
      As md_stop_writes manipulates the sync_thread and calls md_update_sb,
      it need to be called with mddev_lock held.
      
      In all internal cases it is, but the symbol is exported for dm-raid to
      call and in that case the lock won't be help.
      Do make an exported version which takes the lock, and an internal
      version which does not.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      defad61a
    • N
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown 提交于
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886