1. 09 10月, 2008 3 次提交
    • T
      block: move stats from disk to part0 · 074a7aca
      Tejun Heo 提交于
      Move stats related fields - stamp, in_flight, dkstats - from disk to
      part0 and unify stat handling such that...
      
      * part_stat_*() now updates part0 together if the specified partition
        is not part0.  ie. part_stat_*() are now essentially all_stat_*().
      
      * {disk|all}_stat_*() are gone.
      
      * part_round_stats() is updated similary.  It handles part0 stats
        automatically and disk_round_stats() is killed.
      
      * part_{inc|dec}_in_fligh() is implemented which automatically updates
        part0 stats for parts other than part0.
      
      * disk_map_sector_rcu() is updated to return part0 if no part matches.
        Combined with the above changes, this makes NULL special case
        handling in callers unnecessary.
      
      * Separate stats show code paths for disk are collapsed into part
        stats show code paths.
      
      * Rename disk_stat_lock/unlock() to part_stat_lock/unlock()
      
      While at it, reposition stat handling macros a bit and add missing
      parentheses around macro parameters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      074a7aca
    • T
      block: always set bdev->bd_part · 0762b8bd
      Tejun Heo 提交于
      Till now, bdev->bd_part is set only if the bdev was for parts other
      than part0.  This patch makes bdev->bd_part always set so that code
      paths don't have to differenciate common handling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      0762b8bd
    • T
      block: implement and use {disk|part}_to_dev() · ed9e1982
      Tejun Heo 提交于
      Implement {disk|part}_to_dev() and use them to access generic device
      instead of directly dereferencing {disk|part}->dev.  To make sure no
      user is left behind, rename generic devices fields to __dev.
      
      This is in preparation of unifying partition 0 handling with other
      partitions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ed9e1982
  2. 19 9月, 2008 1 次提交
    • N
      md: Don't wait UNINTERRUPTIBLE for other resync to finish · 9744197c
      NeilBrown 提交于
      When two md arrays share some block device (e.g each uses different
      partitions on the one device), a resync of one array will wait for
      the resync on the other to finish.
      
      This can be a long time and as it currently waits TASK_UNINTERRUPTIBLE,
      the softlockup code notices and complains.
      
      So use TASK_INTERRUPTIBLE instead and make sure to flush signals
      before calling schedule.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9744197c
  3. 01 9月, 2008 1 次提交
    • N
      Remove invalidate_partition call from do_md_stop. · 271f5a9b
      NeilBrown 提交于
      When stopping an md array, or just switching to read-only, we
      currently call invalidate_partition while holding the mddev lock.
      The main reason for this is probably to ensure all dirty buffers
      are flushed (invalidate_partition calls fsync_bdev).
      
      However if any dirty buffers are found, it will almost certainly cause
      a deadlock as starting writeout will require an update to the
      superblock, and performing that updates requires taking the mddev
      lock - which is already held.
      
      This deadlock can be demonstrated by running "reboot -f -n" with
      a root filesystem on md/raid, and some dirty buffers in memory.
      
      All other calls to stop an array should already happen after a flush.
      The normal sequence is to stop using the array (e.g. umount) which
      will cause __blkdev_put to call sync_blockdev.  Then open the
      array and issue the STOP_ARRAY ioctl while the buffers are all still
      clean.
      
      So this invalidate_partition is normally a no-op, except for one case
      where it will cause a deadlock.
      
      So remove it.
      
      This patch possibly addresses the regression recored in
         http://bugzilla.kernel.org/show_bug.cgi?id=11460
      and
         http://bugzilla.kernel.org/show_bug.cgi?id=11452
      
      though it isn't yet clear how it ever worked.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      271f5a9b
  4. 08 8月, 2008 1 次提交
  5. 05 8月, 2008 4 次提交
    • N
      Allow faulty devices to be removed from a readonly array. · c89a8eee
      NeilBrown 提交于
      Removing faulty devices from an array is a two stage process.
      First the device is moved from being a part of the active array
      to being similar to a spare device.  Then it can be removed
      by a request from user space.
      
      The first step is currently not performed for read-only arrays,
      so the second step can never succeed.
      
      So allow readonly arrays to remove failed devices (which aren't
      blocked).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c89a8eee
    • N
      Fail safely when trying to grow an array with a write-intent bitmap. · dba034ee
      NeilBrown 提交于
      We cannot currently change the size of a write-intent bitmap.
      So if we change the size of an array which has such a bitmap, it
      tries to set bits beyond the end of the bitmap.
      
      For now, simply reject any request to change the size of an array
      which has a bitmap.  mdadm can remove the bitmap and add a new one
      after the array has changed size.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dba034ee
    • N
      Restore force switch of md array to readonly at reboot time. · 2b25000b
      NeilBrown 提交于
      A recent patch allowed do_md_stop to know whether it was being called
      via an ioctl or not, and thus where to allow for an extra open file
      descriptor when checking if it is in use.
      This broke then switch to readonly performed by the shutdown notifier,
      which needs to work even when the array is still (apparently) active
      (as md doesn't get told when the filesystem becomes readonly).
      
      So restore this feature by pretending that there can be lots of
      file descriptors open, but we still want do_md_stop to switch to
      readonly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b25000b
    • N
      Make writes to md/safe_mode_delay immediately effective. · 19052c0e
      NeilBrown 提交于
      If we reduce the 'safe_mode_delay', it could still wait for the old
      delay to completely expire before doing anything about safe_mode.
      Thus the effect if the change is delayed.
      
      To make the effect more immediate, run the timeout function
      immediately if the delay was reduced.  This may cause it to run
      slightly earlier that required, but that is the safer option.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      19052c0e
  6. 29 7月, 2008 1 次提交
  7. 24 7月, 2008 1 次提交
  8. 21 7月, 2008 6 次提交
    • N
      md: Protect access to mddev->disks list using RCU · 4b80991c
      NeilBrown 提交于
      All modifications and most access to the mddev->disks list are made
      under the reconfig_mutex lock.  However there are three places where
      the list is walked without any locking.  If a reconfig happens at this
      time, havoc (and oops) can ensue.
      
      So use RCU to protect these accesses:
        - wrap them in rcu_read_{,un}lock()
        - use list_for_each_entry_rcu
        - add to the list with list_add_rcu
        - delete from the list with list_del_rcu
        - delay the 'free' with call_rcu rather than schedule_work
      
      Note that export_rdev did a list_del_init on this list.  In almost all
      cases the entry was not in the list anymore so it was a no-op and so
      safe.  It is no longer safe as after list_del_rcu we may not touch
      the list_head.
      An audit shows that export_rdev is called:
        - after unbind_rdev_from_array, in which case the delete has
           already been done,
        - after bind_rdev_to_array fails, in which case the delete isn't needed.
        - before the device has been put on a list at all (e.g. in
            add_new_disk where reading the superblock fails).
        - and in autorun devices after a failure when the device is on a
            different list.
      
      So remove the list_del_init call from export_rdev, and add it back
      immediately before the called to export_rdev for that last case.
      
      Note also that ->same_set is sometimes used for lists other than
      mddev->list (e.g. candidates).  In these cases rcu is not needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4b80991c
    • N
      md: only count actual openers as access which prevent a 'stop' · f2ea68cf
      NeilBrown 提交于
      Open isn't the only thing that increments ->active.  e.g. reading
      /proc/mdstat will increment it briefly.  So to avoid false positives
      in testing for concurrent access, introduce a new counter that counts
      just the number of times the md device it open.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f2ea68cf
    • A
      md: Make mddev->array_size sector-based. · f233ea5c
      Andre Noll 提交于
      This patch renames the array_size field of struct mddev_s to array_sectors
      and converts all instances to use units of 512 byte sectors instead of 1k
      blocks.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f233ea5c
    • A
      md: Make super_type->rdev_size_change() take sector-based sizes. · 15f4a5fd
      Andre Noll 提交于
      Also, change the type of the size parameter from unsigned long long to
      sector_t and rename it to num_sectors.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      15f4a5fd
    • A
      md: Fix check for overlapping devices. · d07bd3bc
      Andre Noll 提交于
      The checks in overlaps() expect all parameters either in block-based
      or sector-based quantities. However, its single caller passes two
      rdev->data_offset arguments as well as two rdev->size arguments, the
      former being sector counts while the latter are measured in 1K blocks.
      
      This could cause rdev_size_store() to accept an invalid size from user
      space. Fix it by passing only sector-based quantities to overlaps().
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d07bd3bc
    • N
      md: Tidy up rdev_size_store a bit: · d7027458
      Neil Brown 提交于
       - used strict_strtoull in place of simple_strtoull
       - use my_mddev in place of rdev->mddev (they have the same value)
      and more significantly,
       - don't adjust mddev->size to fit, rather reject changes which make
         rdev->size smaller than mddev->size
      
      Adjusting mddev->size is a hangover from bind_rdev_to_array which
      does a similar thing.  But it really is a better design to insist that
      mddev->size is set as required, then the rdev->sizes are set to allow
      for that.  The previous way invites confusion.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d7027458
  9. 11 7月, 2008 10 次提交
  10. 08 7月, 2008 7 次提交
  11. 01 7月, 2008 1 次提交
    • D
      md: resolve external metadata handling deadlock in md_allow_write · b5470dc5
      Dan Williams 提交于
      md_allow_write() marks the metadata dirty while holding mddev->lock and then
      waits for the write to complete.  For externally managed metadata this causes a
      deadlock as userspace needs to take the lock to communicate that the metadata
      update has completed.
      
      Change md_allow_write() in the 'external' case to start the 'mark active'
      operation and then return -EAGAIN.  The expected side effects while waiting for
      userspace to write 'active' to 'array_state' are holding off reshape (code
      currently handles -ENOMEM), cause some 'stripe_cache_size' change requests to
      fail, cause some GET_BITMAP_FILE ioctl requests to fall back to GFP_NOIO, and
      cause updates to 'raid_disks' to fail.  Except for 'stripe_cache_size' changes
      these failures can be mitigated by coordinating with mdmon.
      
      md_write_start() still prevents writes from occurring until the metadata
      handler has had a chance to take action as it unconditionally waits for
      MD_CHANGE_CLEAN to be cleared.
      
      [neilb@suse.de: return -EAGAIN, try GFP_NOIO]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b5470dc5
  12. 28 6月, 2008 4 次提交