1. 30 11月, 2012 1 次提交
    • L
      wait: add wait_event_lock_irq() interface · eed8c02e
      Lukas Czerner 提交于
      New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
      moves the private wait_event_lock_irq() macro from MD to regular wait
      includes, introduces new macro wait_event_lock_irq_cmd() instead of using
      the old method with omitting cmd parameter which is ugly and makes a use
      of new macros in the MD. It also introduces the _interruptible_ variant.
      
      The use of new interface is when one have a special lock to protect data
      structures used in the condition, or one also needs to invoke "cmd"
      before putting it to sleep.
      
      All new macros are expected to be called with the lock taken. The lock
      is released before sleep and is reacquired afterwards. We will leave the
      macro with the lock held.
      
      Note to DM: IMO this should also fix theoretical race on waitqueue while
      using simultaneously wait_event_lock_irq() and wait_event() because of
      lack of locking around current state setting and wait queue removal.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed8c02e
  2. 11 10月, 2012 2 次提交
  3. 31 7月, 2012 3 次提交
    • N
      blk: pass from_schedule to non-request unplug functions. · 74018dc3
      NeilBrown 提交于
      This will allow md/raid to know why the unplug was called,
      and will be able to act according - if !from_schedule it
      is safe to perform tasks which could themselves schedule.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74018dc3
    • N
      blk: centralize non-request unplug handling. · 9cbb1750
      NeilBrown 提交于
      Both md and umem has similar code for getting notified on an
      blk_finish_plug event.
      Centralize this code in block/ and allow each driver to
      provide its distinctive difference.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9cbb1750
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
  4. 22 5月, 2012 2 次提交
  5. 21 5月, 2012 2 次提交
    • N
      md: add possibility to change data-offset for devices. · c6563a8c
      NeilBrown 提交于
      When reshaping we can avoid costly intermediate backup by
      changing the 'start' address of the array on the device
      (if there is enough room).
      
      So as a first step, allow such a change to be requested
      through sysfs, and recorded in v1.x metadata.
      
      (As we didn't previous check that all 'pad' fields were zero,
       we need a new FEATURE flag for this.
       A (belatedly) check that all remaining 'pad' fields are
       zero to avoid a repeat of this)
      
      The new data offset must be requested separately for each device.
      This allows each to have a different change in the data offset.
      This is not likely to be used often but as data_offset can be
      set per-device, new_data_offset should be too.
      
      This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
      it is never used and never will be.  At the same time we add a new
      arg ('in_new') which is currently always zero but will be used more
      soon.
      
      When a reshape finishes we will need to update the data_offset
      and rdev->sectors.  So provide an exported function to do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6563a8c
    • N
      md: allow a reshape operation to be reversed. · 2c810cdd
      NeilBrown 提交于
      Currently a reshape operation always progresses from the start
      of the array to the end unless the number of devices is being
      reduced, in which case it progressed in the opposite direction.
      
      To reverse a partial reshape which changes the number of devices
      you can stop the array and re-assemble with the raid-disks numbers
      reversed and it will undo.
      
      However for a reshape that does not change the number of devices
      it is not possible to reverse the reshape in the middle - you have to
      wait until it completes.
      
      So add a 'reshape_direction' attribute with is either 'forwards' or
      'backwards' and can be explicitly set when delta_disks is zero.
      
      This will become more important when we allow the data_offset to
      change in a reshape.  Then the explicit statement of what direction is
      being used will be more useful.
      
      This can be enabled in raid5 trivially as it already supports
      reverse reshape and just needs to use a different trigger to request it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c810cdd
  6. 19 3月, 2012 2 次提交
    • N
      md/raid10: handle merge_bvec_fn in member devices. · 050b6615
      NeilBrown 提交于
      Currently we don't honour merge_bvec_fn in member devices so if there
      is one, we force all requests to be single-page at most.
      This is not ideal.
      
      So enhance the raid10 merge_bvec_fn to check that function in children
      as well.
      
      This introduces a small problem.  There is no locking around calls
      the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
      device added between these could end up getting a request which
      violates its merge_bvec_fn.
      
      Currently the best we can do is synchronize_sched().  This will work
      providing no preemption happens.  If there is preemption, we just
      have to hope that new devices are largely consistent with old devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      050b6615
    • N
      md: tidy up rdev_for_each usage. · dafb20fa
      NeilBrown 提交于
      md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
      mddev.  However it uses the 'safe' version of list_for_each_entry,
      and so requires the extra variable, but doesn't include 'safe' in the
      name, which is useful documentation.
      
      Consequently some places use this safe version without needing it, and
      many use an explicity list_for_each entry.
      
      So:
       - rename rdev_for_each to rdev_for_each_safe
       - create a new rdev_for_each which uses the plain
         list_for_each_entry,
       - use the 'safe' version only where needed, and convert all other
         list_for_each_entry calls to use rdev_for_each.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dafb20fa
  7. 23 12月, 2011 2 次提交
    • N
      md: create externally visible flags for supporting hot-replace. · 2d78f8c4
      NeilBrown 提交于
      hot-replace is a feature being added to md which will allow a
      device to be replaced without removing it from the array first.
      
      With hot-replace a spare can be activated and recovery can start while
      the original device is still in place, thus allowing a transition from
      an unreliable device to a reliable device without leaving the array
      degraded during the transition.  It can also be use when the original
      device is still reliable but it not wanted for some reason.
      
      This will eventually be supported in RAID4/5/6 and RAID10.
      
      This patch adds a super-block flag to distinguish the replacement
      device.  If an old kernel sees this flag it will reject the device.
      
      It also adds two per-device flags which are viewable and settable via
      sysfs.
         "want_replacement" can be set to request that a device be replaced.
         "replacement" is set to show that this device is replacing another
         device.
      
      The "rd%d" links in /sys/block/mdXx/md only apply to the original
      device, not the replacement.  We currently don't make links for the
      replacement - there doesn't seem to be a need.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d78f8c4
    • N
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown 提交于
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8321b68
  8. 11 10月, 2011 4 次提交
  9. 21 9月, 2011 2 次提交
  10. 12 9月, 2011 1 次提交
  11. 28 7月, 2011 5 次提交
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: add 'write_error' flag to component devices. · d7a9d443
      NeilBrown 提交于
      If a device has ever seen a write error, we will want to handle
      known-bad-blocks differently.
      So create an appropriate state flag and export it via sysfs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      d7a9d443
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
  12. 27 7月, 2011 3 次提交
    • J
      MD bitmap: Revert DM dirty log hooks · 3520fa4d
      Jonathan Brassow 提交于
      
      Revert most of commit e384e585
        md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
      
      MD should not need to use DM's dirty log - we decided to use md's
      bitmaps instead.
      
      Keeping the DIV_ROUND_UP clean-ups that were part of commit
      e384e585, however.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3520fa4d
    • N
      md: change managed of recovery_disabled. · 5389042f
      NeilBrown 提交于
      If we hit a read error while recovering a mirror, we want to abort the
      recovery without necessarily failing the disk - as having a disk this
      a read error is better than not having an array at all.
      
      Currently this is managed with a per-array flag "recovery_disabled"
      and is only implemented for RAID1.  For RAID10 we will need finer
      grained control as we might want to disable recovery for individual
      devices separately.
      
      So push more of the decision making into the personality.
      'recovery_disabled' is now a 'cookie' which is copied when the
      personality want to disable recovery and is changed when a device is
      added to the array as this is used as a trigger to 'try recovery
      again'.
      
      This will allow RAID10 to get the control that it needs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5389042f
    • N
      md: introduce link/unlink_rdev() helpers · 36fad858
      Namhyung Kim 提交于
      There are places where sysfs links to rdev are handled
      in a same way. Add the helper functions to consolidate
      them.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      36fad858
  13. 09 6月, 2011 1 次提交
  14. 08 6月, 2011 1 次提交
    • J
      MD: add sync_super to mddev_t struct · 076f968b
      Jonathan Brassow 提交于
      Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
      
      If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
      able to load it, there must still be a way for MD to initiate superblock
      updates.  The simplest way to make this happen is to provide a pointer in
      the MD array structure that can be set by device-mapper (or other module)
      with a function to do this.  If the function has been set, it will be used;
      otherwise, the method with be looked up via 'super_types' as usual.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      076f968b
  15. 18 4月, 2011 2 次提交
    • N
      md: provide generic support for handling unplug callbacks. · 97658cdd
      NeilBrown 提交于
      When an md device adds a request to a queue, it can call
      mddev_check_plugged.
      If this succeeds then we know that the md thread will be woken up
      shortly, and ->plug_cnt will be non-zero until then, so some
      processing can be delayed.
      
      If it fails, then no unplug callback is expected and the make_request
      function needs to do whatever is required to make the request happen.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97658cdd
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  16. 31 3月, 2011 1 次提交
  17. 24 2月, 2011 1 次提交
    • N
      md: Fix - again - partition detection when array becomes active · f0b4f7e2
      NeilBrown 提交于
      Revert
          b821eaa5
      and
          f3b99be1
      
      When I wrote the first of these I had a wrong idea about the
      lifetime of 'struct block_device'.  It can disappear at any time that
      the block device is not open if it falls out of the inode cache.
      
      So relying on the 'size' recorded with it to detect when the
      device size has changed and so we need to revalidate, is wrong.
      
      Rather, we really do need the 'changed' attribute stored directly in
      the mddev and set/tested as appropriate.
      
      Without this patch, a sequence of:
         mknod / open / close / unlink
      
      (which can cause a block_device to be created and then destroyed)
      will result in a rescan of the partition table and consequence removal
      and addition of partitions.
      Several of these in a row can get udev racing to create and unlink and
      other code can get confused.
      
      With the patch, the rescan is only performed when needed and so there
      are no races.
      
      This is suitable for any stable kernel from 2.6.35.
      Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      f0b4f7e2
  18. 31 1月, 2011 1 次提交
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
  19. 14 1月, 2011 3 次提交
    • J
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow 提交于
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6ff7e08
    • J
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow 提交于
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • N
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown 提交于
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886
  20. 28 10月, 2010 1 次提交
    • N
      md: use separate bio pool for each md device. · a167f663
      NeilBrown 提交于
      bio_clone and bio_alloc allocate from a common bio pool.
      If an md device is stacked with other devices that use this pool, or under
      something like swap which uses the pool, then the multiple calls on
      the pool can cause deadlocks.
      
      So allocate a local bio pool for each md array and use that rather
      than the common pool.
      
      This pool is used both for regular IO and metadata updates.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a167f663