1. 19 3月, 2012 2 次提交
    • N
      md/raid10: handle merge_bvec_fn in member devices. · 050b6615
      NeilBrown 提交于
      Currently we don't honour merge_bvec_fn in member devices so if there
      is one, we force all requests to be single-page at most.
      This is not ideal.
      
      So enhance the raid10 merge_bvec_fn to check that function in children
      as well.
      
      This introduces a small problem.  There is no locking around calls
      the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
      device added between these could end up getting a request which
      violates its merge_bvec_fn.
      
      Currently the best we can do is synchronize_sched().  This will work
      providing no preemption happens.  If there is preemption, we just
      have to hope that new devices are largely consistent with old devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      050b6615
    • N
      md: tidy up rdev_for_each usage. · dafb20fa
      NeilBrown 提交于
      md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
      mddev.  However it uses the 'safe' version of list_for_each_entry,
      and so requires the extra variable, but doesn't include 'safe' in the
      name, which is useful documentation.
      
      Consequently some places use this safe version without needing it, and
      many use an explicity list_for_each entry.
      
      So:
       - rename rdev_for_each to rdev_for_each_safe
       - create a new rdev_for_each which uses the plain
         list_for_each_entry,
       - use the 'safe' version only where needed, and convert all other
         list_for_each_entry calls to use rdev_for_each.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dafb20fa
  2. 23 12月, 2011 2 次提交
    • N
      md: create externally visible flags for supporting hot-replace. · 2d78f8c4
      NeilBrown 提交于
      hot-replace is a feature being added to md which will allow a
      device to be replaced without removing it from the array first.
      
      With hot-replace a spare can be activated and recovery can start while
      the original device is still in place, thus allowing a transition from
      an unreliable device to a reliable device without leaving the array
      degraded during the transition.  It can also be use when the original
      device is still reliable but it not wanted for some reason.
      
      This will eventually be supported in RAID4/5/6 and RAID10.
      
      This patch adds a super-block flag to distinguish the replacement
      device.  If an old kernel sees this flag it will reject the device.
      
      It also adds two per-device flags which are viewable and settable via
      sysfs.
         "want_replacement" can be set to request that a device be replaced.
         "replacement" is set to show that this device is replacing another
         device.
      
      The "rd%d" links in /sys/block/mdXx/md only apply to the original
      device, not the replacement.  We currently don't make links for the
      replacement - there doesn't seem to be a need.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d78f8c4
    • N
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown 提交于
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8321b68
  3. 11 10月, 2011 4 次提交
  4. 21 9月, 2011 2 次提交
  5. 12 9月, 2011 1 次提交
  6. 28 7月, 2011 5 次提交
    • N
      md: make it easier to wait for bad blocks to be acknowledged. · de393cde
      NeilBrown 提交于
      It is only safe to choose not to write to a bad block if that bad
      block is safely recorded in metadata - i.e. if it has been
      'acknowledged'.
      
      If it hasn't we need to wait for the acknowledgement.
      
      We support that using rdev->blocked wait and
      md_wait_for_blocked_rdev by introducing a new device flag
      'BlockedBadBlock'.
      
      This flag is only advisory.
      It is cleared whenever we acknowledge a bad block, so that a waiter
      can re-check the particular bad blocks that it is interested it.
      
      It should be set by a caller when they find they need to wait.
      This (set after test) is inherently racy, but as
      md_wait_for_blocked_rdev already has a timeout, losing the race will
      have minimal impact.
      
      When we clear "Blocked" was also clear "BlockedBadBlocks" incase it
      was set incorrectly (see above race).
      
      We also modify the way we manage 'Blocked' to fit better with the new
      handling of 'BlockedBadBlocks' and to make it consistent between
      externally managed and internally managed metadata.   This requires
      that each raidXd loop checks if the metadata needs to be written and
      triggers a write (md_check_recovery) if needed.  Otherwise a queued
      write request might cause raidXd to wait for the metadata to write,
      and only that thread can write it.
      
      Before writing metadata, we set FaultRecorded for all devices that
      are Faulty, then after writing the metadata we clear Blocked for any
      device for which the Fault was certainly Recorded.
      
      The 'faulty' device flag now appears in sysfs if the device is faulty
      *or* it has unacknowledged bad blocks.  So user-space which does not
      understand bad blocks can continue to function correctly.
      User space which does, should not assume a device is faulty until it
      sees the 'faulty' flag, and then sees the list of unacknowledged bad
      blocks is empty.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      de393cde
    • N
      md: add 'write_error' flag to component devices. · d7a9d443
      NeilBrown 提交于
      If a device has ever seen a write error, we will want to handle
      known-bad-blocks differently.
      So create an appropriate state flag and export it via sysfs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      d7a9d443
    • N
      md/raid1: avoid reading from known bad blocks. · d2eb35ac
      NeilBrown 提交于
      Now that we have a bad block list, we should not read from those
      blocks.
      There are several main parts to this:
        1/ read_balance needs to check for bad blocks, and return not only
           the chosen device, but also how many good blocks are available
           there.
        2/ fix_read_error needs to avoid trying to read from bad blocks.
        3/ read submission must be ready to issue multiple reads to
           different devices as different bad blocks on different devices
           could mean that a single large read cannot be served by any one
           device, but can still be served by the array.
           This requires keeping count of the number of outstanding requests
           per bio.  This count is stored in 'bi_phys_segments'
        4/ retrying a read needs to also be ready to submit a smaller read
           and queue another request for the rest.
      
      This does not yet handle bad blocks when reading to perform resync,
      recovery, or check.
      
      'md_trim_bio' will also be used for RAID10, so put it in md.c and
      export it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d2eb35ac
    • N
      md: load/store badblock list from v1.x metadata · 2699b672
      NeilBrown 提交于
      Space must have been allocated when array was created.
      A feature flag is set when the badblock list is non-empty, to
      ensure old kernels don't load and trust the whole device.
      
      We only update the on-disk badblocklist when it has changed.
      If the badblocklist (or other metadata) is stored on a bad block, we
      don't cope very well.
      
      If metadata has no room for bad block, flag bad-blocks as disabled,
      and do the same for 0.90 metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2699b672
    • N
      md: beginnings of bad block management. · 2230dfe4
      NeilBrown 提交于
      This the first step in allowing md to track bad-blocks per-device so
      that we can fail individual blocks rather than the whole device.
      
      This patch just adds a data structure for recording bad blocks, with
      routines to add, remove, search the list.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NNamhyung Kim <namhyung@gmail.com>
      2230dfe4
  7. 27 7月, 2011 3 次提交
    • J
      MD bitmap: Revert DM dirty log hooks · 3520fa4d
      Jonathan Brassow 提交于
      
      Revert most of commit e384e585
        md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
      
      MD should not need to use DM's dirty log - we decided to use md's
      bitmaps instead.
      
      Keeping the DIV_ROUND_UP clean-ups that were part of commit
      e384e585, however.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3520fa4d
    • N
      md: change managed of recovery_disabled. · 5389042f
      NeilBrown 提交于
      If we hit a read error while recovering a mirror, we want to abort the
      recovery without necessarily failing the disk - as having a disk this
      a read error is better than not having an array at all.
      
      Currently this is managed with a per-array flag "recovery_disabled"
      and is only implemented for RAID1.  For RAID10 we will need finer
      grained control as we might want to disable recovery for individual
      devices separately.
      
      So push more of the decision making into the personality.
      'recovery_disabled' is now a 'cookie' which is copied when the
      personality want to disable recovery and is changed when a device is
      added to the array as this is used as a trigger to 'try recovery
      again'.
      
      This will allow RAID10 to get the control that it needs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5389042f
    • N
      md: introduce link/unlink_rdev() helpers · 36fad858
      Namhyung Kim 提交于
      There are places where sysfs links to rdev are handled
      in a same way. Add the helper functions to consolidate
      them.
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      36fad858
  8. 09 6月, 2011 1 次提交
  9. 08 6月, 2011 1 次提交
    • J
      MD: add sync_super to mddev_t struct · 076f968b
      Jonathan Brassow 提交于
      Add the 'sync_super' function pointer to MD array structure (struct mddev_s)
      
      If device-mapper (dm-raid.c) is to define its own on-disk superblock and be
      able to load it, there must still be a way for MD to initiate superblock
      updates.  The simplest way to make this happen is to provide a pointer in
      the MD array structure that can be set by device-mapper (or other module)
      with a function to do this.  If the function has been set, it will be used;
      otherwise, the method with be looked up via 'super_types' as usual.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      076f968b
  10. 18 4月, 2011 2 次提交
    • N
      md: provide generic support for handling unplug callbacks. · 97658cdd
      NeilBrown 提交于
      When an md device adds a request to a queue, it can call
      mddev_check_plugged.
      If this succeeds then we know that the md thread will be woken up
      shortly, and ->plug_cnt will be non-zero until then, so some
      processing can be delayed.
      
      If it fails, then no unplug callback is expected and the make_request
      function needs to do whatever is required to make the request happen.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97658cdd
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  11. 31 3月, 2011 1 次提交
  12. 24 2月, 2011 1 次提交
    • N
      md: Fix - again - partition detection when array becomes active · f0b4f7e2
      NeilBrown 提交于
      Revert
          b821eaa5
      and
          f3b99be1
      
      When I wrote the first of these I had a wrong idea about the
      lifetime of 'struct block_device'.  It can disappear at any time that
      the block device is not open if it falls out of the inode cache.
      
      So relying on the 'size' recorded with it to detect when the
      device size has changed and so we need to revalidate, is wrong.
      
      Rather, we really do need the 'changed' attribute stored directly in
      the mddev and set/tested as appropriate.
      
      Without this patch, a sequence of:
         mknod / open / close / unlink
      
      (which can cause a block_device to be created and then destroyed)
      will result in a rescan of the partition table and consequence removal
      and addition of partitions.
      Several of these in a row can get udev racing to create and unlink and
      other code can get confused.
      
      With the patch, the rescan is only performed when needed and so there
      are no races.
      
      This is suitable for any stable kernel from 2.6.35.
      Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      f0b4f7e2
  13. 31 1月, 2011 1 次提交
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
  14. 14 1月, 2011 3 次提交
    • J
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow 提交于
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6ff7e08
    • J
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow 提交于
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • N
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown 提交于
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886
  15. 28 10月, 2010 2 次提交
    • N
      md: use separate bio pool for each md device. · a167f663
      NeilBrown 提交于
      bio_clone and bio_alloc allocate from a common bio pool.
      If an md device is stacked with other devices that use this pool, or under
      something like swap which uses the pool, then the multiple calls on
      the pool can cause deadlocks.
      
      So allocate a local bio pool for each md array and use that rather
      than the common pool.
      
      This pool is used both for regular IO and metadata updates.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a167f663
    • N
      md: change type of first arg to sync_page_io. · 2b193363
      NeilBrown 提交于
      Currently sync_page_io takes a 'bdev'.
      Every caller passes 'rdev->bdev'.
      We will soon want another field out of the rdev in sync_page_io,
      So just pass the rdev instead of the bdev out of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b193363
  16. 10 9月, 2010 1 次提交
    • T
      md: implment REQ_FLUSH/FUA support · e9c7469b
      Tejun Heo 提交于
      This patch converts md to support REQ_FLUSH/FUA instead of now
      deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
      changes are notable.
      
      * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
        processing of other requests and thus there is no reason to mark the
        queue congested while FLUSH/FUA is in progress.
      
      * REQ_FLUSH/FUA failures are final and its users don't need retry
        logic.  Retry logic is removed.
      
      * Preflush needs to be issued to all member devices but FUA writes can
        be handled the same way as other writes - their processing can be
        deferred to request_queue of member devices.  md_barrier_request()
        is renamed to md_flush_request() and simplified accordingly.
      
      For linear, raid0 and multipath, the core changes are enough.  raid1,
      5 and 10 need the following conversions.
      
      * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
        request_queues of member devices.  Barrier related logic removed.
      
      * raid5: Queue draining logic dropped.  FUA bit is propagated through
        biodrain and stripe resconstruction such that all the updated parts
        of the stripe are written out with FUA writes if any of the dirtying
        writes was FUA.  preread_active_stripes handling in make_request()
        is updated as suggested by Neil Brown.
      
      * raid10: FUA bit needs to be propagated to write clones.
      
      linear, raid0, 1, 5 and 10 tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e9c7469b
  17. 30 8月, 2010 1 次提交
    • N
      md: resolve confusion of MD_CHANGE_CLEAN · 070dc6dd
      NeilBrown 提交于
      MD_CHANGE_CLEAN is used for two different purposes and this leads to
      confusion.
      One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
      not used for anything else, so have MD_CHANGE_PENDING take over that
      purpose fully.
      
      The two purposes are:
       1/ tell md_update_sb that an update is needed and that it is just a
         clean/dirty transition.
       2/ tell user-space that an transition from clean to dirty is pending
          (something wants to write), and tell te kernel (by clearin the
          flag) that the transition is OK.
      
      The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
      fully to MD_CHANGE_PENDING.
      
      This means that various places which conditionally set or cleared
      MD_CHANGE_CLEAN no longer need to be conditional.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      070dc6dd
  18. 08 8月, 2010 2 次提交
    • N
      md: fix another deadlock with removing sysfs attributes. · bb4f1e9d
      NeilBrown 提交于
      Move the deletion of sysfs attributes from reconfig_mutex to
      open_mutex didn't really help as a process can try to take
      open_mutex while holding reconfig_mutex, so the same deadlock can
      happen, just requiring one more process to be involved in the chain.
      
      I looks like I cannot easily use locking to wait for the sysfs
      deletion to complete, so don't.
      
      The only things that we cannot do while the deletions are still
      pending is other things which can change the sysfs namespace: run,
      takeover, stop.  Each of these can fail with -EBUSY.
      So set a flag while doing a sysfs deletion, and fail run, takeover,
      stop if that flag is set.
      
      This is suitable for 2.6.35.x
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bb4f1e9d
    • C
      block: unify flags for struct bio and struct request · 7b6d91da
      Christoph Hellwig 提交于
      Remove the current bio flags and reuse the request flags for the bio, too.
      This allows to more easily trace the type of I/O from the filesystem
      down to the block driver.  There were two flags in the bio that were
      missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
      renamed two request flags that had a superflous RW in them.
      
      Note that the flags are in bio.h despite having the REQ_ name - as
      blkdev.h includes bio.h that is the only way to go for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7b6d91da
  19. 26 7月, 2010 5 次提交
    • N
      md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. · e384e585
      NeilBrown 提交于
      This allows md/raid5 to fully work as a dm target.
      
      Normally md uses a 'filemap' which contains a list of pages of bits
      each of which may be written separately.
      dm-log uses and all-or-nothing approach to writing the log, so
      when using a dm-log, ->filemap is NULL and the flags normally stored
      in filemap_attr are stored in ->logattrs instead.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e384e585
    • N
      md/bitmap: clean up plugging calls. · b63d7c2e
      NeilBrown 提交于
      1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
        arrays with no queue attached.
      
      2/ Don't bother plugging the queue when we set a bit in the bitmap.
         The reason for this was to encourage as many bits as possible to
         get set before we unplug and write stuff out.
         However every personality already plugs the queue after
         bitmap_startwrite either directly (raid1/raid10) or be setting
         STRIPE_BIT_DELAY which causes the queue to be plugged later
         (raid5).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b63d7c2e
    • N
      md/bitmap: white space clean up and similar. · ac2f40be
      NeilBrown 提交于
      Fixes some whitespace problems
      Fixed some checkpatch.pl complaints.
      Replaced kmalloc ... memset(0), with kzalloc
      Fixed an unlikely memory leak on an error path.
      Reformatted a number of 'if/else' sets, sometimes
      replacing goto with an else clause.
      Removed some old comments and commented-out code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac2f40be
    • N
      md/plug: optionally use plugger to unplug an array during resync/recovery. · 252ac522
      NeilBrown 提交于
      If an array doesn't have a 'queue' then md_do_sync cannot
      unplug it.
      In that case it will have a 'plugger', so make that available
      to the mddev, and use it to unplug the array if needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      252ac522
    • N
      md/raid5: add simple plugging infrastructure. · 2ac87401
      NeilBrown 提交于
      md/raid5 uses the plugging infrastructure provided by the block layer
      and 'struct request_queue'.  However when we plug raid5 under dm there
      is no request queue so we cannot use that.
      
      So create a similar infrastructure that is much lighter weight and use
      it for raid5.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2ac87401