1. 18 4月, 2011 2 次提交
    • N
      md: provide generic support for handling unplug callbacks. · 97658cdd
      NeilBrown 提交于
      When an md device adds a request to a queue, it can call
      mddev_check_plugged.
      If this succeeds then we know that the md thread will be woken up
      shortly, and ->plug_cnt will be non-zero until then, so some
      processing can be delayed.
      
      If it fails, then no unplug callback is expected and the make_request
      function needs to do whatever is required to make the request happen.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      97658cdd
    • N
      md - remove old plugging code. · 482c0834
      NeilBrown 提交于
      md has some plugging infrastructure for RAID5 to use because the
      normal plugging infrastructure required a 'request_queue', and when
      called from dm, RAID5 doesn't have one of those available.
      
      This relied on the ->unplug_fn callback which doesn't exist any more.
      
      So remove all of that code, both in md and raid5.  Subsequent patches
      with restore the plugging functionality.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      482c0834
  2. 31 3月, 2011 1 次提交
  3. 24 2月, 2011 1 次提交
    • N
      md: Fix - again - partition detection when array becomes active · f0b4f7e2
      NeilBrown 提交于
      Revert
          b821eaa5
      and
          f3b99be1
      
      When I wrote the first of these I had a wrong idea about the
      lifetime of 'struct block_device'.  It can disappear at any time that
      the block device is not open if it falls out of the inode cache.
      
      So relying on the 'size' recorded with it to detect when the
      device size has changed and so we need to revalidate, is wrong.
      
      Rather, we really do need the 'changed' attribute stored directly in
      the mddev and set/tested as appropriate.
      
      Without this patch, a sequence of:
         mknod / open / close / unlink
      
      (which can cause a block_device to be created and then destroyed)
      will result in a rescan of the partition table and consequence removal
      and addition of partitions.
      Several of these in a row can get udev racing to create and unlink and
      other code can get confused.
      
      With the patch, the rescan is only performed when needed and so there
      are no races.
      
      This is suitable for any stable kernel from 2.6.35.
      Reported-by: N"Wojcik, Krzysztof" <krzysztof.wojcik@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      f0b4f7e2
  4. 31 1月, 2011 1 次提交
    • N
      md: Remove the AllReserved flag for component devices. · f21e9ff7
      NeilBrown 提交于
      This flag is not needed and is used badly.
      
      Devices that are included in a native-metadata array are reserved
      exclusively for that array - and currently have AllReserved set.
      They all are bd_claimed for the rdev and so cannot be shared.
      
      Devices that are included in external-metadata arrays can be shared
      among multiple arrays - providing there is no overlap.
      These are bd_claimed for md in general - not for a particular rdev.
      
      When changing the amount of a device that is used in an array we need
      to check for overlap.  This currently includes a check on AllReserved
      So even without overlap, sharing with an AllReserved device is not
      allowed.
      However the bd_claim usage already precludes sharing with these
      devices, so the test on AllReserved is not needed.  And in fact it is
      wrong.
      
      As this is the only use of AllReserved, simply remove all usage and
      definition of AllReserved.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f21e9ff7
  5. 14 1月, 2011 3 次提交
    • J
      md: separate meta and data devs · a6ff7e08
      Jonathan Brassow 提交于
      Allow the metadata to be on a separate device from the
      data.
      
      This doesn't mean the data and metadata will by on separate
      physical devices - it simply gives device-mapper and userspace
      tools more flexibility.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a6ff7e08
    • J
      md-new-param-to_sync_page_io · ccebd4c4
      Jonathan Brassow 提交于
      Add new parameter to 'sync_page_io'.
      
      The new parameter allows us to distinguish between metadata and data
      operations.  This becomes important later when we add the ability to
      use separate devices for data and metadata.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      ccebd4c4
    • N
      md: Ensure no IO request to get md device before it is properly initialised. · 0ca69886
      NeilBrown 提交于
      When an md device is in the process of coming on line it is possible
      for an IO request (typically a partition table probe) to get through
      before the array is fully initialised, which can cause unexpected
      behaviour (e.g. a crash).
      
      So explicitly record when the array is ready for IO and don't allow IO
      through until then.
      
      There is no possibility for a similar problem when the array is going
      off-line as there must only be one 'open' at that time, and it is busy
      off-lining the array and so cannot send IO requests.  So no memory
      barrier is needed in md_stop()
      
      This has been a bug since commit 409c57f3 in 2.6.30 which
      introduced md_make_request.  Before then, each personality would
      register its own make_request_fn when it was ready.
      This is suitable for any stable kernel from 2.6.30.y onwards.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Hawrylewicz Czarnowski, Przemyslaw" <przemyslaw.hawrylewicz.czarnowski@intel.com>
      0ca69886
  6. 28 10月, 2010 2 次提交
    • N
      md: use separate bio pool for each md device. · a167f663
      NeilBrown 提交于
      bio_clone and bio_alloc allocate from a common bio pool.
      If an md device is stacked with other devices that use this pool, or under
      something like swap which uses the pool, then the multiple calls on
      the pool can cause deadlocks.
      
      So allocate a local bio pool for each md array and use that rather
      than the common pool.
      
      This pool is used both for regular IO and metadata updates.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a167f663
    • N
      md: change type of first arg to sync_page_io. · 2b193363
      NeilBrown 提交于
      Currently sync_page_io takes a 'bdev'.
      Every caller passes 'rdev->bdev'.
      We will soon want another field out of the rdev in sync_page_io,
      So just pass the rdev instead of the bdev out of it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2b193363
  7. 10 9月, 2010 1 次提交
    • T
      md: implment REQ_FLUSH/FUA support · e9c7469b
      Tejun Heo 提交于
      This patch converts md to support REQ_FLUSH/FUA instead of now
      deprecated REQ_HARDBARRIER.  In the core part (md.c), the following
      changes are notable.
      
      * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with
        processing of other requests and thus there is no reason to mark the
        queue congested while FLUSH/FUA is in progress.
      
      * REQ_FLUSH/FUA failures are final and its users don't need retry
        logic.  Retry logic is removed.
      
      * Preflush needs to be issued to all member devices but FUA writes can
        be handled the same way as other writes - their processing can be
        deferred to request_queue of member devices.  md_barrier_request()
        is renamed to md_flush_request() and simplified accordingly.
      
      For linear, raid0 and multipath, the core changes are enough.  raid1,
      5 and 10 need the following conversions.
      
      * raid1: Handling of FLUSH/FUA bio's can simply be deferred to
        request_queues of member devices.  Barrier related logic removed.
      
      * raid5: Queue draining logic dropped.  FUA bit is propagated through
        biodrain and stripe resconstruction such that all the updated parts
        of the stripe are written out with FUA writes if any of the dirtying
        writes was FUA.  preread_active_stripes handling in make_request()
        is updated as suggested by Neil Brown.
      
      * raid10: FUA bit needs to be propagated to write clones.
      
      linear, raid0, 1, 5 and 10 tested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      e9c7469b
  8. 30 8月, 2010 1 次提交
    • N
      md: resolve confusion of MD_CHANGE_CLEAN · 070dc6dd
      NeilBrown 提交于
      MD_CHANGE_CLEAN is used for two different purposes and this leads to
      confusion.
      One of the purposes is largely mirrored by MD_CHANGE_PENDING which is
      not used for anything else, so have MD_CHANGE_PENDING take over that
      purpose fully.
      
      The two purposes are:
       1/ tell md_update_sb that an update is needed and that it is just a
         clean/dirty transition.
       2/ tell user-space that an transition from clean to dirty is pending
          (something wants to write), and tell te kernel (by clearin the
          flag) that the transition is OK.
      
      The first purpose remains wit MD_CHANGE_CLEAN, the second is moved
      fully to MD_CHANGE_PENDING.
      
      This means that various places which conditionally set or cleared
      MD_CHANGE_CLEAN no longer need to be conditional.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      070dc6dd
  9. 08 8月, 2010 2 次提交
    • N
      md: fix another deadlock with removing sysfs attributes. · bb4f1e9d
      NeilBrown 提交于
      Move the deletion of sysfs attributes from reconfig_mutex to
      open_mutex didn't really help as a process can try to take
      open_mutex while holding reconfig_mutex, so the same deadlock can
      happen, just requiring one more process to be involved in the chain.
      
      I looks like I cannot easily use locking to wait for the sysfs
      deletion to complete, so don't.
      
      The only things that we cannot do while the deletions are still
      pending is other things which can change the sysfs namespace: run,
      takeover, stop.  Each of these can fail with -EBUSY.
      So set a flag while doing a sysfs deletion, and fail run, takeover,
      stop if that flag is set.
      
      This is suitable for 2.6.35.x
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bb4f1e9d
    • C
      block: unify flags for struct bio and struct request · 7b6d91da
      Christoph Hellwig 提交于
      Remove the current bio flags and reuse the request flags for the bio, too.
      This allows to more easily trace the type of I/O from the filesystem
      down to the block driver.  There were two flags in the bio that were
      missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
      renamed two request flags that had a superflous RW in them.
      
      Note that the flags are in bio.h despite having the REQ_ name - as
      blkdev.h includes bio.h that is the only way to go for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7b6d91da
  10. 26 7月, 2010 8 次提交
    • N
      md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. · e384e585
      NeilBrown 提交于
      This allows md/raid5 to fully work as a dm target.
      
      Normally md uses a 'filemap' which contains a list of pages of bits
      each of which may be written separately.
      dm-log uses and all-or-nothing approach to writing the log, so
      when using a dm-log, ->filemap is NULL and the flags normally stored
      in filemap_attr are stored in ->logattrs instead.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e384e585
    • N
      md/bitmap: clean up plugging calls. · b63d7c2e
      NeilBrown 提交于
      1/ use md_unplug in bitmap.c as we will soon be using bitmaps under
        arrays with no queue attached.
      
      2/ Don't bother plugging the queue when we set a bit in the bitmap.
         The reason for this was to encourage as many bits as possible to
         get set before we unplug and write stuff out.
         However every personality already plugs the queue after
         bitmap_startwrite either directly (raid1/raid10) or be setting
         STRIPE_BIT_DELAY which causes the queue to be plugged later
         (raid5).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b63d7c2e
    • N
      md/bitmap: white space clean up and similar. · ac2f40be
      NeilBrown 提交于
      Fixes some whitespace problems
      Fixed some checkpatch.pl complaints.
      Replaced kmalloc ... memset(0), with kzalloc
      Fixed an unlikely memory leak on an error path.
      Reformatted a number of 'if/else' sets, sometimes
      replacing goto with an else clause.
      Removed some old comments and commented-out code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac2f40be
    • N
      md/plug: optionally use plugger to unplug an array during resync/recovery. · 252ac522
      NeilBrown 提交于
      If an array doesn't have a 'queue' then md_do_sync cannot
      unplug it.
      In that case it will have a 'plugger', so make that available
      to the mddev, and use it to unplug the array if needed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      252ac522
    • N
      md/raid5: add simple plugging infrastructure. · 2ac87401
      NeilBrown 提交于
      md/raid5 uses the plugging infrastructure provided by the block layer
      and 'struct request_queue'.  However when we plug raid5 under dm there
      is no request queue so we cannot use that.
      
      So create a similar infrastructure that is much lighter weight and use
      it for raid5.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2ac87401
    • N
      md: add support for raising dm events. · 768a418d
      NeilBrown 提交于
      dm uses scheduled work to raise events to user-space.
      So allow md device to have work_structs and schedule them on an error.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      768a418d
    • N
      md: export various start/stop interfaces · 390ee602
      NeilBrown 提交于
      export entry points for starting and stopping md arrays.
      This will be used by a module to make md/raid5 work under
      dm.
      Also stop calling md_stop_writes from md_stop, as that won't
      work well with dm - it will want to call the two separately.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      390ee602
    • N
      md: split out md_rdev_init · e8bb9a83
      NeilBrown 提交于
      This functionality will be needed separately in a subsequent patch, so
      split it into it's own exported function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e8bb9a83
  11. 21 7月, 2010 1 次提交
  12. 24 6月, 2010 1 次提交
    • N
      md: fix handling of array level takeover that re-arranges devices. · e93f68a1
      NeilBrown 提交于
      Most array level changes leave the list of devices largely unchanged,
      possibly causing one at the end to become redundant.
      However conversions between RAID0 and RAID10 need to renumber
      all devices (except 0).
      
      This renumbering is currently being done in the ->run method when the
      new personality takes over.  However this is too late as the common
      code in md.c might already have invalidated some of the devices if
      they had a ->raid_disk number that appeared to high.
      
      Moving it into the ->takeover method is too early as the array is
      still active at that time and wrong ->raid_disk numbers could cause
      confusion.
      
      So add a ->new_raid_disk field to mdk_rdev_s and use it to communicate
      the new raid_disk number.
      Now the common code knows exactly which devices need to be renumbered,
      and which can be invalidated, and can do it all at a convenient time
      when the array is suspend.
      It can also update some symlinks in sysfs which previously were not be
      updated correctly.
      Reported-by: NMaciej Trela <maciej.trela@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e93f68a1
  13. 18 5月, 2010 5 次提交
    • N
      md: simplify updating of event count to sometimes avoid updating spares. · a8707c08
      NeilBrown 提交于
      When updating the event count for a simple clean <-> dirty transition,
      we try to avoid updating the spares so they can safely spin-down.
      As the event_counts across an array must be +/- 1, this means
      decrementing the event_count on a dirty->clean transition.
      This is not always safe and we have to avoid the unsafe time.
      We current do this with a misguided idea about it being safe or
      not depending on whether the event_count is odd or even.  This
      approach only works reliably in a few common instances, but easily
      falls down.
      
      So instead, simply keep internal state concerning whether it is safe
      or not, and always assume it is not safe when an array is first
      assembled.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8707c08
    • N
      md: pass mddev to make_request functions rather than request_queue · 21a52c6d
      NeilBrown 提交于
      We used to pass the personality make_request function direct
      to the block layer so the first argument had to be a queue.
      But now we have the intermediary md_make_request so it makes
      at lot more sense to pass a struct mddev_s.
      It makes it possible to have an mddev without its own queue too.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      21a52c6d
    • N
      md: remove ->changed and related code. · b821eaa5
      NeilBrown 提交于
      We set ->changed to 1 and call check_disk_change at the end
      of md_open so that bd_invalidated would be set and thus
      partition rescan would happen appropriately.
      
      Now that we call revalidate_disk directly, which sets bd_invalidates,
      that indirection is no longer needed and can be removed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b821eaa5
    • N
      md: discard StateChanged device flag. · c0cc75f8
      NeilBrown 提交于
      This was needed when sysfs files could only be 'notified'
      from process context.  Now that we have sys_notify_direct,
      we can call it directly from an interrupt.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c0cc75f8
    • N
      md: remove some dead fields from mddev_s · ee8b81b0
      NeilBrown 提交于
      These fields have never been used.
      commit 4b6d287f
      added them, but also added identical files to bitmap_super_s,
      and only used the latter.
      
      So remove these unused fields.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ee8b81b0
  14. 17 5月, 2010 1 次提交
    • N
      md: manage redundancy group in sysfs when changing level. · a64c876f
      NeilBrown 提交于
      Some levels expect the 'redundancy group' to be present,
      others don't.
      So when we change level of an array we might need to
      add or remove this group.
      
      This requires fixing up the current practice of overloading ->private
      to indicate (when ->pers == NULL) that something needs to be removed.
      So create a new ->to_remove to fill that role.
      
      When changing levels, we may need to add or remove attributes.  When
      changing RAID5 -> RAID6, we both add and remove the same thing.  It is
      important to catch this and optimise it out as the removal is delayed
      until a lock is released, so trying to add immediately would cause
      problems.
      
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a64c876f
  15. 14 12月, 2009 9 次提交
    • R
      raid: improve MD/raid10 handling of correctable read errors. · 1e50915f
      Robert Becker 提交于
      We've noticed severe lasting performance degradation of our raid
      arrays when we have drives that yield large amounts of media errors.
      The raid10 module will queue each failed read for retry, and also
      will attempt call fix_read_error() to perform the read recovery.
      Read recovery is performed while the array is frozen, so repeated
      recovery attempts can degrade the performance of the array for
      extended periods of time.
      
      With this patch I propose adding a per md device max number of
      corrected read attempts.  Each rdev will maintain a count of
      read correction attempts in the rdev->read_errors field (not
      used currently for raid10). When we enter fix_read_error()
      we'll check to see when the last read error occurred, and
      divide the read error count by 2 for every hour since the
      last read error. If at that point our read error count
      exceeds the read error threshold, we'll fail the raid device.
      
      In addition in this patch I add sysfs nodes (get/set) for
      the per md max_read_errors attribute, the rdev->read_errors
      attribute, and added some printk's to indicate when
      fix_read_error fails to repair an rdev.
      
      For testing I used debugfs->fail_make_request to inject
      IO errors to the rdev while doing IO to the raid array.
      Signed-off-by: NRobert Becker <Rob.Becker@riverbed.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1e50915f
    • N
      md: Support write-intent bitmaps with externally managed metadata. · ece5cff0
      NeilBrown 提交于
      In this case, the metadata needs to not be in the same
      sector as the bitmap.
      md will not read/write any bitmap metadata.  Config must be
      done via sysfs and when a recovery makes the array non-degraded
      again, writing 'true' to 'bitmap/can_clear' will allow bits in
      the bitmap to be cleared again.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ece5cff0
    • N
      md: support updating bitmap parameters via sysfs. · 43a70507
      NeilBrown 提交于
      A new attribute directory 'bitmap' in 'md' is created which
      contains files for configuring the bitmap.
      'location' identifies where the bitmap is, either 'none',
      or 'file' or 'sector offset from metadata'.
      Writing 'location' can create or remove a bitmap.
      Adding a 'file' bitmap this way is not yet supported.
      'chunksize' and 'time_base' must be set before 'location'
      can be set.
      
      'chunksize' can be set before creating a bitmap, but is
      currently always over-ridden by the bitmap superblock.
      
      'time_base' and 'backlog' can be updated at any time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      43a70507
    • N
      md: factor out parsing of fixed-point numbers · 72e02075
      NeilBrown 提交于
      safe_delay_store can parse fixed point numbers (for fractions
      of a second).  We will want to do that for another sysfs
      file soon, so factor out the code.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72e02075
    • N
      md: support bitmap offset appropriate for external-metadata arrays. · f6af949c
      NeilBrown 提交于
      For md arrays were metadata is managed externally, the kernel does not
      know about a superblock so the superblock offset is 0.
      If we want to have a write-intent-bitmap near the end of the
      devices of such an array, we should support sector_t sized offset.
      We need offset be possibly negative for when the bitmap is before
      the metadata, so use loff_t instead.
      
      Also add sanity check that bitmap does not overlap with data.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f6af949c
    • N
      md: move offset, daemon_sleep and chunksize out of bitmap structure · 42a04b50
      NeilBrown 提交于
      ... and into bitmap_info.  These are all configuration parameters
      that need to be set before the bitmap is created.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      42a04b50
    • N
      md: collect bitmap-specific fields into one structure. · c3d9714e
      NeilBrown 提交于
      In preparation for making bitmap fields configurable via sysfs,
      start tidying up by making a single structure to contain the
      configuration fields.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c3d9714e
    • N
      md: support barrier requests on all personalities. · a2826aa9
      NeilBrown 提交于
      Previously barriers were only supported on RAID1.  This is because
      other levels requires synchronisation across all devices and so needed
      a different approach.
      Here is that approach.
      
      When a barrier arrives, we send a zero-length barrier to every active
      device.  When that completes - and if the original request was not
      empty -  we submit the barrier request itself (with the barrier flag
      cleared) and then submit a fresh load of zero length barriers.
      
      The barrier request itself is asynchronous, but any subsequent
      request will block until the barrier completes.
      
      The reason for clearing the barrier flag is that a barrier request is
      allowed to fail.  If we pass a non-empty barrier through a striping
      raid level it is conceivable that part of it could succeed and part
      could fail.  That would be way too hard to deal with.
      So if the first run of zero length barriers succeed, we assume all is
      sufficiently well that we send the request and ignore errors in the
      second run of barriers.
      
      RAID5 needs extra care as write requests may not have been submitted
      to the underlying devices yet.  So we flush the stripe cache before
      proceeding with the barrier.
      
      Note that the second set of zero-length barriers are submitted
      immediately after the original request is submitted.  Thus when
      a personality finds mddev->barrier to be set during make_request,
      it should not return from make_request until the corresponding
      per-device request(s) have been queued.
      
      That will be done in later patches.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      a2826aa9
    • N
      md/bitmap: protect against bitmap removal while being updated. · aa5cbd10
      NeilBrown 提交于
      A write intent bitmap can be removed from an array while the
      array is active.
      When this happens, all IO is suspended and flushed before the
      bitmap is removed.
      However it is possible that bitmap_daemon_work is still running to
      clear old bits from the bitmap.  If it is, it can dereference the
      bitmap after it has been freed.
      
      So introduce a new mutex to protect bitmap_daemon_work and get it
      before destroying a bitmap.
      
      This is suitable for any current -stable kernel.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: stable@kernel.org
      aa5cbd10
  16. 23 9月, 2009 1 次提交