1. 09 9月, 2009 1 次提交
    • D
      dmaengine: add fence support · 0403e382
      Dan Williams 提交于
      Some engines optimize operation by reading ahead in the descriptor chain
      such that descriptor2 may start execution before descriptor1 completes.
      If descriptor2 depends on the result from descriptor1 then a fence is
      required (on descriptor2) to disable this optimization.  The async_tx
      api could implicitly identify dependencies via the 'depend_tx'
      parameter, but that would constrain cases where the dependency chain
      only specifies a completion order rather than a data dependency.  So,
      provide an ASYNC_TX_FENCE to explicitly identify data dependencies.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0403e382
  2. 30 8月, 2009 13 次提交
  3. 15 7月, 2009 1 次提交
  4. 09 6月, 2009 3 次提交
    • N
      md/raid5: fix bug in reshape code when chunk_size decreases. · 0e6e0271
      NeilBrown 提交于
      Now that we support changing the chunksize, we calculate
      "reshape_sectors" to be the max of number of sectors in old
      and new chunk size.
      However there is one please where we still use 'chunksize'
      rather than 'reshape_sectors'.
      This causes a reshape that reduces the size of chunks to freeze.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0e6e0271
    • N
      md/raid5 - avoid deadlocks in get_active_stripe during reshape · a8c906ca
      NeilBrown 提交于
      md has functionality to 'quiesce' and array so that all pending
      IO completed and no new IO starts.  This is used to achieve a
      stable state before making internal changes.
      
      Currently this quiescing applies equally to normal IO, resync
      IO, and reshape IO.
      However there is a problem with applying it to reshape IO.
      Reshape can have multiple 'stripe_heads' that must be active together.
      If the quiesce come between allocating the first and the last of
      such a collection, then we deadlock, as the last will not be allocated
      until the quiesce is lifted, the quiesce will not be lifted until the
      first (which has been allocated) gets used, and that first cannot be
      used until the last is allocated.
      
      It is not necessary to inhibit reshape IO when a quiesce is
      requested.  Those places in the code that require a full quiesce will
      ensure the reshape thread is not running at all.
      
      So allow reshape requests to get access to new stripe_heads without
      being blocked by a 'quiesce'.
      
      This only affects in-place reshapes (i.e. where the array does not
      grow or shrink) and these are only newly supported.  So this patch is
      not needed in earlier kernels.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a8c906ca
    • N
      md/raid5: use conf->raid_disks in preference to mddev->raid_disk · f001a70c
      NeilBrown 提交于
      mddev->raid_disks can be changed and any time by a request from
      user-space.  It is a suggestion as to what number of raid_disks is
      desired.
      
      conf->raid_disks can only be changed by the raid5 module with suitable
      locks in place.  It is a statement as to the current number of
      raid_disks.
      
      There are two places where the latter should be used, but the former
      is used.  This can lead to a crash when reshaping an array.
      
      This patch changes to mddev-> to conf->
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f001a70c
  5. 04 6月, 2009 2 次提交
    • D
      async_tx: structify submission arguments, add scribble · a08abd8c
      Dan Williams 提交于
      Prepare the api for the arrival of a new parameter, 'scribble'.  This
      will allow callers to identify scratchpad memory for dma address or page
      address conversions.  As this adds yet another parameter, take this
      opportunity to convert the common submission parameters (flags,
      dependency, callback, and callback argument) into an object that is
      passed by reference.
      
      Also, take this opportunity to fix up the kerneldoc and add notes about
      the relevant ASYNC_TX_* flags for each routine.
      
      [ Impact: moves api pass-by-value parameters to a pass-by-reference struct ]
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a08abd8c
    • D
      async_tx: kill ASYNC_TX_DEP_ACK flag · 88ba2aa5
      Dan Williams 提交于
      In support of inter-channel chaining async_tx utilizes an ack flag to
      gate whether a dependent operation can be chained to another.  While the
      flag is not set the chain can be considered open for appending.  Setting
      the ack flag closes the chain and flags the descriptor for garbage
      collection.  The ASYNC_TX_DEP_ACK flag essentially means "close the
      chain after adding this dependency".  Since each operation can only have
      one child the api now implicitly sets the ack flag at dependency
      submission time.  This removes an unnecessary management burden from
      clients of the api.
      
      [ Impact: clean up and enforce one dependency per operation ]
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Acked-by: NMaciej Sosnowski <maciej.sosnowski@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      88ba2aa5
  6. 27 5月, 2009 1 次提交
  7. 26 5月, 2009 7 次提交
  8. 07 5月, 2009 7 次提交
    • N
      md: remove rd%d links immediately after stopping an array. · c4647292
      NeilBrown 提交于
      md maintains link in sys/mdXX/md/ to identify which device has
      which role in the array. e.g.
         rd2 -> dev-sda
      
      indicates that the device with role '2' in the array is sda.
      
      These links are only present when the array is active.  They are
      created immediately after ->run is called, and so should be removed
      immediately after ->stop is called.
      However they are currently removed a little bit later, and it is
      possible for ->run to be called again, thus adding these links, before
      they are removed.
      
      So move the removal earlier so they are consistently only present when
      the array is active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c4647292
    • N
      md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597
      NeilBrown 提交于
      Being able to write 'clean' to an 'array_state' of an inactive array
      to activate it in 'clean' mode is both unnecessary and inconvenient.
      
      It is unnecessary because the same can be achieved by writing
      'active'.  This activates and array, but it still remains 'clean'
      until the first write.
      
      It is inconvenient because writing 'clean' is more often used to
      cause an 'active' array to revert to 'clean' mode (thus blocking
      any writes until a 'write-pending' is promoted to 'active').
      
      Allowing 'clean' to both activate an array and mark an active array as
      clean can lead to races:  One program writes 'clean' to mark the
      active array as clean at the same time as another program writes
      'inactive' to deactivate (stop) and active array.  Depending on which
      writes first, the array could be deactivated and immediately
      reactivated which isn't what was desired.
      
      So just disable the use of 'clean' to activate an array.
      
      This avoids a race that can be triggered with mdadm-3.0 and external
      metadata, so it suitable for -stable.
      Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5bf29597
    • J
      md: constify VFTs · 110518bc
      Jan Engelhardt 提交于
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      110518bc
    • N
      md: tidy up status_resync to handle large arrays. · dd71cf6b
      NeilBrown 提交于
      Two problems in status_resync.
      1/ It still used Kilobytes as the basic block unit, while most code
         now uses sectors uniformly.
      2/ It doesn't allow for the possibility that max_sectors exceeds
         the range of "unsigned long".
      
      So
       - change "max_blocks" to "max_sectors", and store sector numbers
         in there and in 'resync'
       - Make 'rt' a 'sector_t' so it can temporarily hold the number of
         remaining sectors.
       - use sector_div rather than normal division.
       - change the magic '100' used to preserve precision to '32'.
         + making it a power of 2 makes division easier
         + it doesn't need to be as large as it was chosen when we averaged
           speed over the entire run.  Now we average speed over the last 30
           seconds or so.
      Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dd71cf6b
    • N
      md: fix some (more) errors with bitmaps on devices larger than 2TB. · db305e50
      NeilBrown 提交于
      If a write intent bitmap covers more than 2TB, we sometimes work with
      values beyond 32bit, so these need to be sector_t.  This patches
      add the required casts to some unsigned longs that are being shifted
      up.
      
      This will affect any raid10 larger than 2TB, or any raid1/4/5/6 with
      member devices that are larger than 2TB.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
      Cc: stable@kernel.org
      db305e50
    • N
      md/raid10: don't clear bitmap during recovery if array will still be degraded. · 18055569
      NeilBrown 提交于
      If we have a raid10 with multiple missing devices, and we recover just
      one of these to a spare, then we risk (depending on the bitmap and
      array chunk size) clearing bits of the bitmap for which recovery isn't
      complete (because a device is still missing).
      
      This can lead to a subsequent "re-add" being recovered without
      any IO happening, which would result in loss of data.
      
      This patch takes the safe approach of not clearing bitmap bits
      if the array will still be degraded.
      
      This patch is suitable for all active -stable kernels.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      18055569
    • N
      md: fix loading of out-of-date bitmap. · b74fd282
      NeilBrown 提交于
      When md is loading a bitmap which it knows is out of date, it fills
      each page with 1s and writes it back out again.  However the
      write_page call makes used of bitmap->file_pages and
      bitmap->last_page_size which haven't been set correctly yet.  So this
      can sometimes fail.
      
      Move the setting of file_pages and last_page_size to before the call
      to write_page.
      
      This bug can cause the assembly on an array to fail, thus making the
      data inaccessible.  Hence I think it is a suitable candidate for
      -stable.
      
      Cc: stable@kernel.org
      Reported-by: NVojtech Pavlik <vojtech@suse.cz>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b74fd282
  9. 20 4月, 2009 1 次提交
  10. 17 4月, 2009 1 次提交
    • N
      md: update sync_completed and reshape_position even more often. · c03f6a19
      NeilBrown 提交于
      There are circumstances when a user-space process might need to
      "oversee" a resync/reshape process.  For example when doing an
      in-place reshape of a raid5, it is prudent to take a backup of each
      section before reshaping it as this is the only way to provide
      safety against an unplanned shutdown (i.e. crash/power failure).
      
      The sync_max sysfs value can be used to stop the resync from
      advancing beyond a particular point.
      So user-space can:
        suspend IO to the first section and back it up
        set 'sync_max' to the end of the section
        wait for 'sync_completed' to reach that point
        resume IO on the first section and move on to the next section.
      
      However this process requires the kernel and user-space to run in
      lock-step which could introduce unnecessary delays.
      
      It would be better if a 'double buffered' approach could be used with
      userspace and kernel space working on different sections with the
      'next' section always ready when the 'current' section is finished.
      
      One problem with implementing this is that sync_completed is only
      guaranteed to be updated when the sync process reaches sync_max.
      (it is updated on a time basis at other times, but it is hard to rely
      on that).  This defeats some of the double buffering.
      
      With this patch, sync_completed (and reshape_position) get updated as
      the current position approaches sync_max, so there is room for
      userspace to advance sync_max early without losing updates.
      
      To be precise, sync_completed is updated when the current sync
      position reaches half way between the current value of sync_completed
      and the value of sync_max.  This will usually be a good time for user
      space to update sync_max.
      
      If sync_max does not get updated, the updates to sync_completed
      (together with associated metadata updates) will occur at an
      exponentially increasing frequency which will get unreasonably fast
      (one update every page) immediately before the process hits sync_max
      and stops.  So the update rate will be unreasonably fast only for an
      insignificant period of time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c03f6a19
  11. 15 4月, 2009 1 次提交
  12. 14 4月, 2009 2 次提交
    • N
      md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0
      NeilBrown 提交于
      The sync_completed file reports how much of a resync (or recovery or
      reshape) has been completed.
      However due to the possibility of out-of-order completion of writes,
      it is not certain to be accurate.
      
      We have an internal value - mddev->curr_resync_completed - which is an
      accurate value (though it might not always be quite so uptodate).
      
      So:
       - make curr_resync_completed be uptodate a little more often,
         particularly when raid5 reshape updates status in the metadata
       - report curr_resync_completed in the sysfs file
       - allow poll/select to report all updates to md/sync_completed.
      
      This makes sync_completed completed usable by any external metadata
      handler that wants to record this status information in its metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      acb180b0
    • N
      md: allow setting newly added device to 'in_sync' via sysfs. · 6d56e278
      NeilBrown 提交于
      When adding devices to an active array via sysfs, there is currently
      no way to mark a device as 'in-sync' which is useful when
      incrementally assembling an array.
      
      So add that option.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6d56e278