1. 03 8月, 2009 5 次提交
    • N
      md: Use revalidate_disk to effect changes in size of device. · 449aad3e
      NeilBrown 提交于
      As revalidate_disk calls check_disk_size_change, it will cause
      any capacity change of a gendisk to be propagated to the blockdev
      inode.  So use that instead of mucking about with locks and
      i_size_write.
      
      Also add a call to revalidate_disk in do_md_run and a few other places
      where the gendisk capacity is changed.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      449aad3e
    • N
      md: Handle growth of v1.x metadata correctly. · 70471daf
      NeilBrown 提交于
      The v1.x metadata does not have a fixed size and can grow
      when devices are added.
      If it grows enough to require an extra sector of storage,
      we need to update the 'sb_size' to match.
      
      Without this, md can write out an incomplete superblock with a
      bad checksum, which will be rejected when trying to re-assemble
      the array.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      70471daf
    • N
      md: avoid array overflow with bad v1.x metadata · 3673f305
      NeilBrown 提交于
      We trust the 'desc_nr' field in v1.x metadata enough to use it
      as an index in an array.  This isn't really safe.
      So range-check the value first.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3673f305
    • N
      md: when a level change reduces the number of devices, remove the excess. · 3a981b03
      NeilBrown 提交于
      When an array is changed from RAID6 to RAID5, fewer drives are
      needed.  So any device that is made superfluous by the level
      conversion must be marked as not-active.
      For the RAID6->RAID5 conversion, this will be a drive which only
      has 'Q' blocks on it.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3a981b03
    • A
      md: Push down data integrity code to personalities. · ac5e7113
      Andre Noll 提交于
      This patch replaces md_integrity_check() by two new public functions:
      md_integrity_register() and md_integrity_add_rdev() which are both
      personality-independent.
      
      md_integrity_register() is called from the ->run and ->hot_remove
      methods of all personalities that support data integrity.  The
      function iterates over the component devices of the array and
      determines if all active devices are integrity capable and if their
      profiles match. If this is the case, the common profile is registered
      for the mddev via blk_integrity_register().
      
      The second new function, md_integrity_add_rdev() is called from the
      ->hot_add_disk methods, i.e. whenever a new device is being added
      to a raid array. If the new device does not support data integrity,
      or has a profile different from the one already registered, data
      integrity for the mddev is disabled.
      
      For raid0 and linear, only the call to md_integrity_register() from
      the ->run method is necessary.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ac5e7113
  2. 09 7月, 2009 1 次提交
  3. 01 7月, 2009 4 次提交
  4. 18 6月, 2009 8 次提交
    • A
      md: Move check for bitmap presence to personality code. · 0894cc30
      Andre Noll 提交于
      If the superblock of a component device indicates the presence of a
      bitmap but the corresponding raid personality does not support bitmaps
      (raid0, linear, multipath, faulty), then something is seriously wrong
      and we'd better refuse to run such an array.
      
      Currently, this check is performed while the superblocks are examined,
      i.e. before entering personality code. Therefore the generic md layer
      must know which raid levels support bitmaps and which do not.
      
      This patch avoids this layer violation without adding identical code
      to various personalities. This is accomplished by introducing a new
      public function to md.c, md_check_no_bitmap(), which replaces the
      hard-coded checks in the superblock loading functions.
      
      A call to md_check_no_bitmap() is added to the ->run method of each
      personality which does not support bitmaps and assembly is aborted
      if at least one component device contains a bitmap.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0894cc30
    • N
      md: remove chunksize rounding from common code. · 8190e754
      NeilBrown 提交于
      It is easiest to round sizes to multiples of chunk size in
      the personality code for those personalities which care.
      Those personalities now do the rounding, so we can
      remove that function from common code.
      
      Also remove the upper bound on the size of a chunk, and the lower
      bound on the size of a device (1 chunk), neither of which really buy
      us anything.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8190e754
    • N
      md: move assignment of ->utime so that it never gets skipped. · 1b57f132
      NeilBrown 提交于
      Currently the assignment to utime gets skipped for 'external'
      metadata.  So move it to the top of the function so that it
      always gets effected.
      This is of largely cosmetic interest.  Nothing actually depends
      on ->utime being right for external arrays.
      "mdadm --monitor" does use it for 0.90 and 1.x arrays, but with
      mdadm-3.0, this is not important for external metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1b57f132
    • A
      md: Push down reconstruction log message to personality code. · 8c6ac868
      Andre Noll 提交于
      Currently, the md layer checks in analyze_sbs() if the raid level
      supports reconstruction (mddev->level >= 1) and if reconstruction is
      in progress (mddev->recovery_cp != MaxSector).
      
      Move that printk into the personality code of those raid levels that
      care (levels 1, 4, 5, 6, 10).
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8c6ac868
    • N
      md: merge reconfig and check_reshape methods. · 50ac168a
      NeilBrown 提交于
      The difference between these two methods is artificial.
      Both check that a pending reshape is valid, and perform any
      aspect of it that can be done immediately.
      'reconfig' handles chunk size and layout.
      'check_reshape' handles raid_disks.
      
      So make them just one method.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      50ac168a
    • N
      md: remove unnecessary arguments from ->reconfig method. · 597a711b
      NeilBrown 提交于
      Passing the new layout and chunksize as args is not necessary as
      the mddev has fields for new_check and new_layout.
      
      This is preparation for combining the check_reshape and reconfig
      methods
      Signed-off-by: NNeilBrown <neilb@suse.de>
      597a711b
    • A
      md: Convert mddev->new_chunk to sectors. · 664e7c41
      Andre Noll 提交于
      A straight-forward conversion which gets rid of some
      multiplications/divisions/shifts. The patch also introduces a couple
      of new ones, most of which are due to conf->chunk_size still being
      represented in bytes. This will be cleaned up in subsequent patches.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      664e7c41
    • A
      md: Make mddev->chunk_size sector-based. · 9d8f0363
      Andre Noll 提交于
      This patch renames the chunk_size field to chunk_sectors with the
      implied change of semantics.  Since
      
      	is_power_of_2(chunk_size) = is_power_of_2(chunk_sectors << 9)
      				  = is_power_of_2(chunk_sectors)
      
      these bits don't need an adjustment for the shift.
      Signed-off-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9d8f0363
  5. 16 6月, 2009 1 次提交
  6. 26 5月, 2009 5 次提交
  7. 23 5月, 2009 1 次提交
  8. 07 5月, 2009 4 次提交
    • N
      md: remove rd%d links immediately after stopping an array. · c4647292
      NeilBrown 提交于
      md maintains link in sys/mdXX/md/ to identify which device has
      which role in the array. e.g.
         rd2 -> dev-sda
      
      indicates that the device with role '2' in the array is sda.
      
      These links are only present when the array is active.  They are
      created immediately after ->run is called, and so should be removed
      immediately after ->stop is called.
      However they are currently removed a little bit later, and it is
      possible for ->run to be called again, thus adding these links, before
      they are removed.
      
      So move the removal earlier so they are consistently only present when
      the array is active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c4647292
    • N
      md: remove ability to explicit set an inactive array to 'clean'. · 5bf29597
      NeilBrown 提交于
      Being able to write 'clean' to an 'array_state' of an inactive array
      to activate it in 'clean' mode is both unnecessary and inconvenient.
      
      It is unnecessary because the same can be achieved by writing
      'active'.  This activates and array, but it still remains 'clean'
      until the first write.
      
      It is inconvenient because writing 'clean' is more often used to
      cause an 'active' array to revert to 'clean' mode (thus blocking
      any writes until a 'write-pending' is promoted to 'active').
      
      Allowing 'clean' to both activate an array and mark an active array as
      clean can lead to races:  One program writes 'clean' to mark the
      active array as clean at the same time as another program writes
      'inactive' to deactivate (stop) and active array.  Depending on which
      writes first, the array could be deactivated and immediately
      reactivated which isn't what was desired.
      
      So just disable the use of 'clean' to activate an array.
      
      This avoids a race that can be triggered with mdadm-3.0 and external
      metadata, so it suitable for -stable.
      Reported-by: NRafal Marszewski <rafal.marszewski@intel.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5bf29597
    • J
      md: constify VFTs · 110518bc
      Jan Engelhardt 提交于
      Signed-off-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      110518bc
    • N
      md: tidy up status_resync to handle large arrays. · dd71cf6b
      NeilBrown 提交于
      Two problems in status_resync.
      1/ It still used Kilobytes as the basic block unit, while most code
         now uses sectors uniformly.
      2/ It doesn't allow for the possibility that max_sectors exceeds
         the range of "unsigned long".
      
      So
       - change "max_blocks" to "max_sectors", and store sector numbers
         in there and in 'resync'
       - Make 'rt' a 'sector_t' so it can temporarily hold the number of
         remaining sectors.
       - use sector_div rather than normal division.
       - change the magic '100' used to preserve precision to '32'.
         + making it a power of 2 makes division easier
         + it doesn't need to be as large as it was chosen when we averaged
           speed over the entire run.  Now we average speed over the last 30
           seconds or so.
      Reported-by: N"Mario 'BitKoenig' Holbe" <Mario.Holbe@TU-Ilmenau.DE>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dd71cf6b
  9. 17 4月, 2009 1 次提交
    • N
      md: update sync_completed and reshape_position even more often. · c03f6a19
      NeilBrown 提交于
      There are circumstances when a user-space process might need to
      "oversee" a resync/reshape process.  For example when doing an
      in-place reshape of a raid5, it is prudent to take a backup of each
      section before reshaping it as this is the only way to provide
      safety against an unplanned shutdown (i.e. crash/power failure).
      
      The sync_max sysfs value can be used to stop the resync from
      advancing beyond a particular point.
      So user-space can:
        suspend IO to the first section and back it up
        set 'sync_max' to the end of the section
        wait for 'sync_completed' to reach that point
        resume IO on the first section and move on to the next section.
      
      However this process requires the kernel and user-space to run in
      lock-step which could introduce unnecessary delays.
      
      It would be better if a 'double buffered' approach could be used with
      userspace and kernel space working on different sections with the
      'next' section always ready when the 'current' section is finished.
      
      One problem with implementing this is that sync_completed is only
      guaranteed to be updated when the sync process reaches sync_max.
      (it is updated on a time basis at other times, but it is hard to rely
      on that).  This defeats some of the double buffering.
      
      With this patch, sync_completed (and reshape_position) get updated as
      the current position approaches sync_max, so there is room for
      userspace to advance sync_max early without losing updates.
      
      To be precise, sync_completed is updated when the current sync
      position reaches half way between the current value of sync_completed
      and the value of sync_max.  This will usually be a good time for user
      space to update sync_max.
      
      If sync_max does not get updated, the updates to sync_completed
      (together with associated metadata updates) will occur at an
      exponentially increasing frequency which will get unreasonably fast
      (one update every page) immediately before the process hits sync_max
      and stops.  So the update rate will be unreasonably fast only for an
      insignificant period of time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c03f6a19
  10. 14 4月, 2009 2 次提交
    • N
      md: improve usefulness and accuracy of sysfs file md/sync_completed. · acb180b0
      NeilBrown 提交于
      The sync_completed file reports how much of a resync (or recovery or
      reshape) has been completed.
      However due to the possibility of out-of-order completion of writes,
      it is not certain to be accurate.
      
      We have an internal value - mddev->curr_resync_completed - which is an
      accurate value (though it might not always be quite so uptodate).
      
      So:
       - make curr_resync_completed be uptodate a little more often,
         particularly when raid5 reshape updates status in the metadata
       - report curr_resync_completed in the sysfs file
       - allow poll/select to report all updates to md/sync_completed.
      
      This makes sync_completed completed usable by any external metadata
      handler that wants to record this status information in its metadata.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      acb180b0
    • N
      md: allow setting newly added device to 'in_sync' via sysfs. · 6d56e278
      NeilBrown 提交于
      When adding devices to an active array via sysfs, there is currently
      no way to mark a device as 'in-sync' which is useful when
      incrementally assembling an array.
      
      So add that option.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6d56e278
  11. 31 3月, 2009 8 次提交
    • N
      md: don't display meaningless values in sysfs files resync_start and sync_speed · d1a7c503
      NeilBrown 提交于
      When no resync if happening, both of these files currently have
      meaningless values (is slightly different ways).
      Change them to "none" in that case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d1a7c503
    • N
      md: add explicit method to signal the end of a reshape. · cea9c228
      NeilBrown 提交于
      Currently raid5 (the only module that supports restriping)
      notices that the reshape has finished be sync_request being
      given a large value, and handles any cleanup them.
      
      This patch changes it so md_check_recovery calls into an
      explicit finish_reshape method as well.
      
      The clean-up from sync_request can do things that need to be
      done promptly, typically things local to the raid5_conf_t
      structure.
      
      The "finish_reshape" method is called under the mddev_lock
      so it can do things involving reconfiguring the device.
      
      This allows us to get rid of md_set_array_sectors_locked, which
      would have caused a deadlock if you tried to stop and array
      while a reshape was happening.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cea9c228
    • D
      md: 'array_size' sysfs attribute · b522adcd
      Dan Williams 提交于
      Allow userspace to set the size of the array according to the following
      semantics:
      
      1/ size must be <= to the size returned by mddev->pers->size(mddev, 0, 0)
         a) If size is set before the array is running, do_md_run will fail
            if size is greater than the default size
         b) A reshape attempt that reduces the default size to less than the set
            array size should be blocked
      2/ once userspace sets the size the kernel will not change it
      3/ writing 'default' to this attribute returns control of the size to the
         kernel and reverts to the size reported by the personality
      
      Also, convert locations that need to know the default size from directly
      reading ->array_sectors to <pers>_size.  Resync/reshape operations
      always follow the default size.
      
      Finally, fixup other locations that read a number of 1k-blocks from
      userspace to use strict_blocks_to_sectors() which checks for unsigned
      long long to sector_t overflow and blocks to sectors overflow.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b522adcd
    • D
      md: centralize ->array_sectors modifications · 1f403624
      Dan Williams 提交于
      Get personalities out of the business of directly modifying
      ->array_sectors.  Lays groundwork to introduce policy on when
      ->array_sectors can be modified.
      Reviewed-by: NAndre Noll <maan@systemlinux.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1f403624
    • N
      md/raid5: allow layout/chunksize to be changed on an active 2-drive raid5. · b3546035
      NeilBrown 提交于
      2-drive raid5's aren't very interesting.  But if you are converting
      a raid1 into a raid5, you will at least temporarily have one.  And
      that it a good time to set the layout/chunksize for the new RAID5
      if you aren't happy with the defaults.
      
      layout and chunksize don't actually affect the placement of data
      on a 2-drive raid5, so we just do some internal book-keeping.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b3546035
    • N
      md: add ->takeover method to support changing the personality managing an array · 245f46c2
      NeilBrown 提交于
      Implement this for RAID6 to be able to 'takeover' a RAID5 array.  The
      new RAID6 will use a layout which places Q on the last device, and
      that device will be missing.
      If there are any available spares, one will immediately have Q
      recovered onto it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      245f46c2
    • N
      md: enable suspend/resume of md devices. · 409c57f3
      NeilBrown 提交于
      To be able to change the 'level' of an md/raid array, we need to
      suspend the device so that no requests are active - then move some
      pointers around etc.
      
      The code already keeps counts of active requests and the ->quiesce
      function can be used to wait until those counts hit zero.
      However the quiesce function blocks new requests once they are all
      ready 'inside' the personality module, and that is too late if we want
      to replace the personality modules.
      
      So make all md requests come in through a common md_make_request
      function that keeps track of how many requests have entered the
      modules but may not yet be on the internal reference counts.
      Allow md_make_request to be blocked when we want to suspend the
      device, and make it possible to wait for all those in-transit requests
      to be added to internal lists so that ->quiesce can wait for them.
      
      There is still a problem that when a request completes, we drop the
      ref count inside the personality code so there is a short time between
      when the refcount hits zero, and when the personality code is no
      longer being used.
      The personality code never blocks (schedule or spinlock) between
      dropping the refcount and exiting the routine, so this should be safe
      (as put_module calls synchronize_sched() before unmapping the module
      code).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      409c57f3
    • N
      md: md_unregister_thread should cope with being passed NULL · e0cf8f04
      NeilBrown 提交于
      Mostly md_unregister_thread is only called when we know that the
      thread is NULL, but sometimes we need to check first.  It is safer
      to put the check inside md_unregister_thread itself.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e0cf8f04