1. 21 5月, 2012 4 次提交
    • N
      md: teach sync_page_io about new_data_offset. · 1fdd6fc9
      NeilBrown 提交于
      Some code in raid1 and raid10 use sync_page_io to
      read/write pages when responding to read errors.
      As we will shortly support changing data_offset for
      raid10, this function must understand new_data_offset.
      
      So add that understanding.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1fdd6fc9
    • N
      md: add possibility to change data-offset for devices. · c6563a8c
      NeilBrown 提交于
      When reshaping we can avoid costly intermediate backup by
      changing the 'start' address of the array on the device
      (if there is enough room).
      
      So as a first step, allow such a change to be requested
      through sysfs, and recorded in v1.x metadata.
      
      (As we didn't previous check that all 'pad' fields were zero,
       we need a new FEATURE flag for this.
       A (belatedly) check that all remaining 'pad' fields are
       zero to avoid a repeat of this)
      
      The new data offset must be requested separately for each device.
      This allows each to have a different change in the data offset.
      This is not likely to be used often but as data_offset can be
      set per-device, new_data_offset should be too.
      
      This patch also removes the 'acknowledged' arg to rdev_set_badblocks as
      it is never used and never will be.  At the same time we add a new
      arg ('in_new') which is currently always zero but will be used more
      soon.
      
      When a reshape finishes we will need to update the data_offset
      and rdev->sectors.  So provide an exported function to do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c6563a8c
    • N
      md: allow a reshape operation to be reversed. · 2c810cdd
      NeilBrown 提交于
      Currently a reshape operation always progresses from the start
      of the array to the end unless the number of devices is being
      reduced, in which case it progressed in the opposite direction.
      
      To reverse a partial reshape which changes the number of devices
      you can stop the array and re-assemble with the raid-disks numbers
      reversed and it will undo.
      
      However for a reshape that does not change the number of devices
      it is not possible to reverse the reshape in the middle - you have to
      wait until it completes.
      
      So add a 'reshape_direction' attribute with is either 'forwards' or
      'backwards' and can be explicitly set when delta_disks is zero.
      
      This will become more important when we allow the data_offset to
      change in a reshape.  Then the explicit statement of what direction is
      being used will be more useful.
      
      This can be enabled in raid5 trivially as it already supports
      reverse reshape and just needs to use a different trigger to request it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c810cdd
    • S
      md: using GFP_NOIO to allocate bio for flush request · b5e1b8ce
      Shaohua Li 提交于
      A flush request is usually issued in transaction commit code path, so
      using GFP_KERNEL to allocate memory for flush request bio falls into
      the classic deadlock issue.
      
      This is suitable for any -stable kernel to which it applies as it
      avoids a possible deadlock.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b5e1b8ce
  2. 17 5月, 2012 1 次提交
    • J
      MD: Add del_timer_sync to mddev_suspend (fix nasty panic) · 0d9f4f13
      Jonathan Brassow 提交于
      Use del_timer_sync to remove timer before mddev_suspend finishes.
      
      We don't want a timer going off after an mddev_suspend is called.  This is
      especially true with device-mapper, since it can call the destructor function
      immediately following a suspend.  This results in the removal (kfree) of the
      structures upon which the timer depends - resulting in a very ugly panic.
      Therefore, we add a del_timer_sync to mddev_suspend to prevent this.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0d9f4f13
  3. 24 4月, 2012 2 次提交
    • N
      md: fix possible corruption of array metadata on shutdown. · 30b8aa91
      NeilBrown 提交于
      commit c744a65c
        md: don't set md arrays to readonly on shutdown.
      
      removed the possibility of a 'BUG' when data is written to an array
      that has just been switched to read-only, but also introduced the
      possibility that the array metadata could be corrupted.
      
      If, when md_notify_reboot gets the mddev lock, the array is
      in a state where it is assembled but hasn't been started (as can
      happen if the personality module is not available, or in other unusual
      situations), then incorrect metadata will be written out making it
      impossible to re-assemble the array.
      
      So only call __md_stop_writes() if the array has actually been
      activated.
      
      This patch is needed for any stable kernel which has had the above
      commit applied.
      
      Cc: stable@vger.kernel.org
      Reported-by: NChristoph Nelles <evilazrael@evilazrael.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30b8aa91
    • N
      md: don't call ->add_disk unless there is good reason. · ed209584
      NeilBrown 提交于
      Commit 7bfec5f3
      
         md/raid5: If there is a spare and a want_replacement device, start replacement.
      
      cause md_check_recovery to call ->add_disk much more often.
      Instead of only when the array is degraded, it is now called whenever
      md_check_recovery finds anything useful to do, which includes
      updating the metadata for clean<->dirty transition.
      This causes unnecessary work, and causes info messages from ->add_disk
      to be reported much too often.
      
      So refine md_check_recovery to only do any actual recovery checking
      (including ->add_disk) if MD_RECOVERY_NEEDED is set.
      
      This fix is suitable for 3.3.y:
      
      Cc: stable@vger.kernel.org
      Reported-by: NJan Ceuleers <jan.ceuleers@computer.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ed209584
  4. 19 3月, 2012 6 次提交
    • M
      md: Add judgement bb->unacked_exist in function md_ack_all_badblocks(). · ecb178bb
      majianpeng 提交于
      If there are no unacked bad blocks, then there is no point searching
      for them to acknowledge them.
      Signed-off-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ecb178bb
    • N
      md: fix clearing of the 'changed' flags for the bad blocks list. · d0962936
      NeilBrown 提交于
      In super_1_sync (the first hunk) we need to clear 'changed' before
      checking read_seqretry(), otherwise we might race with other code
      adding a bad block and so won't retry later.
      
      In md_update_sb (the second hunk), in the case where there is no
      metadata (neither persistent nor external), we treat any bad blocks as
      an error.  However we need to clear the 'changed' flag before calling
      md_ack_all_badblocks, else it won't do anything.
      
      This patch is suitable for -stable release 3.0 and later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0962936
    • N
      md/bitmap: move printing of bitmap status to bitmap.c · 57148964
      NeilBrown 提交于
      The part of /proc/mdstat which describes the bitmap should really
      be generated by code in bitmap.c.  So move it there.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      57148964
    • N
      md/raid10: handle merge_bvec_fn in member devices. · 050b6615
      NeilBrown 提交于
      Currently we don't honour merge_bvec_fn in member devices so if there
      is one, we force all requests to be single-page at most.
      This is not ideal.
      
      So enhance the raid10 merge_bvec_fn to check that function in children
      as well.
      
      This introduces a small problem.  There is no locking around calls
      the ->merge_bvec_fn and subsequent calls to ->make_request.  So a
      device added between these could end up getting a request which
      violates its merge_bvec_fn.
      
      Currently the best we can do is synchronize_sched().  This will work
      providing no preemption happens.  If there is preemption, we just
      have to hope that new devices are largely consistent with old devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      050b6615
    • N
      md: tidy up rdev_for_each usage. · dafb20fa
      NeilBrown 提交于
      md.h has an 'rdev_for_each()' macro for iterating the rdevs in an
      mddev.  However it uses the 'safe' version of list_for_each_entry,
      and so requires the extra variable, but doesn't include 'safe' in the
      name, which is useful documentation.
      
      Consequently some places use this safe version without needing it, and
      many use an explicity list_for_each entry.
      
      So:
       - rename rdev_for_each to rdev_for_each_safe
       - create a new rdev_for_each which uses the plain
         list_for_each_entry,
       - use the 'safe' version only where needed, and convert all other
         list_for_each_entry calls to use rdev_for_each.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dafb20fa
    • N
      md: don't set md arrays to readonly on shutdown. · c744a65c
      NeilBrown 提交于
      It seems that with recent kernel, writeback can still be happening
      while shutdown is happening, and consequently data can be written
      after the md reboot notifier switches all arrays to read-only.
      This causes a BUG.
      
      So don't switch them to read-only - just mark them clean and
      set 'safemode' to '2' which mean that immediately after any
      write the array will be switch back to 'clean'.
      
      This could result in the shutdown happening when array is marked
      dirty, thus forcing a resync on reboot.  However if you reboot
      without performing a "sync" first, you get to keep both halves.
      
      This is suitable for any stable kernel (though there might be some
      conflicts with obvious fixes in earlier kernels).
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c744a65c
  5. 07 2月, 2012 1 次提交
    • N
      md: two small fixes to handling interrupt resync. · db91ff55
      NeilBrown 提交于
      1/ If a resync is aborted we should record how far we got
       (recovery_cp) the last request that we know has completed
       (->curr_resync_completed) rather than the last request that was
       submitted (->curr_resync).
      
      2/ When a resync aborts we still want to update the metadata with
       any changes, so set MD_CHANGE_DEVS even if we 'skip'.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      db91ff55
  6. 11 1月, 2012 2 次提交
  7. 04 1月, 2012 1 次提交
  8. 23 12月, 2011 6 次提交
    • N
      md/raid5: If there is a spare and a want_replacement device, start replacement. · 7bfec5f3
      NeilBrown 提交于
      When attempting to add a spare to a RAID[456] array, also consider
      adding it as a replacement for a want_replacement device.
      
      This requires that common md code attempt hot_add even when the array
      is not formally degraded.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bfec5f3
    • N
      md: create externally visible flags for supporting hot-replace. · 2d78f8c4
      NeilBrown 提交于
      hot-replace is a feature being added to md which will allow a
      device to be replaced without removing it from the array first.
      
      With hot-replace a spare can be activated and recovery can start while
      the original device is still in place, thus allowing a transition from
      an unreliable device to a reliable device without leaving the array
      degraded during the transition.  It can also be use when the original
      device is still reliable but it not wanted for some reason.
      
      This will eventually be supported in RAID4/5/6 and RAID10.
      
      This patch adds a super-block flag to distinguish the replacement
      device.  If an old kernel sees this flag it will reject the device.
      
      It also adds two per-device flags which are viewable and settable via
      sysfs.
         "want_replacement" can be set to request that a device be replaced.
         "replacement" is set to show that this device is replacing another
         device.
      
      The "rd%d" links in /sys/block/mdXx/md only apply to the original
      device, not the replacement.  We currently don't make links for the
      replacement - there doesn't seem to be a need.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d78f8c4
    • N
      md: change hot_remove_disk to take an rdev rather than a number. · b8321b68
      NeilBrown 提交于
      Soon an array will be able to have multiple devices with the
      same raid_disk number (an original and a replacement).  So removing
      a device based on the number won't work.  So pass the actual device
      handle instead.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b8321b68
    • N
      md: remove test for duplicate device when setting slot number. · 476a7abb
      NeilBrown 提交于
      When setting the slot number on a device in an active array we
      currently check that the number is not already in use.
      We then call into the personality's hot_add_disk function
      which performs the same test and returns the same error.
      
      Thus the common test is not needed.
      
      As we will shortly be changing some personalities to allow duplicates
      in some cases (to support hot-replace), the common test will become
      inconvenient.
      
      So remove the common test.
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      476a7abb
    • N
      md: allow non-privileged uses to GET_*_INFO about raid arrays. · 506c9e44
      NeilBrown 提交于
      The info is already available in /proc/mdstat and /sys/block in
      an accessible form so there is no point in putting a road-block in
      the ioctl for information gathering.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      506c9e44
    • N
      md: don't give up looking for spares on first failure-to-add · 60fc1370
      NeilBrown 提交于
      Before performing a recovery we try to remove any spares that
      might not be working, then add any that might have become relevant.
      
      Currently we abort on the first spare that cannot be added.
      This is a false optimisation.
      It is conceivable that - depending on rules in the personality - a
      subsequent spare might be accepted.
      Also the loop does other things like count the available spares and
      reset the 'recovery_offset' value.
      
      If we abort early these might not happen properly.
      
      So remove the early abort.
      
      In particular if you have an array what is undergoing recovery and
      which has extra spares, then the recovery may not restart after as
      reboot as the could of 'spares' might end up as zero.
      Reported-by: NAnssi Hannula <anssi.hannula@iki.fi>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60fc1370
  9. 08 12月, 2011 4 次提交
    • N
      md: ensure new badblocks are handled promptly. · 8bd2f0a0
      NeilBrown 提交于
      When we mark blocks as bad we need them to be acknowledged by the
      metadata handler promptly.
      
      For an in-kernel metadata handler that was already being done.  But
      for an external metadata handler we need to alert it of the change by
      sending a notification through the sysfs file.  This adds that
      notification.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      8bd2f0a0
    • N
      md: bad blocks shouldn't cause a Blocked status on a Faulty device. · 52c64152
      NeilBrown 提交于
      Once a device is marked Faulty the badblocks - whether acknowledged or
      not - become irrelevant.  So they shouldn't cause the device to be
      marked as Blocked.
      
      Without this patch, a process might write "-blocked" to clear the
      Blocked status, but while that will correctly fail the device, it
      won't remove the apparent 'blocked' status.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      52c64152
    • N
      md: take a reference to mddev during sysfs access. · af8a2434
      NeilBrown 提交于
      
      When we are accessing an mddev via sysfs we know that the
      mddev cannot disappear because it has an embedded kobj which
      is refcounted by sysfs.
      And we also take the mddev_lock.
      However this is not enough.
      
      The final mddev_put could have been called and the
      mddev_delayed_delete is waiting for sysfs to let go so it can destroy
      the kobj and mddev.
      In this state there are a lot of changes that should not be attempted.
      
      To to guard against this we:
       - initialise mddev->all_mddevs in on last put so the state can be
         easily detected.
       - in md_attr_show and md_attr_store, check ->all_mddevs under
         all_mddevs_lock and mddev_get the mddev if it still appears to
         be active.
      
      This means that if we get to sysfs as the mddev is being deleted we
      will get -EBUSY.
      
      rdev_attr_store and rdev_attr_show are similar but already have
      sufficient protection.  They check that rdev->mddev still points to
      mddev after taking mddev_lock.  As this is cleared  before delayed
      removal which can only be requested under the mddev_lock, this
      ensure the rdev and mddev are still alive.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      af8a2434
    • N
      md: refine interpretation of "hold_active == UNTIL_IOCTL". · 1d23f178
      NeilBrown 提交于
      We like md devices to disappear when they really are not needed.
      However it is not possible to tell from the current state whether it
      is needed or not.  We can only tell from recent history of changes.
      
      In particular immediately after we create an md device it looks very
      similar to immediately after we have finished with it.
      
      So we always preserve a newly created md device until something
      significant happens.  This state is stored in 'hold_active'.
      
      The normal case is to keep it until an ioctl happens, as that will
      normally either activate it, or explicitly de-activate it.  If it
      doesn't then it was probably created by mistake and it is now time to
      get rid of it.
      
      We can also modify an array via sysfs (instead of via ioctl) and we
      currently treat any change via sysfs like an ioctl as a sign that if
      it now isn't more active, it should be destroyed.
      However this is not appropriate as changes made via sysfs are more
      gradual so we should look for a more definitive change.
      
      So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when
      the array_state is changed via sysfs.  Other changes via sysfs
      are ignored.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1d23f178
  10. 01 11月, 2011 1 次提交
  11. 19 10月, 2011 1 次提交
  12. 18 10月, 2011 2 次提交
  13. 11 10月, 2011 4 次提交
  14. 07 10月, 2011 1 次提交
  15. 23 9月, 2011 1 次提交
    • D
      md: don't delay reboot by 1 second if no MD devices exist · 2dba6a91
      Daniel P. Berrange 提交于
      The md_notify_reboot() method includes a call to mdelay(1000),
      to deal with "exotic SCSI devices" which are too volatile on
      reboot. The delay is unconditional. Even if the machine does
      not have any block devices, let alone MD devices, the kernel
      shutdown sequence is slowed down.
      
      1 second does not matter much with physical hardware, but with
      certain virtualization use cases any wasted time in the bootup
      & shutdown sequence counts for alot.
      
      * drivers/md/md.c: md_notify_reboot() - only impose a delay if
        there was at least one MD device to be stopped during reboot
      Signed-off-by: NDaniel P. Berrange <berrange@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2dba6a91
  16. 21 9月, 2011 1 次提交
    • N
      md: Avoid waking up a thread after it has been freed. · 01f96c0a
      NeilBrown 提交于
      Two related problems:
      
      1/ some error paths call "md_unregister_thread(mddev->thread)"
         without subsequently clearing ->thread.  A subsequent call
         to mddev_unlock will try to wake the thread, and crash.
      
      2/ Most calls to md_wakeup_thread are protected against the thread
         disappeared either by:
            - holding the ->mutex
            - having an active request, so something else must be keeping
              the array active.
         However mddev_unlock calls md_wakeup_thread after dropping the
         mutex and without any certainty of an active request, so the
         ->thread could theoretically disappear.
         So we need a spinlock to provide some protections.
      
      So change md_unregister_thread to take a pointer to the thread
      pointer, and ensure that it always does the required locking, and
      clears the pointer properly.
      Reported-by: N"Moshe Melnikov" <moshe@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cc: stable@kernel.org
      01f96c0a
  17. 12 9月, 2011 1 次提交
  18. 10 9月, 2011 1 次提交
    • N
      md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. · 27a7b260
      NeilBrown 提交于
      0.90 metadata uses an unsigned 32bit number to count the number of
      kilobytes used from each device.
      This should allow up to 4TB per device.
      However we multiply this by 2 (to get sectors) before casting to a
      larger type, so sizes above 2TB get truncated.
      
      Also we allow rdev->sectors to be larger than 4TB, so it is possible
      for the array to be resized larger than the metadata can handle.
      So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in
      used.
      
      Also the sanity check at the end of super_90_load should include level
      1 as it used ->size too. (RAID0 and Linear don't use ->size at all).
      Reported-by: NPim Zandbergen <P.Zandbergen@macroscoop.nl>
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27a7b260