1. 25 7月, 2013 2 次提交
    • N
      md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66
      NeilBrown 提交于
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      
      So schedule those writes when s.replacing is set as well.
      
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      'replace'.
      
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-by: Nqindehua <13691222965@163.com>
      Tested-by: Nqindehua <qindehua@163.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f94c0b66
    • N
      md/raid10: remove use-after-free bug. · 0eb25bb0
      NeilBrown 提交于
      We always need to be careful when calling generic_make_request, as it
      can start a chain of events which might free something that we are
      using.
      
      Here is one place I wasn't careful enough.  If the wbio2 is not in
      use, then it might get freed at the first generic_make_request call.
      So perform all necessary tests first.
      
      This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
      oops, so fix is suitable for any -stable since then.
      
      Cc: stable@vger.kernel.org (3.3+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eb25bb0
  2. 18 7月, 2013 3 次提交
    • N
      md/raid1: fix bio handling problems in process_checks() · 30bc9b53
      NeilBrown 提交于
      Recent change to use bio_copy_data() in raid1 when repairing
      an array is faulty.
      
      The underlying may have changed the bio in various ways using
      bio_advance and these need to be undone not just for the 'sbio' which
      is being copied to, but also the 'pbio' (primary) which is being
      copied from.
      
      So perform the reset on all bios that were read from and do it early.
      
      This also ensure that the sbio->bi_io_vec[j].bv_len passed to
      memcmp is correct.
      
      This fixes a crash during a 'check' of a RAID1 array.  The crash was
      introduced in 3.10 so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30bc9b53
    • N
      md: Remove recent change which allows devices to skip recovery. · 5024c298
      NeilBrown 提交于
      commit 7ceb17e8
          md: Allow devices to be re-added to a read-only array.
      
      allowed a bit more than just that.  It also allows devices to be added
      to a read-write array and to end up skipping recovery.
      
      This patch removes the offending piece of code pending a rewrite for a
      subsequent release.
      
      More specifically:
       If the array has a bitmap, then the device will still need a bitmap
       based resync ('saved_raid_disk' is set under different conditions
       is a bitmap is present).
       If the array doesn't have a bitmap, then this is correct as long as
       nothing has been written to the array since the metadata was checked
       by ->validate_super.  However there is no locking to ensure that there
       was no write.
      
      Bug was introduced in 3.10 and causes data corruption so
      patch is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5024c298
    • N
      md/raid10: fix two problems with RAID10 resync. · 7bb23c49
      NeilBrown 提交于
      1/ When an different between blocks is found, data is copied from
         one bio to the other.  However bv_len is used as the length to
         copy and this could be zero.  So use r10_bio->sectors to calculate
         length instead.
         Using bv_len was probably always a bit dubious, but the introduction
         of bio_advance made it much more likely to be a problem.
      
      2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
         except on bios that we schedule for a read.  This ensures that
         missing/failed devices don't confuse the loop at the top of
         sync_request write.
         Commit 8be185f2 "raid10: Use bio_reset()"
         removed a loop which set BIO_UPTDATE on all appropriate bios.
         So we need to re-add that flag.
      
      These bugs were introduced in 3.10, so this patch is suitable for
      3.10-stable, and can remove a potential for data corruption.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bb23c49
  3. 11 7月, 2013 12 次提交
  4. 04 7月, 2013 2 次提交
    • N
      md/raid10: fix bug which causes all RAID10 reshapes to move no data. · 13765120
      NeilBrown 提交于
      The recent comment:
      commit 7e83ccbe
          md/raid10: Allow skipping recovery when clean arrays are assembled
      
      Causes raid10 to skip a recovery in certain cases where it is safe to
      do so.  Unfortunately it also causes a reshape to be skipped which is
      never safe.  The result is that an attempt to reshape a RAID10 will
      appear to complete instantly, but no data will have been moves so the
      array will now contain garbage.
      (If nothing is written, you can recovery by simple performing the
      reverse reshape which will also complete instantly).
      
      Bug was introduced in 3.10, so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Cc: Martin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13765120
    • N
      md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6
      NeilBrown 提交于
      There is a bug in 'check_reshape' for raid5.c  To checks
      that the new minimum number of devices is large enough (which is
      good), but it does so also after the reshape has started (bad).
      
      This is bad because
       - the calculation is now wrong as mddev->raid_disks has changed
         already, and
       - it is pointless because it is now too late to stop.
      
      So only perform that test when reshape has not been committed to.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fdcfbbb6
  5. 03 7月, 2013 1 次提交
    • N
      md/raid10: fix two bugs affecting RAID10 reshape. · 78eaa0d4
      NeilBrown 提交于
      1/ If a RAID10 is being reshaped to a fewer number of devices
       and is stopped while this is ongoing, then when the array is
       reassembled the 'mirrors' array will be allocated too small.
       This will lead to an access error or memory corruption.
      
      2/ A sanity test for a reshaping RAID10 array is restarted
       is slightly incorrect.
      
      Due to the first bug, this is suitable for any -stable
      kernel since 3.5 where this code was introduced.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78eaa0d4
  6. 26 6月, 2013 2 次提交
    • J
      MD: Remember the last sync operation that was performed · c4a39551
      Jonathan Brassow 提交于
      MD:  Remember the last sync operation that was performed
      
      This patch adds a field to the mddev structure to track the last
      sync operation that was performed.  This is especially useful when
      it comes to what is recorded in mismatch_cnt in sysfs.  If the
      last operation was "data-check", then it reports the number of
      descrepancies found by the user-initiated check.  If it was a
      "repair" operation, then it is reporting the number of
      descrepancies repaired.  etc.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c4a39551
    • N
      md: fix buglet in RAID5 -> RAID0 conversion. · eea136d6
      NeilBrown 提交于
      RAID5 uses a 'per-array' value for the 'size' of each device.
      RAID0 uses a 'per-device' value - it can be different for each device.
      
      When converting a RAID5 to a RAID0 we must ensure that the per-device
      size of each device matches the per-array size for the RAID5, else
      the array will change size.
      
      If the metadata cannot record a changed per-device size (as is the
      case with v0.90 metadata) the array could get bigger on restart.  This
      does not cause data corruption, so it not a big issue and is mainly
      yet another a reason to not use 0.90.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      eea136d6
  7. 18 6月, 2013 1 次提交
  8. 14 6月, 2013 8 次提交
    • N
      md/raid10: check In_sync flag in 'enough()'. · 725d6e57
      NeilBrown 提交于
      It isn't really enough to check that the rdev is present, we need to
      also be sure that the device is still In_sync.
      
      Doing this requires using rcu_dereference to access the rdev, and
      holding the rcu_read_lock() to ensure the rdev doesn't disappear while
      we look at it.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      725d6e57
    • N
      md/raid10: locking changes for 'enough()'. · 635f6416
      NeilBrown 提交于
      As 'enough' accesses conf->prev and conf->geo, which can change
      spontanously, it should guard against changes.
      This can be done with device_lock as start_reshape holds device_lock
      while updating 'geo' and end_reshape holds it while updating 'prev'.
      
      So 'error' needs to hold 'device_lock'.
      
      On the other hand, raid10_end_read_request knows which of the two it
      really wants to access, and as it is an active request on that one,
      the value cannot change underneath it.
      
      So change _enough to take flag rather than a pointer, pass the
      appropriate flag from raid10_end_read_request(), and remove the locking.
      
      All other calls to 'enough' are made with reconfig_mutex held, so
      neither 'prev' nor 'geo' can change.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      635f6416
    • J
      md: replace strict_strto*() with kstrto*() · b29bebd6
      Jingoo Han 提交于
      The usage of strict_strtoul() is not preferred, because
      strict_strtoul() is obsolete. Thus, kstrtoul() should be
      used.
      Signed-off-by: NJingoo Han <jg1.han@samsung.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b29bebd6
    • H
      md: Wait for md_check_recovery before attempting device removal. · 90f5f7ad
      Hannes Reinecke 提交于
      When a device has failed, it needs to be removed from the personality
      module before it can be removed from the array as a whole.
      The first step is performed by md_check_recovery() which is called
      from the raid management thread.
      
      So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
      to have run.  This increases the chance that the ioctl will succeed.
      Signed-off-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NNeil Brown <nfbrown@suse.de>
      90f5f7ad
    • N
      dm-raid: silence compiler warning on rebuilds_per_group. · 3f6bbd3f
      NeilBrown 提交于
      This doesn't really need to be initialised, but it doesn't hurt,
      silences the compiler, and as it is a counter it makes sense for it to
      start at zero.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3f6bbd3f
    • J
      DM RAID: Fix raid_resume not reviving failed devices in all cases · a4dc163a
      Jonathan Brassow 提交于
      DM RAID:  Fix raid_resume not reviving failed devices in all cases
      
      When a device fails in a RAID array, it is marked as Faulty.  Later,
      md_check_recovery is called which (through the call chain) calls
      'hot_remove_disk' in order to have the personalities remove the device
      from use in the array.
      
      Sometimes, it is possible for the array to be suspended before the
      personalities get their chance to perform 'hot_remove_disk'.  This is
      normally not an issue.  If the array is deactivated, then the failed
      device will be noticed when the array is reinstantiated.  If the
      array is resumed and the disk is still missing, md_check_recovery will
      be called upon resume and 'hot_remove_disk' will be called at that
      time.  However, (for dm-raid) if the device has been restored,
      a resume on the array would cause it to attempt to revive the device
      by calling 'hot_add_disk'.  If 'hot_remove_disk' had not been called,
      a situation is then created where the device is thought to concurrently
      be the replacement and the device to be replaced.  Thus, the device
      is first sync'ed with the rest of the array (because it is the replacement
      device) and then marked Faulty and removed from the array (because
      it is also the device being replaced).
      
      The solution is to check and see if the device had properly been removed
      before the array was suspended.  This is done by seeing whether the
      device's 'raid_disk' field is -1 - a condition that implies that
      'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
      -> hot_remove_disk' has been called.  If 'raid_disk' is not -1, then
      'hot_remove_disk' must be called to complete the removal of the previously
      faulty device before it can be revived via 'hot_add_disk'.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a4dc163a
    • J
      DM RAID: Break-up untidy function · f381e71b
      Jonathan Brassow 提交于
      DM RAID:  Break-up untidy function
      
      Clean-up excessive indentation by moving some code in raid_resume()
      into its own function.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f381e71b
    • J
      DM RAID: Add ability to restore transiently failed devices on resume · 9092c02d
      Jonathan Brassow 提交于
      DM RAID: Add ability to restore transiently failed devices on resume
      
      This patch adds code to the resume function to check over the devices
      in the RAID array.  If any are found to be marked as failed and their
      superblocks can be read, an attempt is made to reintegrate them into
      the array.  This allows the user to refresh the array with a simple
      suspend and resume of the array - rather than having to load a
      completely new table, allocate and initialize all the structures and
      throw away the old instantiation.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9092c02d
  9. 13 6月, 2013 4 次提交
    • H
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin 提交于
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      
      [neilb: added raid5]
      
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5026d7a9
    • N
      md/raid1,raid10: use freeze_array in place of raise_barrier in various places. · e2d59925
      NeilBrown 提交于
      Various places in raid1 and raid10 are calling raise_barrier when they
      really should call freeze_array.
      The former is only intended to be called from "make_request".
      The later has extra checks for 'nr_queued' and makes a call to
      flush_pending_writes(), so it is safe to call it from within the
      management thread.
      
      Using raise_barrier will sometimes deadlock.  Using freeze_array
      should not.
      
      As 'freeze_array' currently expects one request to be pending (in
      handle_read_error - the only previous caller), we need to pass
      it the number of pending requests (extra) to ignore.
      
      The deadlock was made particularly noticeable by commits
      050b6615 (raid10) and 6b740b8d (raid1) which
      appeared in 3.4, so the fix is appropriate for any -stable
      kernel since then.
      
      This patch probably won't apply directly to some early kernels and
      will need to be applied by hand.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e2d59925
    • A
      md/raid1: consider WRITE as successful only if at least one non-Faulty and... · 3056e3ae
      Alex Lyakas 提交于
      md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
      
      Without that fix, the following scenario could happen:
      
      - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
      - Drive A fails
      - WRITE request arrives to the array. It is failed by drive A, so
      r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
      succeeds in writing it, so the same r1_bio is marked as
      R1BIO_Uptodate.
      - r1_bio arrives to handle_write_finished, badblocks are disabled,
      md_error()->error() does nothing because we don't fail the last drive
      of raid1
      - raid_end_bio_io()  calls call_bio_endio()
      - As a result, in call_bio_endio():
              if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                      clear_bit(BIO_UPTODATE, &bio->bi_flags);
      this code doesn't clear the BIO_UPTODATE flag, and the whole master
      WRITE succeeds, back to the upper layer.
      
      So we returned success to the upper layer, even though we had written
      the data onto the rebuilding drive only. But when we want to read the
      data back, we would not read from the rebuilding drive, so this data
      is lost.
      
      [neilb - applied identical change to raid10 as well]
      
      This bug can result in lost data, so it is suitable for any
      -stable kernel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3056e3ae
    • N
      md: md_stop_writes() should always freeze recovery. · 6b6204ee
      NeilBrown 提交于
      __md_stop_writes() will currently sometimes freeze recovery.
      So any caller must be ready for that to happen, and indeed they are.
      
      However if __md_stop_writes() doesn't freeze_recovery, then
      a recovery could start before mddev_suspend() is called, which
      could be awkward.  This can particularly cause problems or dm-raid.
      
      So change __md_stop_writes() to always freeze recovery.  This is safe
      and more predicatable.
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Tested-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6b6204ee
  10. 30 5月, 2013 1 次提交
  11. 20 5月, 2013 1 次提交
    • A
      dm thin: fix metadata dev resize detection · 610bba8b
      Alasdair G Kergon 提交于
      Fix detection of the need to resize the dm thin metadata device.
      
      The code incorrectly tried to extend the metadata device when it
      didn't need to due to a merging error with patch 24347e95 ("dm thin:
      detect metadata device resizing").
      
        device-mapper: transaction manager: couldn't open metadata space map
        device-mapper: thin metadata: tm_open_with_sm failed
        device-mapper: thin: aborting transaction failed
        device-mapper: thin: switching pool to failure mode
      Signed-off-by: NAlasdair G Kergon <agk@redhat.com>
      610bba8b
  12. 15 5月, 2013 3 次提交
    • K
      bcache: Fix error handling in init code · f59fce84
      Kent Overstreet 提交于
      This code appears to have rotted... fix various bugs and do some
      refactoring.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      f59fce84
    • P
      bcache: drop "select CLOSURES" · bbb1c3b5
      Paul Bolle 提交于
      The Kconfig entry for BCACHE selects CLOSURES. But there's no Kconfig
      symbol CLOSURES. That symbol was used in development versions of bcache,
      but was removed when the closures code was no longer provided as a
      kernel library. It can safely be dropped.
      Signed-off-by: NPaul Bolle <pebolle@tiscali.nl>
      bbb1c3b5
    • E
      bcache: Fix incompatible pointer type warning · 867e1162
      Emil Goode 提交于
      The function pointer release in struct block_device_operations
      should point to functions declared as void.
      
      Sparse warnings:
      
      drivers/md/bcache/super.c:656:27: warning:
      	incorrect type in initializer (different base types)
      	drivers/md/bcache/super.c:656:27:
      	expected void ( *release )( ... )
      	drivers/md/bcache/super.c:656:27:
      	got int ( static [toplevel] *<noident> )( ... )
      
      drivers/md/bcache/super.c:656:2: warning:
      	initialization from incompatible pointer type [enabled by default]
      
      drivers/md/bcache/super.c:656:2: warning:
      	(near initialization for ‘bcache_ops.release’) [enabled by default]
      Signed-off-by: NEmil Goode <emilgoode@gmail.com>
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      867e1162