1. 28 8月, 2013 1 次提交
    • S
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li 提交于
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      773ca82f
  2. 27 8月, 2013 5 次提交
    • N
      md: avoid deadlock when dirty buffers during md_stop. · 260fa034
      NeilBrown 提交于
      When the last process closes /dev/mdX sync_blockdev will be called so
      that all buffers get flushed.
      So if it is then opened for the STOP_ARRAY ioctl to be sent there will
      be nothing to flush.
      
      However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
      moments before some other process which was writing closes their file
      descriptor, then there won't be a 'last close' and the buffers might
      not get flushed.
      
      So do_md_stop() calls sync_blockdev().  However at this point it is
      holding ->reconfig_mutex.  So if the array is currently 'clean' then
      the writes from sync_blockdev() will not complete until the array
      can be marked dirty and that won't happen until some other thread
      can get ->reconfig_mutex.  So we deadlock.
      
      We need to move the sync_blockdev() call to before we take
      ->reconfig_mutex.
      However then some other thread could open /dev/mdX and write to it
      after we call sync_blockdev() and before we actually stop the array.
      This can leave dirty data in the page cache which is awkward.
      
      So introduce new flag MD_STILL_CLOSED.  Set it before calling
      sync_blockdev(), clear it if anyone does open the file, and abort the
      STOP_ARRAY attempt if it gets set before we lock against further
      opens.
      
      It is still possible to get problems if you open /dev/mdX, write to
      it, then issue the STOP_ARRAY ioctl.  Just don't do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      260fa034
    • N
      md: Don't test all of mddev->flags at once. · 7a0a5355
      NeilBrown 提交于
      mddev->flags is mostly used to record if an update of the
      metadata is needed.  Sometimes the whole field is tested
      instead of just the important bits.  This makes it difficult
      to introduce more state bits.
      
      So replace all bare tests of mddev->flags with tests for the bits
      that actually need testing.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a0a5355
    • D
      md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f
      Dave Jones 提交于
      Setting a variable to itself probably wasn't the intention here.
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c9ad020f
    • N
      md: fix safe_mode buglet. · 275c51c4
      NeilBrown 提交于
      Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
      immediately - otherwise the small value might not be honoured.
      However if the previous timeout was 0 meaning "no timeout", we didn't.
      This would mean that no timeout happens until the next write completes,
      which could be a long time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      275c51c4
    • N
      md: don't call md_allow_write in get_bitmap_file. · 60559da4
      NeilBrown 提交于
      There is no really need as GFP_NOIO is very likely sufficient,
      and failure is not catastrophic.
      
      Calling md_allow_write here will convert a read-auto array to
      read/write which could be confusing when you are just performing
      a read operation.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60559da4
  3. 17 8月, 2013 1 次提交
  4. 25 7月, 2013 2 次提交
    • N
      md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66
      NeilBrown 提交于
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      
      So schedule those writes when s.replacing is set as well.
      
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      'replace'.
      
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-by: Nqindehua <13691222965@163.com>
      Tested-by: Nqindehua <qindehua@163.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f94c0b66
    • N
      md/raid10: remove use-after-free bug. · 0eb25bb0
      NeilBrown 提交于
      We always need to be careful when calling generic_make_request, as it
      can start a chain of events which might free something that we are
      using.
      
      Here is one place I wasn't careful enough.  If the wbio2 is not in
      use, then it might get freed at the first generic_make_request call.
      So perform all necessary tests first.
      
      This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
      oops, so fix is suitable for any -stable since then.
      
      Cc: stable@vger.kernel.org (3.3+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eb25bb0
  5. 18 7月, 2013 3 次提交
    • N
      md/raid1: fix bio handling problems in process_checks() · 30bc9b53
      NeilBrown 提交于
      Recent change to use bio_copy_data() in raid1 when repairing
      an array is faulty.
      
      The underlying may have changed the bio in various ways using
      bio_advance and these need to be undone not just for the 'sbio' which
      is being copied to, but also the 'pbio' (primary) which is being
      copied from.
      
      So perform the reset on all bios that were read from and do it early.
      
      This also ensure that the sbio->bi_io_vec[j].bv_len passed to
      memcmp is correct.
      
      This fixes a crash during a 'check' of a RAID1 array.  The crash was
      introduced in 3.10 so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30bc9b53
    • N
      md: Remove recent change which allows devices to skip recovery. · 5024c298
      NeilBrown 提交于
      commit 7ceb17e8
          md: Allow devices to be re-added to a read-only array.
      
      allowed a bit more than just that.  It also allows devices to be added
      to a read-write array and to end up skipping recovery.
      
      This patch removes the offending piece of code pending a rewrite for a
      subsequent release.
      
      More specifically:
       If the array has a bitmap, then the device will still need a bitmap
       based resync ('saved_raid_disk' is set under different conditions
       is a bitmap is present).
       If the array doesn't have a bitmap, then this is correct as long as
       nothing has been written to the array since the metadata was checked
       by ->validate_super.  However there is no locking to ensure that there
       was no write.
      
      Bug was introduced in 3.10 and causes data corruption so
      patch is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5024c298
    • N
      md/raid10: fix two problems with RAID10 resync. · 7bb23c49
      NeilBrown 提交于
      1/ When an different between blocks is found, data is copied from
         one bio to the other.  However bv_len is used as the length to
         copy and this could be zero.  So use r10_bio->sectors to calculate
         length instead.
         Using bv_len was probably always a bit dubious, but the introduction
         of bio_advance made it much more likely to be a problem.
      
      2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
         except on bios that we schedule for a read.  This ensures that
         missing/failed devices don't confuse the loop at the top of
         sync_request write.
         Commit 8be185f2 "raid10: Use bio_reset()"
         removed a loop which set BIO_UPTDATE on all appropriate bios.
         So we need to re-add that flag.
      
      These bugs were introduced in 3.10, so this patch is suitable for
      3.10-stable, and can remove a potential for data corruption.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bb23c49
  6. 12 7月, 2013 8 次提交
    • K
      bcache: Allocation kthread fixes · 79826c35
      Kent Overstreet 提交于
      The alloc kthread should've been using try_to_freeze() - and also there
      was the potential for the alloc kthread to get woken up after it had
      shut down, which would have been bad.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      79826c35
    • K
      bcache: Fix GC_SECTORS_USED() calculation · 29ebf465
      Kent Overstreet 提交于
      Part of the job of garbage collection is to add up however many sectors
      of live data it finds in each bucket, but that doesn't work very well if
      it doesn't reset GC_SECTORS_USED() when it starts. Whoops.
      
      This wouldn't have broken anything horribly, but allocation tries to
      preferentially reclaim buckets that are mostly empty and that's not
      gonna work with an incorrect GC_SECTORS_USED() value.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      29ebf465
    • K
      bcache: Journal replay fix · faa56736
      Kent Overstreet 提交于
      The journal replay code starts by finding something that looks like a
      valid journal entry, then it does a binary search over the unchecked
      region of the journal for the journal entries with the highest sequence
      numbers.
      
      Trouble is, the logic was wrong - journal_read_bucket() returns true if
      it found journal entries we need, but if the range of journal entries
      we're looking for loops around the end of the journal - in that case
      journal_read_bucket() could return true when it hadn't found the highest
      sequence number we'd seen yet, and in that case the binary search did
      the wrong thing. Whoops.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      faa56736
    • K
      bcache: Shutdown fix · 5caa52af
      Kent Overstreet 提交于
      Stopping a cache set is supposed to make it stop attached backing
      devices, but somewhere along the way that code got lost. Fixing this
      mainly has the effect of fixing our reboot notifier.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      5caa52af
    • K
      bcache: Fix a sysfs splat on shutdown · c9502ea4
      Kent Overstreet 提交于
      If we stopped a bcache device when we were already detaching (or
      something like that), bcache_device_unlink() would try to remove a
      symlink from sysfs that was already gone because the bcache dev kobject
      had already been removed from sysfs.
      
      So keep track of whether we've removed stuff from sysfs.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      c9502ea4
    • K
      bcache: Advertise that flushes are supported · 54d12f2b
      Kent Overstreet 提交于
      Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
      out unless we say we support them...
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      54d12f2b
    • D
      bcache: check for allocation failures · d2a65ce2
      Dan Carpenter 提交于
      There is a missing NULL check after the kzalloc().
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      d2a65ce2
    • K
      bcache: Fix a dumb race · 6aa8f1a6
      Kent Overstreet 提交于
      In the far-too-complicated closure code - closures can have destructors,
      for probably dubious reasons; they get run after the closure is no
      longer waiting on anything but before dropping the parent ref, intended
      just for freeing whatever memory the closure is embedded in.
      
      Trouble is, when remaining goes to 0 and we've got nothing more to run -
      we also have to unlock the closure, setting remaining to -1. If there's
      a destructor, that unlock isn't doing anything - nobody could be trying
      to lock it if we're about to free it - but if the unlock _is needed...
      that check for a destructor was racy. Argh.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      6aa8f1a6
  7. 11 7月, 2013 12 次提交
  8. 04 7月, 2013 2 次提交
    • N
      md/raid10: fix bug which causes all RAID10 reshapes to move no data. · 13765120
      NeilBrown 提交于
      The recent comment:
      commit 7e83ccbe
          md/raid10: Allow skipping recovery when clean arrays are assembled
      
      Causes raid10 to skip a recovery in certain cases where it is safe to
      do so.  Unfortunately it also causes a reshape to be skipped which is
      never safe.  The result is that an attempt to reshape a RAID10 will
      appear to complete instantly, but no data will have been moves so the
      array will now contain garbage.
      (If nothing is written, you can recovery by simple performing the
      reverse reshape which will also complete instantly).
      
      Bug was introduced in 3.10, so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Cc: Martin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13765120
    • N
      md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6
      NeilBrown 提交于
      There is a bug in 'check_reshape' for raid5.c  To checks
      that the new minimum number of devices is large enough (which is
      good), but it does so also after the reshape has started (bad).
      
      This is bad because
       - the calculation is now wrong as mddev->raid_disks has changed
         already, and
       - it is pointless because it is now too late to stop.
      
      So only perform that test when reshape has not been committed to.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fdcfbbb6
  9. 03 7月, 2013 1 次提交
    • N
      md/raid10: fix two bugs affecting RAID10 reshape. · 78eaa0d4
      NeilBrown 提交于
      1/ If a RAID10 is being reshaped to a fewer number of devices
       and is stopped while this is ongoing, then when the array is
       reassembled the 'mirrors' array will be allocated too small.
       This will lead to an access error or memory corruption.
      
      2/ A sanity test for a reshaping RAID10 array is restarted
       is slightly incorrect.
      
      Due to the first bug, this is suitable for any -stable
      kernel since 3.5 where this code was introduced.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78eaa0d4
  10. 02 7月, 2013 4 次提交
  11. 27 6月, 2013 1 次提交