1. 28 8月, 2013 3 次提交
    • S
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li 提交于
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      851c30c9
    • S
      raid5: fix stripe release order · d265d9dc
      Shaohua Li 提交于
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d265d9dc
    • S
      raid5: make release_stripe lockless · 773ca82f
      Shaohua Li 提交于
      release_stripe still has big lock contention. We just add the stripe to a llist
      without taking device_lock. We let the raid5d thread to do the real stripe
      release, which must hold device_lock anyway. In this way, release_stripe
      doesn't hold any locks.
      
      The side effect is the released stripes order is changed. But sounds not a big
      deal, stripes are never handled in order. And I thought block layer can already
      do nice request merge, which means order isn't that important.
      
      I kept the unplug release batch, which is unnecessary with this patch from lock
      contention avoid point of view, and actually if we delete it, the stripe_head
      release_list and lru can share storage. But the unplug release batch is also
      helpful for request merge. We probably can delay wakeup raid5d till unplug, but
      I'm still afraid of the case which raid5d is running.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      773ca82f
  2. 27 8月, 2013 5 次提交
    • N
      md: avoid deadlock when dirty buffers during md_stop. · 260fa034
      NeilBrown 提交于
      When the last process closes /dev/mdX sync_blockdev will be called so
      that all buffers get flushed.
      So if it is then opened for the STOP_ARRAY ioctl to be sent there will
      be nothing to flush.
      
      However if we open /dev/mdX in order to send the STOP_ARRAY ioctl just
      moments before some other process which was writing closes their file
      descriptor, then there won't be a 'last close' and the buffers might
      not get flushed.
      
      So do_md_stop() calls sync_blockdev().  However at this point it is
      holding ->reconfig_mutex.  So if the array is currently 'clean' then
      the writes from sync_blockdev() will not complete until the array
      can be marked dirty and that won't happen until some other thread
      can get ->reconfig_mutex.  So we deadlock.
      
      We need to move the sync_blockdev() call to before we take
      ->reconfig_mutex.
      However then some other thread could open /dev/mdX and write to it
      after we call sync_blockdev() and before we actually stop the array.
      This can leave dirty data in the page cache which is awkward.
      
      So introduce new flag MD_STILL_CLOSED.  Set it before calling
      sync_blockdev(), clear it if anyone does open the file, and abort the
      STOP_ARRAY attempt if it gets set before we lock against further
      opens.
      
      It is still possible to get problems if you open /dev/mdX, write to
      it, then issue the STOP_ARRAY ioctl.  Just don't do that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      260fa034
    • N
      md: Don't test all of mddev->flags at once. · 7a0a5355
      NeilBrown 提交于
      mddev->flags is mostly used to record if an update of the
      metadata is needed.  Sometimes the whole field is tested
      instead of just the important bits.  This makes it difficult
      to introduce more state bits.
      
      So replace all bare tests of mddev->flags with tests for the bits
      that actually need testing.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a0a5355
    • D
      md: Fix apparent cut-and-paste error in super_90_validate · c9ad020f
      Dave Jones 提交于
      Setting a variable to itself probably wasn't the intention here.
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c9ad020f
    • N
      md: fix safe_mode buglet. · 275c51c4
      NeilBrown 提交于
      Whe we set the safe_mode_timeout to a smaller value we trigger a timeout
      immediately - otherwise the small value might not be honoured.
      However if the previous timeout was 0 meaning "no timeout", we didn't.
      This would mean that no timeout happens until the next write completes,
      which could be a long time.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      275c51c4
    • N
      md: don't call md_allow_write in get_bitmap_file. · 60559da4
      NeilBrown 提交于
      There is no really need as GFP_NOIO is very likely sufficient,
      and failure is not catastrophic.
      
      Calling md_allow_write here will convert a read-auto array to
      read/write which could be confusing when you are just performing
      a read operation.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      60559da4
  3. 17 8月, 2013 1 次提交
  4. 25 7月, 2013 2 次提交
    • N
      md/raid5: fix interaction of 'replace' and 'recovery'. · f94c0b66
      NeilBrown 提交于
      If a device in a RAID4/5/6 is being replaced while another is being
      recovered, then the writes to the replacement device currently don't
      happen, resulting in corruption when the replacement completes and the
      new drive takes over.
      
      This is because the replacement writes are only triggered when
      's.replacing' is set and not when the similar 's.sync' is set (which
      is the case during resync and recovery - it means all devices need to
      be read).
      
      So schedule those writes when s.replacing is set as well.
      
      In this case we cannot use "STRIPE_INSYNC" to record that the
      replacement has happened as that is needed for recording that any
      parity calculation is complete.  So introduce STRIPE_REPLACED to
      record if the replacement has happened.
      
      For safety we should also check that STRIPE_COMPUTE_RUN is not set.
      This has a similar effect to the "s.locked == 0" test.  The latter
      ensure that now IO has been flagged but not started.  The former
      checks if any parity calculation has been flagged by not started.
      We must wait for both of these to complete before triggering the
      'replace'.
      
      Add a similar test to the subsequent check for "are we finished yet".
      This possibly isn't needed (is subsumed in the STRIPE_INSYNC test),
      but it makes it more obvious that the REPLACE will happen before we
      think we are finished.
      
      Finally if a NeedReplace device is not UPTODATE then that is an
      error.  We really must trigger a warning.
      
      This bug was introduced in commit 9a3e1101
      (md/raid5:  detect and handle replacements during recovery.)
      which introduced replacement for raid5.
      That was in 3.3-rc3, so any stable kernel since then would benefit
      from this fix.
      
      Cc: stable@vger.kernel.org (3.3+)
      Reported-by: Nqindehua <13691222965@163.com>
      Tested-by: Nqindehua <qindehua@163.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f94c0b66
    • N
      md/raid10: remove use-after-free bug. · 0eb25bb0
      NeilBrown 提交于
      We always need to be careful when calling generic_make_request, as it
      can start a chain of events which might free something that we are
      using.
      
      Here is one place I wasn't careful enough.  If the wbio2 is not in
      use, then it might get freed at the first generic_make_request call.
      So perform all necessary tests first.
      
      This bug was introduced in 3.3-rc3 (24afd80d) and can cause an
      oops, so fix is suitable for any -stable since then.
      
      Cc: stable@vger.kernel.org (3.3+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eb25bb0
  5. 18 7月, 2013 3 次提交
    • N
      md/raid1: fix bio handling problems in process_checks() · 30bc9b53
      NeilBrown 提交于
      Recent change to use bio_copy_data() in raid1 when repairing
      an array is faulty.
      
      The underlying may have changed the bio in various ways using
      bio_advance and these need to be undone not just for the 'sbio' which
      is being copied to, but also the 'pbio' (primary) which is being
      copied from.
      
      So perform the reset on all bios that were read from and do it early.
      
      This also ensure that the sbio->bi_io_vec[j].bv_len passed to
      memcmp is correct.
      
      This fixes a crash during a 'check' of a RAID1 array.  The crash was
      introduced in 3.10 so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30bc9b53
    • N
      md: Remove recent change which allows devices to skip recovery. · 5024c298
      NeilBrown 提交于
      commit 7ceb17e8
          md: Allow devices to be re-added to a read-only array.
      
      allowed a bit more than just that.  It also allows devices to be added
      to a read-write array and to end up skipping recovery.
      
      This patch removes the offending piece of code pending a rewrite for a
      subsequent release.
      
      More specifically:
       If the array has a bitmap, then the device will still need a bitmap
       based resync ('saved_raid_disk' is set under different conditions
       is a bitmap is present).
       If the array doesn't have a bitmap, then this is correct as long as
       nothing has been written to the array since the metadata was checked
       by ->validate_super.  However there is no locking to ensure that there
       was no write.
      
      Bug was introduced in 3.10 and causes data corruption so
      patch is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5024c298
    • N
      md/raid10: fix two problems with RAID10 resync. · 7bb23c49
      NeilBrown 提交于
      1/ When an different between blocks is found, data is copied from
         one bio to the other.  However bv_len is used as the length to
         copy and this could be zero.  So use r10_bio->sectors to calculate
         length instead.
         Using bv_len was probably always a bit dubious, but the introduction
         of bio_advance made it much more likely to be a problem.
      
      2/ When preparing some blocks for sync, we don't set BIO_UPTODATE
         except on bios that we schedule for a read.  This ensures that
         missing/failed devices don't confuse the loop at the top of
         sync_request write.
         Commit 8be185f2 "raid10: Use bio_reset()"
         removed a loop which set BIO_UPTDATE on all appropriate bios.
         So we need to re-add that flag.
      
      These bugs were introduced in 3.10, so this patch is suitable for
      3.10-stable, and can remove a potential for data corruption.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NBrassow Jonathan <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7bb23c49
  6. 12 7月, 2013 8 次提交
    • K
      bcache: Allocation kthread fixes · 79826c35
      Kent Overstreet 提交于
      The alloc kthread should've been using try_to_freeze() - and also there
      was the potential for the alloc kthread to get woken up after it had
      shut down, which would have been bad.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      79826c35
    • K
      bcache: Fix GC_SECTORS_USED() calculation · 29ebf465
      Kent Overstreet 提交于
      Part of the job of garbage collection is to add up however many sectors
      of live data it finds in each bucket, but that doesn't work very well if
      it doesn't reset GC_SECTORS_USED() when it starts. Whoops.
      
      This wouldn't have broken anything horribly, but allocation tries to
      preferentially reclaim buckets that are mostly empty and that's not
      gonna work with an incorrect GC_SECTORS_USED() value.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      29ebf465
    • K
      bcache: Journal replay fix · faa56736
      Kent Overstreet 提交于
      The journal replay code starts by finding something that looks like a
      valid journal entry, then it does a binary search over the unchecked
      region of the journal for the journal entries with the highest sequence
      numbers.
      
      Trouble is, the logic was wrong - journal_read_bucket() returns true if
      it found journal entries we need, but if the range of journal entries
      we're looking for loops around the end of the journal - in that case
      journal_read_bucket() could return true when it hadn't found the highest
      sequence number we'd seen yet, and in that case the binary search did
      the wrong thing. Whoops.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      faa56736
    • K
      bcache: Shutdown fix · 5caa52af
      Kent Overstreet 提交于
      Stopping a cache set is supposed to make it stop attached backing
      devices, but somewhere along the way that code got lost. Fixing this
      mainly has the effect of fixing our reboot notifier.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      5caa52af
    • K
      bcache: Fix a sysfs splat on shutdown · c9502ea4
      Kent Overstreet 提交于
      If we stopped a bcache device when we were already detaching (or
      something like that), bcache_device_unlink() would try to remove a
      symlink from sysfs that was already gone because the bcache dev kobject
      had already been removed from sysfs.
      
      So keep track of whether we've removed stuff from sysfs.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      c9502ea4
    • K
      bcache: Advertise that flushes are supported · 54d12f2b
      Kent Overstreet 提交于
      Whoops - bcache's flush/FUA was mostly correct, but flushes get filtered
      out unless we say we support them...
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      54d12f2b
    • D
      bcache: check for allocation failures · d2a65ce2
      Dan Carpenter 提交于
      There is a missing NULL check after the kzalloc().
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      d2a65ce2
    • K
      bcache: Fix a dumb race · 6aa8f1a6
      Kent Overstreet 提交于
      In the far-too-complicated closure code - closures can have destructors,
      for probably dubious reasons; they get run after the closure is no
      longer waiting on anything but before dropping the parent ref, intended
      just for freeing whatever memory the closure is embedded in.
      
      Trouble is, when remaining goes to 0 and we've got nothing more to run -
      we also have to unlock the closure, setting remaining to -1. If there's
      a destructor, that unlock isn't doing anything - nobody could be trying
      to lock it if we're about to free it - but if the unlock _is needed...
      that check for a destructor was racy. Argh.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: linux-stable <stable@vger.kernel.org> # >= v3.10
      6aa8f1a6
  7. 11 7月, 2013 12 次提交
  8. 04 7月, 2013 2 次提交
    • N
      md/raid10: fix bug which causes all RAID10 reshapes to move no data. · 13765120
      NeilBrown 提交于
      The recent comment:
      commit 7e83ccbe
          md/raid10: Allow skipping recovery when clean arrays are assembled
      
      Causes raid10 to skip a recovery in certain cases where it is safe to
      do so.  Unfortunately it also causes a reshape to be skipped which is
      never safe.  The result is that an attempt to reshape a RAID10 will
      appear to complete instantly, but no data will have been moves so the
      array will now contain garbage.
      (If nothing is written, you can recovery by simple performing the
      reverse reshape which will also complete instantly).
      
      Bug was introduced in 3.10, so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Cc: Martin Wilck <mwilck@arcor.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      13765120
    • N
      md/raid5: allow 5-device RAID6 to be reshaped to 4-device. · fdcfbbb6
      NeilBrown 提交于
      There is a bug in 'check_reshape' for raid5.c  To checks
      that the new minimum number of devices is large enough (which is
      good), but it does so also after the reshape has started (bad).
      
      This is bad because
       - the calculation is now wrong as mddev->raid_disks has changed
         already, and
       - it is pointless because it is now too late to stop.
      
      So only perform that test when reshape has not been committed to.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      fdcfbbb6
  9. 03 7月, 2013 1 次提交
    • N
      md/raid10: fix two bugs affecting RAID10 reshape. · 78eaa0d4
      NeilBrown 提交于
      1/ If a RAID10 is being reshaped to a fewer number of devices
       and is stopped while this is ongoing, then when the array is
       reassembled the 'mirrors' array will be allocated too small.
       This will lead to an access error or memory corruption.
      
      2/ A sanity test for a reshaping RAID10 array is restarted
       is slightly incorrect.
      
      Due to the first bug, this is suitable for any -stable
      kernel since 3.5 where this code was introduced.
      
      Cc: stable@vger.kernel.org (v3.5+)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      78eaa0d4
  10. 02 7月, 2013 3 次提交