1. 02 10月, 2014 1 次提交
    • N
      md/raid5: disable 'DISCARD' by default due to safety concerns. · 8e0e99ba
      NeilBrown 提交于
      It has come to my attention (thanks Martin) that 'discard_zeroes_data'
      is only a hint.  Some devices in some cases don't do what it
      says on the label.
      
      The use of DISCARD in RAID5 depends on reads from discarded regions
      being predictably zero.  If a write to a previously discarded region
      performs a read-modify-write cycle it assumes that the parity block
      was consistent with the data blocks.  If all were zero, this would
      be the case.  If some are and some aren't this would not be the case.
      This could lead to data corruption after a device failure when
      data needs to be reconstructed from the parity.
      
      As we cannot trust 'discard_zeroes_data', ignore it by default
      and so disallow DISCARD on all raid4/5/6 arrays.
      
      As many devices are trustworthy, and as there are benefits to using
      DISCARD, add a module parameter to over-ride this caution and cause
      DISCARD to work if discard_zeroes_data is set.
      
      If a site want to enable DISCARD on some arrays but not on others they
      should select DISCARD support at the filesystem level, and set the
      raid456 module parameter.
          raid456.devices_handle_discard_safely=Y
      
      As this is a data-safety issue, I believe this patch is suitable for
      -stable.
      DISCARD support for RAID456 was added in 3.7
      
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Heinz Mauelshagen <heinzm@redhat.com>
      Cc: stable@vger.kernel.org (3.7+)
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: 620125f2Signed-off-by: NNeilBrown <neilb@suse.de>
      8e0e99ba
  2. 18 8月, 2014 2 次提交
    • N
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 9c4bdf69
      NeilBrown 提交于
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      
      Cc: stable@vger.kernel.org (2.6.32+)
      Fixes: 6c0069c0
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      9c4bdf69
    • N
      md/raid5: avoid livelock caused by non-aligned writes. · a40687ff
      NeilBrown 提交于
      If a stripe in a raid6 array received a write to each data block while
      the array is degraded, and if any of these writes to a missing device
      are not page-aligned, then a live-lock happens.
      
      In this case the P and Q blocks need to be read so that the part of
      the missing block which is *not* being updated by the write can be
      constructed.  Due to a logic error, these blocks are not loaded, so
      the update cannot proceed and the stripe is 'handled' repeatedly in an
      infinite loop.
      
      This bug is unlikely as most writes are page aligned.  However as it
      can lead to a livelock it is suitable for -stable.  It was introduced
      in 3.16.
      
      Cc: stable@vger.kernel.org (v3.16)
      Fixed: 67f45548Signed-off-by: NNeilBrown <neilb@suse.de>
      a40687ff
  3. 10 6月, 2014 1 次提交
    • E
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto 提交于
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      053f5b65
  4. 05 6月, 2014 1 次提交
    • H
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao 提交于
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2844dc32
  5. 29 5月, 2014 3 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
    • E
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto 提交于
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cf170f3f
    • N
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown 提交于
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: NPatrik Horník <patrik@dsl.sk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67f45548
  6. 18 4月, 2014 1 次提交
  7. 17 4月, 2014 1 次提交
  8. 09 4月, 2014 2 次提交
    • S
      raid5: get_active_stripe avoids device_lock · e240c183
      Shaohua Li 提交于
      For sequential workload (or request size big workload), get_active_stripe can
      find cached stripe. In this case, we always hold device_lock, which exposes a
      lot of lock contention for such workload. If stripe count isn't 0, we don't
      need hold the lock actually, since we just increase its count. And this is the
      hot code path for such workload. Unfortunately we must delete the BUG_ON.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e240c183
    • S
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li 提交于
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27c0f68f
  9. 13 2月, 2014 1 次提交
    • O
      md/raid5: Fix CPU hotplug callback registration · 789b5e03
      Oleg Nesterov 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	put_online_cpus();
      
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      
      So reorganize the code in raid5 this way to fix the deadlock with callback
      registration.
      
      Cc: linux-raid@vger.kernel.org
      Cc: stable@vger.kernel.org (v2.6.32+)
      Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      789b5e03
  10. 22 1月, 2014 1 次提交
    • N
      md/raid5: close recently introduced race in stripe_head management. · 7da9d450
      NeilBrown 提交于
      As release_stripe and __release_stripe decrement ->count and then
      manipulate ->lru both under ->device_lock, it is important that
      get_active_stripe() increments ->count and clears ->lru also under
      ->device_lock.
      
      However we currently list_del_init ->lru under the lock, but increment
      the ->count outside the lock.  This can lead to races and list
      corruption.
      
      So move the atomic_inc(&sh->count) up inside the ->device_lock
      protected region.
      
      Note that we still increment ->count without device lock in the case
      where get_free_stripe() was called, and in fact don't take
      ->device_lock at all in that path.
      This is safe because if the stripe_head can be found by
      get_free_stripe, then the hash lock assures us the no-one else could
      possibly be calling release_stripe() at the same time.
      
      Fixes: 566c09c5
      Cc: stable@vger.kernel.org (3.13)
      Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7da9d450
  11. 16 1月, 2014 1 次提交
    • N
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1
      NeilBrown 提交于
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      
      So: set the flag when a write fails.
      
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f97e4b1
  12. 14 1月, 2014 2 次提交
    • N
      md/raid5: fix a recently broken BUG_ON(). · 5af9bef7
      NeilBrown 提交于
      commit 6d183de4
          md/raid5: fix newly-broken locking in get_active_stripe.
      
      simplified a BUG_ON, but removed too much so now it sometimes fires
      when it shouldn't.
      
      When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
      special list while multiple stripe_heads are collected, or it might
      not be on any list, even a 'free' list when the refcount is zero.  As
      long as STRIPE_EXPANDING is set, it will be found and added back to a
      list eventually.
      
      So both of the BUG_ONs which test for the ->lru being empty or not
      need to avoid the case where STRIPE_EXPANDING is set.
      
      The patch which broke this was marked for -stable, so this patch needs
      to be applied to any branch that received 6d183de4
      
      Fixes: 6d183de4
      Cc: stable@vger.kernel.org (any release to which above was applied)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5af9bef7
    • N
      md/raid5: Fix possible confusion when multiple write errors occur. · 1cc03eb9
      NeilBrown 提交于
      commit 5d8c71f9
          md: raid5 crash during degradation
      
      Fixed a crash in an overly simplistic way which could leave
      R5_WriteError or R5_MadeGood set in the stripe cache for devices
      for which it is no longer relevant.
      When those devices are removed and spares added the flags are still
      set and can cause incorrect behaviour.
      
      commit 14a75d3e
          md/raid5: preferentially read from replacement device if possible.
      
      Fixed the same bug if a more effective way, so we can now revert
      the original commit.
      Reported-and-tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
      Fixes: 5d8c71f9Signed-off-by: NNeilBrown <neilb@suse.de>
      1cc03eb9
  13. 09 1月, 2014 1 次提交
    • K
      bcache/md: Use raid stripe size · c78afc62
      Kent Overstreet 提交于
      Now that we've got code for raid5/6 stripe awareness, bcache just needs
      to know about the stripes and when writing partial stripes is expensive
      - we probably don't want to enable this optimization for raid1 or 10,
      even though they have stripes. So add a flag to queue_limits.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      c78afc62
  14. 28 11月, 2013 2 次提交
    • N
      md/raid5: fix newly-broken locking in get_active_stripe. · 6d183de4
      NeilBrown 提交于
      commit 566c09c5 raid5: relieve lock contention in get_active_stripe()
      
      modified the locking in get_active_stripe() reducing the range
      protected by the (highly contended) device_lock.
      Unfortunately it reduced the range too much opening up some races.
      
      One race can occur if get_priority_stripe runs between the
      test on sh->count and device_lock being taken.
      This will mean that sh->lru is not empty while get_active_stripe
      thinks ->count is zero resulting in a 'BUG' firing.
      
      Another race happens if __release_stripe is called immediately
      after sh->count is tested and found to be non-zero.  If STRIPE_HANDLE
      is not set, get_active_stripe should increment ->active_stripes
      when it increments ->count from 0, but as it didn't think it was 0,
      it doesn't.
      
      Extending device_lock to cover the test on sh->count close these
      races.
      
      While we are here, fix the two BUG tests:
       -If count is zero, then lru really must not be empty, or we've
        lock the stripe_head somehow - no other tests are relevant.
       -STRIPE_ON_RELEASE_LIST is completely independent of ->lru so
        testing it is pointless.
      Reported-and-tested-by: NBrassow Jonathan <jbrassow@redhat.com>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      Fixes: 566c09c5Signed-off-by: NNeilBrown <neilb@suse.de>
      6d183de4
    • N
      md/raid5: fix new memory-reference bug in alloc_thread_groups. · 0c775d52
      NeilBrown 提交于
      In alloc_thread_groups, worker_groups is a pointer to an array,
      not an array of pointers.
      So
         worker_groups[i]
      is wrong.  It should be
         &(*worker_groups)[i]
      
      Found-by: coverity
      Fixes: 60aaf933Reported-by: NBen Hutchings <bhutchings@solarflare.com>
      Cc: majianpeng <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0c775d52
  15. 24 11月, 2013 2 次提交
    • K
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet 提交于
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: NGeoff Levand <geoff@infradead.org>
      7988613b
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  16. 19 11月, 2013 6 次提交
    • M
      md/raid5: Use conf->device_lock protect changing of multi-thread resources. · 60aaf933
      majianpeng 提交于
      When we change group_thread_cnt from sysfs entry, it can OOPS.
      
      The kernel messages are:
      [  135.299021] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  135.299073] IP: [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299107] PGD 0
      [  135.299122] Oops: 0000 [#1] SMP
      [  135.299144] Modules linked in: netconsole e1000e ptp pps_core
      [  135.299188] CPU: 3 PID: 2225 Comm: md0_raid5 Not tainted 3.12.0+ #24
      [  135.299214] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  135.299255] task: ffff8800b9638f80 ti: ffff8800b77a4000 task.ti: ffff8800b77a4000
      [  135.299283] RIP: 0010:[<ffffffff815188ab>]  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.299323] RSP: 0018:ffff8800b77a5c48  EFLAGS: 00010002
      [  135.299344] RAX: ffff880037bb5c70 RBX: 0000000000000000 RCX: 0000000000000008
      [  135.299371] RDX: ffff880037bb5cb8 RSI: 0000000000000001 RDI: ffff880037bb5c00
      [  135.299398] RBP: ffff8800b77a5d08 R08: 0000000000000001 R09: 0000000000000000
      [  135.299425] R10: ffff8800b77a5c98 R11: 00000000ffffffff R12: ffff880037bb5c00
      [  135.299452] R13: 0000000000000000 R14: 0000000000000000 R15: ffff880037bb5c70
      [  135.299479] FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
      [  135.299510] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  135.299532] CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 00000000000407e0
      [  135.299559] Stack:
      [  135.299570]  ffff8800b77a5c88 ffffffff8107383e ffff8800b77a5c88 ffff880037a64300
      [  135.299611]  000000000000ec08 ffff880037bb5cb8 ffff8800b77a5c98 ffffffffffffffd8
      [  135.299654]  000000000000ec08 ffff880037bb5c60 ffff8800b77a5c98 ffff8800b77a5c98
      [  135.299696] Call Trace:
      [  135.299711]  [<ffffffff8107383e>] ? __wake_up+0x4e/0x70
      [  135.299733]  [<ffffffff81518f88>] raid5d+0x4c8/0x680
      [  135.299756]  [<ffffffff817174ed>] ? schedule_timeout+0x15d/0x1f0
      [  135.299781]  [<ffffffff81524c9f>] md_thread+0x11f/0x170
      [  135.299804]  [<ffffffff81069cd0>] ? wake_up_bit+0x40/0x40
      [  135.299826]  [<ffffffff81524b80>] ? md_rdev_init+0x110/0x110
      [  135.299850]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  135.299871]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299899]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  135.299923]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  135.299951] Code: ff ff ff 0f 84 d7 fe ff ff e9 5c fe ff ff 66 90 41 8b b4 24 d8 01 00 00 45 31 ed 85 f6 0f 8e 7b fd ff ff 49 8b 9c 24 d0 01 00 00 <48> 3b 1b 49 89 dd 0f 85 67 fd ff ff 48 8d 43 28 31 d2 eb 17 90
      [  135.300005] RIP  [<ffffffff815188ab>] handle_active_stripes+0x32b/0x440
      [  135.300005]  RSP <ffff8800b77a5c48>
      [  135.300005] CR2: 0000000000000000
      [  135.300005] ---[ end trace 504854e5bb7562ed ]---
      [  135.300005] Kernel panic - not syncing: Fatal exception
      
      This is because raid5d() can be running when the multi-thread
      resources are changed via system. We see need to provide locking.
      
      mddev->device_lock is suitable, but we cannot simple call
      alloc_thread_groups under this lock as we cannot allocate memory
      while holding a spinlock.
      So change alloc_thread_groups() to allocate and return the data
      structures, then raid5_store_group_thread_cnt() can take the lock
      while updating the pointers to the data structures.
      
      This fixes a bug introduced in 3.12 and so is suitable for the 3.12.x
      stable series.
      
      Fixes: b721420e
      Cc: stable@vger.kernel.org (3.12)
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      60aaf933
    • M
      md/raid5: Before freeing old multi-thread worker, it should flush them. · d206dcfa
      majianpeng 提交于
      When changing group_thread_cnt from sysfs entry, the kernel can oops.
      
      The kernel messages are:
      [  740.961389] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  740.961444] IP: [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961476] PGD b9013067 PUD b651e067 PMD 0
      [  740.961503] Oops: 0000 [#1] SMP
      [  740.961525] Modules linked in: netconsole e1000e ptp pps_core
      [  740.961577] CPU: 0 PID: 3683 Comm: kworker/u8:5 Not tainted 3.12.0+ #23
      [  740.961602] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
      [  740.961646] task: ffff88013abe0000 ti: ffff88013a246000 task.ti: ffff88013a246000
      [  740.961673] RIP: 0010:[<ffffffff81062570>]  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.961708] RSP: 0018:ffff88013a247e08  EFLAGS: 00010086
      [  740.961730] RAX: ffff8800b912b400 RBX: ffff88013a61e680 RCX: ffff8800b912b400
      [  740.961757] RDX: ffff8800b912b600 RSI: ffff8800b912b600 RDI: ffff88013a61e680
      [  740.961782] RBP: ffff88013a247e48 R08: ffff88013a246000 R09: 000000000002c09d
      [  740.961808] R10: 000000000000010f R11: 0000000000000000 R12: ffff88013b00cc00
      [  740.961833] R13: 0000000000000000 R14: ffff88013b00cf80 R15: ffff88013a61e6b0
      [  740.961861] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
      [  740.961893] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  740.962001] CR2: 00000000000000b8 CR3: 00000000b24fe000 CR4: 00000000000407f0
      [  740.962001] Stack:
      [  740.962001]  0000000000000008 ffff8800b912b600 ffff88013b00cc00 ffff88013a61e680
      [  740.962001]  ffff88013b00cc00 ffff88013b00cc18 ffff88013b00cf80 ffff88013a61e6b0
      [  740.962001]  ffff88013a247eb8 ffffffff810639c6 0000000000012a80 ffff88013a247fd8
      [  740.962001] Call Trace:
      [  740.962001]  [<ffffffff810639c6>] worker_thread+0x206/0x3f0
      [  740.962001]  [<ffffffff810637c0>] ? manage_workers+0x2c0/0x2c0
      [  740.962001]  [<ffffffff81069656>] kthread+0xc6/0xd0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001]  [<ffffffff81722ffc>] ret_from_fork+0x7c/0xb0
      [  740.962001]  [<ffffffff81069590>] ? kthread_freezable_should_stop+0x70/0x70
      [  740.962001] Code: 89 e5 41 57 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 18 48 8b 06 4c 8b 67 48 48 89 c1 30 c9 a8 04 4c 0f 45 e9 80 7f 58 00 <49> 8b 45 08 44 8b b0 00 01 00 00 78 0c 41 f6 44 24 10 04 0f 84
      [  740.962001] RIP  [<ffffffff81062570>] process_one_work+0x30/0x500
      [  740.962001]  RSP <ffff88013a247e08>
      [  740.962001] CR2: 0000000000000008
      [  740.962001] ---[ end trace 39181460000748de ]---
      [  740.962001] Kernel panic - not syncing: Fatal exception
      
      This can happen if there are some stripes left, fewer than MAX_STRIPE_BATCH.
      A worker is queued to handle them.
      But before calling raid5_do_work, raid5d handles those
      stripes making conf->active_stripe = 0.
      So mddev_suspend() can return.
      We might then free old worker resources before the queued
      raid5_do_work() handled them.  When it runs, it crashes.
      
      	raid5d()		raid5_store_group_thread_cnt()
      	queue_work		mddev_suspend()
      				handle_strips
      				active_stripe=0
      				free(old worker resources)
      	process_one_work
      	raid5_do_work
      
      To avoid this, we should only flush the worker resources before freeing them.
      
      This fixes a bug introduced in 3.12 so is suitable for the 3.12.x
      stable series.
      
      Cc: stable@vger.kernel.org (3.12)
      Fixes: b721420eSigned-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Reviewed-by: NShaohua Li <shli@kernel.org>
      d206dcfa
    • M
      md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. · e59aa23f
      majianpeng 提交于
      For R5_ReadNoMerge,it mean this bio can't merge with other bios or
      request.It used REQ_FLUSH to achieve this. But REQ_NOMERGE can do the
      same work.
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e59aa23f
    • N
      md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. · c91abf5a
      NeilBrown 提交于
      We currently use kthread_should_stop() in various places in the
      sync/reshape code to abort early.
      However some places set MD_RECOVERY_INTR but don't immediately call
      md_reap_sync_thread() (and we will shortly get another one).
      When this happens we are relying on md_check_recovery() to reap the
      thread and that only happen when it finishes normally.
      So MD_RECOVERY_INTR must lead to a normal finish without the
      kthread_should_stop() test.
      
      So replace all relevant tests, and be more careful when the thread is
      interrupted not to acknowledge that latest step in a reshape as it may
      not be fully committed yet.
      
      Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop
      so we don't wait have to wait for the speed to drop before we can abort.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c91abf5a
    • B
      raid5: Retry R5_ReadNoMerge flag when hit a read error. · edfa1f65
      Bian Yu 提交于
      Because of block layer merge, one bio fails will cause other bios
      which belongs to the same request fails, so raid5_end_read_request
      will record all these bios as badblocks.
      If retry request with R5_ReadNoMerge flag to avoid bios merge,
      badblocks can only record sector which is bad exactly.
      
      test:
      hdparm --yes-i-know-what-i-am-doing --make-bad-sector 300000 /dev/sdb
      mdadm -C /dev/md0 -l5 -n3 /dev/sd[bcd] --assume-clean
      mdadm /dev/md0 -f /dev/sdd
      mdadm /dev/md0 -r /dev/sdd
      mdadm --zero-superblock /dev/sdd
      mdadm /dev/md0 -a /dev/sdd
      
      1. Without this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      299776 256
      299776 256
      
      2. With this patch:
      cat /sys/block/md0/md/rd*/bad_blocks
      300000 8
      300000 8
      Signed-off-by: NBian Yu <bianyu@kedacom.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edfa1f65
    • S
      raid5: relieve lock contention in get_active_stripe() · 4bda556a
      Shaohua Li 提交于
      track empty inactive list count, so md_raid5_congested() can use it to make
      decision.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4bda556a
  17. 15 11月, 2013 1 次提交
  18. 14 11月, 2013 3 次提交
    • S
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li 提交于
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      566c09c5
    • N
      md/raid5.c: add proper locking to error path of raid5_start_reshape. · ba8805b9
      NeilBrown 提交于
      If raid5_start_reshape errors out, we need to reset all the fields
      that were updated (not just some), and need to use the seq_counter
      to ensure make_request() doesn't use an inconsitent state.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ba8805b9
    • M
      raid5: Use slow_path to release stripe when mddev->thread is null · ad4068de
      majianpeng 提交于
      When release_stripe() is called in grow_one_stripe(), the
      mddev->thread is null. So it will omit one wakeup this thread to
      release stripe.
      For this condition, use slow_path to release stripe.
      
      Bug was introduced in 3.12
      
      Cc: stable@vger.kernel.org (3.12+)
      Fixes: 773ca82fSigned-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad4068de
  19. 24 10月, 2013 2 次提交
  20. 02 9月, 2013 1 次提交
    • S
      raid5: only wakeup necessary threads · bfc90cb0
      Shaohua Li 提交于
      If there are not enough stripes to handle, we'd better not always
      queue all available work_structs. If one worker can only handle small
      or even none stripes, it will impact request merge and create lock
      contention.
      
      With this patch, the number of work_struct running will depend on
      pending stripes number. Note: some statistics info used in the patch
      are accessed without locking protection. This should doesn't matter,
      we just try best to avoid queue unnecessary work_struct.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bfc90cb0
  21. 28 8月, 2013 5 次提交
    • N
      md/raid5: flush out all pending requests before proceeding with reshape. · 4d77e3ba
      NeilBrown 提交于
      Some requests - particularly 'discard' and 'read' are handled
      differently depending on whether a reshape is active or not.
      
      It is harmless to assume reshape is active if it isn't but wrong
      to act as though reshape is not active when it is.
      
      So when we start reshape - after making clear to all requests that
      reshape has started - use mddev_suspend/mddev_resume to flush out all
      requests.  This will ensure that no requests will be assuming the
      absence of reshape once it really starts.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      4d77e3ba
    • N
      md/raid5: use seqcount to protect access to shape in make_request. · c46501b2
      NeilBrown 提交于
      make_request() access various shape parameters (raid_disks, chunk_size
      etc) which might be changed by raid5_start_reshape().
      
      If the later is called at and awkward time during the form, the wrong
      stripe_head might be used.
      
      So introduce a 'seqcount' and after finding a stripe_head make sure
      there is no reason to expect that we got the wrong one.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c46501b2
    • S
      raid5: sysfs entry to control worker thread number · b721420e
      Shaohua Li 提交于
      Add a sysfs entry to control running workqueue thread number. If
      group_thread_cnt is set to 0, we will disable workqueue offload handling of
      stripes.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b721420e
    • S
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li 提交于
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      851c30c9
    • S
      raid5: fix stripe release order · d265d9dc
      Shaohua Li 提交于
      patch "make release_stripe lockless" changes the order stripes are released.
      Originally I thought block layer can take care of request merge, but it appears
      there are still some requests not merged. It's easy to fix the order.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d265d9dc