1. 22 4月, 2015 5 次提交
    • S
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org 提交于
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      59fc630b
    • S
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org 提交于
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a87f434
    • S
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org 提交于
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da41ba65
    • S
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org 提交于
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      46d5b785
    • N
      md: remove 'go_faster' option from ->sync_request() · 09314799
      NeilBrown 提交于
      This option is not well justified and testing suggests that
      it hardly ever makes any difference.
      
      The comment suggests there might be a need to wait for non-resync
      activity indicated by ->nr_waiting, however raise_barrier()
      already waits for all of that.
      
      So just remove it to simplify reasoning about speed limiting.
      
      This allows us to remove a 'FIXME' comment from raid5.c as that
      never used the flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      09314799
  2. 25 2月, 2015 1 次提交
    • E
      raid5: check faulty flag for array status during recovery. · 16d9cfab
      Eric Mei 提交于
      When we have more than 1 drive failure, it's possible we start
      rebuild one drive while leaving another faulty drive in array.
      To determine whether array will be optimal after building, current
      code only check whether a drive is missing, which could potentially
      lead to data corruption. This patch is to add checking Faulty flag.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      16d9cfab
  3. 18 2月, 2015 1 次提交
  4. 06 2月, 2015 2 次提交
    • N
      md: make reconfig_mutex optional for writes to md sysfs files. · 6791875e
      NeilBrown 提交于
      Rather than using mddev_lock() to take the reconfig_mutex
      when writing to any md sysfs file, we only take mddev_lock()
      in the particular _store() functions that require it.
      Admittedly this is most, but it isn't all.
      
      This also allows us to remove special-case handling for new_dev_store
      (in md_attr_store).
      Signed-off-by: NNeilBrown <neilb@suse.de>
      6791875e
    • N
      md/raid5: use ->lock to protect accessing raid5 sysfs attributes. · 7b1485ba
      NeilBrown 提交于
      It is important that mddev->private isn't freed while
      a sysfs attribute function is accessing it.
      
      So use mddev->lock to protect the setting of ->private to NULL, and
      take that lock when checking ->private for NULL and de-referencing it
      in the sysfs access functions.
      
      This only applies to the read ('show') side of access.  Write
      access will be handled separately.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7b1485ba
  5. 04 2月, 2015 9 次提交
    • N
      md: rename ->stop to ->free · afa0f557
      NeilBrown 提交于
      Now that the ->stop function only frees the private data,
      rename is accordingly.
      
      Also pass in the private pointer as an arg rather than using
      mddev->private.  This flexibility will be useful in level_store().
      
      Finally, don't clear ->private.  It doesn't make sense to clear
      it seeing that isn't what we free, and it is no longer necessary
      to clear ->private (it was some time ago before  ->to_remove was
      introduced).
      
      Setting ->to_remove in ->free() is a bit of a wart, but not a
      big problem at the moment.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      afa0f557
    • N
      md: split detach operation out from ->stop. · 5aa61f42
      NeilBrown 提交于
      Each md personality has a 'stop' operation which does two
      things:
       1/ it finalizes some aspects of the array to ensure nothing
          is accessing the ->private data
       2/ it frees the ->private data.
      
      All the steps in '1' can apply to all arrays and so can be
      performed in common code.
      
      This is useful as in the case where we change the personality which
      manages an array (in level_store()), it would be helpful to do
      step 1 early, and step 2 later.
      
      So split the 'step 1' functionality out into a new mddev_detach().
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5aa61f42
    • N
      md: make merge_bvec_fn more robust in face of personality changes. · 64590f45
      NeilBrown 提交于
      There is no locking around calls to merge_bvec_fn(), so
      it is possible that calls which coincide with a level (or personality)
      change could go wrong.
      
      So create a central dispatch point for these functions and use
      rcu_read_lock().
      If the array is suspended, reject any merge that can be rejected.
      If not, we know it is safe to call the function.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      64590f45
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
    • N
      md/raid5: need_this_block: tidy/fix last condition. · ea664c82
      NeilBrown 提交于
      That last condition is unclear and over cautious.
      
      There are two related issues here.
      
      If a partial write is destined for a missing device, then
      either RMW or RCW can work.  We must read all the available
      block.  Only then can the missing blocks be calculated, and
      then the parity update performed.
      
      If RMW is not an option, then there is a complication even
      without partial writes.  If we would need to read a missing
      device to perform the reconstruction, then we must first read every
      block so the missing device data can be computed.
      This is the case for RAID6 (Which currently does not support
      RMW) and for times when we don't trust the parity (after a crash)
      and so are in the process of resyncing it.
      
      So make these two cases more clear and separate, and perform
      the relevant tests more  thoroughly.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ea664c82
    • N
      md/raid5: need_this_block: start simplifying the last two conditions. · a9d56950
      NeilBrown 提交于
      Both the last two cases are only relevant if something has failed and
      something needs to be written (but not over-written), and if it is OK
      to pre-read blocks at this point.  So factor out those tests and
      explain them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a9d56950
    • N
      md/raid5: separate out the easy conditions in need_this_block. · a79cfe12
      NeilBrown 提交于
      Some of the conditions in need_this_block have very straight
      forward motivation.  Separate those out and document them.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      a79cfe12
    • N
      md/raid5: separate large if clause out of fetch_block(). · 2c58f06e
      NeilBrown 提交于
      fetch_block() has a very large and hard to read 'if' condition.
      
      Separate it into its own function so that it can be
      made more readable.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2c58f06e
    • J
      md: do_release_stripe(): No need to call md_wakeup_thread() twice · ad3ab8b6
      Jes Sorensen 提交于
      67f45548 introduced a call to
      md_wakeup_thread() when adding to the delayed_list. However the md
      thread is woken up unconditionally just below.
      
      Remove the unnecessary wakeup call.
      Signed-off-by: NJes Sorensen <Jes.Sorensen@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      ad3ab8b6
  6. 02 2月, 2015 1 次提交
    • N
      md/raid5: fix another livelock caused by non-aligned writes. · b1b02fe9
      NeilBrown 提交于
      If a non-page-aligned write is destined for a device which
      is missing/faulty, we can deadlock.
      
      As the target device is missing, a read-modify-write cycle
      is not possible.
      As the write is not for a full-page, a recontruct-write cycle
      is not possible.
      
      This should be handled by logic in fetch_block() which notices
      there is a non-R5_OVERWRITE write to a missing device, and so
      loads all blocks.
      
      However since commit 67f45548, that code requires
      STRIPE_PREREAD_ACTIVE before it will active, and those circumstances
      never set STRIPE_PREREAD_ACTIVE.
      
      So: in handle_stripe_dirtying, if neither rmw or rcw was possible,
      set STRIPE_DELAYED, which will cause STRIPE_PREREAD_ACTIVE be set
      after a suitable delay.
      
      Fixes: 67f45548
      Cc: stable@vger.kernel.org (v3.16+)
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b1b02fe9
  7. 03 12月, 2014 1 次提交
    • N
      md/raid5: fetch_block must fetch all the blocks handle_stripe_dirtying wants. · 108cef3a
      NeilBrown 提交于
      It is critical that fetch_block() and handle_stripe_dirtying()
      are consistent in their analysis of what needs to be loaded.
      Otherwise raid5 can wait forever for a block that won't be loaded.
      
      Currently when writing to a RAID5 that is resyncing, to a location
      beyond the resync offset, handle_stripe_dirtying chooses a
      reconstruct-write cycle, but fetch_block() assumes a
      read-modify-write, and a lockup can happen.
      
      So treat that case just like RAID6, just as we do in
      handle_stripe_dirtying.  RAID6 always does reconstruct-write.
      
      This bug was introduced when the behaviour of handle_stripe_dirtying
      was changed in 3.7, so the patch is suitable for any kernel since,
      though it will need careful merging for some versions.
      
      Cc: stable@vger.kernel.org (v3.7+)
      Fixes: a7854487Reported-by: NHenry Cai <henryplusplus@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      108cef3a
  8. 14 10月, 2014 2 次提交
  9. 09 10月, 2014 1 次提交
  10. 02 10月, 2014 1 次提交
    • N
      md/raid5: disable 'DISCARD' by default due to safety concerns. · 8e0e99ba
      NeilBrown 提交于
      It has come to my attention (thanks Martin) that 'discard_zeroes_data'
      is only a hint.  Some devices in some cases don't do what it
      says on the label.
      
      The use of DISCARD in RAID5 depends on reads from discarded regions
      being predictably zero.  If a write to a previously discarded region
      performs a read-modify-write cycle it assumes that the parity block
      was consistent with the data blocks.  If all were zero, this would
      be the case.  If some are and some aren't this would not be the case.
      This could lead to data corruption after a device failure when
      data needs to be reconstructed from the parity.
      
      As we cannot trust 'discard_zeroes_data', ignore it by default
      and so disallow DISCARD on all raid4/5/6 arrays.
      
      As many devices are trustworthy, and as there are benefits to using
      DISCARD, add a module parameter to over-ride this caution and cause
      DISCARD to work if discard_zeroes_data is set.
      
      If a site want to enable DISCARD on some arrays but not on others they
      should select DISCARD support at the filesystem level, and set the
      raid456 module parameter.
          raid456.devices_handle_discard_safely=Y
      
      As this is a data-safety issue, I believe this patch is suitable for
      -stable.
      DISCARD support for RAID456 was added in 3.7
      
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Heinz Mauelshagen <heinzm@redhat.com>
      Cc: stable@vger.kernel.org (3.7+)
      Acked-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Fixes: 620125f2Signed-off-by: NNeilBrown <neilb@suse.de>
      8e0e99ba
  11. 18 8月, 2014 2 次提交
    • N
      md/raid6: avoid data corruption during recovery of double-degraded RAID6 · 9c4bdf69
      NeilBrown 提交于
      During recovery of a double-degraded RAID6 it is possible for
      some blocks not to be recovered properly, leading to corruption.
      
      If a write happens to one block in a stripe that would be written to a
      missing device, and at the same time that stripe is recovering data
      to the other missing device, then that recovered data may not be written.
      
      This patch skips, in the double-degraded case, an optimisation that is
      only safe for single-degraded arrays.
      
      Bug was introduced in 2.6.32 and fix is suitable for any kernel since
      then.  In an older kernel with separate handle_stripe5() and
      handle_stripe6() functions the patch must change handle_stripe6().
      
      Cc: stable@vger.kernel.org (2.6.32+)
      Fixes: 6c0069c0
      Cc: Yuri Tikhonov <yur@emcraft.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Reported-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Tested-by: N"Manibalan P" <pmanibalan@amiindia.co.in>
      Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1090423Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      9c4bdf69
    • N
      md/raid5: avoid livelock caused by non-aligned writes. · a40687ff
      NeilBrown 提交于
      If a stripe in a raid6 array received a write to each data block while
      the array is degraded, and if any of these writes to a missing device
      are not page-aligned, then a live-lock happens.
      
      In this case the P and Q blocks need to be read so that the part of
      the missing block which is *not* being updated by the write can be
      constructed.  Due to a logic error, these blocks are not loaded, so
      the update cannot proceed and the stripe is 'handled' repeatedly in an
      infinite loop.
      
      This bug is unlikely as most writes are page aligned.  However as it
      can lead to a livelock it is suitable for -stable.  It was introduced
      in 3.16.
      
      Cc: stable@vger.kernel.org (v3.16)
      Fixed: 67f45548Signed-off-by: NNeilBrown <neilb@suse.de>
      a40687ff
  12. 10 6月, 2014 1 次提交
    • E
      raid5: speedup sync_request processing · 053f5b65
      Eivind Sarto 提交于
      The raid5 sync_request() processing calls handle_stripe() within the context of
      the resync-thread.  The resync-thread issues the first set of read requests
      and this adds execution latency and slows down the scheduling of the next
      sync_request().
      The current rebuild/resync speed of raid5 is not much faster than what
      rotational HDDs can sustain.
      Testing the following patch on a 6-drive array, I can increase the rebuild
      speed from 100 MB/s to 175 MB/s.
      The sync_request() now just sets STRIPE_HANDLE and releases the stripe.  This
      creates some more parallelism between the resync-thread and raid5 kernel daemon.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      053f5b65
  13. 05 6月, 2014 1 次提交
    • H
      md/raid5: deadlock between retry_aligned_read with barrier io · 2844dc32
      hui jiao 提交于
      A chunk aligned read increases counter active_aligned_reads and
      decreases it after sub-device handle it successfully. But when a read
      error occurs,  the read redispatched by raid5d, and the
      active_aligned_reads will not be decreased until we can grab a stripe
      head in retry_aligned_read. Now suppose, a barrier io comes, set
      conf->quiesce to 2, and wait until both active_stripes and
      active_aligned_reads are zero. The retried chunk aligned read gets
      stuck at get_active_stripe waiting until conf->quiesce becomes 0.
      Retry_aligned_read and barrier io are waiting each other now.
      One possible solution is that we ignore conf->quiesce, let the retried
      aligned read finish. I reproduced this deadlock and test this patch on
      centos6.0
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2844dc32
  14. 29 5月, 2014 3 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
    • E
      raid5: avoid release list until last reference of the stripe · cf170f3f
      Eivind Sarto 提交于
      The (lockless) release_list reduces lock contention, but there is excessive
      queueing and dequeuing of stripes on this list.  A stripe will currently be
      queued on the release_list with a stripe reference count > 1.  This can cause
      the raid5 kernel thread(s) to dequeue the stripe and decrement the refcount
      without doing any other useful processing of the stripe.  The are two cases
      when the stripe can be put on the release_list multiple times before it is
      actually handled by the kernel thread(s).
      1) make_request() activates the stripe processing in 4k increments.  When a
         write request is large enough to span multiple chunks of a stripe_head, the
         first 4k chunk adds the stripe to the plug list.  The next 4k chunk that is
         processed for the same stripe puts the stripe on the release_list with a
         refcount=2.  This can cause the kernel thread to process and decrement the
         stripe before the stripe us unplugged, which again will put it back on the
         release_list.
      2) Whenever IO is scheduled on a stripe (pre-read and/or write), the stripe
         refcount is set to the number of active IO (for each chunk).  The stripe is
         released as each IO complete, and can be queued and dequeued multiple times
         on the release_list, until its refcount finally reached zero.
      
      This simple patch will ensure a stripe is only queued on the release_list when
      its refcount=1 and is ready to be handled by the kernel thread(s).  I added some
      instrumentation to raid5 and counted the number of times striped were queued on
      the release_list for a variety of write IO sizes.  Without this patch the number
      of times stripes got queued on the release_list was 100-500% higher than with
      the patch.  The excess queuing will increase with the IO size.  The patch also
      improved throughput by 5-10%.
      Signed-off-by: NEivind Sarto <esarto@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      cf170f3f
    • N
      md/raid56: Don't perform reads to support writes until stripe is ready. · 67f45548
      NeilBrown 提交于
      If it is found that we need to pre-read some blocks before a write
      can succeed, we normally set STRIPE_DELAYED and don't actually perform
      the read until STRIPE_PREREAD_ACTIVE subsequently gets set.
      
      However for a degraded RAID6 we currently perform the reads as soon
      as we see that a write is pending.  This significantly hurts
      throughput.
      
      So:
       - when handle_stripe_dirtying find a block that it wants on a device
         that is failed, set STRIPE_DELAY, instead of doing nothing, and
       - when fetch_block detects that a read might be required to satisfy a
         write, only perform the read if STRIPE_PREREAD_ACTIVE is set,
         and if we would actually need to read something to complete the write.
      
      This also helps RAID5, though less often as RAID5 supports a
      read-modify-write cycle.  For RAID5 the read is performed too early
      only if the write is not a full 4K aligned write (i.e. no an
      R5_OVERWRITE).
      
      Also clean up a couple of horrible bits of formatting.
      Reported-by: NPatrik Horník <patrik@dsl.sk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      67f45548
  15. 18 4月, 2014 1 次提交
  16. 17 4月, 2014 1 次提交
  17. 09 4月, 2014 2 次提交
    • S
      raid5: get_active_stripe avoids device_lock · e240c183
      Shaohua Li 提交于
      For sequential workload (or request size big workload), get_active_stripe can
      find cached stripe. In this case, we always hold device_lock, which exposes a
      lot of lock contention for such workload. If stripe count isn't 0, we don't
      need hold the lock actually, since we just increase its count. And this is the
      hot code path for such workload. Unfortunately we must delete the BUG_ON.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e240c183
    • S
      raid5: make_request does less prepare wait · 27c0f68f
      Shaohua Li 提交于
      In NUMA machine, prepare_to_wait/finish_wait in make_request exposes a
      lot of contention for sequential workload (or big request size
      workload). For such workload, each bio includes several stripes. So we
      can just do prepare_to_wait/finish_wait once for the whold bio instead
      of every stripe.  This reduces the lock contention completely for such
      workload. Random workload might have the similar lock contention too,
      but I didn't see it yet, maybe because my stroage is still not fast
      enough.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      27c0f68f
  18. 13 2月, 2014 1 次提交
    • O
      md/raid5: Fix CPU hotplug callback registration · 789b5e03
      Oleg Nesterov 提交于
      Subsystems that want to register CPU hotplug callbacks, as well as perform
      initialization for the CPUs that are already online, often do it as shown
      below:
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	put_online_cpus();
      
      This is wrong, since it is prone to ABBA deadlocks involving the
      cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
      with CPU hotplug operations).
      
      Interestingly, the raid5 code can actually prevent double initialization and
      hence can use the following simplified form of callback registration:
      
      	register_cpu_notifier(&foobar_cpu_notifier);
      
      	get_online_cpus();
      
      	for_each_online_cpu(cpu)
      		init_cpu(cpu);
      
      	put_online_cpus();
      
      A hotplug operation that occurs between registering the notifier and calling
      get_online_cpus(), won't disrupt anything, because the code takes care to
      perform the memory allocations only once.
      
      So reorganize the code in raid5 this way to fix the deadlock with callback
      registration.
      
      Cc: linux-raid@vger.kernel.org
      Cc: stable@vger.kernel.org (v2.6.32+)
      Fixes: 36d1c647Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
      free_scratch_buffer() helper to condense code further and wrote the changelog.]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      789b5e03
  19. 22 1月, 2014 1 次提交
    • N
      md/raid5: close recently introduced race in stripe_head management. · 7da9d450
      NeilBrown 提交于
      As release_stripe and __release_stripe decrement ->count and then
      manipulate ->lru both under ->device_lock, it is important that
      get_active_stripe() increments ->count and clears ->lru also under
      ->device_lock.
      
      However we currently list_del_init ->lru under the lock, but increment
      the ->count outside the lock.  This can lead to races and list
      corruption.
      
      So move the atomic_inc(&sh->count) up inside the ->device_lock
      protected region.
      
      Note that we still increment ->count without device lock in the case
      where get_free_stripe() was called, and in fact don't take
      ->device_lock at all in that path.
      This is safe because if the stripe_head can be found by
      get_free_stripe, then the hash lock assures us the no-one else could
      possibly be calling release_stripe() at the same time.
      
      Fixes: 566c09c5
      Cc: stable@vger.kernel.org (3.13)
      Reported-and-tested-by: NIan Kumlien <ian.kumlien@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7da9d450
  20. 16 1月, 2014 1 次提交
    • N
      md/raid5: fix long-standing problem with bitmap handling on write failure. · 9f97e4b1
      NeilBrown 提交于
      Before a write starts we set a bit in the write-intent bitmap.
      When the write completes we clear that bit if the write was successful
      to all devices.  However if the write wasn't fully successful we
      should not clear the bit.  If the faulty drive is subsequently
      re-added, the fact that the bit is still set ensure that we will
      re-write the data that is missing.
      
      This logic is mediated by the STRIPE_DEGRADED flag - we only clear the
      bitmap bit when this flag is not set.
      Currently we correctly set the flag if a write starts when some
      devices are failed or missing.  But we do *not* set the flag if some
      device failed during the write attempt.
      This is wrong and can result in clearing the bit inappropriately.
      
      So: set the flag when a write fails.
      
      This bug has been present since bitmaps were introduces, so the fix is
      suitable for any -stable kernel.
      Reported-by: NEthan Wilson <ethan.wilson@shiftmail.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9f97e4b1
  21. 14 1月, 2014 2 次提交
    • N
      md/raid5: fix a recently broken BUG_ON(). · 5af9bef7
      NeilBrown 提交于
      commit 6d183de4
          md/raid5: fix newly-broken locking in get_active_stripe.
      
      simplified a BUG_ON, but removed too much so now it sometimes fires
      when it shouldn't.
      
      When the STRIPE_EXPANDING flag is set, the stripe_head might be on a
      special list while multiple stripe_heads are collected, or it might
      not be on any list, even a 'free' list when the refcount is zero.  As
      long as STRIPE_EXPANDING is set, it will be found and added back to a
      list eventually.
      
      So both of the BUG_ONs which test for the ->lru being empty or not
      need to avoid the case where STRIPE_EXPANDING is set.
      
      The patch which broke this was marked for -stable, so this patch needs
      to be applied to any branch that received 6d183de4
      
      Fixes: 6d183de4
      Cc: stable@vger.kernel.org (any release to which above was applied)
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5af9bef7
    • N
      md/raid5: Fix possible confusion when multiple write errors occur. · 1cc03eb9
      NeilBrown 提交于
      commit 5d8c71f9
          md: raid5 crash during degradation
      
      Fixed a crash in an overly simplistic way which could leave
      R5_WriteError or R5_MadeGood set in the stripe cache for devices
      for which it is no longer relevant.
      When those devices are removed and spares added the flags are still
      set and can cause incorrect behaviour.
      
      commit 14a75d3e
          md/raid5: preferentially read from replacement device if possible.
      
      Fixed the same bug if a more effective way, so we can now revert
      the original commit.
      Reported-and-tested-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Cc: stable@vger.kernel.org (3.2+ - 3.2 will need a different fix though)
      Fixes: 5d8c71f9Signed-off-by: NNeilBrown <neilb@suse.de>
      1cc03eb9