1. 19 11月, 2013 1 次提交
  2. 24 10月, 2013 1 次提交
    • L
      md: Fix skipping recovery for read-only arrays. · 61e4947c
      Lukasz Dorau 提交于
      Since:
              commit 7ceb17e8
              md: Allow devices to be re-added to a read-only array.
      
      spares are activated on a read-only array. In case of raid1 and raid10
      personalities it causes that not-in-sync devices are marked in-sync
      without checking if recovery has been finished.
      
      If a read-only array is degraded and one of its devices is not in-sync
      (because the array has been only partially recovered) recovery will be skipped.
      
      This patch adds checking if recovery has been finished before marking a device
      in-sync for raid1 and raid10 personalities. In case of raid5 personality
      such condition is already present (at raid5.c:6029).
      
      Bug was introduced in 3.10 and causes data corruption.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPawel Baldysiak <pawel.baldysiak@intel.com>
      Signed-off-by: NLukasz Dorau <lukasz.dorau@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      61e4947c
  3. 18 7月, 2013 1 次提交
    • N
      md/raid1: fix bio handling problems in process_checks() · 30bc9b53
      NeilBrown 提交于
      Recent change to use bio_copy_data() in raid1 when repairing
      an array is faulty.
      
      The underlying may have changed the bio in various ways using
      bio_advance and these need to be undone not just for the 'sbio' which
      is being copied to, but also the 'pbio' (primary) which is being
      copied from.
      
      So perform the reset on all bios that were read from and do it early.
      
      This also ensure that the sbio->bi_io_vec[j].bv_len passed to
      memcmp is correct.
      
      This fixes a crash during a 'check' of a RAID1 array.  The crash was
      introduced in 3.10 so this is suitable for 3.10-stable.
      
      Cc: stable@vger.kernel.org (3.10)
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      30bc9b53
  4. 14 6月, 2013 1 次提交
    • J
      DM RAID: Add ability to restore transiently failed devices on resume · 9092c02d
      Jonathan Brassow 提交于
      DM RAID: Add ability to restore transiently failed devices on resume
      
      This patch adds code to the resume function to check over the devices
      in the RAID array.  If any are found to be marked as failed and their
      superblocks can be read, an attempt is made to reintegrate them into
      the array.  This allows the user to refresh the array with a simple
      suspend and resume of the array - rather than having to load a
      completely new table, allocate and initialize all the structures and
      throw away the old instantiation.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9092c02d
  5. 13 6月, 2013 3 次提交
    • H
      md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place · 5026d7a9
      H. Peter Anvin 提交于
      There are cases where the kernel will believe that the WRITE SAME
      command is supported by a block device which does not, in fact,
      support WRITE SAME.  This currently happens for SATA drivers behind a
      SAS controller, but there are probably a hundred other ways that can
      happen, including drive firmware bugs.
      
      After receiving an error for WRITE SAME the block layer will retry the
      request as a plain write of zeroes, but mdraid will consider the
      failure as fatal and consider the drive failed.  This has the effect
      that all the mirrors containing a specific set of data are each
      offlined in very rapid succession resulting in data loss.
      
      However, just bouncing the request back up to the block layer isn't
      ideal either, because the whole initial request-retry sequence should
      be inside the write bitmap fence, which probably means that md needs
      to do its own conversion of WRITE SAME to write zero.
      
      Until the failure scenario has been sorted out, disable WRITE SAME for
      raid1, raid5, and raid10.
      
      [neilb: added raid5]
      
      This patch is appropriate for any -stable since 3.7 when write_same
      support was added.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5026d7a9
    • N
      md/raid1,raid10: use freeze_array in place of raise_barrier in various places. · e2d59925
      NeilBrown 提交于
      Various places in raid1 and raid10 are calling raise_barrier when they
      really should call freeze_array.
      The former is only intended to be called from "make_request".
      The later has extra checks for 'nr_queued' and makes a call to
      flush_pending_writes(), so it is safe to call it from within the
      management thread.
      
      Using raise_barrier will sometimes deadlock.  Using freeze_array
      should not.
      
      As 'freeze_array' currently expects one request to be pending (in
      handle_read_error - the only previous caller), we need to pass
      it the number of pending requests (extra) to ignore.
      
      The deadlock was made particularly noticeable by commits
      050b6615 (raid10) and 6b740b8d (raid1) which
      appeared in 3.4, so the fix is appropriate for any -stable
      kernel since then.
      
      This patch probably won't apply directly to some early kernels and
      will need to be applied by hand.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e2d59925
    • A
      md/raid1: consider WRITE as successful only if at least one non-Faulty and... · 3056e3ae
      Alex Lyakas 提交于
      md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
      
      Without that fix, the following scenario could happen:
      
      - RAID1 with drives A and B; drive B was freshly-added and is rebuilding
      - Drive A fails
      - WRITE request arrives to the array. It is failed by drive A, so
      r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
      succeeds in writing it, so the same r1_bio is marked as
      R1BIO_Uptodate.
      - r1_bio arrives to handle_write_finished, badblocks are disabled,
      md_error()->error() does nothing because we don't fail the last drive
      of raid1
      - raid_end_bio_io()  calls call_bio_endio()
      - As a result, in call_bio_endio():
              if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
                      clear_bit(BIO_UPTODATE, &bio->bi_flags);
      this code doesn't clear the BIO_UPTODATE flag, and the whole master
      WRITE succeeds, back to the upper layer.
      
      So we returned success to the upper layer, even though we had written
      the data onto the rebuilding drive only. But when we want to read the
      data back, we would not read from the rebuilding drive, so this data
      is lost.
      
      [neilb - applied identical change to raid10 as well]
      
      This bug can result in lost data, so it is suitable for any
      -stable kernel.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAlex Lyakas <alex@zadarastorage.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      3056e3ae
  6. 30 4月, 2013 2 次提交
  7. 24 3月, 2013 9 次提交
    • K
      block: Add bio_alloc_pages() · a0787606
      Kent Overstreet 提交于
      More utility code to replace stuff that's getting open coded.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      a0787606
    • K
      block: Convert some code to bio_for_each_segment_all() · cb34e057
      Kent Overstreet 提交于
      More prep work for immutable bvecs:
      
      A few places in the code were either open coding or using the wrong
      version - fix.
      
      After we introduce the bvec iter, it'll no longer be possible to modify
      the biovec through bio_for_each_segment_all() - it doesn't increment a
      pointer to the current bvec, you pass in a struct bio_vec (not a
      pointer) which is updated with what the current biovec would be (taking
      into account bi_bvec_done and bi_size).
      
      So because of that it's more worthwhile to be consistent about
      bio_for_each_segment()/bio_for_each_segment_all() usage.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Alexander Viro <viro@zeniv.linux.org.uk>
      cb34e057
    • K
      block: Add bio_for_each_segment_all() · d74c6d51
      Kent Overstreet 提交于
      __bio_for_each_segment() iterates bvecs from the specified index
      instead of bio->bv_idx.  Currently, the only usage is to walk all the
      bvecs after the bio has been advanced by specifying 0 index.
      
      For immutable bvecs, we need to split these apart;
      bio_for_each_segment() is going to have a different implementation.
      This will also help document the intent of code that's using it -
      bio_for_each_segment_all() is only legal to use for code that owns the
      bio.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Neil Brown <neilb@suse.de>
      CC: Boaz Harrosh <bharrosh@panasas.com>
      d74c6d51
    • K
      raid1: use bio_copy_data() · d3b45c2a
      Kent Overstreet 提交于
      This doesn't really delete any code _yet_, but once immutable bvecs are
      done we can just delete the rest of the code in that loop.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      d3b45c2a
    • K
      raid1: Refactor narrow_write_error() to not use bi_idx · b783863f
      Kent Overstreet 提交于
      More bi_idx removal. This code was just open coding bio_clone(). This
      could probably be further improved by using bio_advance() instead of
      skipping over null pages, but that'd be a larger rework.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      b783863f
    • K
      raid1: use bio_reset() · 2aabaa65
      Kent Overstreet 提交于
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      2aabaa65
    • K
      block: Add submit_bio_wait(), remove from md · 9e882242
      Kent Overstreet 提交于
      Random cleanup - this code was duplicated and it's not really specific
      to md.
      
      Also added the ability to return the actual error code.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: NeilBrown <neilb@suse.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      9e882242
    • K
      block: Use bio_sectors() more consistently · aa8b57aa
      Kent Overstreet 提交于
      Bunch of places in the code weren't using it where they could be -
      this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx
      into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: "Ed L. Cashin" <ecashin@coraid.com>
      CC: Nick Piggin <npiggin@kernel.dk>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Jim Paris <jim@jtan.com>
      CC: Geoff Levand <geoff@infradead.org>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Steven Rostedt <rostedt@goodmis.org>
      Acked-by: NEd Cashin <ecashin@coraid.com>
      aa8b57aa
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d
  8. 26 2月, 2013 2 次提交
  9. 30 11月, 2012 1 次提交
    • L
      wait: add wait_event_lock_irq() interface · eed8c02e
      Lukas Czerner 提交于
      New wait_event{_interruptible}_lock_irq{_cmd} macros added. This commit
      moves the private wait_event_lock_irq() macro from MD to regular wait
      includes, introduces new macro wait_event_lock_irq_cmd() instead of using
      the old method with omitting cmd parameter which is ugly and makes a use
      of new macros in the MD. It also introduces the _interruptible_ variant.
      
      The use of new interface is when one have a special lock to protect data
      structures used in the condition, or one also needs to invoke "cmd"
      before putting it to sleep.
      
      All new macros are expected to be called with the lock taken. The lock
      is released before sleep and is reacquired afterwards. We will leave the
      macro with the lock held.
      
      Note to DM: IMO this should also fix theoretical race on waitqueue while
      using simultaneously wait_event_lock_irq() and wait_event() because of
      lack of locking around current state setting and wait queue removal.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed8c02e
  10. 27 11月, 2012 1 次提交
    • N
      md/raid1{,0}: fix deadlock in bitmap_unplug. · 874807a8
      NeilBrown 提交于
      If the raid1 or raid10 unplug function gets called
      from a make_request function (which is very possible) when
      there are bios on the current->bio_list list, then it will not
      be able to successfully call bitmap_unplug() and it could
      need to submit more bios and wait for them to complete.
      But they won't complete while current->bio_list is non-empty.
      
      So detect that case and handle the unplugging off to another thread
      just like we already do when called from within the scheduler.
      
      RAID1 version of bug was introduced in 3.6, so that part of fix is
      suitable for 3.6.y.  RAID10 part won't apply.
      
      Cc: stable@vger.kernel.org
      Reported-by: NTorsten Kaiser <just.for.lkml@googlemail.com>
      Reported-by: NPeter Maloney <peter.maloney@brockmann-consult.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      874807a8
  11. 31 10月, 2012 1 次提交
    • N
      md/raid1: Fix assembling of arrays containing Replacements. · 02b898f2
      NeilBrown 提交于
      setup_conf in raid1.c uses conf->raid_disks before assigning
      a value.  It is used when including 'Replacement' devices.
      
      The consequence is that assembling an array which contains a
      replacement will misbehave and either not include the replacement, or
      not include the device being replaced.
      
      Though this doesn't lead directly to data corruption, it could lead to
      reduced data safety.
      
      So use mddev->raid_disks, which is initialised, instead.
      
      Bug was introduced by commit c19d5798
            md/raid1: recognise replacements when assembling arrays.
      
      in 3.3, so fix is suitable for 3.3.y thru 3.6.y.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      02b898f2
  12. 11 10月, 2012 4 次提交
  13. 02 8月, 2012 1 次提交
    • N
      md/raid1: submit IO from originating thread instead of md thread. · f54a9d0e
      NeilBrown 提交于
      queuing writes to the md thread means that all requests go through the
      one processor which may not be able to keep up with very high request
      rates.
      
      So use the plugging infrastructure to submit all requests on unplug.
      If a 'schedule' is needed, we fall back on the old approach of handing
      the requests to the thread for it to handle.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      f54a9d0e
  14. 31 7月, 2012 8 次提交
    • N
      md: remove plug_cnt feature of plugging. · 0021b7bc
      NeilBrown 提交于
      This seemed like a good idea at the time, but after further thought I
      cannot see it making a difference other than very occasionally and
      testing to try to exercise the case it is most likely to help did not
      show any performance difference by removing it.
      
      So remove the counting of active plugs and allow 'pending writes' to
      be activated at any time, not just when no plugs are active.
      
      This is only relevant when there is a write-intent bitmap, and the
      updating of the bitmap will likely introduce enough delay that
      the single-threading of bitmap updates will be enough to collect large
      numbers of updates together.
      
      Removing this will make it easier to centralise the unplug code, and
      will clear the other for other unplug enhancements which have a
      measurable effect.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0021b7bc
    • A
      md/RAID1: Add missing case for attempting to repair known bad blocks. · d57368af
      Alexander Lyakas 提交于
      When doing resync or repair, attempt to correct bad blocks, according
      to WriteErrorSeen policy
      Signed-off-by: NAlex Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d57368af
    • N
      md/raid1: don't abort a resync on the first badblock. · b7219ccb
      NeilBrown 提交于
      If a resync of a RAID1 array with 2 devices finds a known bad block
      one device it will neither read from, or write to, that device for
      this block offset.
      So there will be one read_target (The other device) and zero write
      targets.
      This condition causes md/raid1 to abort the resync assuming that it
      has finished - without known bad blocks this would be true.
      
      When there are no write targets because of the presence of bad blocks
      we should only skip over the area covered by the bad block.
      RAID10 already gets this right, raid1 doesn't.  Or didn't.
      
      As this can cause a 'sync' to abort early and appear to have succeeded
      it could lead to some data corruption, so it suitable for -stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlexander Lyakas <alex.bolshoy@gmail.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b7219ccb
    • S
      md/raid1: prevent merging too large request · 12cee5a8
      Shaohua Li 提交于
      For SSD, if request size exceeds specific value (optimal io size), request size
      isn't important for bandwidth. In such condition, if making request size bigger
      will cause some disks idle, the total throughput will actually drop. A good
      example is doing a readahead in a two-disk raid1 setup.
      
      So when should we split big requests? We absolutly don't want to split big
      request to very small requests. Even in SSD, big request transfer is more
      efficient. This patch only considers request with size above optimal io size.
      
      If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
      two requests 32k and two disks. We can let each disk run one 32k request, or
      split the requests to 4 16k requests and each disk runs two. It's hard to say
      which case is better, depending on hardware.
      
      So only consider case where there are idle disks. For readahead, split is
      always better in this case. And in my test, below patch can improve > 30%
      thoughput. Hmm, not 100%, because disk isn't 100% busy.
      
      Such case can happen not just in readahead, for example, in directio. But I
      suppose directio usually will have bigger IO depth and make all disks busy, so
      I ignored it.
      
      Note: if the raid uses any hard disk, we don't prevent merging. That will make
      performace worse.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      12cee5a8
    • S
      md/raid1: read balance chooses idlest disk for SSD · 9dedf603
      Shaohua Li 提交于
      SSD hasn't spindle, distance between requests means nothing. And the original
      distance based algorithm sometimes can cause severe performance issue for SSD
      raid.
      
      Considering two thread groups, one accesses file A, the other access file B.
      The first group will access one disk and the second will access the other disk,
      because requests are near from one group and far between groups. In this case,
      read balance might keep one disk very busy but the other relative idle.  For
      SSD, we should try best to distribute requests to as many disks as possible.
      There isn't spindle move penality anyway.
      
      With below patch, I can see more than 50% throughput improvement sometimes
      depending on workloads.
      
      The only exception is small requests can be merged to a big request which
      typically can drive higher throughput for SSD too. Such small requests are
      sequential reads. Unlike hard disk, sequential read which can't be merged (for
      example direct IO, or read without readahead) can be ignored for SSD. Again
      there is no spindle move penality. readahead dispatches small requests and such
      requests can be merged.
      
      Last patch can help detect sequential read well, at least if concurrent read
      number isn't greater than raid disk number. In that case, distance based
      algorithm doesn't work well too.
      
      V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
      random IO too. This makes the algorithm generic for raid with SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      9dedf603
    • S
      md/raid1: make sequential read detection per disk based · be4d3280
      Shaohua Li 提交于
      Currently the sequential read detection is global wide. It's natural to make it
      per disk based, which can improve the detection for concurrent multiple
      sequential reads. And next patch will make SSD read balance not use distance
      based algorithm, where this change help detect truly sequential read for SSD.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      be4d3280
    • J
      MD: Move macros from raid1*.h to raid1*.c · 473e87ce
      Jonathan Brassow 提交于
      MD RAID1/RAID10: Move some macros from .h file to .c file
      
      There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
      in both raid1.h and raid10.h.  They are only used in there respective .c files.
      However, if we wish to make RAID10 accessible to the device-mapper RAID
      target (dm-raid.c), then we need to move these macros into the .c files where
      they are used so that they do not conflict with each other.
      
      The macros from the two files are identical and could be moved into md.h, but
      I chose to leave the duplication and have them remain in the personality
      files.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      473e87ce
    • J
      MD RAID1: rename mirror_info structure · 0eaf822c
      Jonathan Brassow 提交于
      MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
      
      The same structure name ('mirror_info') is used by raid10.  Each of these
      structures are defined in there respective header files.  If dm-raid is
      to support both RAID1 and RAID10, the header files will be included and
      the structure names must not collide.  While only one of these structure
      names needs to change, this patch adds consistency to the naming of the
      structure.
      Signed-off-by: NJonathan Brassow <jbrassow@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      0eaf822c
  15. 19 7月, 2012 1 次提交
  16. 09 7月, 2012 1 次提交
    • N
      md/raid1: fix use-after-free bug in RAID1 data-check code. · 2d4f4f33
      NeilBrown 提交于
      This bug has been present ever since data-check was introduce
      in 2.6.16.  However it would only fire if a data-check were
      done on a degraded array, which was only possible if the array
      has 3 or more devices.  This is certainly possible, but is quite
      uncommon.
      
      Since hot-replace was added in 3.3 it can happen more often as
      the same condition can arise if not all possible replacements are
      present.
      
      The problem is that as soon as we submit the last read request, the
      'r1_bio' structure could be freed at any time, so we really should
      stop looking at it.  If the last device is being read from we will
      stop looking at it.  However if the last device is not due to be read
      from, we will still check the bio pointer in the r1_bio, but the
      r1_bio might already be free.
      
      So use the read_targets counter to make sure we stop looking for bios
      to submit as soon as we have submitted them all.
      
      This fix is suitable for any -stable kernel since 2.6.16.
      
      Cc: stable@vger.kernel.org
      Reported-by: NArnold Schulz <arnysch@gmx.net>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      2d4f4f33
  17. 03 7月, 2012 2 次提交
    • N
      md: fix up plugging (again). · b357f04a
      NeilBrown 提交于
      The value returned by "mddev_check_plug" is only valid until the
      next 'schedule' as that will unplug things.  This could happen at any
      call to mempool_alloc.
      So just calling mddev_check_plug at the start doesn't really make
      sense.
      
      So call it just before, or just after, queuing things for the thread.
      As the action that happens at unplug is to wake the thread, this makes
      lots of sense.
      If we cannot add a plug (which requires a small GFP_ATOMIC alloc) we
      wake thread immediately.
      
      RAID5 is a bit different.  Requests are queued for the thread and the
      thread is woken by release_stripe.  So we don't need to wake the
      thread on failure.
      However the thread doesn't perform certain actions when there is any
      active plug, so it is important to install a plug before waking the
      thread.  So for RAID5 we install the plug *before* queuing the request
      and waking the thread.
      
      Without this patch it is possible for raid1 or raid10 to queue a
      request without then waking the thread, resulting in the array locking
      up.
      
      Also change raid10 to only flush_pending_write when there are not
      active plugs, just like raid1.
      
      This patch is suitable for 3.0 or later.  I plan to submit it to
      -stable, but I'll like to let it spend a few weeks in mainline
      first to be sure it is completely safe.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b357f04a
    • N
      md/raid1: fix bug in read_balance introduced by hot-replace · 32644afd
      NeilBrown 提交于
      When we added hot_replace we doubled the number of devices
      that could be in a RAID1 array.  So we doubled how far read_balance
      would search.  Unfortunately we didn't double the point at which
      it looped back to the beginning - so it effectively loops over
      all non-replacement disks twice.
      This doesn't cause bad behaviour, but it pointless and means we
      never read from replacement devices.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      32644afd