1. 24 4月, 2017 1 次提交
  2. 12 4月, 2017 4 次提交
    • N
      md/raid1: factor out flush_bio_list() · 673ca68d
      NeilBrown 提交于
      flush_pending_writes() and raid1_unplug() each contain identical
      copies of a fairly large slab of code.  So factor that out into
      new flush_bio_list() to simplify maintenance.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      673ca68d
    • N
      md/raid1: simplify handle_read_error(). · 689389a0
      NeilBrown 提交于
      handle_read_error() duplicates a lot of the work that raid1_read_request()
      does, so it makes sense to just use that function.
      This doesn't quite work as handle_read_error() relies on the same r1bio
      being re-used so that, in the case of a read-only array, setting
      IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use
      that device.
      So we need to allow a r1bio to be passed to raid1_read_request(), and to
      have that function mostly initialise the r1bio, but leave the bios[]
      array untouched.
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to raid1_read_request().
      
      Note that this highlights a minor bug on alloc_r1bio().  It doesn't
      initalise the bios[] array, so it is possible that old content is there,
      which might cause read_balance() to ignore some devices with no good reason.
      
      With this change, we no longer need inc_pending(), or the sectors_handled
      arg to alloc_r1bio().
      
      As handle_read_error() is called from raid1d() and allocates memory,
      there is tiny chance of a deadlock.  All element of various pools
      could be queued waiting for raid1 to handle them, and there may be no
      extra memory free.
      Achieving guaranteed forward progress would probably require a second
      thread and another mempool.  Instead of that complexity, add
      __GFP_HIGH to any allocations when read1_read_request() is called
      from raid1d.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      689389a0
    • N
      md/raid1: simplify alloc_behind_master_bio() · cb83efcf
      NeilBrown 提交于
      Now that we always always pass an offset of 0 and a size
      that matches the bio to alloc_behind_master_bio(),
      we can remove the offset/size args and simplify the code.
      
      We could probably remove bio_copy_data_partial() too.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb83efcf
    • N
      md/raid1: simplify the splitting of requests. · c230e7e5
      NeilBrown 提交于
      raid1 currently splits requests in two different ways for
      two different reasons.
      
      First, bio_split() is used to ensure the bio fits within a
      resync accounting region.
      Second, multiple r1bios are allocated for each bio to handle
      the possiblity of known bad blocks on some devices.
      
      This can be simplified to just use bio_split() once, and not
      use multiple r1bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r1bio, and then split the bio and
      queue the remainder for later handling.
      
      This avoids all loops inside raid1.c request handling.  Just
      a single read, or a single set of writes, is submitted to
      lower-level devices for each bio that comes from
      generic_make_request().
      
      When the bio needs to be split, generic_make_request() will
      do the necessary looping and call md_make_request() multiple
      times.
      
      raid1_make_request() no longer queues request for raid1 to handle,
      so we can remove that branch from the 'if'.
      
      This patch also creates a new private bio_set
      (conf->bio_split) for splitting bios.  Using fs_bio_set
      is wrong, as it is meant to be used by filesystems, not
      block devices.  Using it inside md can lead to deadlocks
      under high memory pressure.
      
      Delete unused variable in raid1_write_request() (Shaohua)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c230e7e5
  3. 11 4月, 2017 1 次提交
    • N
      md/raid1: avoid reusing a resync bio after error handling. · 0c9d5b12
      NeilBrown 提交于
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: NMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Cc: stable@vger.kernel.org (v4.10+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0c9d5b12
  4. 28 3月, 2017 1 次提交
    • M
      md: raid1: kill warning on powerpc_pseries · 8fc04e6e
      Ming Lei 提交于
      This patch kills the warning reported on powerpc_pseries,
      and actually we don't need the initialization.
      
      	After merging the md tree, today's linux-next build (powerpc
      	pseries_le_defconfig) produced this warning:
      
      	drivers/md/raid1.c: In function 'raid1d':
      	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
      	     if (memcmp(page_address(ppages[j]),
      	         ^
      	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
      	   int page_len[RESYNC_PAGES];
             ^
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8fc04e6e
  5. 26 3月, 2017 1 次提交
  6. 25 3月, 2017 8 次提交
  7. 23 3月, 2017 2 次提交
    • N
      md/raid1: stop using bi_phys_segment · 37011e3a
      NeilBrown 提交于
      Change to use bio->__bi_remaining to count number of r1bio attached
      to a bio.
      See precious raid10 patch for more details.
      
      Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
      used to measure different things, but were being compared.
      
      This patch fixes another bug in that nr_pending previously did not
      could write-behind requests, so behind writes could continue while
      resync was happening.  How that nr_pending counts all r1_bio,
      the resync cannot commence until the behind writes have completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      37011e3a
    • N
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown 提交于
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6b6c8110
  8. 17 3月, 2017 1 次提交
  9. 15 3月, 2017 1 次提交
  10. 10 3月, 2017 2 次提交
    • S
      md/raid1/10: fix potential deadlock · 61eb2b43
      Shaohua Li 提交于
      Neil Brown pointed out a potential deadlock in raid 10 code with
      bio_split/chain. The raid1 code could have the same issue, but recent
      barrier rework makes it less likely to happen. The deadlock happens in
      below sequence:
      
      1. generic_make_request(bio), this will set current->bio_list
      2. raid10_make_request will split bio to bio1 and bio2
      3. __make_request(bio1), wait_barrer, add underlayer disk bio to
      current->bio_list
      4. __make_request(bio2), wait_barrer
      
      If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
      raise_barrier waits for IO completion from 3. And since raise_barrier
      sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
      dispatched because raid10_make_request() doesn't finished yet.
      
      The solution is to adjust the IO ordering. Quotes from Neil:
      "
      It is much safer to:
      
          if (need to split) {
              split = bio_split(bio, ...)
              bio_chain(...)
              make_request_fn(split);
              generic_make_request(bio);
         } else
              make_request_fn(mddev, bio);
      
      This way we first process the initial section of the bio (in 'split')
      which will queue some requests to the underlying devices.  These
      requests will be queued in generic_make_request.
      Then we queue the remainder of the bio, which will be added to the end
      of the generic_make_request queue.
      Then we return.
      generic_make_request() will pop the lower-level device requests off the
      queue and handle them first.  Then it will process the remainder
      of the original bio once the first section has been fully processed.
      "
      
      Note, this only happens in read path. In write path, the bio is flushed to
      underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
      It's queued in current->bio_list.
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      61eb2b43
    • G
      md: move funcs from pers->resize to update_size · c9483634
      Guoqing Jiang 提交于
      raid1_resize and raid5_resize should also check the
      mddev->queue if run underneath dm-raid.
      
      And both set_capacity and revalidate_disk are used in
      pers->resize such as raid1, raid10 and raid5. So
      move them from personality file to common code.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c9483634
  11. 02 3月, 2017 1 次提交
  12. 24 2月, 2017 2 次提交
  13. 20 2月, 2017 3 次提交
    • S
      md/raid1: fix a use-after-free bug · af5f42a7
      Shaohua Li 提交于
      Commit fd76863e (RAID1: a new I/O barrier implementation to remove resync
      window) introduces a user-after-free bug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      af5f42a7
    • C
      RAID1: avoid unnecessary spin locks in I/O barrier code · 824e47da
      colyli@suse.de 提交于
      When I run a parallel reading performan testing on a md raid1 device with
      two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
      block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
      only 2.7GB/s, this is around 50% of the idea performance number.
      
      The perf reports locking contention happens at allow_barrier() and
      wait_barrier() code,
       - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
               + 89.92% allow_barrier
               + 9.34% __wake_up
       - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
         - _raw_spin_lock_irq
               - 100.00% wait_barrier
      
      The reason is, in these I/O barrier related functions,
       - raise_barrier()
       - lower_barrier()
       - wait_barrier()
       - allow_barrier()
      They always hold conf->resync_lock firstly, even there are only regular
      reading I/Os and no resync I/O at all. This is a huge performance penalty.
      
      The solution is a lockless-like algorithm in I/O barrier code, and only
      holding conf->resync_lock when it has to.
      
      The original idea is from Hannes Reinecke, and Neil Brown provides
      comments to improve it. I continue to work on it, and make the patch into
      current form.
      
      In the new simpler raid1 I/O barrier implementation, there are two
      wait barrier functions,
       - wait_barrier()
         Which calls _wait_barrier(), is used for regular write I/O. If there is
         resync I/O happening on the same I/O barrier bucket, or the whole
         array is frozen, task will wait until no barrier on same barrier bucket,
         or the whold array is unfreezed.
       - wait_read_barrier()
         Since regular read I/O won't interfere with resync I/O (read_balance()
         will make sure only uptodate data will be read out), it is unnecessary
         to wait for barrier in regular read I/Os, waiting in only necessary
         when the whole array is frozen.
      
      The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
      barrier[idx] are very carefully designed in raise_barrier(),
      lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
      avoid unnecessary spin locks in these functions. Once conf->
      nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
      has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
      raised in same barrier bucket index and array is not frozen, the regular
      I/O doesn't need to hold conf->resync_lock, it can just increase
      conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
      very similar to _wait_barrier(), the only difference is it only waits when
      array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
      code almostly gets rid of all spin lock cost.
      
      This patch significantly improves raid1 reading peroformance. From my
      testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
      blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
      increases from 2.7GB/s to 4.6GB/s (+70%).
      
      Changelog
      V4:
      - Change conf->nr_queued[] to atomic_t.
      - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
      V3:
      - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
      - Change conf->nr_queued[] from atomic_t to int.
      - Change conf->array_frozen from atomic_t back to int, and use
        READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
        in _wait_barrier() and wait_read_barrier().
      - In _wait_barrier() and wait_read_barrier(), add a call to
        wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
        to fix a deadlock between  _wait_barrier()/wait_read_barrier and
        freeze_array().
      V2:
      - Remove a spin_lock/unlock pair in raid1d().
      - Add more code comments to explain why there is no racy when checking two
        atomic_t variables at same time.
      V1:
      - Original RFC patch for comments.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      824e47da
    • C
      RAID1: a new I/O barrier implementation to remove resync window · fd76863e
      colyli@suse.de 提交于
      'Commit 79ef3a8a ("raid1: Rewrite the implementation of iobarrier.")'
      introduces a sliding resync window for raid1 I/O barrier, this idea limits
      I/O barriers to happen only inside a slidingresync window, for regular
      I/Os out of this resync window they don't need to wait for barrier any
      more. On large raid1 device, it helps a lot to improve parallel writing
      I/O throughput when there are background resync I/Os performing at
      same time.
      
      The idea of sliding resync widow is awesome, but code complexity is a
      challenge. Sliding resync window requires several variables to work
      collectively, this is complexed and very hard to make it work correctly.
      Just grep "Fixes: 79ef3a8a" in kernel git log, there are 8 more patches
      to fix the original resync window patch. This is not the end, any further
      related modification may easily introduce more regreassion.
      
      Therefore I decide to implement a much simpler raid1 I/O barrier, by
      removing resync window code, I believe life will be much easier.
      
      The brief idea of the simpler barrier is,
       - Do not maintain a global unique resync window
       - Use multiple hash buckets to reduce I/O barrier conflicts, regular
         I/O only has to wait for a resync I/O when both them have same barrier
         bucket index, vice versa.
       - I/O barrier can be reduced to an acceptable number if there are enough
         barrier buckets
      
      Here I explain how the barrier buckets are designed,
       - BARRIER_UNIT_SECTOR_SIZE
         The whole LBA address space of a raid1 device is divided into multiple
         barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
         Bio requests won't go across border of barrier unit size, that means
         maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
         For random I/O 64MB is large enough for both read and write requests,
         for sequential I/O considering underlying block layer may merge them
         into larger requests, 64MB is still good enough.
         Neil also points out that for resync operation, "we want the resync to
         move from region to region fairly quickly so that the slowness caused
         by having to synchronize with the resync is averaged out over a fairly
         small time frame". For full speed resync, 64MB should take less then 1
         second. When resync is competing with other I/O, it could take up a few
         minutes. Therefore 64MB size is fairly good range for resync.
      
       - BARRIER_BUCKETS_NR
         There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
              #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
              #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
         this patch makes the bellowed members of struct r1conf from integer
         to array of integers,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     *nr_pending;
              +       int                     *nr_waiting;
              +       int                     *nr_queued;
              +       int                     *barrier;
         number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
         kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
         barrier buckets, and each array of integers occupies single memory page.
         1024 means for a request which is smaller than the I/O barrier unit size
         has ~0.1% chance to wait for resync to pause, which is quite a small
         enough fraction. Also requesting single memory page is more friendly to
         kernel page allocator than larger memory size.
      
       - I/O barrier bucket is indexed by bio start sector
         If multiple I/O requests hit different I/O barrier units, they only need
         to compete I/O barrier with other I/Os which hit the same I/O barrier
         bucket index with each other. The index of a barrier bucket which a
         bio should look for is calculated by sector_to_idx() which is defined
         in raid1.h as an inline function,
              static inline int sector_to_idx(sector_t sector)
              {
                      return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                      BARRIER_BUCKETS_NR_BITS);
              }
         Here sector_nr is the start sector number of a bio.
      
       - Single bio won't go across boundary of a I/O barrier unit
         If a request goes across boundary of barrier unit, it will be split. A
         bio may be split in raid1_make_request() or raid1_sync_request(), if
         sectors returned by align_to_barrier_unit_end() is smaller than
         original bio size.
      
      Comparing to single sliding resync window,
       - Currently resync I/O grows linearly, therefore regular and resync I/O
         will conflict within a single barrier units. So the I/O behavior is
         similar to single sliding resync window.
       - But a barrier unit bucket is shared by all barrier units with identical
         barrier uinit index, the probability of conflict might be higher
         than single sliding resync window, in condition that writing I/Os
         always hit barrier units which have identical barrier bucket indexs with
         the resync I/Os. This is a very rare condition in real I/O work loads,
         I cannot imagine how it could happen in practice.
       - Therefore we can achieve a good enough low conflict rate with much
         simpler barrier algorithm and implementation.
      
      There are two changes should be noticed,
       - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
         single loop, it looks like this,
              spin_lock_irqsave(&conf->device_lock, flags);
              conf->nr_queued[idx]--;
              spin_unlock_irqrestore(&conf->device_lock, flags);
         This change generates more spin lock operations, but in next patch of
         this patch set, it will be replaced by a single line code,
              atomic_dec(&conf->nr_queueud[idx]);
         So we don't need to worry about spin lock cost here.
       - Mainline raid1 code split original raid1_make_request() into
         raid1_read_request() and raid1_write_request(). If the original bio
         goes across an I/O barrier unit size, this bio will be split before
         calling raid1_read_request() or raid1_write_request(),  this change
         the code logic more simple and clear.
       - In this patch wait_barrier() is moved from raid1_make_request() to
         raid1_write_request(). In raid_read_request(), original wait_barrier()
         is replaced by raid1_read_request().
         The differnece is wait_read_barrier() only waits if array is frozen,
         using different barrier function in different code path makes the code
         more clean and easy to read.
      Changelog
      V4:
      - Add alloc_r1bio() to remove redundant r1bio memory allocation code.
      - Fix many typos in patch comments.
      - Use (PAGE_SHIFT - ilog2(sizeof(int))) to define BARRIER_BUCKETS_NR_BITS.
      V3:
      - Rebase the patch against latest upstream kernel code.
      - Many fixes by review comments from Neil,
        - Back to use pointers to replace arraries in struct r1conf
        - Remove total_barriers from struct r1conf
        - Add more patch comments to explain how/why the values of
          BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
        - Use get_unqueued_pending() to replace get_all_pendings() and
          get_all_queued()
        - Increase bucket number from 512 to 1024
      - Change code comments format by review from Shaohua.
      V2:
      - Use bio_split() to split the orignal bio if it goes across barrier unit
        bounday, to make the code more simple, by suggestion from Shaohua and
        Neil.
      - Use hash_long() to replace original linear hash, to avoid a possible
        confilict between resync I/O and sequential write I/O, by suggestion from
        Shaohua.
      - Add conf->total_barriers to record barrier depth, which is used to
        control number of parallel sync I/O barriers, by suggestion from Shaohua.
      - In V1 patch the bellowed barrier buckets related members in r1conf are
        allocated in memory page. To make the code more simple, V2 patch moves
        the memory space into struct r1conf, like this,
              -       int                     nr_pending;
              -       int                     nr_waiting;
              -       int                     nr_queued;
              -       int                     barrier;
              +       int                     nr_pending[BARRIER_BUCKETS_NR];
              +       int                     nr_waiting[BARRIER_BUCKETS_NR];
              +       int                     nr_queued[BARRIER_BUCKETS_NR];
              +       int                     barrier[BARRIER_BUCKETS_NR];
        This change is by the suggestion from Shaohua.
      - Remove some inrelavent code comments, by suggestion from Guoqing.
      - Add a missing wait_barrier() before jumping to retry_write, in
        raid1_make_write_request().
      V1:
      - Original RFC patch for comments
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fd76863e
  14. 16 2月, 2017 2 次提交
  15. 02 2月, 2017 1 次提交
  16. 28 1月, 2017 1 次提交
  17. 06 1月, 2017 1 次提交
  18. 04 1月, 2017 1 次提交
  19. 09 12月, 2016 2 次提交
  20. 23 11月, 2016 3 次提交
    • N
      md/raid1: add failfast handling for writes. · 212e7eb7
      NeilBrown 提交于
      When writing to a fastfail device we use MD_FASTFAIL unless
      it is the only device being written to.
      
      For resync/recovery, assume there was a working device to
      read from so always use REQ_FASTFAIL_DEV.
      
      If a write for resync/recovery fails, we just fail the
      device - there is not much else to do.
      
      If a normal failfast write fails, but the device cannot be
      failed (must be only one left), we queue for write error
      handling.  This will call narrow_write_error() to retry the
      write synchronously and without any FAILFAST flags.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      212e7eb7
    • N
      md/raid1: add failfast handling for reads. · 2e52d449
      NeilBrown 提交于
      If a device is marked FailFast and it is not the only device
      we can read from, we mark the bio with REQ_FAILFAST_* flags.
      
      If this does fail, we don't try read repair but just allow
      failure.  If it was the last device it doesn't fail of
      course, so the retry happens on the same device - this time
      without FAILFAST.  A subsequent failure will not retry but
      will just pass up the error.
      
      During resync we may use FAILFAST requests and on a failure
      we will simply use the other device(s).
      
      During recovery we will only use FAILFAST in the unusual
      case were there are multiple places to read from - i.e. if
      there are > 2 devices.  If we get a failure we will fail the
      device and complete the resync/recovery with remaining
      devices.
      
      The new R1BIO_FailFast flag is set on read reqest to suggest
      the a FAILFAST request might be acceptable.  The rdev needs
      to have FailFast set as well for the read to actually use
      REQ_FAILFAST_*.
      
      We need to know there are at least two working devices
      before we can set R1BIO_FailFast, so we mustn't stop looking
      at the first device we find.  So the "min_pending == 0"
      handling to not exit early, but too always choose the
      best_pending_disk if min_pending == 0.
      
      The spinlocked region in raid1_error() in enlarged to ensure
      that if two bios, reading from two different devices, fail
      at the same time, then there is no risk that both devices
      will be marked faulty, leaving zero "In_sync" devices.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e52d449
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
  21. 19 11月, 2016 1 次提交