1. 12 5月, 2017 1 次提交
  2. 02 5月, 2017 1 次提交
    • S
      md/raid10: skip spare disk as 'first' disk · b506335e
      Shaohua Li 提交于
      Commit 6f287ca6(md/raid10: reset the 'first' at the end of loop) ignores
      a case in reshape, the first rdev could be a spare disk, which shouldn't
      be accounted as the first disk since it doesn't include the offset info.
      
      Fix: 6f287ca6(md/raid10: reset the 'first' at the end of loop)
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b506335e
  3. 26 4月, 2017 1 次提交
  4. 24 4月, 2017 1 次提交
  5. 21 4月, 2017 1 次提交
  6. 12 4月, 2017 2 次提交
    • N
      md/raid10: simplify handle_read_error() · 545250f2
      NeilBrown 提交于
      handle_read_error() duplicates a lot of the work that raid10_read_request()
      does, so it makes sense to just use that function.
      
      handle_read_error() relies on the same r10bio being re-used so that,
      in the case of a read-only array, setting IO_BLOCKED in r1bio->devs[].bio
      ensures read_balance() won't re-use that device.
      So when called from raid10_make_request() we clear that array, but not
      when called from handle_read_error().
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to
      raid10_read_request().  If the failing rdev can be found, messages
      are printed.  Otherwise they aren't.
      
      Not that as rdev_dec_pending() has already been called on the failing
      rdev, we need to use rcu_read_lock() to get a new reference from
      the conf.  We only use this to get the name of the failing block device.
      
      With this change, we no longer need inc_pending().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      545250f2
    • N
      md/raid10: simplify the splitting of requests. · fc9977dd
      NeilBrown 提交于
      raid10 splits requests in two different ways for two different
      reasons.
      
      First, bio_split() is used to ensure the bio fits with a chunk.
      Second, multiple r10bio structures are allocated to represent the
      different sections that need to go to different devices, to avoid
      known bad blocks.
      
      This can be simplified to just use bio_split() once, and not to use
      multiple r10bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r10bio, and then split the bio and queue
      the remainder for later handling.
      
      As with raid1, we allocate a new bio_set to help with the splitting.
      It is not correct to use fs_bio_set in a device driver.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fc9977dd
  7. 11 4月, 2017 1 次提交
  8. 09 4月, 2017 1 次提交
  9. 25 3月, 2017 6 次提交
  10. 24 3月, 2017 1 次提交
  11. 23 3月, 2017 2 次提交
    • N
      md/raid10: stop using bi_phys_segments · fd16f2e8
      NeilBrown 提交于
      raid10 currently repurposes bi_phys_segments on each
      incoming bio to count how many r10bio was used to encode the
      request.
      
      We need to know when the number of attached r10bio reaches
      zero to:
      1/ call bio_endio() when all IO on the bio is finished
      2/ decrement ->nr_pending so that resync IO can proceed.
      
      Now that the bio has its own __bi_remaining counter, that
      can be used instead. We can call bio_inc_remaining to
      increment the counter and call bio_endio() every time an
      r10bio completes, rather than only when bi_phys_segments
      reaches zero.
      
      This addresses point 1, but not point 2.  bio_endio()
      doesn't (and cannot) report when the last r10bio has
      finished, so a different approach is needed.
      
      So: instead of counting bios in ->nr_pending, count r10bios.
      i.e. every time we attach a bio, increment nr_pending.
      Every time an r10bio completes, decrement nr_pending.
      
      Normally we only increment nr_pending after first checking
      that ->barrier is zero, or some other non-trivial tests and
      possible waiting.  When attaching multiple r10bios to a bio,
      we only need the tests and the waiting once.  After the
      first increment, subsequent increments can happen
      unconditionally as they are really all part of the one
      request.
      
      So introduce inc_pending() which can be used when we know
      that nr_pending is already elevated.
      
      Note that this fixes a bug.  freeze_array() contains the line
      	atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
      which implies that the units for ->nr_pending, ->nr_queued and extra
      are the same.
      ->nr_queue and extra count r10_bios, but prior to this patch,
      ->nr_pending counted bios.  If a bio ever resulted in multiple
      r10_bios (due to bad blocks), freeze_array() would not work correctly.
      Now it does.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fd16f2e8
    • N
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown 提交于
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6b6c8110
  12. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  13. 10 3月, 2017 3 次提交
    • S
      md/raid1/10: fix potential deadlock · 61eb2b43
      Shaohua Li 提交于
      Neil Brown pointed out a potential deadlock in raid 10 code with
      bio_split/chain. The raid1 code could have the same issue, but recent
      barrier rework makes it less likely to happen. The deadlock happens in
      below sequence:
      
      1. generic_make_request(bio), this will set current->bio_list
      2. raid10_make_request will split bio to bio1 and bio2
      3. __make_request(bio1), wait_barrer, add underlayer disk bio to
      current->bio_list
      4. __make_request(bio2), wait_barrer
      
      If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
      raise_barrier waits for IO completion from 3. And since raise_barrier
      sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
      dispatched because raid10_make_request() doesn't finished yet.
      
      The solution is to adjust the IO ordering. Quotes from Neil:
      "
      It is much safer to:
      
          if (need to split) {
              split = bio_split(bio, ...)
              bio_chain(...)
              make_request_fn(split);
              generic_make_request(bio);
         } else
              make_request_fn(mddev, bio);
      
      This way we first process the initial section of the bio (in 'split')
      which will queue some requests to the underlying devices.  These
      requests will be queued in generic_make_request.
      Then we queue the remainder of the bio, which will be added to the end
      of the generic_make_request queue.
      Then we return.
      generic_make_request() will pop the lower-level device requests off the
      queue and handle them first.  Then it will process the remainder
      of the original bio once the first section has been fully processed.
      "
      
      Note, this only happens in read path. In write path, the bio is flushed to
      underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
      It's queued in current->bio_list.
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      61eb2b43
    • G
      md: move funcs from pers->resize to update_size · c9483634
      Guoqing Jiang 提交于
      raid1_resize and raid5_resize should also check the
      mddev->queue if run underneath dm-raid.
      
      And both set_capacity and revalidate_disk are used in
      pers->resize such as raid1, raid10 and raid5. So
      move them from personality file to common code.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c9483634
    • S
      md/raid10: submit bio directly to replacement disk · 6d399783
      Shaohua Li 提交于
      Commit 57c67df4(md/raid10: submit IO from originating thread instead of
      md thread) submits bio directly for normal disks but not for replacement
      disks. There is no point we shouldn't do this for replacement disks.
      
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6d399783
  14. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  15. 02 2月, 2017 1 次提交
  16. 04 1月, 2017 1 次提交
  17. 09 12月, 2016 1 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
  18. 23 11月, 2016 3 次提交
    • N
      md/raid10: add failfast handling for writes. · 1919cbb2
      NeilBrown 提交于
      When writing to a fastfail device, we use MD_FASTFAIL unless
      it is the only device being written to.  For
      resync/recovery, assume there was a working device to read
      from so always use MD_FASTFAIL.
      
      If a write for resync/recovery fails, we just fail the
      device - there is not much else to do.
      
      If a normal write fails, but the device cannot be marked
      Faulty (must be only one left), we queue for write error
      handling which calls narrow_write_error() to write the block
      synchronously without any failfast flags.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1919cbb2
    • N
      md/raid10: add failfast handling for reads. · 8d3ca83d
      NeilBrown 提交于
      If a device is marked FailFast, and it is not the only
      device we can read from, we mark the bio as MD_FAILFAST.
      
      If this does fail-fast, we don't try read repair but just
      allow failure.
      
      If it was the last device, it doesn't get marked Faulty so
      the retry happens on the same device - this time without
      FAILFAST.  A subsequent failure will not retry but will just
      pass up the error.
      
      During resync we may use FAILFAST requests, and on a failure
      we will simply use the other device(s).
      
      During recovery we will only use FAILFAST in the unusual
      case were there are multiple places to read from - i.e. if
      there are > 2 devices.  If we get a failure we will fail the
      device and complete the resync/recovery with remaining
      devices.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8d3ca83d
    • N
      md: Use REQ_FAILFAST_* on metadata writes where appropriate · 46533ff7
      NeilBrown 提交于
      This can only be supported on personalities which ensure
      that md_error() never causes an array to enter the 'failed'
      state.  i.e. if marking a device Faulty would cause some
      data to be inaccessible, the device is status is left as
      non-Faulty.  This is true for RAID1 and RAID10.
      
      If we get a failure writing metadata but the device doesn't
      fail, it must be the last device so we re-write without
      FAILFAST to improve chance of success.  We also flag the
      device as LastDev so that future metadata updates don't
      waste time on failfast writes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      46533ff7
  19. 19 11月, 2016 2 次提交
    • N
      md/raid1, raid10: add blktrace records when IO is delayed · 578b54ad
      NeilBrown 提交于
      Both raid1 and raid10 will sometimes delay handling an IO request,
      such as when resync is happening or there are too many requests queued.
      
      Add some blktrace messsages so we can see when that is happening when
      looking for performance artefacts.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      578b54ad
    • N
      md: add block tracing for bio_remapping · 109e3765
      NeilBrown 提交于
      The block tracing infrastructure (accessed with blktrace/blkparse)
      supports the tracing of mapping bios from one device to another.
      This is currently used when a bio in a partition is mapped to the
      whole device, when bios are mapped by dm, and for mapping in md/raid5.
      Other md personalities do not include this tracing yet, so add it.
      
      When a read-error is detected we redirect the request to a different device.
      This could justifiably be seen as a new mapping for the originial bio,
      or a secondary mapping for the bio that errors.  This patch uses
      the second option.
      
      When md is used under dm-raid, the mappings are not traced as we do
      not have access to the block device number of the parent.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      109e3765
  20. 08 11月, 2016 2 次提交
  21. 25 10月, 2016 1 次提交
    • S
      RAID10: ignore discard error · 579ed34f
      Shaohua Li 提交于
      This is the counterpart of raid10 fix. If a write error occurs, raid10
      will try to rewrite the bio in small chunk size. If the rewrite fails,
      raid10 will record the error in bad block. narrow_write_error will
      always use WRITE for the bio, but actually it could be a discard. Since
      discard bio hasn't payload, write the bio will cause different issues.
      But discard error isn't fatal, we can safely ignore it. This is what
      this patch does.
      
      This issue should exist since discard is added, but only exposed with
      recent arbitrary bio size feature.
      
      Cc: Sitsofe Wheeler <sitsofe@gmail.com>
      Cc: stable@vger.kernel.org (v3.6)
      Signed-off-by: NShaohua Li <shli@fb.com>
      579ed34f
  22. 25 8月, 2016 1 次提交
  23. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  24. 31 7月, 2016 1 次提交
  25. 20 7月, 2016 2 次提交
    • T
      raid10: improve random reads performance · 0e5313e2
      Tomasz Majchrzak 提交于
      RAID10 random read performance is lower than expected due to excessive spinlock
      utilisation which is required mostly for rebuild/resync. Simplify allow_barrier
      as it's in IO path and encounters a lot of unnecessary congestion.
      
      As lower_barrier just takes a lock in order to decrement a counter, convert
      counter (nr_pending) into atomic variable and remove the spin lock. There is
      also a congestion for wake_up (it uses lock internally) so call it only when
      it's really needed. As wake_up is not called constantly anymore, ensure process
      waiting to raise a barrier is notified when there are no more waiting IOs.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e5313e2
    • A
      md: use seconds granularity for error logging · 0e3ef49e
      Arnd Bergmann 提交于
      The md code stores the exact time of the last error in the
      last_read_error variable using a timespec structure. It only
      ever uses the seconds portion of that though, so we can
      use a scalar for it.
      
      There won't be an overflow in 2038 here, because it already
      used monotonic time and 32-bit is enough for that, but I've
      decided to use time64_t for consistency in the conversion.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0e3ef49e
  26. 14 6月, 2016 1 次提交
    • N
      md: reduce the number of synchronize_rcu() calls when multiple devices fail. · d787be40
      NeilBrown 提交于
      Every time a device is removed with ->hot_remove_disk() a synchronize_rcu() call is made
      which can delay several milliseconds in some case.
      If lots of devices fail at once - as could happen with a large RAID10 where one set
      of devices are removed all at once - these delays can add up to be very inconcenient.
      
      As failure is not reversible we can check for that first, setting a
      separate flag if it is found, and then all synchronize_rcu() once for
      all the flagged devices.  Then ->hot_remove_disk() function can skip the
      synchronize_rcu() step if the flag is set.
      
      fix build error(Shaohua)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d787be40