1. 20 2月, 2018 1 次提交
    • Y
      md raid10: fix NULL deference in handle_write_completed() · 01a69cab
      Yufen Yu 提交于
      In the case of 'recover', an r10bio with R10BIO_WriteError &
      R10BIO_IsRecover will be progressed by handle_write_completed().
      This function traverses all r10bio->devs[copies].
      If devs[m].repl_bio != NULL, it thinks conf->mirrors[dev].replacement
      is also not NULL. However, this is not always true.
      
      When there is an rdev of raid10 has replacement, then each r10bio
      ->devs[m].repl_bio != NULL in conf->r10buf_pool. However, in 'recover',
      even if corresponded replacement is NULL, it doesn't clear r10bio
      ->devs[m].repl_bio, resulting in replacement NULL deference.
      
      This bug was introduced when replacement support for raid10 was
      added in Linux 3.3.
      
      As NeilBrown suggested:
      	Elsewhere the determination of "is this device part of the
      	resync/recovery" is made by resting bio->bi_end_io.
      	If this is end_sync_write, then we tried to write here.
      	If it is NULL, then we didn't try to write.
      
      Fixes: 9ad1aefc ("md/raid10:  Handle replacement devices during resync.")
      Cc: stable (V3.3+)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NShaohua Li <sh.li@alibaba-inc.com>
      01a69cab
  2. 18 2月, 2018 1 次提交
  3. 12 12月, 2017 1 次提交
    • N
      md/raid1,raid10: silence warning about wait-within-wait · 474beb57
      NeilBrown 提交于
      If you prepare_to_wait() after a previous prepare_to_wait(),
      but before calling schedule(), you get warning:
      
        do not call blocking ops when !TASK_RUNNING; state=2
      
      This is appropriate as it is often a bug.  The event that the
      first prepare_to_wait() expects might wake up the schedule following
      the second prepare_to_wait(), which could be confusing.
      
      However if both prepare_to_wait()s are part of simple wait_event()
      loops, and if the inner one is rarely called, then there is
      no problem.  The inner loop is too simple to get confused by
      a stray wakeup, and the outer loop won't spin unduly because the
      inner doesnt affect it often.
      
      This pattern occurs in both raid1.c and raid10.c in the use of
      flush_pending_writes().
      
      The warning can be silenced by setting current->state to TASK_RUNNING.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      474beb57
  4. 02 12月, 2017 1 次提交
  5. 02 11月, 2017 4 次提交
    • G
      md-cluster: Use a small window for raid10 resync · 8db87912
      Guoqing Jiang 提交于
      Suspending the entire device for resync could take
      too long. Resync in small chunks.
      
      cluster's resync window is maintained in r10conf as
      cluster_sync_low and cluster_sync_high, and processed
      in raid10's sync_request(). If the current resync is
      outside the cluster resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Set cluster_sync_high to cluster_sync_low + stripe
         size.
      3. Send a message to all nodes so they may add it in
         their suspension list.
      
      Note:
      We only support "near" raid10 so far, resync a far or
      offset raid10 array could have trouble. So raid10_run
      checks the layout of clustered raid10, it will refuse
      to run if the layout is not correct.
      
      With the "near" layout we process one stripe at a time
      progressing monotonically through the address space.
      So we can have a sliding window of whole-stripes which
      moves through the array suspending IO on other nodes,
      and both resync which uses array addresses and recovery
      which uses device addresses can stay within this window.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8db87912
    • G
      md-cluster: Suspend writes in RAID10 if within range · cb8a7a7e
      Guoqing Jiang 提交于
      If there is a resync going on, all nodes must suspend
      writes to the range. This is recorded in suspend_info
      and suspend_list.
      
      If there is an I/O within the ranges of any of the
      suspend_info, area_resyncing will return 1.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb8a7a7e
    • G
      md-cluster/raid10: set "do_balance = 0" if area is resyncing · d4098c72
      Guoqing Jiang 提交于
      Just like clustered raid1, it is impossible for cluster raid10
      to choose the best device for read balance when the area of
      array is resyncing. Because we cannot trust the data to be the
      same on all devices at that time, so we choose just the first
      one to use, so set do_balance to 0.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d4098c72
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
  6. 17 10月, 2017 3 次提交
  7. 26 8月, 2017 1 次提交
  8. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  9. 22 7月, 2017 4 次提交
  10. 19 6月, 2017 1 次提交
  11. 17 6月, 2017 1 次提交
    • G
      md/raid10: fix FailFast test for wrong device · 1cdd1257
      Guoqing Jiang 提交于
      We need to test FailFast flag for replacement device here
      since the set up for writing is for the replacement, so we
      need fix it like:
      
      - if (test_bit(FailFast, &conf->mirrors[d].rdev->flags))
      + if (test_bit(FailFast, &conf->mirrors[d].replacement->flags))
      
      Since commit f90145f3 ("md/raid10: add rcu protection
      to rdev access in raid10_sync_request.") had added the rcu
      protection for the part, so let's extend the range protected
      by rcu and use rdev directly.
      
      Fixes: 1919cbb2 ("md/raid10: add failfast handling for writes.")
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1cdd1257
  12. 14 6月, 2017 1 次提交
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  13. 09 6月, 2017 1 次提交
  14. 06 6月, 2017 1 次提交
  15. 12 5月, 2017 1 次提交
  16. 02 5月, 2017 1 次提交
    • S
      md/raid10: skip spare disk as 'first' disk · b506335e
      Shaohua Li 提交于
      Commit 6f287ca6(md/raid10: reset the 'first' at the end of loop) ignores
      a case in reshape, the first rdev could be a spare disk, which shouldn't
      be accounted as the first disk since it doesn't include the offset info.
      
      Fix: 6f287ca6(md/raid10: reset the 'first' at the end of loop)
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b506335e
  17. 26 4月, 2017 1 次提交
  18. 24 4月, 2017 1 次提交
  19. 21 4月, 2017 1 次提交
  20. 12 4月, 2017 2 次提交
    • N
      md/raid10: simplify handle_read_error() · 545250f2
      NeilBrown 提交于
      handle_read_error() duplicates a lot of the work that raid10_read_request()
      does, so it makes sense to just use that function.
      
      handle_read_error() relies on the same r10bio being re-used so that,
      in the case of a read-only array, setting IO_BLOCKED in r1bio->devs[].bio
      ensures read_balance() won't re-use that device.
      So when called from raid10_make_request() we clear that array, but not
      when called from handle_read_error().
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to
      raid10_read_request().  If the failing rdev can be found, messages
      are printed.  Otherwise they aren't.
      
      Not that as rdev_dec_pending() has already been called on the failing
      rdev, we need to use rcu_read_lock() to get a new reference from
      the conf.  We only use this to get the name of the failing block device.
      
      With this change, we no longer need inc_pending().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      545250f2
    • N
      md/raid10: simplify the splitting of requests. · fc9977dd
      NeilBrown 提交于
      raid10 splits requests in two different ways for two different
      reasons.
      
      First, bio_split() is used to ensure the bio fits with a chunk.
      Second, multiple r10bio structures are allocated to represent the
      different sections that need to go to different devices, to avoid
      known bad blocks.
      
      This can be simplified to just use bio_split() once, and not to use
      multiple r10bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r10bio, and then split the bio and queue
      the remainder for later handling.
      
      As with raid1, we allocate a new bio_set to help with the splitting.
      It is not correct to use fs_bio_set in a device driver.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fc9977dd
  21. 11 4月, 2017 1 次提交
  22. 09 4月, 2017 1 次提交
  23. 25 3月, 2017 6 次提交
  24. 24 3月, 2017 1 次提交
  25. 23 3月, 2017 2 次提交
    • N
      md/raid10: stop using bi_phys_segments · fd16f2e8
      NeilBrown 提交于
      raid10 currently repurposes bi_phys_segments on each
      incoming bio to count how many r10bio was used to encode the
      request.
      
      We need to know when the number of attached r10bio reaches
      zero to:
      1/ call bio_endio() when all IO on the bio is finished
      2/ decrement ->nr_pending so that resync IO can proceed.
      
      Now that the bio has its own __bi_remaining counter, that
      can be used instead. We can call bio_inc_remaining to
      increment the counter and call bio_endio() every time an
      r10bio completes, rather than only when bi_phys_segments
      reaches zero.
      
      This addresses point 1, but not point 2.  bio_endio()
      doesn't (and cannot) report when the last r10bio has
      finished, so a different approach is needed.
      
      So: instead of counting bios in ->nr_pending, count r10bios.
      i.e. every time we attach a bio, increment nr_pending.
      Every time an r10bio completes, decrement nr_pending.
      
      Normally we only increment nr_pending after first checking
      that ->barrier is zero, or some other non-trivial tests and
      possible waiting.  When attaching multiple r10bios to a bio,
      we only need the tests and the waiting once.  After the
      first increment, subsequent increments can happen
      unconditionally as they are really all part of the one
      request.
      
      So introduce inc_pending() which can be used when we know
      that nr_pending is already elevated.
      
      Note that this fixes a bug.  freeze_array() contains the line
      	atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
      which implies that the units for ->nr_pending, ->nr_queued and extra
      are the same.
      ->nr_queue and extra count r10_bios, but prior to this patch,
      ->nr_pending counted bios.  If a bio ever resulted in multiple
      r10_bios (due to bad blocks), freeze_array() would not work correctly.
      Now it does.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fd16f2e8
    • N
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown 提交于
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6b6c8110