1. 02 11月, 2017 5 次提交
    • G
      raid1: remove obsolete code in raid1_write_request · f81f7302
      Guoqing Jiang 提交于
      There are some lines could be removed due to recent
      change for raid1 such as commit 3956df15d634 ("md:
      move suspend_hi/lo handling into core md code").
      
      Also, seems some comments are put to wrong place,
      move them before wait_barrier.
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f81f7302
    • N
      raid1: prevent freeze_array/wait_all_barriers deadlock · f6eca2d4
      Nate Dailey 提交于
      If freeze_array is attempted in the middle of close_sync/
      wait_all_barriers, deadlock can occur.
      
      freeze_array will wait for nr_pending and nr_queued to line up.
      wait_all_barriers increments nr_pending for each barrier bucket, one
      at a time, but doesn't actually issue IO that could be counted in
      nr_queued. So freeze_array is blocked until wait_all_barriers
      completes and allow_all_barriers runs. At the same time, when
      _wait_barrier sees array_frozen == 1, it stops and waits for
      freeze_array to complete.
      
      Prevent the deadlock by making close_sync call _wait_barrier and
      _allow_barrier for one bucket at a time, instead of deferring the
      _allow_barrier calls until after all _wait_barriers are complete.
      Signed-off-by: NNate Dailey <nate.dailey@stratus.com>
      Fix: fd76863e(RAID1: a new I/O barrier implementation to remove resync window)
      Reviewed-by: NColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org (v4.11)
      Signed-off-by: NShaohua Li <shli@fb.com>
      f6eca2d4
    • M
      md: use TASK_IDLE instead of blocking signals · ae89fd3d
      Mikulas Patocka 提交于
      Hi - I submit this patch for the next merge window:
      
      Some times ago, I made a patch f9c79bc0 that blocks signals around the
      schedule() calls in MD. The MD subsystem needs to do an uninterruptible
      sleep that is not accounted in load average - so we block signals and use
      interruptible sleep.
      
      The kernel has a special TASK_IDLE state for this purpose, so we can use
      it instead of blocking signals. This patch doesn't fix any bug, it just
      makes the code simpler.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ae89fd3d
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: move suspend_hi/lo handling into core md code · b3143b9a
      NeilBrown 提交于
      responding to ->suspend_lo and ->suspend_hi is similar
      to responding to ->suspended.  It is best to wait in
      the common core code without incrementing ->active_io.
      This allows mddev_suspend()/mddev_resume() to work while
      requests are waiting for suspend_lo/hi to change.
      This is will be important after a subsequent patch
      which uses mddev_suspend() to synchronize updating for
      suspend_lo/hi.
      
      So move the code for testing suspend_lo/hi out of raid1.c
      and raid5.c, and place it in md.c
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b3143b9a
  2. 17 10月, 2017 2 次提交
  3. 28 8月, 2017 1 次提交
  4. 26 8月, 2017 1 次提交
  5. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  6. 22 7月, 2017 5 次提交
  7. 19 6月, 2017 1 次提交
  8. 17 6月, 2017 1 次提交
  9. 14 6月, 2017 2 次提交
    • M
      md: don't use flush_signals in userspace processes · f9c79bc0
      Mikulas Patocka 提交于
      The function flush_signals clears all pending signals for the process. It
      may be used by kernel threads when we need to prepare a kernel thread for
      responding to signals. However using this function for an userspaces
      processes is incorrect - clearing signals without the program expecting it
      can cause misbehavior.
      
      The raid1 and raid5 code uses flush_signals in its request routine because
      it wants to prepare for an interruptible wait. This patch drops
      flush_signals and uses sigprocmask instead to block all signals (including
      SIGKILL) around the schedule() call. The signals are not lost, but the
      schedule() call won't respond to them.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f9c79bc0
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  10. 09 6月, 2017 1 次提交
  11. 06 6月, 2017 1 次提交
  12. 13 5月, 2017 1 次提交
    • T
      raid1: prefer disk without bad blocks · d82dd0e3
      Tomasz Majchrzak 提交于
      If an array consists of two drives and the first drive has the bad
      block, the read request to the region overlapping the bad block chooses
      the same disk (with bad block) as device to read from over and over and
      the request gets stuck. If the first disk only partially overlaps with
      bad block, it becomes a candidate ("best disk") for shorter range of
      sectors. The second disk is capable of reading the entire requested
      range and it is updated accordingly, however it is not recorded as a
      best device for the request. In the end the request is sent to the first
      disk to read entire range of sectors. It fails and is re-tried in a
      moment but with the same outcome.
      
      Actually it is quite likely scenario but it had little exposure in my
      test until commit 715d40b93b10 ("md/raid1: add failfast handling for
      reads.") removed preference for idle disk. Such scenario had been
      passing as second disk was always chosen when idle.
      
      Reset a candidate ("best disk") to read from if disk can read entire
      range. Do it only if other disk has already been chosen as a candidate
      for a smaller range. The head position / disk type logic will select
      the best disk to read from - it is fine as disk with bad block won't be
      considered for it.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d82dd0e3
  13. 12 5月, 2017 1 次提交
  14. 09 5月, 2017 1 次提交
  15. 28 4月, 2017 1 次提交
    • X
      md/raid1: Use a new variable to count flighting sync requests · 43ac9b84
      Xiao Ni 提交于
      In new barrier codes, raise_barrier waits if conf->nr_pending[idx] is not zero.
      After all the conditions are true, the resync request can go on be handled. But
      it adds conf->nr_pending[idx] again. The next resync request hit the same bucket
      idx need to wait the resync request which is submitted before. The performance
      of resync/recovery is degraded.
      So we should use a new variable to count sync requests which are in flight.
      
      I did a simple test:
      1. Without the patch, create a raid1 with two disks. The resync speed:
      Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0.00     0.00  166.00    0.00    10.38     0.00   128.00     0.03    0.20    0.20    0.00   0.19   3.20
      sdc               0.00     0.00    0.00  166.00     0.00    10.38   128.00     0.96    5.77    0.00    5.77   5.75  95.50
      2. With the patch, the result is:
      sdb            2214.00     0.00  766.00    0.00   185.69     0.00   496.46     2.80    3.66    3.66    0.00   1.03  79.10
      sdc               0.00  2205.00    0.00  769.00     0.00   186.44   496.52     5.25    6.84    0.00    6.84   1.30 100.10
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      43ac9b84
  16. 26 4月, 2017 1 次提交
  17. 24 4月, 2017 1 次提交
  18. 12 4月, 2017 4 次提交
    • N
      md/raid1: factor out flush_bio_list() · 673ca68d
      NeilBrown 提交于
      flush_pending_writes() and raid1_unplug() each contain identical
      copies of a fairly large slab of code.  So factor that out into
      new flush_bio_list() to simplify maintenance.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      673ca68d
    • N
      md/raid1: simplify handle_read_error(). · 689389a0
      NeilBrown 提交于
      handle_read_error() duplicates a lot of the work that raid1_read_request()
      does, so it makes sense to just use that function.
      This doesn't quite work as handle_read_error() relies on the same r1bio
      being re-used so that, in the case of a read-only array, setting
      IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use
      that device.
      So we need to allow a r1bio to be passed to raid1_read_request(), and to
      have that function mostly initialise the r1bio, but leave the bios[]
      array untouched.
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to raid1_read_request().
      
      Note that this highlights a minor bug on alloc_r1bio().  It doesn't
      initalise the bios[] array, so it is possible that old content is there,
      which might cause read_balance() to ignore some devices with no good reason.
      
      With this change, we no longer need inc_pending(), or the sectors_handled
      arg to alloc_r1bio().
      
      As handle_read_error() is called from raid1d() and allocates memory,
      there is tiny chance of a deadlock.  All element of various pools
      could be queued waiting for raid1 to handle them, and there may be no
      extra memory free.
      Achieving guaranteed forward progress would probably require a second
      thread and another mempool.  Instead of that complexity, add
      __GFP_HIGH to any allocations when read1_read_request() is called
      from raid1d.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      689389a0
    • N
      md/raid1: simplify alloc_behind_master_bio() · cb83efcf
      NeilBrown 提交于
      Now that we always always pass an offset of 0 and a size
      that matches the bio to alloc_behind_master_bio(),
      we can remove the offset/size args and simplify the code.
      
      We could probably remove bio_copy_data_partial() too.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb83efcf
    • N
      md/raid1: simplify the splitting of requests. · c230e7e5
      NeilBrown 提交于
      raid1 currently splits requests in two different ways for
      two different reasons.
      
      First, bio_split() is used to ensure the bio fits within a
      resync accounting region.
      Second, multiple r1bios are allocated for each bio to handle
      the possiblity of known bad blocks on some devices.
      
      This can be simplified to just use bio_split() once, and not
      use multiple r1bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r1bio, and then split the bio and
      queue the remainder for later handling.
      
      This avoids all loops inside raid1.c request handling.  Just
      a single read, or a single set of writes, is submitted to
      lower-level devices for each bio that comes from
      generic_make_request().
      
      When the bio needs to be split, generic_make_request() will
      do the necessary looping and call md_make_request() multiple
      times.
      
      raid1_make_request() no longer queues request for raid1 to handle,
      so we can remove that branch from the 'if'.
      
      This patch also creates a new private bio_set
      (conf->bio_split) for splitting bios.  Using fs_bio_set
      is wrong, as it is meant to be used by filesystems, not
      block devices.  Using it inside md can lead to deadlocks
      under high memory pressure.
      
      Delete unused variable in raid1_write_request() (Shaohua)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c230e7e5
  19. 11 4月, 2017 1 次提交
    • N
      md/raid1: avoid reusing a resync bio after error handling. · 0c9d5b12
      NeilBrown 提交于
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: NMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Cc: stable@vger.kernel.org (v4.10+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0c9d5b12
  20. 09 4月, 2017 1 次提交
  21. 28 3月, 2017 1 次提交
    • M
      md: raid1: kill warning on powerpc_pseries · 8fc04e6e
      Ming Lei 提交于
      This patch kills the warning reported on powerpc_pseries,
      and actually we don't need the initialization.
      
      	After merging the md tree, today's linux-next build (powerpc
      	pseries_le_defconfig) produced this warning:
      
      	drivers/md/raid1.c: In function 'raid1d':
      	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
      	     if (memcmp(page_address(ppages[j]),
      	         ^
      	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
      	   int page_len[RESYNC_PAGES];
             ^
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8fc04e6e
  22. 26 3月, 2017 1 次提交
  23. 25 3月, 2017 5 次提交