1. 22 7月, 2017 1 次提交
  2. 19 6月, 2017 1 次提交
  3. 17 6月, 2017 1 次提交
  4. 14 6月, 2017 2 次提交
    • M
      md: don't use flush_signals in userspace processes · f9c79bc0
      Mikulas Patocka 提交于
      The function flush_signals clears all pending signals for the process. It
      may be used by kernel threads when we need to prepare a kernel thread for
      responding to signals. However using this function for an userspaces
      processes is incorrect - clearing signals without the program expecting it
      can cause misbehavior.
      
      The raid1 and raid5 code uses flush_signals in its request routine because
      it wants to prepare for an interruptible wait. This patch drops
      flush_signals and uses sigprocmask instead to block all signals (including
      SIGKILL) around the schedule() call. The signals are not lost, but the
      schedule() call won't respond to them.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f9c79bc0
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  5. 09 6月, 2017 1 次提交
  6. 06 6月, 2017 1 次提交
  7. 13 5月, 2017 1 次提交
    • T
      raid1: prefer disk without bad blocks · d82dd0e3
      Tomasz Majchrzak 提交于
      If an array consists of two drives and the first drive has the bad
      block, the read request to the region overlapping the bad block chooses
      the same disk (with bad block) as device to read from over and over and
      the request gets stuck. If the first disk only partially overlaps with
      bad block, it becomes a candidate ("best disk") for shorter range of
      sectors. The second disk is capable of reading the entire requested
      range and it is updated accordingly, however it is not recorded as a
      best device for the request. In the end the request is sent to the first
      disk to read entire range of sectors. It fails and is re-tried in a
      moment but with the same outcome.
      
      Actually it is quite likely scenario but it had little exposure in my
      test until commit 715d40b93b10 ("md/raid1: add failfast handling for
      reads.") removed preference for idle disk. Such scenario had been
      passing as second disk was always chosen when idle.
      
      Reset a candidate ("best disk") to read from if disk can read entire
      range. Do it only if other disk has already been chosen as a candidate
      for a smaller range. The head position / disk type logic will select
      the best disk to read from - it is fine as disk with bad block won't be
      considered for it.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d82dd0e3
  8. 12 5月, 2017 1 次提交
  9. 09 5月, 2017 1 次提交
  10. 28 4月, 2017 1 次提交
    • X
      md/raid1: Use a new variable to count flighting sync requests · 43ac9b84
      Xiao Ni 提交于
      In new barrier codes, raise_barrier waits if conf->nr_pending[idx] is not zero.
      After all the conditions are true, the resync request can go on be handled. But
      it adds conf->nr_pending[idx] again. The next resync request hit the same bucket
      idx need to wait the resync request which is submitted before. The performance
      of resync/recovery is degraded.
      So we should use a new variable to count sync requests which are in flight.
      
      I did a simple test:
      1. Without the patch, create a raid1 with two disks. The resync speed:
      Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0.00     0.00  166.00    0.00    10.38     0.00   128.00     0.03    0.20    0.20    0.00   0.19   3.20
      sdc               0.00     0.00    0.00  166.00     0.00    10.38   128.00     0.96    5.77    0.00    5.77   5.75  95.50
      2. With the patch, the result is:
      sdb            2214.00     0.00  766.00    0.00   185.69     0.00   496.46     2.80    3.66    3.66    0.00   1.03  79.10
      sdc               0.00  2205.00    0.00  769.00     0.00   186.44   496.52     5.25    6.84    0.00    6.84   1.30 100.10
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Acked-by: NColy Li <colyli@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      43ac9b84
  11. 26 4月, 2017 1 次提交
  12. 24 4月, 2017 1 次提交
  13. 12 4月, 2017 4 次提交
    • N
      md/raid1: factor out flush_bio_list() · 673ca68d
      NeilBrown 提交于
      flush_pending_writes() and raid1_unplug() each contain identical
      copies of a fairly large slab of code.  So factor that out into
      new flush_bio_list() to simplify maintenance.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      673ca68d
    • N
      md/raid1: simplify handle_read_error(). · 689389a0
      NeilBrown 提交于
      handle_read_error() duplicates a lot of the work that raid1_read_request()
      does, so it makes sense to just use that function.
      This doesn't quite work as handle_read_error() relies on the same r1bio
      being re-used so that, in the case of a read-only array, setting
      IO_BLOCKED in r1bio->bios[] ensures read_balance() won't re-use
      that device.
      So we need to allow a r1bio to be passed to raid1_read_request(), and to
      have that function mostly initialise the r1bio, but leave the bios[]
      array untouched.
      
      Two parts of handle_read_error() that need to be preserved are the warning
      message it prints, so they are conditionally added to raid1_read_request().
      
      Note that this highlights a minor bug on alloc_r1bio().  It doesn't
      initalise the bios[] array, so it is possible that old content is there,
      which might cause read_balance() to ignore some devices with no good reason.
      
      With this change, we no longer need inc_pending(), or the sectors_handled
      arg to alloc_r1bio().
      
      As handle_read_error() is called from raid1d() and allocates memory,
      there is tiny chance of a deadlock.  All element of various pools
      could be queued waiting for raid1 to handle them, and there may be no
      extra memory free.
      Achieving guaranteed forward progress would probably require a second
      thread and another mempool.  Instead of that complexity, add
      __GFP_HIGH to any allocations when read1_read_request() is called
      from raid1d.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      689389a0
    • N
      md/raid1: simplify alloc_behind_master_bio() · cb83efcf
      NeilBrown 提交于
      Now that we always always pass an offset of 0 and a size
      that matches the bio to alloc_behind_master_bio(),
      we can remove the offset/size args and simplify the code.
      
      We could probably remove bio_copy_data_partial() too.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cb83efcf
    • N
      md/raid1: simplify the splitting of requests. · c230e7e5
      NeilBrown 提交于
      raid1 currently splits requests in two different ways for
      two different reasons.
      
      First, bio_split() is used to ensure the bio fits within a
      resync accounting region.
      Second, multiple r1bios are allocated for each bio to handle
      the possiblity of known bad blocks on some devices.
      
      This can be simplified to just use bio_split() once, and not
      use multiple r1bios.
      We delay the split until we know a maximum bio size that can
      be handled with a single r1bio, and then split the bio and
      queue the remainder for later handling.
      
      This avoids all loops inside raid1.c request handling.  Just
      a single read, or a single set of writes, is submitted to
      lower-level devices for each bio that comes from
      generic_make_request().
      
      When the bio needs to be split, generic_make_request() will
      do the necessary looping and call md_make_request() multiple
      times.
      
      raid1_make_request() no longer queues request for raid1 to handle,
      so we can remove that branch from the 'if'.
      
      This patch also creates a new private bio_set
      (conf->bio_split) for splitting bios.  Using fs_bio_set
      is wrong, as it is meant to be used by filesystems, not
      block devices.  Using it inside md can lead to deadlocks
      under high memory pressure.
      
      Delete unused variable in raid1_write_request() (Shaohua)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c230e7e5
  14. 11 4月, 2017 1 次提交
    • N
      md/raid1: avoid reusing a resync bio after error handling. · 0c9d5b12
      NeilBrown 提交于
      fix_sync_read_error() modifies a bio on a newly faulty
      device by setting bi_end_io to end_sync_write.
      This ensure that put_buf() will still call rdev_dec_pending()
      as required, but makes sure that subsequent code in
      fix_sync_read_error() doesn't try to read from the device.
      
      Unfortunately this interacts badly with sync_request_write()
      which assumes that any bio with bi_end_io set to non-NULL
      other than end_sync_read is safe to write to.
      
      As the device is now faulty it doesn't make sense to write.
      As the bio was recently used for a read, it is "dirty"
      and not suitable for immediate submission.
      In particular, ->bi_next might be non-NULL, which will cause
      generic_make_request() to complain.
      
      Break this interaction by refusing to write to devices
      which are marked as Faulty.
      Reported-and-tested-by: NMichael Wang <yun.wang@profitbricks.com>
      Fixes: 2e52d449 ("md/raid1: add failfast handling for reads.")
      Cc: stable@vger.kernel.org (v4.10+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0c9d5b12
  15. 09 4月, 2017 1 次提交
  16. 28 3月, 2017 1 次提交
    • M
      md: raid1: kill warning on powerpc_pseries · 8fc04e6e
      Ming Lei 提交于
      This patch kills the warning reported on powerpc_pseries,
      and actually we don't need the initialization.
      
      	After merging the md tree, today's linux-next build (powerpc
      	pseries_le_defconfig) produced this warning:
      
      	drivers/md/raid1.c: In function 'raid1d':
      	drivers/md/raid1.c:2172:9: warning: 'page_len$' may be used uninitialized in this function [-Wmaybe-uninitialized]
      	     if (memcmp(page_address(ppages[j]),
      	         ^
      	drivers/md/raid1.c:2160:7: note: 'page_len$' was declared here
      	   int page_len[RESYNC_PAGES];
             ^
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8fc04e6e
  17. 26 3月, 2017 1 次提交
  18. 25 3月, 2017 8 次提交
  19. 23 3月, 2017 2 次提交
    • N
      md/raid1: stop using bi_phys_segment · 37011e3a
      NeilBrown 提交于
      Change to use bio->__bi_remaining to count number of r1bio attached
      to a bio.
      See precious raid10 patch for more details.
      
      Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending
      used to measure different things, but were being compared.
      
      This patch fixes another bug in that nr_pending previously did not
      could write-behind requests, so behind writes could continue while
      resync was happening.  How that nr_pending counts all r1_bio,
      the resync cannot commence until the behind writes have completed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      37011e3a
    • N
      md/raid1, raid10: move rXbio accounting closer to allocation. · 6b6c8110
      NeilBrown 提交于
      When raid1 or raid10 find they will need to allocate a new
      r1bio/r10bio, in order to work around a known bad block, they
      account for the allocation well before the allocation is
      made.  This separation makes the correctness less obvious
      and requires comments.
      
      The accounting needs to be a little before: before the first
      rXbio is submitted, but that is all.
      
      So move the accounting down to where it makes more sense.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6b6c8110
  20. 17 3月, 2017 1 次提交
  21. 15 3月, 2017 1 次提交
  22. 10 3月, 2017 2 次提交
    • S
      md/raid1/10: fix potential deadlock · 61eb2b43
      Shaohua Li 提交于
      Neil Brown pointed out a potential deadlock in raid 10 code with
      bio_split/chain. The raid1 code could have the same issue, but recent
      barrier rework makes it less likely to happen. The deadlock happens in
      below sequence:
      
      1. generic_make_request(bio), this will set current->bio_list
      2. raid10_make_request will split bio to bio1 and bio2
      3. __make_request(bio1), wait_barrer, add underlayer disk bio to
      current->bio_list
      4. __make_request(bio2), wait_barrer
      
      If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
      raise_barrier waits for IO completion from 3. And since raise_barrier
      sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
      dispatched because raid10_make_request() doesn't finished yet.
      
      The solution is to adjust the IO ordering. Quotes from Neil:
      "
      It is much safer to:
      
          if (need to split) {
              split = bio_split(bio, ...)
              bio_chain(...)
              make_request_fn(split);
              generic_make_request(bio);
         } else
              make_request_fn(mddev, bio);
      
      This way we first process the initial section of the bio (in 'split')
      which will queue some requests to the underlying devices.  These
      requests will be queued in generic_make_request.
      Then we queue the remainder of the bio, which will be added to the end
      of the generic_make_request queue.
      Then we return.
      generic_make_request() will pop the lower-level device requests off the
      queue and handle them first.  Then it will process the remainder
      of the original bio once the first section has been fully processed.
      "
      
      Note, this only happens in read path. In write path, the bio is flushed to
      underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
      It's queued in current->bio_list.
      
      Cc: Coly Li <colyli@suse.de>
      Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
      Suggested-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NJack Wang <jinpu.wang@profitbricks.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      61eb2b43
    • G
      md: move funcs from pers->resize to update_size · c9483634
      Guoqing Jiang 提交于
      raid1_resize and raid5_resize should also check the
      mddev->queue if run underneath dm-raid.
      
      And both set_capacity and revalidate_disk are used in
      pers->resize such as raid1, raid10 and raid5. So
      move them from personality file to common code.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c9483634
  23. 02 3月, 2017 1 次提交
  24. 24 2月, 2017 2 次提交
  25. 20 2月, 2017 2 次提交
    • S
      md/raid1: fix a use-after-free bug · af5f42a7
      Shaohua Li 提交于
      Commit fd76863e (RAID1: a new I/O barrier implementation to remove resync
      window) introduces a user-after-free bug.
      Signed-off-by: NShaohua Li <shli@fb.com>
      af5f42a7
    • C
      RAID1: avoid unnecessary spin locks in I/O barrier code · 824e47da
      colyli@suse.de 提交于
      When I run a parallel reading performan testing on a md raid1 device with
      two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
      block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
      only 2.7GB/s, this is around 50% of the idea performance number.
      
      The perf reports locking contention happens at allow_barrier() and
      wait_barrier() code,
       - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
         - _raw_spin_lock_irqsave
               + 89.92% allow_barrier
               + 9.34% __wake_up
       - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
         - _raw_spin_lock_irq
               - 100.00% wait_barrier
      
      The reason is, in these I/O barrier related functions,
       - raise_barrier()
       - lower_barrier()
       - wait_barrier()
       - allow_barrier()
      They always hold conf->resync_lock firstly, even there are only regular
      reading I/Os and no resync I/O at all. This is a huge performance penalty.
      
      The solution is a lockless-like algorithm in I/O barrier code, and only
      holding conf->resync_lock when it has to.
      
      The original idea is from Hannes Reinecke, and Neil Brown provides
      comments to improve it. I continue to work on it, and make the patch into
      current form.
      
      In the new simpler raid1 I/O barrier implementation, there are two
      wait barrier functions,
       - wait_barrier()
         Which calls _wait_barrier(), is used for regular write I/O. If there is
         resync I/O happening on the same I/O barrier bucket, or the whole
         array is frozen, task will wait until no barrier on same barrier bucket,
         or the whold array is unfreezed.
       - wait_read_barrier()
         Since regular read I/O won't interfere with resync I/O (read_balance()
         will make sure only uptodate data will be read out), it is unnecessary
         to wait for barrier in regular read I/Os, waiting in only necessary
         when the whole array is frozen.
      
      The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
      barrier[idx] are very carefully designed in raise_barrier(),
      lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
      avoid unnecessary spin locks in these functions. Once conf->
      nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
      has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
      raised in same barrier bucket index and array is not frozen, the regular
      I/O doesn't need to hold conf->resync_lock, it can just increase
      conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
      very similar to _wait_barrier(), the only difference is it only waits when
      array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
      code almostly gets rid of all spin lock cost.
      
      This patch significantly improves raid1 reading peroformance. From my
      testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
      blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
      increases from 2.7GB/s to 4.6GB/s (+70%).
      
      Changelog
      V4:
      - Change conf->nr_queued[] to atomic_t.
      - Define BARRIER_BUCKETS_NR_BITS by (PAGE_SHIFT - ilog2(sizeof(atomic_t)))
      V3:
      - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
      - Change conf->nr_queued[] from atomic_t to int.
      - Change conf->array_frozen from atomic_t back to int, and use
        READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
        in _wait_barrier() and wait_read_barrier().
      - In _wait_barrier() and wait_read_barrier(), add a call to
        wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
        to fix a deadlock between  _wait_barrier()/wait_read_barrier and
        freeze_array().
      V2:
      - Remove a spin_lock/unlock pair in raid1d().
      - Add more code comments to explain why there is no racy when checking two
        atomic_t variables at same time.
      V1:
      - Original RFC patch for comments.
      Signed-off-by: NColy Li <colyli@suse.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Guoqing Jiang <gqjiang@suse.com>
      Reviewed-by: NNeil Brown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      824e47da