1. 22 6月, 2017 1 次提交
    • N
      md: use a separate bio_set for synchronous IO. · 5a85071c
      NeilBrown 提交于
      md devices allocate a bio_set and use it for two
      distinct purposes.
      mddev->bio_set is used to clone bios as part of sending
      upper level requests down to lower level devices,
      and it is also use for synchronous IO such as superblock
      and bitmap updates, and for correcting read errors.
      
      This multiple usage can lead to deadlocks.  It is likely
      that cloned bios might be queued for write and to be
      waiting for a metadata update before the write can be permitted.
      If the cloning exhausted mddev->bio_set, the metadata update
      may not be able to proceed.
      
      This scenario has been seen during heavy testing, with lots of IO and
      lots of memory pressure.
      
      Address this by adding a new bio_set specifically for synchronous IO.
      All synchronous IO goes directly to the underlying device and is not
      queued at the md level, so request using entries from the new
      mddev->sync_set will complete in a timely fashion.
      Requests that use mddev->bio_set will sometimes need to wait
      for synchronous IO, but will no longer risk deadlocking that iO.
      
      Also: small simplification in mddev_put(): there is no need to
      wait until the spinlock is released before calling bioset_free().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a85071c
  2. 17 6月, 2017 3 次提交
  3. 14 6月, 2017 2 次提交
    • M
      md: don't use flush_signals in userspace processes · f9c79bc0
      Mikulas Patocka 提交于
      The function flush_signals clears all pending signals for the process. It
      may be used by kernel threads when we need to prepare a kernel thread for
      responding to signals. However using this function for an userspaces
      processes is incorrect - clearing signals without the program expecting it
      can cause misbehavior.
      
      The raid1 and raid5 code uses flush_signals in its request routine because
      it wants to prepare for an interruptible wait. This patch drops
      flush_signals and uses sigprocmask instead to block all signals (including
      SIGKILL) around the schedule() call. The signals are not lost, but the
      schedule() call won't respond to them.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Acked-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      f9c79bc0
    • N
      md: fix deadlock between mddev_suspend() and md_write_start() · cc27b0c7
      NeilBrown 提交于
      If mddev_suspend() races with md_write_start() we can deadlock
      with mddev_suspend() waiting for the request that is currently
      in md_write_start() to complete the ->make_request() call,
      and md_write_start() waiting for the metadata to be updated
      to mark the array as 'dirty'.
      As metadata updates done by md_check_recovery() only happen then
      the mddev_lock() can be claimed, and as mddev_suspend() is often
      called with the lock held, these threads wait indefinitely for each
      other.
      
      We fix this by having md_write_start() abort if mddev_suspend()
      is happening, and ->make_request() aborts if md_write_start()
      aborted.
      md_make_request() can detect this abort, decrease the ->active_io
      count, and wait for mddev_suspend().
      Reported-by: NNix <nix@esperi.org.uk>
      Fix: 68866e42(MD: no sync IO while suspended)
      Cc: stable@vger.kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      cc27b0c7
  4. 06 6月, 2017 1 次提交
  5. 01 6月, 2017 1 次提交
    • J
      md: Make flush bios explicitely sync · 5a8948f8
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      CC: linux-raid@vger.kernel.org
      CC: Shaohua Li <shli@kernel.org>
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a8948f8
  6. 31 5月, 2017 1 次提交
    • J
      dm: make flush bios explicitly sync · ff0361b3
      Jan Kara 提交于
      Commit b685d3d6 ("block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous") removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions.
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      Fixes: b685d3d6 ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      ff0361b3
  7. 25 5月, 2017 2 次提交
  8. 23 5月, 2017 3 次提交
  9. 22 5月, 2017 1 次提交
  10. 17 5月, 2017 2 次提交
  11. 16 5月, 2017 5 次提交
  12. 15 5月, 2017 8 次提交
  13. 13 5月, 2017 1 次提交
    • T
      raid1: prefer disk without bad blocks · d82dd0e3
      Tomasz Majchrzak 提交于
      If an array consists of two drives and the first drive has the bad
      block, the read request to the region overlapping the bad block chooses
      the same disk (with bad block) as device to read from over and over and
      the request gets stuck. If the first disk only partially overlaps with
      bad block, it becomes a candidate ("best disk") for shorter range of
      sectors. The second disk is capable of reading the entire requested
      range and it is updated accordingly, however it is not recorded as a
      best device for the request. In the end the request is sent to the first
      disk to read entire range of sectors. It fails and is re-tried in a
      moment but with the same outcome.
      
      Actually it is quite likely scenario but it had little exposure in my
      test until commit 715d40b93b10 ("md/raid1: add failfast handling for
      reads.") removed preference for idle disk. Such scenario had been
      passing as second disk was always chosen when idle.
      
      Reset a candidate ("best disk") to read from if disk can read entire
      range. Do it only if other disk has already been chosen as a candidate
      for a smaller range. The head position / disk type logic will select
      the best disk to read from - it is fine as disk with bad block won't be
      considered for it.
      Signed-off-by: NTomasz Majchrzak <tomasz.majchrzak@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d82dd0e3
  14. 12 5月, 2017 3 次提交
    • S
      md/r5cache: handle sync with data in write back cache · 5ddf0440
      Song Liu 提交于
      Currently, sync of raid456 array cannot make progress when hitting
      data in writeback r5cache.
      
      This patch fixes this issue by flushing cached data of the stripe
      before processing the sync request. This is achived by:
      
      1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is
         in write back cache;
      2. In r5c_try_caching_write(), handle the stripe in sync with write
         through;
      3. In do_release_stripe(), make stripe in sync write out and send
         it to the state machine.
      
      Shaohua: explictly set STRIPE_HANDLE after write out completed
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5ddf0440
    • S
      md/r5cache: gracefully handle journal device errors for writeback mode · 70d466f7
      Song Liu 提交于
      For the raid456 with writeback cache, when journal device failed during
      normal operation, it is still possible to persist all data, as all
      pending data is still in stripe cache. However, it is necessary to handle
      journal failure gracefully.
      
      During journal failures, the following logic handles the graceful shutdown
      of journal:
      1. raid5_error() marks the device as Faulty and schedules async work
         log->disable_writeback_work;
      2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
         suspended, set to write through, and then resumed. mddev_suspend()
         flushes all cached stripes;
      3. All cached stripes need to be flushed carefully to the RAID array.
      
      This patch fixes issues within the process above:
      1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
         journal failures;
      2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
         since raid5_error() updates superblock.
      3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
         to make progress during log_failed;
      4. In delay_towrite(), if log failed only process data in the cache (skip
         new writes in dev->towrite);
      5. In __get_priority_stripe(), process loprio_list during journal device
         failures.
      6. In raid5_remove_disk(), wait for all cached stripes are flushed before
         calling log_exit().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      70d466f7
    • S
      md/raid1/10: avoid unnecessary locking · 23b245c0
      Shaohua Li 提交于
      If we add bios to block plugging list, locking is unnecessry, since the block
      unplug is guaranteed not to run at that time.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      23b245c0
  15. 11 5月, 2017 1 次提交
  16. 09 5月, 2017 5 次提交