1. 25 8月, 2017 2 次提交
    • S
      md/raid5: release/flush io in raid5_do_work() · 9c72a18e
      Song Liu 提交于
      In raid5, there are scenarios where some ios are deferred to a later
      time, and some IO need a flush to complete. To make sure we make
      progress with these IOs, we need to call the following functions:
      
          flush_deferred_bios(conf);
          r5l_flush_stripe_to_raid(conf->log);
      
      Both of these functions are called in raid5d(), but missing in
      raid5_do_work(). As a result, these functions are not called
      when multi-threading (group_thread_cnt > 0) is enabled. This patch
      adds calls to these function to raid5_do_work().
      
      Note for stable branches:
      
        r5l_flush_stripe_to_raid(conf->log) is need for 4.4+
        flush_deferred_bios(conf) is only needed for 4.11+
      
      Cc: stable@vger.kernel.org (4.4+)
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9c72a18e
    • S
      md/bitmap: copy correct data for bitmap super · 8031c3dd
      Shaohua Li 提交于
      raid5 cache could write bitmap superblock before bitmap superblock is
      initialized. The bitmap superblock is less than 512B. The current code will
      only copy the superblock to a new page and write the whole 512B, which will
      zero the the data after the superblock. Unfortunately the data could include
      bitmap, which we should preserve. The patch will make superblock read do 4k
      chunk and we always copy the 4k data to new page, so the superblock write will
      old data to disk and we don't change the bitmap.
      Reported-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NSong Liu <songliubraving@fb.com>
      Cc: stable@vger.kernel.org (4.10+)
      Signed-off-by: NShaohua Li <shli@fb.com>
      8031c3dd
  2. 12 8月, 2017 1 次提交
  3. 08 8月, 2017 4 次提交
    • S
      md/r5cache: fix io_unit handling in r5l_log_endio() · a9501d74
      Song Liu 提交于
      In r5l_log_endio(), once log->io_list_lock is released, the io unit
      may be accessed (or even freed) by other threads. Current code
      doesn't handle the io_unit properly, which leads to potential race
      conditions.
      
      This patch solves this race condition by:
      
      1. Add a pending_stripe count flush_payload. Multiple flush_payloads
         are counted as only one pending_stripe. Flag has_flush_payload is
         added to show whether the io unit has flush_payload;
      2. In r5l_log_endio(), check flags has_null_flush and
         has_flush_payload with log->io_list_lock held. After the lock
         is released, this IO unit is only accessed when we know the
         pending_stripe counter cannot be zeroed by other threads.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a9501d74
    • S
      md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_set · b44886c5
      Song Liu 提交于
      In r5c_journal_mode_set(), it is necessary to call mddev_lock()
      before accessing conf and conf->log. Otherwise, the conf->log
      may change (and become NULL).
      
      Shaohua: fix unlock in failure cases
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b44886c5
    • N
      md: fix test in md_write_start() · 81fe48e9
      NeilBrown 提交于
      md_write_start() needs to clear the in_sync flag is it is set, or if
      there might be a race with set_in_sync() such that the later will
      set it very soon.  In the later case it is sufficient to take the
      spinlock to synchronize with set_in_sync(), and then set the flag
      if needed.
      
      The current test is incorrect.
      It should be:
        if "flag is set" or "race is possible"
      
      "flag is set" is trivially "mddev->in_sync".
      "race is possible" should be tested by "mddev->sync_checkers".
      
      If sync_checkers is 0, then there can be no race.  set_in_sync() will
      wait in percpu_ref_switch_to_atomic_sync() for an RCU grace period,
      and as md_write_start() holds the rcu_read_lock(), set_in_sync() will
      be sure ot see the update to writes_pending.
      
      If sync_checkers is > 0, there could be race.  If md_write_start()
      happened entirely between
      		if (!mddev->in_sync &&
      		    percpu_ref_is_zero(&mddev->writes_pending)) {
      and
      			mddev->in_sync = 1;
      in set_in_sync(), then it would not see that is_sync had been set,
      and set_in_sync() would not see that writes_pending had been
      incremented.
      
      This bug means that in_sync is sometimes not set when it should be.
      Consequently there is a small chance that the array will be marked as
      "clean" when in fact it is inconsistent.
      
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      cc: stable@vger.kernel.org (v4.12+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      81fe48e9
    • N
      md: always clear ->safemode when md_check_recovery gets the mddev lock. · 33182d15
      NeilBrown 提交于
      If ->safemode == 1, md_check_recovery() will try to get the mddev lock
      and perform various other checks.
      If mddev->in_sync is zero, it will call set_in_sync, and clear
      ->safemode.  However if mddev->in_sync is not zero, ->safemode will not
      be cleared.
      
      When md_check_recovery() drops the mddev lock, the thread is woken
      up again.  Normally it would just check if there was anything else to
      do, find nothing, and go to sleep.  However as ->safemode was not
      cleared, it will take the mddev lock again, then wake itself up
      when unlocking.
      
      This results in an infinite loop, repeatedly calling
      md_check_recovery(), which RCU or the soft-lockup detector
      will eventually complain about.
      
      Prior to commit 4ad23a97 ("MD: use per-cpu counter for
      writes_pending"), safemode would only be set to one when the
      writes_pending counter reached zero, and would be cleared again
      when writes_pending is incremented.  Since that patch, safemode
      is set more freely, but is not reliably cleared.
      
      So in md_check_recovery() clear ->safemode before checking ->in_sync.
      
      Fixes: 4ad23a97 ("MD: use per-cpu counter for writes_pending")
      Cc: stable@vger.kernel.org (4.12+)
      Reported-by: NDominik Brodowski <linux@dominikbrodowski.net>
      Reported-by: NDavid R <david@unsolicited.net>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      33182d15
  4. 27 7月, 2017 3 次提交
  5. 26 7月, 2017 6 次提交
  6. 25 7月, 2017 3 次提交
  7. 24 7月, 2017 1 次提交
  8. 22 7月, 2017 5 次提交
  9. 20 7月, 2017 2 次提交
  10. 13 7月, 2017 1 次提交
  11. 11 7月, 2017 2 次提交
    • X
      Raid5 should update rdev->sectors after reshape · b5d27718
      Xiao Ni 提交于
      The raid5 md device is created by the disks which we don't use the total size. For example,
      the size of the device is 5G and it just uses 3G of the devices to create one raid5 device.
      Then change the chunksize and wait reshape to finish. After reshape finishing stop the raid
      and assemble it again. It fails.
      mdadm -CR /dev/md0 -l5 -n3 /dev/loop[0-2] --size=3G --chunk=32 --assume-clean
      mdadm /dev/md0 --grow --chunk=64
      wait reshape to finish
      mdadm -S /dev/md0
      mdadm -As
      The error messages:
      [197519.814302] md: loop1 does not have a valid v1.2 superblock, not importing!
      [197519.821686] md: md_import_device returned -22
      
      After reshape the data offset is changed. It selects backwards direction in this condition.
      In function super_1_load it compares the available space of the underlying device with
      sb->data_size. The new data offset gets bigger after reshape. So super_1_load returns -EINVAL.
      rdev->sectors is updated in md_finish_reshape. Then sb->data_size is set in super_1_sync based
      on rdev->sectors. So add md_finish_reshape in end_reshape.
      Signed-off-by: NXiao Ni <xni@redhat.com>
      Acked-by: NGuoqing Jiang <gqjiang@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      b5d27718
    • G
      md/bitmap: don't read page from device with Bitmap_sync · 4aaf7694
      Guoqing Jiang 提交于
      The device owns Bitmap_sync flag needs recovery
      to become in sync, and read page from this type
      device could get stale status.
      
      Also add comments for Bitmap_sync bit per the
      suggestion from Shaohua and Neil.
      
      Previous disscussion can be found here:
      https://marc.info/?t=149760428900004&r=1&w=2Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4aaf7694
  12. 06 7月, 2017 1 次提交
  13. 04 7月, 2017 2 次提交
  14. 30 6月, 2017 1 次提交
  15. 28 6月, 2017 2 次提交
    • V
      dm thin: do not queue freed thin mapping for next stage processing · 00a0ea33
      Vallish Vaidyeshwara 提交于
      process_prepared_discard_passdown_pt1() should cleanup
      dm_thin_new_mapping in cases of error.
      
      dm_pool_inc_data_range() can fail trying to get a block reference:
      
      metadata operation 'dm_pool_inc_data_range' failed: error = -61
      
      When dm_pool_inc_data_range() fails, dm thin aborts current metadata
      transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
      is released as well. However, current thin mapping will be queued
      onto next stage as part of queue_passdown_pt2() or passdown_endio().
      This dangling thin mapping memory when processed and accessed in
      next stage will lead to device mapper crashing.
      
      Code flow without fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
         -> dm_pool_inc_data_range() fails, frees memory m
                  but does not remove it from next stage queue
      
      -> process_prepared_discard_passdown_pt2(m)
         -> processes freed memory m and crashes
      
      One such stack:
      
      Call Trace:
      [<ffffffffa037a46f>] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
      [<ffffffffa039b6dc>] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
      [<ffffffffa039b88b>] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
      [<ffffffffa0399611>] process_prepared+0x81/0xa0 [dm_thin_pool]
      [<ffffffffa039e735>] do_worker+0xc5/0x820 [dm_thin_pool]
      [<ffffffff8152bf54>] ? __schedule+0x244/0x680
      [<ffffffff81087e72>] ? pwq_activate_delayed_work+0x42/0xb0
      [<ffffffff81089f53>] process_one_work+0x153/0x3f0
      [<ffffffff8108a71b>] worker_thread+0x12b/0x4b0
      [<ffffffff8108a5f0>] ? rescuer_thread+0x350/0x350
      [<ffffffff8108fd6a>] kthread+0xca/0xe0
      [<ffffffff8108fca0>] ? kthread_park+0x60/0x60
      [<ffffffff81530b45>] ret_from_fork+0x25/0x30
      
      The fix is to first take the block ref count for discarded block and
      then do a passdown discard of this block. If block ref count fails,
      then bail out aborting current metadata transaction, mark pool as
      PM_READ_ONLY and also free current thin mapping memory (existing error
      handling code) without queueing this thin mapping onto next stage of
      processing. If block ref count succeeds, then passdown discard of this
      block. Discard callback of passdown_endio() will queue this thin mapping
      onto next stage of processing.
      
      Code flow with fix:
      -> process_prepared_discard_passdown_pt1(m)
         -> dm_thin_remove_range()
         -> dm_pool_inc_data_range()
            --> if fails, free memory m and bail out
         -> discard passdown
            --> passdown_endio(m) queues m onto next stage
      
      Cc: stable <stable@vger.kernel.org> # v4.9+
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Reviewed-by: NCristian Gafton <gafton@amazon.com>
      Reviewed-by: NAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: NVallish Vaidyeshwara <vallish@amazon.com>
      Reviewed-by: NJoe Thornber <ejt@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      00a0ea33
    • C
      dm: don't set bounce limit · 41341afa
      Christoph Hellwig 提交于
      Now all queues allocators come without abounce limit by default,
      dm doesn't have to override this anymore.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      41341afa
  16. 24 6月, 2017 2 次提交
  17. 22 6月, 2017 2 次提交
    • N
      md: use a separate bio_set for synchronous IO. · 5a85071c
      NeilBrown 提交于
      md devices allocate a bio_set and use it for two
      distinct purposes.
      mddev->bio_set is used to clone bios as part of sending
      upper level requests down to lower level devices,
      and it is also use for synchronous IO such as superblock
      and bitmap updates, and for correcting read errors.
      
      This multiple usage can lead to deadlocks.  It is likely
      that cloned bios might be queued for write and to be
      waiting for a metadata update before the write can be permitted.
      If the cloning exhausted mddev->bio_set, the metadata update
      may not be able to proceed.
      
      This scenario has been seen during heavy testing, with lots of IO and
      lots of memory pressure.
      
      Address this by adding a new bio_set specifically for synchronous IO.
      All synchronous IO goes directly to the underlying device and is not
      queued at the md level, so request using entries from the new
      mddev->sync_set will complete in a timely fashion.
      Requests that use mddev->bio_set will sometimes need to wait
      for synchronous IO, but will no longer risk deadlocking that iO.
      
      Also: small simplification in mddev_put(): there is no need to
      wait until the spinlock is released before calling bioset_free().
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a85071c
    • M
      dm io: fix duplicate bio completion due to missing ref count · feb7695f
      Mike Snitzer 提交于
      If only a subset of the devices associated with multiple regions support
      a given special operation (eg. DISCARD) then the dec_count() that is
      used to set error for the region must increment the io->count.
      
      Otherwise, when the dec_count() is called it can cause the dm-io
      caller's bio to be completed multiple times.  As was reported against
      the dm-mirror target that had mirror legs with a mix of discard
      capabilities.
      
      Bug: https://bugzilla.kernel.org/show_bug.cgi?id=196077Reported-by: NZhang Yi <yizhan@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      feb7695f