1. 02 11月, 2017 3 次提交
    • S
      md: use lockdep_assert_held · efa4b77b
      Shaohua Li 提交于
      lockdep_assert_held is a better way to assert lock held, and it works
      for UP.
      Signed-off-by: NShaohua Li <shli@fb.com>
      efa4b77b
    • N
      md: remove special meaning of ->quiesce(.., 2) · b03e0ccb
      NeilBrown 提交于
      The '2' argument means "wake up anything that is waiting".
      This is an inelegant part of the design and was added
      to help support management of suspend_lo/suspend_hi setting.
      Now that suspend_lo/hi is managed in mddev_suspend/resume,
      that need is gone.
      These is still a couple of places where we call 'quiesce'
      with an argument of '2', but they can safely be changed to
      call ->quiesce(.., 1); ->quiesce(.., 0) which
      achieve the same result at the small cost of pausing IO
      briefly.
      
      This removes a small "optimization" from suspend_{hi,lo}_store,
      but it isn't clear that optimization served a useful purpose.
      The code now is a lot clearer.
      Suggested-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b03e0ccb
    • N
      md: always hold reconfig_mutex when calling mddev_suspend() · 4d5324f7
      NeilBrown 提交于
      Most often mddev_suspend() is called with
      reconfig_mutex held.  Make this a requirement in
      preparation a subsequent patch.  Also require
      reconfig_mutex to be held for mddev_resume(),
      partly for symmetry and partly to guarantee
      no races with incr/decr of mddev->suspend.
      
      Taking the mutex in r5c_disable_writeback_async() is
      a little tricky as this is called from a work queue
      via log->disable_writeback_work, and flush_work()
      is called on that while holding ->reconfig_mutex.
      If the work item hasn't run before flush_work()
      is called, the work function will not be able to
      get the mutex.
      
      So we use mddev_trylock() inside the wait_event() call, and have that
      abort when conf->log is set to NULL, which happens before
      flush_work() is called.
      We wait in mddev->sb_wait and ensure this is woken
      when any of the conditions change.  This requires
      waking mddev->sb_wait in mddev_unlock().  This is only
      like to trigger extra wake_ups of threads that needn't
      be woken when metadata is being written, and that
      doesn't happen often enough that the cost would be
      noticeable.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4d5324f7
  2. 17 10月, 2017 1 次提交
  3. 26 8月, 2017 1 次提交
  4. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  5. 08 8月, 2017 2 次提交
    • S
      md/r5cache: fix io_unit handling in r5l_log_endio() · a9501d74
      Song Liu 提交于
      In r5l_log_endio(), once log->io_list_lock is released, the io unit
      may be accessed (or even freed) by other threads. Current code
      doesn't handle the io_unit properly, which leads to potential race
      conditions.
      
      This patch solves this race condition by:
      
      1. Add a pending_stripe count flush_payload. Multiple flush_payloads
         are counted as only one pending_stripe. Flag has_flush_payload is
         added to show whether the io unit has flush_payload;
      2. In r5l_log_endio(), check flags has_null_flush and
         has_flush_payload with log->io_list_lock held. After the lock
         is released, this IO unit is only accessed when we know the
         pending_stripe counter cannot be zeroed by other threads.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a9501d74
    • S
      md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_set · b44886c5
      Song Liu 提交于
      In r5c_journal_mode_set(), it is necessary to call mddev_lock()
      before accessing conf and conf->log. Otherwise, the conf->log
      may change (and become NULL).
      
      Shaohua: fix unlock in failure cases
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b44886c5
  6. 19 6月, 2017 1 次提交
  7. 09 6月, 2017 1 次提交
  8. 01 6月, 2017 1 次提交
    • J
      md: Make flush bios explicitely sync · 5a8948f8
      Jan Kara 提交于
      Commit b685d3d6 "block: treat REQ_FUA and REQ_PREFLUSH as
      synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
      definitions.  generic_make_request_checks() however strips REQ_FUA and
      REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
      write cache and thus write effectively becomes asynchronous which can
      lead to performance regressions
      
      Fix the problem by making sure all bios which are synchronous are
      properly marked with REQ_SYNC.
      
      CC: linux-raid@vger.kernel.org
      CC: Shaohua Li <shli@kernel.org>
      Fixes: b685d3d6
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5a8948f8
  9. 12 5月, 2017 2 次提交
    • S
      md/r5cache: handle sync with data in write back cache · 5ddf0440
      Song Liu 提交于
      Currently, sync of raid456 array cannot make progress when hitting
      data in writeback r5cache.
      
      This patch fixes this issue by flushing cached data of the stripe
      before processing the sync request. This is achived by:
      
      1. In handle_stripe(), do not set STRIPE_SYNCING if the stripe is
         in write back cache;
      2. In r5c_try_caching_write(), handle the stripe in sync with write
         through;
      3. In do_release_stripe(), make stripe in sync write out and send
         it to the state machine.
      
      Shaohua: explictly set STRIPE_HANDLE after write out completed
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5ddf0440
    • S
      md/r5cache: gracefully handle journal device errors for writeback mode · 70d466f7
      Song Liu 提交于
      For the raid456 with writeback cache, when journal device failed during
      normal operation, it is still possible to persist all data, as all
      pending data is still in stripe cache. However, it is necessary to handle
      journal failure gracefully.
      
      During journal failures, the following logic handles the graceful shutdown
      of journal:
      1. raid5_error() marks the device as Faulty and schedules async work
         log->disable_writeback_work;
      2. In disable_writeback_work (r5c_disable_writeback_async), the mddev is
         suspended, set to write through, and then resumed. mddev_suspend()
         flushes all cached stripes;
      3. All cached stripes need to be flushed carefully to the RAID array.
      
      This patch fixes issues within the process above:
      1. In r5c_update_on_rdev_error() schedule disable_writeback_work for
         journal failures;
      2. In r5c_disable_writeback_async(), wait for MD_SB_CHANGE_PENDING,
         since raid5_error() updates superblock.
      3. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
         to make progress during log_failed;
      4. In delay_towrite(), if log failed only process data in the cache (skip
         new writes in dev->towrite);
      5. In __get_priority_stripe(), process loprio_list during journal device
         failures.
      6. In raid5_remove_disk(), wait for all cached stripes are flushed before
         calling log_exit().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      70d466f7
  10. 11 5月, 2017 1 次提交
  11. 27 3月, 2017 1 次提交
  12. 26 3月, 2017 1 次提交
  13. 23 3月, 2017 3 次提交
    • N
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac
      NeilBrown 提交于
      md/raid5 needs to keep track of how many stripe_heads are processing a
      bio so that it can delay calling bio_endio() until all stripe_heads
      have completed.  It currently uses 16 bits of ->bi_phys_segments for
      this purpose.
      
      16 bits is only enough for 256M requests, and it is possible for a
      single bio to be larger than this, which causes problems.  Also, the
      bio struct contains a larger counter, __bi_remaining, which has a
      purpose very similar to the purpose of our counter.  So stop using
      ->bi_phys_segments, and instead use __bi_remaining.
      
      This means we don't need to initialize the counter, as our caller
      initializes it to '1'.  It also means we can call bio_endio() directly
      as it tests this counter internally.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      016c76ac
    • N
      md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2
      NeilBrown 提交于
      We currently gather bios that need to be returned into a bio_list
      and call bio_endio() on them all together.
      The original reason for this was to avoid making the calls while
      holding a spinlock.
      Locking has changed a lot since then, and that reason is no longer
      valid.
      
      So discard return_io() and various return_bi lists, and just call
      bio_endio() directly as needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      bd83d0a2
    • N
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown 提交于
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      49728050
  14. 17 3月, 2017 5 次提交
    • S
      md/r5cache: generate R5LOG_PAYLOAD_FLUSH · ea17481f
      Song Liu 提交于
      In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to
      log->current_io.
      
      Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to
      journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in
      quiesce.
      
      Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload.
      However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH,
      which is simpler.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ea17481f
    • S
      md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery · 2d4f4687
      Song Liu 提交于
      This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery.
      Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush
      finish.
      
      When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity
      will be dropped from recovery. This will reduce the number of stripes
      to replay, and thus accelerate the recovery process.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2d4f4687
    • A
      raid5: separate header for log functions · ff875738
      Artur Paszkiewicz 提交于
      Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
      wrappers for functions which will be shared with ppl and use them in
      raid5 core instead of direct calls to raid5-cache.
      
      Remove unused parameter from r5c_cache_data(), move two duplicated
      pr_debug() calls to r5l_init_log().
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ff875738
    • S
      md/r5cache: improve recovery with read ahead page pool · effe6ee7
      Song Liu 提交于
      In r5cache recovery, the journal device is scanned page by page.
      Currently, we use sync_page_io() to read journal device. This is
      not efficient when we have to recovery many stripes from the journal.
      
      To improve the speed of recovery, this patch introduces a read ahead
      page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive
      pages are read in one IO. Then the recovery code read the journal from
      ra_pool.
      
      With ra_pool, r5l_recovery_ctx has become much bigger. Therefore,
      r5l_recovery_log() is refactored so r5l_recovery_ctx is not using
      stack space.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      effe6ee7
    • S
      md/raid5-cache: bump flush stripe batch size · 84890c03
      Shaohua Li 提交于
      Bump the flush stripe batch size to 2048. For my 12 disks raid
      array, the stripes takes:
      12 * 4k * 2048 = 96MB
      
      This is still quite small. A hardware raid card generally has 1GB size,
      which we suggest the raid5-cache has similar cache size.
      
      The advantage of a big batch size is we can dispatch a lot of IO in the
      same time, then we can do some scheduling to make better IO pattern.
      
      Last patch prioritizes stripes, so we don't worry about a big flush
      stripe batch will starve normal stripes.
      Signed-off-by: NShaohua Li <shli@fb.com>
      84890c03
  15. 14 2月, 2017 4 次提交
    • S
      md/raid5-cache: exclude reclaiming stripes in reclaim check · e33fbb9c
      Shaohua Li 提交于
      stripes which are being reclaimed are still accounted into cached
      stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
      stripes and does unnecessary stripe reclaim. In practice, I saw one
      stripe is reclaimed one time. This will cause bad IO pattern. Fixing
      this by excluding the reclaing stripes in the check.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e33fbb9c
    • S
      md/raid5-cache: stripe reclaim only counts valid stripes · e8fd52ee
      Shaohua Li 提交于
      When log space is tight, we try to reclaim stripes from log head. There
      are stripes which can't be reclaimed right now if some conditions are
      met. We skip such stripes but accidentally count them, which might cause
      no stripes are claimed. Fixing this by only counting valid stripes.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e8fd52ee
    • S
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu 提交于
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      39b99586
    • S
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu 提交于
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      03b047f4
  16. 25 1月, 2017 4 次提交
    • S
      md/r5cache: disable write back for degraded array · 2e38a37f
      Song Liu 提交于
      write-back cache in degraded mode introduces corner cases to the array.
      Although we try to cover all these corner cases, it is safer to just
      disable write-back cache when the array is in degraded mode.
      
      In this patch, we disable writeback cache for degraded mode:
      1. On device failure, if the array enters degraded mode, raid5_error()
         will submit async job r5c_disable_writeback_async to disable
         writeback;
      2. In r5c_journal_mode_store(), it is invalid to enable writeback in
         degraded mode;
      3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
         in write-through mode.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e38a37f
    • S
      md/r5cache: flush data only stripes in r5l_recovery_log() · a85dd7b8
      Song Liu 提交于
      For safer operation, all arrays start in write-through mode, which has been
      better tested and is more mature. And actually the write-through/write-mode
      isn't persistent after array restarted, so we always start array in
      write-through mode. However, if recovery found data-only stripes before the
      shutdown (from previous write-back mode), it is not safe to start the array in
      write-through mode, as write-through mode can not handle stripes with data in
      write-back cache. To solve this problem, we flush all data-only stripes in
      r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
      empty cache in write-through mode.
      
      This logic is implemented in r5c_recovery_flush_data_only_stripes():
      
      1. enable write back cache
      2. flush all stripes
      3. wake up conf->mddev->thread
      4. wait for all stripes get flushed (reuse wait_for_quiescent)
      5. disable write back cache
      
      The wait in 4 will be waked up in release_inactive_stripe_list()
      when conf->active_stripes reaches 0.
      
      It is safe to wake up mddev->thread here because all the resource
      required for the thread has been initialized.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a85dd7b8
    • S
      md/r5cache: read data into orig_page for prexor of cached data · 86aa1397
      Song Liu 提交于
      With write back cache, we use orig_page to do prexor. This patch
      makes sure we read data into orig_page for it.
      
      Flag R5_OrigPageUPTDODATE is added to show whether orig_page
      has the latest data from raid disk.
      
      We introduce a helper function uptodate_for_rmw() to simplify
      the a couple conditions in handle_stripe_dirtying().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      86aa1397
    • S
      md/raid5-cache: delete meaningless code · d46d29f0
      Shaohua Li 提交于
      sector_t is unsigned long, it's never < 0
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d46d29f0
  17. 06 1月, 2017 4 次提交
  18. 09 12月, 2016 4 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
    • S
      md/r5cache: after recovery, increase journal seq by 10000 · 3c6edc66
      Song Liu 提交于
      Currently, we increase journal entry seq by 10 after recovery.
      However, this is not sufficient in the following case.
      
      After crash the journal looks like
      
      | seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |
      
      If +1 is not valid, we dropped all entries from +1 to +12; and
      write seq+10:
      
      | seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |
      
      However, if we write a big journal entry with seq+11, it will
      connect with some stale journal entry:
      
      | seq+0 | +10 |                     +11                 | +12 |
      
      To reduce the risk of this issue, we increase seq by 10000 instead.
      
      Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total
      stripe cache size is less than 2k typically, and several stripes can fit into
      one meta data block. So the total inflight meta data blocks would be quite
      small, which means the the total sequence number used should be quite small.
      The 10000 sequence number increase should be far more than safe.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3c6edc66
    • S
      md/raid5-cache: fix crc in rewrite_data_only_stripes() · 5c88f403
      Song Liu 提交于
      r5l_recovery_create_empty_meta_block() creates crc for the empty
      metablock. After the metablock is updated, we need clear the
      checksum before recalculate it.
      
      Shaohua: moved checksum calculation out of
      r5l_recovery_create_empty_meta_block. We should calculate it after all fields
      are updated.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5c88f403
    • J
      md/raid5-cache: no recovery is required when create super-block · d30dfeb9
      JackieLiu 提交于
      When create the super-block information, We do not need to do this
      recovery stage, only need to initialize some variables.
      Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
      Reviewed-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d30dfeb9