1. 09 4月, 2017 1 次提交
  2. 07 4月, 2017 1 次提交
    • N
      block: trace completion of all bios. · fbbaf700
      NeilBrown 提交于
      Currently only dm and md/raid5 bios trigger
      trace_block_bio_complete().  Now that we have bio_chain() and
      bio_inc_remaining(), it is not possible, in general, for a driver to
      know when the bio is really complete.  Only bio_endio() knows that.
      
      So move the trace_block_bio_complete() call to bio_endio().
      
      Now trace_block_bio_complete() pairs with trace_block_bio_queue().
      Any bio for which a 'queue' event is traced, will subsequently
      generate a 'complete' event.
      
      There are a few cases where completion tracing is not wanted.
      1/ If blk_update_request() has already generated a completion
         trace event at the 'request' level, there is no point generating
         one at the bio level too.  In this case the bi_sector and bi_size
         will have changed, so the bio level event would be wrong
      
      2/ If the bio hasn't actually been queued yet, but is being aborted
         early, then a trace event could be confusing.  Some filesystems
         call bio_endio() but do not want tracing.
      
      3/ The bio_integrity code interposes itself by replacing bi_end_io,
         then restoring it and calling bio_endio() again.  This would produce
         two identical trace events if left like that.
      
      To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
      produce the trace event when this is set.
      We address point 1 above by clearing the flag in blk_update_request().
      We address point 2 above by only setting the flag when
      generic_make_request() is called.
      We address point 3 above by clearing the flag after generating a
      completion event.
      
      When bio_split() is used on a bio, particularly in blk_queue_split(),
      there is an extra complication.  A new bio is split off the front, and
      may be handle directly without going through generic_make_request().
      The old bio, which has been advanced, is passed to
      generic_make_request(), so it will trigger a trace event a second
      time.
      Probably the best result when a split happens is to see a single
      'queue' event for the whole bio, then multiple 'complete' events - one
      for each component.  To achieve this was can:
      - copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
      - avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
      This way, the split-off bio won't create a queue event, the original
      won't either even if it re-submitted to generic_make_request(),
      but both will produce completion events, each for their own range.
      
      So if generic_make_request() is called (which generates a QUEUED
      event), then bi_endio() will create a single COMPLETE event for each
      range that the bio is split into, unless the driver has explicitly
      requested it not to.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fbbaf700
  3. 15 3月, 2017 1 次提交
  4. 10 3月, 2017 1 次提交
  5. 02 3月, 2017 1 次提交
  6. 17 2月, 2017 1 次提交
  7. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  8. 14 2月, 2017 4 次提交
    • S
      md/raid5-cache: exclude reclaiming stripes in reclaim check · e33fbb9c
      Shaohua Li 提交于
      stripes which are being reclaimed are still accounted into cached
      stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
      stripes and does unnecessary stripe reclaim. In practice, I saw one
      stripe is reclaimed one time. This will cause bad IO pattern. Fixing
      this by excluding the reclaing stripes in the check.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e33fbb9c
    • S
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu 提交于
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      39b99586
    • S
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu 提交于
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      03b047f4
    • S
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li 提交于
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      765d704d
  9. 02 2月, 2017 1 次提交
  10. 25 1月, 2017 4 次提交
    • S
      md/r5cache: disable write back for degraded array · 2e38a37f
      Song Liu 提交于
      write-back cache in degraded mode introduces corner cases to the array.
      Although we try to cover all these corner cases, it is safer to just
      disable write-back cache when the array is in degraded mode.
      
      In this patch, we disable writeback cache for degraded mode:
      1. On device failure, if the array enters degraded mode, raid5_error()
         will submit async job r5c_disable_writeback_async to disable
         writeback;
      2. In r5c_journal_mode_store(), it is invalid to enable writeback in
         degraded mode;
      3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
         in write-through mode.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e38a37f
    • S
      md/r5cache: shift complex rmw from read path to write path · 07e83364
      Song Liu 提交于
      Write back cache requires a complex RMW mechanism, where old data is
      read into dev->orig_page for prexor, and then xor is done with
      dev->page. This logic is already implemented in the write path.
      
      However, current read path is not awared of this requirement. When
      the array is optimal, the RMW is not required, as the data are
      read from raid disks. However, when the target stripe is degraded,
      complex RMW is required to generate right data.
      
      To keep read path as clean as possible, we handle read path by
      flushing degraded, in-journal stripes before processing reads to
      missing dev.
      
      Specifically, when there is read requests to a degraded stripe
      with data in journal, handle_stripe_fill() calls
      r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
      will do the complex RMW and flush the stripe to RAID disks. After
      that, read requests are handled.
      
      There is one more corner case when there is non-overwrite bio for
      the missing (or out of sync) dev. handle_stripe_dirtying() will not
      be able to process the non-overwrite bios without constructing the
      data in handle_stripe_fill(). This is fixed by delaying non-overwrite
      bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
      these bios after the stripe is flushed to raid disks.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      07e83364
    • S
      md/raid5: move comment of fetch_block to right location · ba02684d
      Song Liu 提交于
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba02684d
    • S
      md/r5cache: read data into orig_page for prexor of cached data · 86aa1397
      Song Liu 提交于
      With write back cache, we use orig_page to do prexor. This patch
      makes sure we read data into orig_page for it.
      
      Flag R5_OrigPageUPTDODATE is added to show whether orig_page
      has the latest data from raid disk.
      
      We introduce a helper function uptodate_for_rmw() to simplify
      the a couple conditions in handle_stripe_dirtying().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      86aa1397
  11. 10 1月, 2017 1 次提交
  12. 06 1月, 2017 1 次提交
  13. 09 12月, 2016 2 次提交
  14. 30 11月, 2016 1 次提交
    • K
      md/raid5: limit request size according to implementation limits · e8d7c332
      Konstantin Khlebnikov 提交于
      Current implementation employ 16bit counter of active stripes in lower
      bits of bio->bi_phys_segments. If request is big enough to overflow
      this counter bio will be completed and freed too early.
      
      Fortunately this not happens in default configuration because several
      other limits prevent that: stripe_cache_size * nr_disks effectively
      limits count of active stripes. And small max_sectors_kb at lower
      disks prevent that during normal read/write operations.
      
      Overflow easily happens in discard if it's enabled by module parameter
      "devices_handle_discard_safely" and stripe_cache_size is set big enough.
      
      This patch limits requests size with 256Mb - 8Kb to prevent overflows.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      e8d7c332
  15. 28 11月, 2016 1 次提交
    • S
      md/r5cache: handle alloc_page failure · d7bd398e
      Song Liu 提交于
      RMW of r5c write back cache uses an extra page to store old data for
      prexor. handle_stripe_dirtying() allocates this page by calling
      alloc_page(). However, alloc_page() may fail.
      
      To handle alloc_page() failures, this patch adds an extra page to
      disk_info. When alloc_page fails, handle_stripe() trys to use these
      pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
      the stripe is added to delayed_list.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7bd398e
  16. 22 11月, 2016 1 次提交
  17. 19 11月, 2016 7 次提交
    • S
      md/r5cache: handle FLUSH and FUA · 3bddb7f8
      Song Liu 提交于
      With raid5 cache, we committing data from journal device. When
      there is flush request, we need to flush journal device's cache.
      This was not needed in raid5 journal, because we will flush the
      journal before committing data to raid disks.
      
      This is similar to FUA, except that we also need flush journal for
      FUA. Otherwise, corruptions in earlier meta data will stop recovery
      from reaching FUA data.
      
      slightly changed the code by Shaohua
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3bddb7f8
    • S
      md/r5cache: r5cache recovery: part 2 · 5aabf7c4
      Song Liu 提交于
      1. In previous patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In this patchpatch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5aabf7c4
    • S
      md/r5cache: sysfs entry journal_mode · 2c7da14b
      Song Liu 提交于
      With write cache, journal_mode is the knob to switch between
      write-back and write-through.
      
      Below is an example:
      
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      [write-through] write-back
      root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      write-through [write-back]
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2c7da14b
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd
    • S
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu 提交于
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1e6d690b
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu 提交于
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      937621c3
  18. 08 11月, 2016 2 次提交
  19. 01 11月, 2016 1 次提交
  20. 22 9月, 2016 3 次提交
  21. 10 9月, 2016 1 次提交
  22. 07 9月, 2016 1 次提交
  23. 01 9月, 2016 1 次提交
  24. 25 8月, 2016 1 次提交