1. 24 3月, 2017 1 次提交
  2. 23 3月, 2017 7 次提交
    • N
      md/raid5: don't test ->writes_pending in raid5_remove_disk · 84dd97a6
      NeilBrown 提交于
      This test on ->writes_pending cannot be safe as the counter
      can be incremented at any moment and cannot be locked against.
      
      Change it to test conf->active_stripes, which at least
      can be locked against.  More changes are still needed.
      
      A future patch will change ->writes_pending, and testing it here will
      be very inconvenient.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      84dd97a6
    • N
      Revert "md/raid5: limit request size according to implementation limits" · 97d53438
      NeilBrown 提交于
      This reverts commit e8d7c332.
      
      Now that raid5 doesn't abuse bi_phys_segments any more, we no longer
      need to impose these limits.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      97d53438
    • N
      md/raid5: remove over-loading of ->bi_phys_segments. · 0472a42b
      NeilBrown 提交于
      When a read request, which bypassed the cache, fails, we need to retry
      it through the cache.
      This involves attaching it to a sequence of stripe_heads, and it may not
      be possible to get all the stripe_heads we need at once.
      We do what we can, and record how far we got in ->bi_phys_segments so
      we can pick up again later.
      
      There is only ever one bio which may have a non-zero offset stored in
      ->bi_phys_segments, the one that is either active in the single thread
      which calls retry_aligned_read(), or is in conf->retry_read_aligned
      waiting for retry_aligned_read() to be called again.
      
      So we only need to store one offset value.  This can be in a local
      variable passed between remove_bio_from_retry() and
      retry_aligned_read(), or in the r5conf structure next to the
      ->retry_read_aligned pointer.
      
      Storing it there allows the last usage of ->bi_phys_segments to be
      removed from md/raid5.c.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      0472a42b
    • N
      md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter · 016c76ac
      NeilBrown 提交于
      md/raid5 needs to keep track of how many stripe_heads are processing a
      bio so that it can delay calling bio_endio() until all stripe_heads
      have completed.  It currently uses 16 bits of ->bi_phys_segments for
      this purpose.
      
      16 bits is only enough for 256M requests, and it is possible for a
      single bio to be larger than this, which causes problems.  Also, the
      bio struct contains a larger counter, __bi_remaining, which has a
      purpose very similar to the purpose of our counter.  So stop using
      ->bi_phys_segments, and instead use __bi_remaining.
      
      This means we don't need to initialize the counter, as our caller
      initializes it to '1'.  It also means we can call bio_endio() directly
      as it tests this counter internally.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      016c76ac
    • N
      md/raid5: call bio_endio() directly rather than queueing for later. · bd83d0a2
      NeilBrown 提交于
      We currently gather bios that need to be returned into a bio_list
      and call bio_endio() on them all together.
      The original reason for this was to avoid making the calls while
      holding a spinlock.
      Locking has changed a lot since then, and that reason is no longer
      valid.
      
      So discard return_io() and various return_bi lists, and just call
      bio_endio() directly as needed.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      bd83d0a2
    • N
      md/raid5: simplfy delaying of writes while metadata is updated. · 16d997b7
      NeilBrown 提交于
      If a device fails during a write, we must ensure the failure is
      recorded in the metadata before the completion of the write is
      acknowleged.
      
      Commit c3cce6cd ("md/raid5: ensure device failure recorded before
      write request returns.")  added code for this, but it was
      unnecessarily complicated.  We already had similar functionality for
      handling updates to the bad-block-list, thanks to Commit de393cde
      ("md: make it easier to wait for bad blocks to be acknowledged.")
      
      So revert most of the former commit, and instead avoid collecting
      completed writes if MD_CHANGE_PENDING is set.  raid5d() will then flush
      the metadata and retry the stripe_head.
      As this change can leave a stripe_head ready for handling immediately
      after handle_active_stripes() returns, we change raid5_do_work() to
      pause when MD_CHANGE_PENDING is set, so that it doesn't spin.
      
      We check MD_CHANGE_PENDING *after* analyse_stripe() as it could be set
      asynchronously.  After analyse_stripe(), we have collected stable data
      about the state of devices, which will be used to make decisions.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      16d997b7
    • N
      md/raid5: use md_write_start to count stripes, not bios · 49728050
      NeilBrown 提交于
      We use md_write_start() to increase the count of pending writes, and
      md_write_end() to decrement the count.  We currently count bios
      submitted to md/raid5.  Change it count stripe_heads that a WRITE bio
      has been attached to.
      
      So now, raid5_make_request() calls md_write_start() and then
      md_write_end() to keep the count elevated during the setup of the
      request.
      
      add_stripe_bio() calls md_write_start() for each stripe_head, and the
      completion routines always call md_write_end(), instead of only
      calling it when raid5_dec_bi_active_stripes() returns 0.
      make_discard_request also calls md_write_start/end().
      
      The parallel between md_write_{start,end} and use of bi_phys_segments
      can be seen in that:
       Whenever we set bi_phys_segments to 1, we now call md_write_start.
       Whenever we increment it on non-read requests with
         raid5_inc_bi_active_stripes(), we now call md_write_start().
       Whenever we decrement bi_phys_segments on non-read requsts with
          raid5_dec_bi_active_stripes(), we now call md_write_end().
      
      This reduces our dependence on keeping a per-bio count of active
      stripes in bi_phys_segments.
      
      md_write_inc() is added which parallels md_write_start(), but requires
      that a write has already been started, and is certain never to sleep.
      This can be used inside a spinlocked region when adding to a write
      request.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      49728050
  3. 17 3月, 2017 7 次提交
    • A
      raid5-ppl: runtime PPL enabling or disabling · ba903a3e
      Artur Paszkiewicz 提交于
      Allow writing to 'consistency_policy' attribute when the array is
      active. Add a new function 'change_consistency_policy' to the
      md_personality operations structure to handle the change in the
      personality code. Values "ppl" and "resync" are accepted and
      turn PPL on and off respectively.
      
      When enabling PPL its location and size should first be set using
      'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
      written at this location on each member device.
      
      Enabling or disabling PPL is performed under a suspended array.  The
      raid5_reset_stripe_cache function frees the stripe cache and allocates
      it again in order to allocate or free the ppl_pages for the stripes in
      the stripe cache.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba903a3e
    • A
      raid5-ppl: support disk hot add/remove with PPL · 6358c239
      Artur Paszkiewicz 提交于
      Add a function to modify the log by removing an rdev when a drive fails
      or adding when a spare/replacement is activated as a raid member.
      
      Removing a disk just clears the child log rdev pointer. No new stripes
      will be accepted for this child log in ppl_write_stripe() and running io
      units will be processed without writing PPL to the device.
      
      Adding a disk sets the child log rdev pointer and writes an empty PPL
      header.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6358c239
    • A
      raid5-ppl: load and recover the log · 4536bf9b
      Artur Paszkiewicz 提交于
      Load the log from each disk when starting the array and recover if the
      array is dirty.
      
      The initial empty PPL is written by mdadm. When loading the log we
      verify the header checksum and signature. For external metadata arrays
      the signature is verified in userspace, so here we read it from the
      header, verifying only if it matches on all disks, and use it later when
      writing PPL.
      
      In addition to the header checksum, each header entry also contains a
      checksum of its partial parity data. If the header is valid, recovery is
      performed for each entry until an invalid entry is found. If the array
      is not degraded and recovery using PPL fully succeeds, there is no need
      to resync the array because data and parity will be consistent, so in
      this case resync will be disabled.
      
      Due to compatibility with IMSM implementations on other systems, we
      can't assume that the recovery data block size is always 4K. Writes
      generated by MD raid5 don't have this issue, but when recovering PPL
      written in other environments it is possible to have entries with
      512-byte sector granularity. The recovery code takes this into account
      and also the logical sector size of the underlying drives.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      4536bf9b
    • A
      raid5-ppl: Partial Parity Log write logging implementation · 3418d036
      Artur Paszkiewicz 提交于
      Implement the calculation of partial parity for a stripe and PPL write
      logging functionality. The description of PPL is added to the
      documentation. More details can be found in the comments in raid5-ppl.c.
      
      Attach a page for holding the partial parity data to stripe_head.
      Allocate it only if mddev has the MD_HAS_PPL flag set.
      
      Partial parity is the xor of not modified data chunks of a stripe and is
      calculated as follows:
      
      - reconstruct-write case:
        xor data from all not updated disks in a stripe
      
      - read-modify-write case:
        xor old data and parity from all updated disks in a stripe
      
      Implement it using the async_tx API and integrate into raid_run_ops().
      It must be called when we still have access to old data, so do it when
      STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
      stored into sh->ppl_page.
      
      Partial parity is not meaningful for full stripe write and is not stored
      in the log or used for recovery, so don't attempt to calculate it when
      stripe has STRIPE_FULL_WRITE.
      
      Put the PPL metadata structures to md_p.h because userspace tools
      (mdadm) will also need to read/write PPL.
      
      Warn about using PPL with enabled disk volatile write-back cache for
      now. It can be removed once disk cache flushing before writing PPL is
      implemented.
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3418d036
    • A
      raid5: separate header for log functions · ff875738
      Artur Paszkiewicz 提交于
      Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
      wrappers for functions which will be shared with ppl and use them in
      raid5 core instead of direct calls to raid5-cache.
      
      Remove unused parameter from r5c_cache_data(), move two duplicated
      pr_debug() calls to r5l_init_log().
      Signed-off-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ff875738
    • S
      md/raid5: sort bios · aaf9f12e
      Shaohua Li 提交于
      Previous patch (raid5: only dispatch IO from raid5d for harddisk raid)
      defers IO dispatching. The goal is to create better IO pattern. At that
      time, we don't sort the deffered IO and hope the block layer can do IO
      merge and sort. Now the raid5-cache writeback could create large amount
      of bios. And if we enable muti-thread for stripe handling, we can't
      control when to dispatch IO to raid disks. In a lot of time, we are
      dispatching IO which block layer can't do merge effectively.
      
      This patch moves further for the IO dispatching defer. We accumulate
      bios, but we don't dispatch all the bios after a threshold is met. This
      'dispatch partial portion of bios' stragety allows bios coming in a
      large time window are sent to disks together. At the dispatching time,
      there is large chance the block layer can merge the bios. To make this
      more effective, we dispatch IO in ascending order. This increases
      request merge chance and reduces disk seek.
      Signed-off-by: NShaohua Li <shli@fb.com>
      aaf9f12e
    • S
      md/raid5: prioritize stripes for writeback · 535ae4eb
      Shaohua Li 提交于
      In raid5-cache writeback mode, we have two types of stripes to handle.
      - stripes which aren't cached yet
      - stripes which are cached and flushing out to raid disks
      
      Upperlayer is more sensistive to latency of the first type of stripes
      generally. But we only one handle list for all these stripes, where the
      two types of stripes are mixed together. When reclaim flushes a lot of
      stripes, the first type of stripes could be noticeably delayed. On the
      other hand, if the log space is tight, we'd like to handle the second
      type of stripes faster and free log space.
      
      This patch destinguishes the two types stripes. They are added into
      different handle list. When we try to get a stripe to handl, we prefer
      the first type of stripes unless log space is tight.
      
      This should have no impact for !writeback case.
      Signed-off-by: NShaohua Li <shli@fb.com>
      535ae4eb
  4. 15 3月, 2017 1 次提交
  5. 10 3月, 2017 1 次提交
  6. 02 3月, 2017 1 次提交
  7. 17 2月, 2017 1 次提交
  8. 16 2月, 2017 1 次提交
    • M
      md: fast clone bio in bio_clone_mddev() · d7a10308
      Ming Lei 提交于
      Firstly bio_clone_mddev() is used in raid normal I/O and isn't
      in resync I/O path.
      
      Secondly all the direct access to bvec table in raid happens on
      resync I/O except for write behind of raid1, in which we still
      use bio_clone() for allocating new bvec table.
      
      So this patch replaces bio_clone() with bio_clone_fast()
      in bio_clone_mddev().
      
      Also kill bio_clone_mddev() and call bio_clone_fast() directly, as
      suggested by Christoph Hellwig.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7a10308
  9. 14 2月, 2017 4 次提交
    • S
      md/raid5-cache: exclude reclaiming stripes in reclaim check · e33fbb9c
      Shaohua Li 提交于
      stripes which are being reclaimed are still accounted into cached
      stripes. The reclaim takes time. r5c_do_reclaim isn't aware of the
      stripes and does unnecessary stripe reclaim. In practice, I saw one
      stripe is reclaimed one time. This will cause bad IO pattern. Fixing
      this by excluding the reclaing stripes in the check.
      
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      e33fbb9c
    • S
      md/r5cache: improve journal device efficiency · 39b99586
      Song Liu 提交于
      It is important to be able to flush all stripes in raid5-cache.
      Therefore, we need reserve some space on the journal device for
      these flushes. If flush operation includes pending writes to the
      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
      for the flush out. This reduces the efficiency of journal space.
      If we exclude these pending writes from flush operation, we only
      need (conf->max_degraded + 1) pages per stripe.
      
      With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
      pending writes will be excluded from stripe flush out. Therefore,
      we can reduce reserved space for flush out and thus improve journal
      device efficiency.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      39b99586
    • S
      md/r5cache: enable chunk_aligned_read with write back cache · 03b047f4
      Song Liu 提交于
      Chunk aligned read significantly reduces CPU usage of raid456.
      However, it is not safe to fully bypass the write back cache.
      This patch enables chunk aligned read with write back cache.
      
      For chunk aligned read, we track stripes in write back cache at
      a bigger granularity, "big_stripe". Each chunk may contain more
      than one stripe (for example, a 256kB chunk contains 64 4kB-page,
      so this chunk contain 64 stripes). For chunk_aligned_read, these
      stripes are grouped into one big_stripe, so we only need one lookup
      for the whole chunk.
      
      For each big_stripe, struct big_stripe_info tracks how many stripes
      of this big_stripe are in the write back cache. We count how many
      stripes of this big_stripe are in the write back cache. These
      counters are tracked in a radix tree (big_stripe_tree).
      r5c_tree_index() is used to calculate keys for the radix tree.
      
      chunk_aligned_read() calls r5c_big_stripe_cached() to look up
      big_stripe of each chunk in the tree. If this big_stripe is in the
      tree, chunk_aligned_read() aborts. This look up is protected by
      rcu_read_lock().
      
      It is necessary to remember whether a stripe is counted in
      big_stripe_tree. Instead of adding new flag, we reuses existing flags:
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
      two flags are set, the stripe is counted in big_stripe_tree. This
      requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
      r5c_try_caching_write(); and moving clear_bit of
      STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
      r5c_finish_stripe_write_out().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      03b047f4
    • S
      raid5: only dispatch IO from raid5d for harddisk raid · 765d704d
      Shaohua Li 提交于
      We made raid5 stripe handling multi-thread before. It works well for
      SSD. But for harddisk, the multi-threading creates more disk seek, so
      not always improve performance. For several hard disks based raid5,
      multi-threading is required as raid5d becames a bottleneck especially
      for sequential write.
      
      To overcome the disk seek issue, we only dispatch IO from raid5d if the
      array is harddisk based. Other threads can still handle stripes, but
      can't dispatch IO.
      
      Idealy, we should control IO dispatching order according to IO position
      interrnally. Right now we still depend on block layer, which isn't very
      efficient sometimes though.
      
      My setup has 9 harddisks, each disk can do around 180M/s sequential
      write. So in theory, the raid5 can do 180 * 8 = 1440M/s sequential
      write. The test machine uses an ATOM CPU. I measure sequential write
      with large iodepth bandwidth to raid array:
      
      without patch: ~600M/s
      without patch and group_thread_cnt=4: 750M/s
      with patch and group_thread_cnt=4: 950M/s
      with patch, group_thread_cnt=4, skip_copy=1: 1150M/s
      
      We are pretty close to the maximum bandwidth in the large iodepth
      iodepth case. The performance gap of small iodepth sequential write
      between software raid and theory value is still very big though, because
      we don't have an efficient pipeline.
      
      Cc: NeilBrown <neilb@suse.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      765d704d
  10. 02 2月, 2017 1 次提交
  11. 25 1月, 2017 4 次提交
    • S
      md/r5cache: disable write back for degraded array · 2e38a37f
      Song Liu 提交于
      write-back cache in degraded mode introduces corner cases to the array.
      Although we try to cover all these corner cases, it is safer to just
      disable write-back cache when the array is in degraded mode.
      
      In this patch, we disable writeback cache for degraded mode:
      1. On device failure, if the array enters degraded mode, raid5_error()
         will submit async job r5c_disable_writeback_async to disable
         writeback;
      2. In r5c_journal_mode_store(), it is invalid to enable writeback in
         degraded mode;
      3. In r5c_try_caching_write(), stripes with s->failed>0 will be handled
         in write-through mode.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2e38a37f
    • S
      md/r5cache: shift complex rmw from read path to write path · 07e83364
      Song Liu 提交于
      Write back cache requires a complex RMW mechanism, where old data is
      read into dev->orig_page for prexor, and then xor is done with
      dev->page. This logic is already implemented in the write path.
      
      However, current read path is not awared of this requirement. When
      the array is optimal, the RMW is not required, as the data are
      read from raid disks. However, when the target stripe is degraded,
      complex RMW is required to generate right data.
      
      To keep read path as clean as possible, we handle read path by
      flushing degraded, in-journal stripes before processing reads to
      missing dev.
      
      Specifically, when there is read requests to a degraded stripe
      with data in journal, handle_stripe_fill() calls
      r5c_make_stripe_write_out() and exits. Then handle_stripe_dirtying()
      will do the complex RMW and flush the stripe to RAID disks. After
      that, read requests are handled.
      
      There is one more corner case when there is non-overwrite bio for
      the missing (or out of sync) dev. handle_stripe_dirtying() will not
      be able to process the non-overwrite bios without constructing the
      data in handle_stripe_fill(). This is fixed by delaying non-overwrite
      bios in handle_stripe_dirtying(). So handle_stripe_fill() works on
      these bios after the stripe is flushed to raid disks.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      07e83364
    • S
      md/raid5: move comment of fetch_block to right location · ba02684d
      Song Liu 提交于
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ba02684d
    • S
      md/r5cache: read data into orig_page for prexor of cached data · 86aa1397
      Song Liu 提交于
      With write back cache, we use orig_page to do prexor. This patch
      makes sure we read data into orig_page for it.
      
      Flag R5_OrigPageUPTDODATE is added to show whether orig_page
      has the latest data from raid disk.
      
      We introduce a helper function uptodate_for_rmw() to simplify
      the a couple conditions in handle_stripe_dirtying().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      86aa1397
  12. 10 1月, 2017 1 次提交
  13. 06 1月, 2017 1 次提交
  14. 09 12月, 2016 2 次提交
  15. 30 11月, 2016 1 次提交
    • K
      md/raid5: limit request size according to implementation limits · e8d7c332
      Konstantin Khlebnikov 提交于
      Current implementation employ 16bit counter of active stripes in lower
      bits of bio->bi_phys_segments. If request is big enough to overflow
      this counter bio will be completed and freed too early.
      
      Fortunately this not happens in default configuration because several
      other limits prevent that: stripe_cache_size * nr_disks effectively
      limits count of active stripes. And small max_sectors_kb at lower
      disks prevent that during normal read/write operations.
      
      Overflow easily happens in discard if it's enabled by module parameter
      "devices_handle_discard_safely" and stripe_cache_size is set big enough.
      
      This patch limits requests size with 256Mb - 8Kb to prevent overflows.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Neil Brown <neilb@suse.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NShaohua Li <shli@fb.com>
      e8d7c332
  16. 28 11月, 2016 1 次提交
    • S
      md/r5cache: handle alloc_page failure · d7bd398e
      Song Liu 提交于
      RMW of r5c write back cache uses an extra page to store old data for
      prexor. handle_stripe_dirtying() allocates this page by calling
      alloc_page(). However, alloc_page() may fail.
      
      To handle alloc_page() failures, this patch adds an extra page to
      disk_info. When alloc_page fails, handle_stripe() trys to use these
      pages. When these pages are used by other stripe (R5C_EXTRA_PAGE_IN_USE),
      the stripe is added to delayed_list.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d7bd398e
  17. 22 11月, 2016 1 次提交
  18. 19 11月, 2016 4 次提交
    • S
      md/r5cache: handle FLUSH and FUA · 3bddb7f8
      Song Liu 提交于
      With raid5 cache, we committing data from journal device. When
      there is flush request, we need to flush journal device's cache.
      This was not needed in raid5 journal, because we will flush the
      journal before committing data to raid disks.
      
      This is similar to FUA, except that we also need flush journal for
      FUA. Otherwise, corruptions in earlier meta data will stop recovery
      from reaching FUA data.
      
      slightly changed the code by Shaohua
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3bddb7f8
    • S
      md/r5cache: r5cache recovery: part 2 · 5aabf7c4
      Song Liu 提交于
      1. In previous patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In this patchpatch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5aabf7c4
    • S
      md/r5cache: sysfs entry journal_mode · 2c7da14b
      Song Liu 提交于
      With write cache, journal_mode is the knob to switch between
      write-back and write-through.
      
      Below is an example:
      
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      [write-through] write-back
      root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      write-through [write-back]
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2c7da14b
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd