1. 25 1月, 2017 3 次提交
    • S
      md/r5cache: flush data only stripes in r5l_recovery_log() · a85dd7b8
      Song Liu 提交于
      For safer operation, all arrays start in write-through mode, which has been
      better tested and is more mature. And actually the write-through/write-mode
      isn't persistent after array restarted, so we always start array in
      write-through mode. However, if recovery found data-only stripes before the
      shutdown (from previous write-back mode), it is not safe to start the array in
      write-through mode, as write-through mode can not handle stripes with data in
      write-back cache. To solve this problem, we flush all data-only stripes in
      r5l_recovery_log(). When r5l_recovery_log() returns, the array starts with
      empty cache in write-through mode.
      
      This logic is implemented in r5c_recovery_flush_data_only_stripes():
      
      1. enable write back cache
      2. flush all stripes
      3. wake up conf->mddev->thread
      4. wait for all stripes get flushed (reuse wait_for_quiescent)
      5. disable write back cache
      
      The wait in 4 will be waked up in release_inactive_stripe_list()
      when conf->active_stripes reaches 0.
      
      It is safe to wake up mddev->thread here because all the resource
      required for the thread has been initialized.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a85dd7b8
    • S
      md/r5cache: read data into orig_page for prexor of cached data · 86aa1397
      Song Liu 提交于
      With write back cache, we use orig_page to do prexor. This patch
      makes sure we read data into orig_page for it.
      
      Flag R5_OrigPageUPTDODATE is added to show whether orig_page
      has the latest data from raid disk.
      
      We introduce a helper function uptodate_for_rmw() to simplify
      the a couple conditions in handle_stripe_dirtying().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      86aa1397
    • S
      md/raid5-cache: delete meaningless code · d46d29f0
      Shaohua Li 提交于
      sector_t is unsigned long, it's never < 0
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d46d29f0
  2. 06 1月, 2017 4 次提交
  3. 09 12月, 2016 4 次提交
    • S
      md: separate flags for superblock changes · 2953079c
      Shaohua Li 提交于
      The mddev->flags are used for different purposes. There are a lot of
      places we check/change the flags without masking unrelated flags, we
      could check/change unrelated flags. These usage are most for superblock
      write, so spearate superblock related flags. This should make the code
      clearer and also fix real bugs.
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2953079c
    • S
      md/r5cache: after recovery, increase journal seq by 10000 · 3c6edc66
      Song Liu 提交于
      Currently, we increase journal entry seq by 10 after recovery.
      However, this is not sufficient in the following case.
      
      After crash the journal looks like
      
      | seq+0 | +1 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |
      
      If +1 is not valid, we dropped all entries from +1 to +12; and
      write seq+10:
      
      | seq+0 | +10 | +2 | +3 | +4 | +5 | +6 | +7 | ... | +11 | +12 |
      
      However, if we write a big journal entry with seq+11, it will
      connect with some stale journal entry:
      
      | seq+0 | +10 |                     +11                 | +12 |
      
      To reduce the risk of this issue, we increase seq by 10000 instead.
      
      Shaohua: use 10000 instead of 1000. The risk should be very unlikely. The total
      stripe cache size is less than 2k typically, and several stripes can fit into
      one meta data block. So the total inflight meta data blocks would be quite
      small, which means the the total sequence number used should be quite small.
      The 10000 sequence number increase should be far more than safe.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3c6edc66
    • S
      md/raid5-cache: fix crc in rewrite_data_only_stripes() · 5c88f403
      Song Liu 提交于
      r5l_recovery_create_empty_meta_block() creates crc for the empty
      metablock. After the metablock is updated, we need clear the
      checksum before recalculate it.
      
      Shaohua: moved checksum calculation out of
      r5l_recovery_create_empty_meta_block. We should calculate it after all fields
      are updated.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5c88f403
    • J
      md/raid5-cache: no recovery is required when create super-block · d30dfeb9
      JackieLiu 提交于
      When create the super-block information, We do not need to do this
      recovery stage, only need to initialize some variables.
      Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
      Reviewed-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      d30dfeb9
  4. 06 12月, 2016 2 次提交
    • Z
      md/r5cache: do r5c_update_log_state after log recovery · 3d7e7e1d
      Zhengyuan Liu 提交于
      We should update log state after we did a log recovery, current completion
      may get wrong log state since log->log_start wasn't initalized until we
      called r5l_recovery_log.
      
      At log recovery stage, no lock needed as there is no race conditon.
      next_checkpoint field will be initialized in r5l_recovery_log too.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3d7e7e1d
    • J
      md/raid5-cache: adjust the write position of the empty block if no data blocks · 43b96748
      JackieLiu 提交于
      When recovery is complete, we write an empty block and record his
      position first, then make the data-only stripes rewritten done,
      the location of the empty block as the last checkpoint position
      to write into the super block. And we should update last_checkpoint
      to this empty block position.
      
      ------------------------------------------------------------------
      |  old log       | empty block | data only stripes | invalid log |
      ------------------------------------------------------------------
      ^                ^                                 ^
      |                |- log->last_checkpoint           |- log->log_start
      |                |- log->last_cp_seq               |- log->next_checkpoint
      |- log->seq=n    |- log->seq=10+n
      
      At the same time, if there is no data-only stripes, this scene may appear,
      | meta1 | meta2 | meta3 |
      meta 1 is valid, meta 2 is invalid. meta 3 could be valid. so we should
      The solution is we create a new meta in meta2 with its seq == meta1's
      seq + 10 and let superblock points to meta2.
      Signed-off-by: NJackieLiu <liuyun01@kylinos.cn>
      Reviewed-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Reviewed-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      43b96748
  5. 03 12月, 2016 1 次提交
  6. 30 11月, 2016 7 次提交
  7. 28 11月, 2016 2 次提交
  8. 24 11月, 2016 1 次提交
  9. 22 11月, 2016 1 次提交
  10. 19 11月, 2016 9 次提交
    • S
      md/r5cache: handle FLUSH and FUA · 3bddb7f8
      Song Liu 提交于
      With raid5 cache, we committing data from journal device. When
      there is flush request, we need to flush journal device's cache.
      This was not needed in raid5 journal, because we will flush the
      journal before committing data to raid disks.
      
      This is similar to FUA, except that we also need flush journal for
      FUA. Otherwise, corruptions in earlier meta data will stop recovery
      from reaching FUA data.
      
      slightly changed the code by Shaohua
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      3bddb7f8
    • S
      md/r5cache: r5cache recovery: part 2 · 5aabf7c4
      Song Liu 提交于
      1. In previous patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In this patchpatch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5aabf7c4
    • S
      md/r5cache: r5cache recovery: part 1 · b4c625c6
      Song Liu 提交于
      Recovery of write-back cache has different logic to write-through only
      cache. Specifically, for write-back cache, the recovery need to scan
      through all active journal entries before flushing data out. Therefore,
      large portion of the recovery logic is rewritten here.
      
      To make the diffs cleaner, we split the rewrite as follows:
      
      1. In this patch, we:
            - add new data to r5l_recovery_ctx
            - add new functions to recovery write-back cache
         The new functions are not used in this patch, so this patch does not
         change the behavior of recovery.
      
      2. In next patch, we:
            - modify main recovery procedure r5l_recovery_log() to call new
              functions
            - remove old functions
      
      With cache feature, there are 2 different scenarios of recovery:
      1. Data-Parity stripe: a stripe with complete parity in journal.
      2. Data-Only stripe: a stripe with only data in journal (or partial
         parity).
      
      The code differentiate Data-Parity stripe from Data-Only stripe with
      flag STRIPE_R5C_CACHING.
      
      For Data-Parity stripes, we use the same procedure as raid5 journal,
      where all the data and parity are replayed to the RAID devices.
      
      For Data-Only strips, we need to finish complete calculate parity and
      finish the full reconstruct write or RMW write. For simplicity, in
      the recovery, we load the stripe to stripe cache. Once the array is
      started, the stripe cache state machine will handle these stripes
      through normal write path.
      
      r5c_recovery_flush_log contains the main procedure of recovery. The
      recovery code first scans through the journal and loads data to
      stripe cache. The code keeps tracks of all these stripes in a list
      (use sh->lru and ctx->cached_list), stripes in the list are
      organized in the order of its first appearance on the journal.
      During the scan, the recovery code assesses each stripe as
      Data-Parity or Data-Only.
      
      During scan, the array may run out of stripe cache. In these cases,
      the recovery code will also call raid5_set_cache_size to increase
      stripe cache size. If the array still runs out of stripe cache
      because there isn't enough memory, the array will not assemble.
      
      At the end of scan, the recovery code replays all Data-Parity
      stripes, and sets proper states for Data-Only stripes. The recovery
      code also increases seq number by 10 and rewrites all Data-Only
      stripes to journal. This is to avoid confusion after repeated
      crashes. More details is explained in raid5-cache.c before
      r5c_recovery_rewrite_data_only_stripes().
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      b4c625c6
    • S
      md/r5cache: refactoring journal recovery code · 9ed988f5
      Song Liu 提交于
      1. rename r5l_read_meta_block() as r5l_recovery_read_meta_block();
      2. pull the code that initialize r5l_meta_block from
         r5l_log_write_empty_meta_block() to a separate function
         r5l_recovery_create_empty_meta_block(), so that we can reuse this
         piece of code.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      9ed988f5
    • S
      md/r5cache: sysfs entry journal_mode · 2c7da14b
      Song Liu 提交于
      With write cache, journal_mode is the knob to switch between
      write-back and write-through.
      
      Below is an example:
      
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      [write-through] write-back
      root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      write-through [write-back]
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2c7da14b
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd
    • S
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu 提交于
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1e6d690b
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: Check array size in r5l_init_log · c757ec95
      Song Liu 提交于
      Currently, r5l_write_stripe checks meta size for each stripe write,
      which is not necessary.
      
      With this patch, r5l_init_log checks maximal meta size of the array,
      which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
      If this is too big to fit in one page, r5l_init_log aborts.
      
      With current meta data, r5l_log support raid_disks up to 203.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c757ec95
  11. 18 11月, 2016 1 次提交
  12. 08 11月, 2016 1 次提交
  13. 01 11月, 2016 1 次提交
  14. 29 10月, 2016 1 次提交
  15. 25 10月, 2016 2 次提交
    • Z
      md/raid5: write an empty meta-block when creating log super-block · 56056c2e
      Zhengyuan Liu 提交于
      If superblock points to an invalid meta block, r5l_load_log will set
      create_super with true and create an new superblock, this runtime path
      would always happen if we do no writing I/O to this array since it was
      created. Writing an empty meta block could avoid this unnecessary
      action at the first time we created log superblock.
      
      Another reason is for the corretness of log recovery. Currently we have
      bellow code to guarantee log revocery to be correct.
      
              if (ctx.seq > log->last_cp_seq + 1) {
                      int ret;
      
                      ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                      if (ret)
                              return ret;
                      log->seq = ctx.seq + 11;
                      log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                      r5l_write_super(log, ctx.pos);
              } else {
                      log->log_start = ctx.pos;
                      log->seq = ctx.seq;
              }
      
      If we just created a array with a journal device, log->log_start and
      log->last_checkpoint should all be 0, then we write three meta block
      which are valid except mid one and supposed crash happened. The ctx.seq
      would equal to log->last_cp_seq + 1 and log->log_start would be set to
      position of mid invalid meta block after we did a recovery, this will
      lead to problems which could be avoided with this patch.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      56056c2e
    • Z
      md/raid5: initialize next_checkpoint field before use · 28cd88e2
      Zhengyuan Liu 提交于
      No initial operation was done to this field when we
      load/recovery the log, it got assignment only when IO
      to raid disk was finished. So r5l_quiesce may use wrong
      next_checkpoint to reclaim log space, that would make
      reclaimable space calculation confused.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      28cd88e2