1. 19 11月, 2016 2 次提交
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: Check array size in r5l_init_log · c757ec95
      Song Liu 提交于
      Currently, r5l_write_stripe checks meta size for each stripe write,
      which is not necessary.
      
      With this patch, r5l_init_log checks maximal meta size of the array,
      which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
      If this is too big to fit in one page, r5l_init_log aborts.
      
      With current meta data, r5l_log support raid_disks up to 203.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      c757ec95
  2. 18 11月, 2016 1 次提交
  3. 08 11月, 2016 1 次提交
  4. 29 10月, 2016 1 次提交
  5. 25 10月, 2016 2 次提交
    • Z
      md/raid5: write an empty meta-block when creating log super-block · 56056c2e
      Zhengyuan Liu 提交于
      If superblock points to an invalid meta block, r5l_load_log will set
      create_super with true and create an new superblock, this runtime path
      would always happen if we do no writing I/O to this array since it was
      created. Writing an empty meta block could avoid this unnecessary
      action at the first time we created log superblock.
      
      Another reason is for the corretness of log recovery. Currently we have
      bellow code to guarantee log revocery to be correct.
      
              if (ctx.seq > log->last_cp_seq + 1) {
                      int ret;
      
                      ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                      if (ret)
                              return ret;
                      log->seq = ctx.seq + 11;
                      log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                      r5l_write_super(log, ctx.pos);
              } else {
                      log->log_start = ctx.pos;
                      log->seq = ctx.seq;
              }
      
      If we just created a array with a journal device, log->log_start and
      log->last_checkpoint should all be 0, then we write three meta block
      which are valid except mid one and supposed crash happened. The ctx.seq
      would equal to log->last_cp_seq + 1 and log->log_start would be set to
      position of mid invalid meta block after we did a recovery, this will
      lead to problems which could be avoided with this patch.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      56056c2e
    • Z
      md/raid5: initialize next_checkpoint field before use · 28cd88e2
      Zhengyuan Liu 提交于
      No initial operation was done to this field when we
      load/recovery the log, it got assignment only when IO
      to raid disk was finished. So r5l_quiesce may use wrong
      next_checkpoint to reclaim log space, that would make
      reclaimable space calculation confused.
      Signed-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      28cd88e2
  6. 01 9月, 2016 1 次提交
    • S
      raid5-cache: fix a deadlock in superblock write · 8e018c21
      Shaohua Li 提交于
      There is a potential deadlock in superblock write. Discard could zero data, so
      before discard we must make sure superblock is updated to new log tail.
      Updating superblock (either directly call md_update_sb() or depend on md
      thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
      with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
      IO finish, hence waitting for reclaim thread, while reclaim thread is calling
      this function and waitting for reconfig mutex. So there is a deadlock. We
      workaround this issue with a trylock. The downside of the solution is we could
      miss discard if we can't take reconfig mutex. But this should happen rarely
      (mainly in raid array stop), so miss discard shouldn't be a big problem.
      
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8e018c21
  7. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  8. 08 6月, 2016 3 次提交
  9. 10 5月, 2016 1 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
  10. 14 4月, 2016 1 次提交
  11. 14 1月, 2016 2 次提交
  12. 06 1月, 2016 6 次提交
  13. 01 11月, 2015 18 次提交