1. 19 11月, 2016 5 次提交
    • S
      md/r5cache: sysfs entry journal_mode · 2c7da14b
      Song Liu 提交于
      With write cache, journal_mode is the knob to switch between
      write-back and write-through.
      
      Below is an example:
      
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      [write-through] write-back
      root@virt-test:~/# echo write-back > /sys/block/md0/md/journal_mode
      root@virt-test:~/# cat /sys/block/md0/md/journal_mode
      write-through [write-back]
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2c7da14b
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd
    • S
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu 提交于
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1e6d690b
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu 提交于
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      937621c3
  2. 08 11月, 2016 2 次提交
  3. 22 9月, 2016 3 次提交
  4. 10 9月, 2016 1 次提交
  5. 07 9月, 2016 1 次提交
  6. 01 9月, 2016 1 次提交
  7. 25 8月, 2016 3 次提交
    • S
      raid5: avoid unnecessary bio data set · 45c91d80
      Shaohua Li 提交于
      bio_reset doesn't change bi_io_vec and bi_max_vecs, so we don't need to
      set them every time. bi_private will be set before the bio is
      dispatched.
      Signed-off-by: NShaohua Li <shli@fb.com>
      45c91d80
    • S
      raid5: fix memory leak of bio integrity data · 5f9d1fde
      Shaohua Li 提交于
      Yi reported a memory leak of raid5 with DIF/DIX enabled disks. raid5
      doesn't alloc/free bio, instead it reuses bios. There are two issues in
      current code:
      1. the code calls bio_init (from
      init_stripe->raid5_build_block->bio_init) then bio_reset (ops_run_io).
      The bio is reused, so likely there is integrity data attached. bio_init
      will clear a pointer to integrity data and makes bio_reset can't release
      the data
      2. bio_reset is called before dispatching bio. After bio is finished,
      it's possible we don't free bio's integrity data (eg, we don't call
      bio_reset again)
      Both issues will cause memory leak. The patch moves bio_init to stripe
      creation and bio_reset to bio end io. This will fix the two issues.
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      5f9d1fde
    • S
      r5cache: set MD_JOURNAL_CLEAN correctly · 486b0f7b
      Song Liu 提交于
      Currently, the code sets MD_JOURNAL_CLEAN when the array has
      MD_FEATURE_JOURNAL and the recovery_cp is MaxSector. The array
      will be MD_JOURNAL_CLEAN even if the journal device is missing.
      
      With this patch, the MD_JOURNAL_CLEAN is only set when the journal
      device presents.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      486b0f7b
  8. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  9. 06 8月, 2016 1 次提交
    • A
      md: Prevent IO hold during accessing to faulty raid5 array · 11367799
      Alexey Obitotskiy 提交于
      After array enters in faulty state (e.g. number of failed drives
      becomes more then accepted for raid5 level) it sets error flags
      (one of this flags is MD_CHANGE_PENDING). For internal metadata
      arrays MD_CHANGE_PENDING cleared into md_update_sb, but not for
      external metadata arrays. MD_CHANGE_PENDING flag set prevents to
      finish all new or non-finished IOs to array and hold them in
      pending state. In some cases this can leads to deadlock situation.
      
      For example, we have faulty array (2 of 4 drives failed) and
      udev handle array state changes and blkid started (or other
      userspace application that used array to read/write) but unable
      to finish reads due to IO hold. At the same time we unable to get
      exclusive access to array (to stop array in our case) because
      another external application still use this array.
      
      Fix makes possible to return IO with errors immediately.
      So external application can finish working with array and
      give exclusive access to other applications to perform
      required management actions with array.
      Signed-off-by: NAlexey Obitotskiy <aleksey.obitotskiy@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      11367799
  10. 02 8月, 2016 1 次提交
    • Z
      raid5: fix incorrectly counter of conf->empty_inactive_list_nr · ff00d3b4
      ZhengYuan Liu 提交于
      The counter conf->empty_inactive_list_nr is only used for determine if the
      raid5 is congested which is deal with in function raid5_congested().
      It was increased in get_free_stripe() when conf->inactive_list got to be
      empty and decreased in release_inactive_stripe_list() when splice
      temp_inactive_list to conf->inactive_list. However, this may have a
      problem when raid5_get_active_stripe or stripe_add_to_batch_list was called,
      because these two functions may call list_del_init(&sh->lru) to delete sh from
      "conf->inactive_list + hash" which may cause "conf->inactive_list + hash" to
      be empty when atomic_inc_not_zero(&sh->count) got false. So a check should be
      done at these two point and increase empty_inactive_list_nr accordingly.
      Otherwise the counter may get to be negative number which would influence
      async readahead from VFS.
      Signed-off-by: NZhengYuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NShaohua Li <shli@fb.com>
      ff00d3b4
  11. 21 7月, 2016 1 次提交
  12. 14 6月, 2016 5 次提交
  13. 08 6月, 2016 3 次提交
  14. 26 5月, 2016 1 次提交
  15. 10 5月, 2016 2 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
    • H
      md: raid5: add prerequisite to run underneath dm-raid · fe67d19a
      Heinz Mauelshagen 提交于
      In case md runs underneath the dm-raid target, the mddev does not have
      a request queue or gendisk, thus avoid accesses.
      
      This patch adds a missing conditional to the raid5 personality.
      Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fe67d19a
  16. 30 4月, 2016 1 次提交
  17. 18 3月, 2016 1 次提交
  18. 10 3月, 2016 2 次提交
    • S
      md/raid5: output stripe state for debug · fb3229d5
      Shaohua Li 提交于
      Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
      quite convenient if we know the stripe state. This is what this patch does.
      Signed-off-by: NShaohua Li <shli@fb.com>
      fb3229d5
    • N
      md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list · 550da24f
      NeilBrown 提交于
      break_stripe_batch_list breaks up a batch and copies some flags from
      the batch head to the members, preserving others.
      
      It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
      normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
      stripe_head is added to a batch, and is not set on stripe_heads
      already in a batch.
      
      However there is no locking to ensure one thread doesn't set the flag
      after it has just been cleared in another.  This does occasionally happen.
      
      md/raid5 maintains a count of the number of stripe_heads with
      STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
      break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
      this could becomes incorrect and will never again return to zero.
      
      md/raid5 delays the handling of some stripe_heads until
      preread_active_stripes becomes zero.  So when the above mention race
      happens, those stripe_heads become blocked and never progress,
      resulting is write to the array handing.
      
      So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
      in the members of a batch.
      
      URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
      URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
      Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
      Tested-by: NTom Weber <linux@junkyard.4t2.com>
      Fixes: 1b956f7a ("md/raid5: be more selective about distributing flags across batch.")
      Cc: stable@vger.kernel.org (v4.1 and later)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      550da24f
  19. 27 2月, 2016 2 次提交
    • S
      RAID5: revert e9e4c377 to fix a livelock · 6ab2a4b8
      Shaohua Li 提交于
      Revert commit
      e9e4c377(md/raid5: per hash value and exclusive wait_for_stripe)
      
      The problem is raid5_get_active_stripe waits on
      conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
      in this order:
      - release all stripes with hash 0
      - raid5_get_active_stripe still sleeps since active_stripes >
        max_nr_stripes * 3 / 4
      - release all stripes with hash other than 0. active_stripes becomes 0
      - raid5_get_active_stripe still sleeps, since nobody wakes up
        wait_for_stripe[0]
      The system live locks. The problem is active_stripes isn't a per-hash
      count. Revert the patch makes the live lock go away.
      
      Cc: stable@vger.kernel.org (v4.2+)
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6ab2a4b8
    • S
      RAID5: check_reshape() shouldn't call mddev_suspend · 27a353c0
      Shaohua Li 提交于
      check_reshape() is called from raid5d thread. raid5d thread shouldn't
      call mddev_suspend(), because mddev_suspend() waits for all IO finish
      but IO is handled in raid5d thread, we could easily deadlock here.
      
      This issue is introduced by
      738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      
      Cc: stable@vger.kernel.org (v4.1+)
      Reported-and-tested-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      27a353c0
  20. 26 2月, 2016 1 次提交
  21. 21 1月, 2016 1 次提交
  22. 06 1月, 2016 1 次提交
    • S
      raid5-cache: add journal hot add/remove support · f6b6ec5c
      Shaohua Li 提交于
      Add support for journal disk hot add/remove. Mostly trival checks in md
      part. The raid5 part is a little tricky. For hot-remove, we can't wait
      pending write as it's called from raid5d. The wait will cause deadlock.
      We simplily fail the hot-remove. A hot-remove retry can success
      eventually since if journal disk is faulty all pending write will be
      failed and finish. For hot-add, since an array supporting journal but
      without journal disk will be marked read-only, we are safe to hot add
      journal without stopping IO (should be read IO, while journal only
      handles write IO).
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6b6ec5c