1. 19 11月, 2016 4 次提交
    • S
      md/r5cache: write-out phase and reclaim support · a39f7afd
      Song Liu 提交于
      There are two limited resources, stripe cache and journal disk space.
      For better performance, we priotize reclaim of full stripe writes.
      To free up more journal space, we free earliest data on the journal.
      
      In current implementation, reclaim happens when:
      1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 30 seconds) reclaim
         if there is no reclaim in the past 5 seconds.
      2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (256) cached full stripes,
         or cached stripes is enough for a full stripe (chunk size / 4k)
         (r5c_check_cached_full_stripe)
      3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
      4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
      
      r5c_do_reclaim() contains new logic of reclaim.
      
      For stripe cache:
      
      When stripe cache pressure is high (more than 3/4 stripes are cached,
      or there is empty inactive lists), flush all full stripe. If fewer
      than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
      are flushed, flush some paritial stripes. When stripe cache pressure
      is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.
      
      For log space:
      
      To avoid deadlock due to log space, we need to reserve enough space
      to flush cached data. The size of required log space depends on total
      number of cached stripes (stripe_in_journal_count). In current
      implementation, the writing-out phase automatically include pending
      data writes with parity writes (similar to write through case).
      Therefore, we need up to (conf->raid_disks + 1) pages for each cached
      stripe (1 page for meta data, raid_disks pages for all data and
      parity). r5c_log_required_to_flush_cache() calculates log space
      required to flush cache. In the following, we refer to the space
      calculated by r5c_log_required_to_flush_cache() as
      reclaim_required_space.
      
      Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
      R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
      device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
      is set when free space on the log device is less than 2x of
      reclaim_required_space.
      
      r5c_cache keeps all data in cache (not fully committed to RAID) in
      a list (stripe_in_journal_list). These stripes are in the order of their
      first appearance on the journal. So the log tail (last_checkpoint)
      should point to the journal_start of the first item in the list.
      
      When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
      stripes at the head of stripe_in_journal. When R5C_LOG_CRITICAL is
      set, the state machine only writes data that are already in the
      log device (in stripe_in_journal_list).
      
      This patch includes a fix to improve performance by
      Shaohua Li <shli@fb.com>.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      a39f7afd
    • S
      md/r5cache: caching phase of r5cache · 1e6d690b
      Song Liu 提交于
      As described in previous patch, write back cache operates in two
      phases: caching and writing-out. The caching phase works as:
      1. write data to journal
         (r5c_handle_stripe_dirtying, r5c_cache_data)
      2. call bio_endio
         (r5c_handle_data_cached, r5c_return_dev_pending_writes).
      
      Then the writing-out phase is as:
      1. Mark the stripe as write-out (r5c_make_stripe_write_out)
      2. Calcualte parity (reconstruct or RMW)
      3. Write parity (and maybe some other data) to journal device
      4. Write data and parity to RAID disks
      
      This patch implements caching phase. The cache is integrated with
      stripe cache of raid456. It leverages code of r5l_log to write
      data to journal device.
      
      Writing-out phase of the cache is implemented in the next patch.
      
      With r5cache, write operation does not wait for parity calculation
      and write out, so the write latency is lower (1 write to journal
      device vs. read and then write to raid disks). Also, r5cache will
      reduce RAID overhead (multipile IO due to read-modify-write of
      parity) and provide more opportunities of full stripe writes.
      
      This patch adds 2 flags to stripe_head.state:
       - STRIPE_R5C_PARTIAL_STRIPE,
       - STRIPE_R5C_FULL_STRIPE,
      
      Instead of inactive_list, stripes with cached data are tracked in
      r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
      STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
      stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
      are not considered as "active".
      
      For RMW, the code allocates an extra page for each data block
      being updated.  This is stored in r5dev->orig_page and the old data
      is read into it.  Then the prexor calculation subtracts ->orig_page
      from the parity block, and the reconstruct calculation adds the
      ->page data back into the parity block.
      
      r5cache naturally excludes SkipCopy. When the array has write back
      cache, async_copy_data() will not skip copy.
      
      There are some known limitations of the cache implementation:
      
      1. Write cache only covers full page writes (R5_OVERWRITE). Writes
         of smaller granularity are write through.
      2. Only one log io (sh->log_io) for each stripe at anytime. Later
         writes for the same stripe have to wait. This can be improved by
         moving log_io to r5dev.
      3. With writeback cache, read path must enter state machine, which
         is a significant bottleneck for some workloads.
      4. There is no per stripe checkpoint (with r5l_payload_flush) in
         the log, so recovery code has to replay more than necessary data
         (sometimes all the log from last_checkpoint). This reduces
         availability of the array.
      
      This patch includes a fix proposed by ZhengYuan Liu
      <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      1e6d690b
    • S
      md/r5cache: State machine for raid5-cache write back mode · 2ded3703
      Song Liu 提交于
      This patch adds state machine for raid5-cache. With log device, the
      raid456 array could operate in two different modes (r5c_journal_mode):
        - write-back (R5C_MODE_WRITE_BACK)
        - write-through (R5C_MODE_WRITE_THROUGH)
      
      Existing code of raid5-cache only has write-through mode. For write-back
      cache, it is necessary to extend the state machine.
      
      With write-back cache, every stripe could operate in two different
      phases:
        - caching
        - writing-out
      
      In caching phase, the stripe handles writes as:
        - write to journal
        - return IO
      
      In writing-out phase, the stripe behaviors as a stripe in write through
      mode R5C_MODE_WRITE_THROUGH.
      
      STRIPE_R5C_CACHING is added to sh->state to differentiate caching and
      writing-out phase.
      
      Please note: this is a "no-op" patch for raid5-cache write-through
      mode.
      
      The following detailed explanation is copied from the raid5-cache.c:
      
      /*
       * raid5 cache state machine
       *
       * With rhe RAID cache, each stripe works in two phases:
       *      - caching phase
       *      - writing-out phase
       *
       * These two phases are controlled by bit STRIPE_R5C_CACHING:
       *   if STRIPE_R5C_CACHING == 0, the stripe is in writing-out phase
       *   if STRIPE_R5C_CACHING == 1, the stripe is in caching phase
       *
       * When there is no journal, or the journal is in write-through mode,
       * the stripe is always in writing-out phase.
       *
       * For write-back journal, the stripe is sent to caching phase on write
       * (r5c_handle_stripe_dirtying). r5c_make_stripe_write_out() kicks off
       * the write-out phase by clearing STRIPE_R5C_CACHING.
       *
       * Stripes in caching phase do not write the raid disks. Instead, all
       * writes are committed from the log device. Therefore, a stripe in
       * caching phase handles writes as:
       *      - write to log device
       *      - return IO
       *
       * Stripes in writing-out phase handle writes as:
       *      - calculate parity
       *      - write pending data and parity to journal
       *      - write data and parity to raid disks
       *      - return IO for pending writes
       */
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      2ded3703
    • S
      md/r5cache: move some code to raid5.h · 937621c3
      Song Liu 提交于
      Move some define and inline functions to raid5.h, so they can be
      used in raid5-cache.c
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      937621c3
  2. 07 9月, 2016 1 次提交
  3. 27 2月, 2016 2 次提交
    • S
      RAID5: revert e9e4c377 to fix a livelock · 6ab2a4b8
      Shaohua Li 提交于
      Revert commit
      e9e4c377(md/raid5: per hash value and exclusive wait_for_stripe)
      
      The problem is raid5_get_active_stripe waits on
      conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
      in this order:
      - release all stripes with hash 0
      - raid5_get_active_stripe still sleeps since active_stripes >
        max_nr_stripes * 3 / 4
      - release all stripes with hash other than 0. active_stripes becomes 0
      - raid5_get_active_stripe still sleeps, since nobody wakes up
        wait_for_stripe[0]
      The system live locks. The problem is active_stripes isn't a per-hash
      count. Revert the patch makes the live lock go away.
      
      Cc: stable@vger.kernel.org (v4.2+)
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6ab2a4b8
    • S
      RAID5: check_reshape() shouldn't call mddev_suspend · 27a353c0
      Shaohua Li 提交于
      check_reshape() is called from raid5d thread. raid5d thread shouldn't
      call mddev_suspend(), because mddev_suspend() waits for all IO finish
      but IO is handled in raid5d thread, we could easily deadlock here.
      
      This issue is introduced by
      738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      
      Cc: stable@vger.kernel.org (v4.1+)
      Reported-and-tested-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      27a353c0
  4. 01 11月, 2015 3 次提交
  5. 24 10月, 2015 4 次提交
    • S
      raid5: log reclaim support · 0576b1c6
      Shaohua Li 提交于
      This is the reclaim support for raid5 log. A stripe write will have
      following steps:
      
      1. reconstruct the stripe, read data/calculate parity. ops_run_io
      prepares to write data/parity to raid disks
      2. hijack ops_run_io. stripe data/parity is appending to log disk
      3. flush log disk cache
      4. ops_run_io run again and do normal operation. stripe data/parity is
      written in raid array disks. raid core can return io to upper layer.
      5. flush cache of all raid array disks
      6. update super block
      7. log disk space used by the stripe can be reused
      
      In practice, several stripes consist of an io_unit and we will batch
      several io_unit in different steps, but the whole process doesn't
      change.
      
      It's possible io return just after data/parity hit log disk, but then
      read IO will need read from log disk. For simplicity, IO return happens
      at step 4, where read IO can directly read from raid disks.
      
      Currently reclaim run if there is specific reclaimable space (1/4 disk
      size or 10G) or we are out of space. Reclaim is just to free log disk
      spaces, it doesn't impact data consistency. The size based force reclaim
      is to make sure log isn't too big, so recovery doesn't scan log too
      much.
      
      Recovery make sure raid disks and log disk have the same data of a
      stripe. If crash happens before 4, recovery might/might not recovery
      stripe's data/parity depending on if data/parity and its checksum
      matches. In either case, this doesn't change the syntax of an IO write.
      After step 3, stripe is guaranteed recoverable, because stripe's
      data/parity is persistent in log disk. In some cases, log disk content
      and raid disks content of a stripe are the same, but recovery will still
      copy log disk content to raid disks, this doesn't impact data
      consistency. space reuse happens after superblock update and cache
      flush.
      
      There is one situation we want to avoid. A broken meta in the middle of
      a log causes recovery can't find meta at the head of log. If operations
      require meta at the head persistent in log, we must make sure meta
      before it persistent in log too. The case is stripe data/parity is in
      log and we start write stripe to raid disks (before step 4). stripe
      data/parity must be persistent in log before we do the write to raid
      disks. The solution is we restrictly maintain io_unit list order. In
      this case, we only write stripes of an io_unit to raid disks till the
      io_unit is the first one whose data/parity is in log.
      
      The io_unit list order is important for other cases too. For example,
      some io_unit are reclaimable and others not. They can be mixed in the
      list, we shouldn't reuse space of an unreclaimable io_unit.
      
      Includes fixes to problems which were...
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0576b1c6
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef
    • S
      raid5: add a new state for stripe log handling · b70abcb2
      Shaohua Li 提交于
      When a stripe finishes construction, we write the stripe to raid in
      ops_run_io normally. With log, we do a bunch of other operations before
      the stripe is written to raid. Mainly write the stripe to log disk,
      flush disk cache and so on. The operations are still driven by raid5d
      and run in the stripe state machine. We introduce a new state for such
      stripe (trapped into log). The stripe is in this state from the time it
      first enters ops_run_io (finish construction) to the time it is written
      to raid. Since we know the state is only for log, we bypass other
      check/operation in handle_stripe.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b70abcb2
    • S
      raid5: export some functions · 6d036f7d
      Shaohua Li 提交于
      Next several patches use some raid5 functions, rename them with raid5
      prefix and export out.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6d036f7d
  6. 01 9月, 2015 2 次提交
    • N
      md/raid5: ensure device failure recorded before write request returns. · c3cce6cd
      NeilBrown 提交于
      When a write to one of the devices of a RAID5/6 fails, the failure is
      recorded in the metadata of the other devices so that after a restart
      the data on the failed drive wont be trusted even if that drive seems
      to be working again (maybe a cable was unplugged).
      
      Similarly when we record a bad-block in response to a write failure,
      we must not let the write complete until the bad-block update is safe.
      
      Currently there is no interlock between the write request completing
      and the metadata update.  So it is possible that the write will
      complete, the app will confirm success in some way, and then the
      machine will crash before the metadata update completes.
      
      This is an extremely small hole for a racy to fit in, but it is
      theoretically possible and so should be closed.
      
      So:
       - set MD_CHANGE_PENDING when requesting a metadata update for a
         failed device, so we can know with certainty when it completes
       - queue requests that completed when MD_CHANGE_PENDING is set to
         only be processed after the metadata update completes
       - call raid_end_bio_io() on bios in that queue when the time comes.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      c3cce6cd
    • N
      md/raid5: use bio_list for the list of bios to return. · 34a6f80e
      NeilBrown 提交于
      This will make it easier to splice two lists together which will
      be needed in future patch.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      34a6f80e
  7. 22 7月, 2015 1 次提交
    • N
      md/raid5: avoid races when changing cache size. · 2d5b569b
      NeilBrown 提交于
      Cache size can grow or shrink due to various pressures at
      any time.  So when we resize the cache as part of a 'grow'
      operation (i.e. change the size to allow more devices) we need
      to blocks that automatic growing/shrinking.
      
      So introduce a mutex.  auto grow/shrink uses mutex_trylock()
      and just doesn't bother if there is a blockage.
      Resizing the whole cache holds the mutex to ensure that
      the correct number of new stripes is allocated.
      
      This bug can result in some stripes not being freed when an
      array is stopped.  This leads to the kmem_cache not being
      freed and a subsequent array can try to use the same kmem_cache
      and get confused.
      
      Fixes: edbe83ab ("md/raid5: allow the stripe_cache to grow and shrink.")
      Cc: stable@vger.kernel.org (4.1 - please delay until 2 weeks after release of 4.2)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      2d5b569b
  8. 17 6月, 2015 2 次提交
    • Y
      md/raid5: per hash value and exclusive wait_for_stripe · e9e4c377
      Yuanhan Liu 提交于
      I noticed heavy spin lock contention at get_active_stripe() with fsmark
      multiple thread write workloads.
      
      Here is how this hot contention comes from. We have limited stripes, and
      it's a multiple thread write workload. Hence, those stripes will be taken
      soon, which puts later processes to sleep for waiting free stripes. When
      enough stripes(>= 1/4 total stripes) are released, all process are woken,
      trying to get the lock. But there is one only being able to get this lock
      for each hash lock, making other processes spinning out there for acquiring
      the lock.
      
      Thus, it's effectiveless to wakeup all processes and let them battle for
      a lock that permits one to access only each time. Instead, we could make
      it be a exclusive wake up: wake up one process only. That avoids the heavy
      spin lock contention naturally.
      
      To do the exclusive wake up, we've to split wait_for_stripe into multiple
      wait queues, to make it per hash value, just like the hash lock.
      
      Here are some test results I have got with this patch applied(all test run
      3 times):
      
      `fsmark.files_per_sec'
      =====================
      
      next-20150317                 this patch
      -------------------------     -------------------------
      metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
      -------------------------     -------------------------   --------     ------------------------------
            25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
            25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
            32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
            32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
            48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
             6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
            63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
           245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
            17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
           188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
           254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync
      
      `time.system_time'
      =====================
      
      next-20150317                 this patch
      -------------------------    -------------------------
      metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
      -------------------------    -------------------------    --------     ------------------------------
          7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
          7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
         14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
          3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
          5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
          5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
          2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
          3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
          3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
      
      Where,
      
           1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark
      
           1t, 64t: where 't' means thread
      
           4M: means the single file size, corresponding to the '-s' option of fsmark
           40G, 30G, 120G: means the total test size
      
           4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
                     the size of one ramdisk. So, it would be 48G in total. And we made a
                     raid on those ramdisk
      
      As you can see, though there are no much performance gain for hard disk
      workload, the system time is dropped heavily, up to 97%. And as expected,
      the performance increased a lot, up to 260%, for fast device(ram disk).
      
      v2: use bits instead of array to note down wait queue need to wake up.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      e9e4c377
    • Y
      md/raid5: split wait_for_stripe and introduce wait_for_quiescent · b1b46486
      Yuanhan Liu 提交于
      I noticed heavy spin lock contention at get_active_stripe(), introduced
      at being wake up stage, where a bunch of processes try to re-hold the
      spin lock again.
      
      After giving some thoughts on this issue, I found the lock could be
      relieved(and even avoided) if we turn the wait_for_stripe to per
      waitqueue for each lock hash and make the wake up exclusive: wake up
      one process each time, which avoids the lock contention naturally.
      
      Before go hacking with wait_for_stripe, I found it actually has 2
      usages: for the array to enter or leave the quiescent state, and also
      to wait for an available stripe in each of the hash lists.
      
      So this patch splits the first usage off into a separate wait_queue,
      wait_for_quiescent, and the next patch will turn the second usage into
      one waitqueue for each hash value, and make it exclusive, to relieve
      the lock contention.
      
      v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
          Commit log refactor suggestion from Neil.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      b1b46486
  9. 28 5月, 2015 2 次提交
    • N
      md/raid5: be more selective about distributing flags across batch. · 1b956f7a
      NeilBrown 提交于
      When a batch of stripes is broken up, we keep some of the flags
      that were per-stripe, and copy other flags from the head to all
      others.
      
      This only happens while a stripe is being handled, so many of the
      flags are irrelevant.
      
      The "SYNC_FLAGS" (which I've renamed to make it clear there are
      several) and STRIPE_DEGRADED are set per-stripe and so need to be
      preserved.  STRIPE_INSYNC is the only flag that is set on the head
      that needs to be propagated to all others.
      
      For safety, add a WARN_ON if others are set, except:
       STRIPE_HANDLE - this is safe and per-stripe and we are going to set
            in several cases anyway
       STRIPE_INSYNC
       STRIPE_IO_STARTED - this is just a hint and doesn't hurt.
       STRIPE_ON_PLUG_LIST
       STRIPE_ON_RELEASE_LIST - It is a point pointless for a batched
                 stripe to be on one of these lists, but it can happen
                 as can be safely ignored.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      1b956f7a
    • N
      md/raid5: close race between STRIPE_BIT_DELAY and batching. · d0852df5
      NeilBrown 提交于
      When we add a write to a stripe we need to make sure the bitmap
      bit is set.  While doing that the stripe is not locked so it could
      be added to a batch after which further changes to STRIPE_BIT_DELAY
      and ->bm_seq are ineffective.
      
      So we need to hold off adding to a stripe until bitmap_startwrite has
      completed at least once, and we need to avoid further changes to
      STRIPE_BIT_DELAY once the stripe has been added to a batch.
      
      If a bitmap_startwrite() completes after the stripe was added to a
      batch, it will not have set the bit, only incremented a counter, so no
      extra delay of the stripe is needed.
      Reported-by: NShaohua Li <shli@kernel.org>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d0852df5
  10. 22 4月, 2015 10 次提交
    • N
      md/raid5: allow the stripe_cache to grow and shrink. · edbe83ab
      NeilBrown 提交于
      The default setting of 256 stripe_heads is probably
      much too small for many configurations.  So it is best to make it
      auto-configure.
      
      Shrinking the cache under memory pressure is easy.  The only
      interesting part here is that we put a fairly high cost
      ('seeks') on shrinking the cache as the cost is greater than
      just having to read more data, it reduces parallelism.
      
      Growing the cache on demand needs to be done carefully.  If we allow
      fast growth, that can upset memory balance as lots of dirty memory can
      quickly turn into lots of memory queued in the stripe_cache.
      It is important for the raid5 block device to appear congested to
      allow write-throttling to work.
      
      So we only add stripes slowly. We set a flag when an allocation
      fails because all stripes are in use, allocate at a convenient
      time when that flag is set, and don't allow it to be set again
      until at least one stripe_head has been released for re-use.
      
      This means that a spurt of requests will only cause one stripe_head
      to be allocated, but a steady stream of requests will slowly
      increase the cache size - until memory pressure puts it back again.
      
      It could take hours to reach a steady state.
      
      The value written to, and displayed in, stripe_cache_size is
      used as a minimum.  The cache can grow above this and shrink back
      down to it.  The actual size is not directly visible, though it can
      be deduced to some extent by watching stripe_cache_active.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      edbe83ab
    • N
      md/raid5: change ->inactive_blocked to a bit-flag. · 5423399a
      NeilBrown 提交于
      This allows us to easily add more (atomic) flags.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5423399a
    • M
      md/raid5: introduce configuration option rmw_level · d06f191f
      Markus Stockhausen 提交于
      Depending on the available coding we allow optimized rmw logic for write
      operations. To support easier testing this patch allows manual control
      of the rmw/rcw descision through the interface /sys/block/mdX/md/rmw_level.
      
      The configuration can handle three levels of control.
      
      rmw_level=0: Disable rmw for all RAID types. Hardware assisted P/Q
      calculation has no implementation path yet to factor in/out chunks of
      a syndrome. Enforcing this level can be benefical for slow CPUs with
      hardware syndrome support and fast SSDs.
      
      rmw_level=1: Estimate rmw IOs and rcw IOs. Execute rmw only if we will
      save IOs. This equals the "old" unpatched behaviour and will be the
      default.
      
      rmw_level=2: Execute rmw even if calculated IOs for rmw and rcw are
      equal. We might have higher CPU consumption because of calculating the
      parity twice but it can be benefical otherwise. E.g. RAID4 with fast
      dedicated parity disk/SSD. The option is implemented just to be
      forward-looking and will ONLY work with this patch!
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d06f191f
    • M
      md/raid5: activate raid6 rmw feature · 584acdd4
      Markus Stockhausen 提交于
      Glue it altogehter. The raid6 rmw path should work the same as the
      already existing raid5 logic. So emulate the prexor handling/flags
      and split functions as needed.
      
      1) Enable xor_syndrome() in the async layer.
      
      2) Split ops_run_prexor() into RAID4/5 and RAID6 logic. Xor the syndrome
      at the start of a rmw run as we did it before for the single parity.
      
      3) Take care of rmw run in ops_run_reconstruct6(). Again process only
      the changed pages to get syndrome back into sync.
      
      4) Enhance set_syndrome_sources() to fill NULL pages if we are in a rmw
      run. The lower layers will calculate start & end pages from that and
      call the xor_syndrome() correspondingly.
      
      5) Adapt the several places where we ignored Q handling up to now.
      
      Performance numbers for a single E5630 system with a mix of 10 7200k
      desktop/server disks. 300 seconds random write with 8 threads onto a
      3,2TB (10*400GB) RAID6 64K chunk without spare (group_thread_cnt=4)
      
      bsize   rmw_level=1   rmw_level=0   rmw_level=1   rmw_level=0
              skip_copy=1   skip_copy=1   skip_copy=0   skip_copy=0
         4K      115 KB/s      141 KB/s      165 KB/s      140 KB/s
         8K      225 KB/s      275 KB/s      324 KB/s      274 KB/s
        16K      434 KB/s      536 KB/s      640 KB/s      534 KB/s
        32K      751 KB/s    1,051 KB/s    1,234 KB/s    1,045 KB/s
        64K    1,339 KB/s    1,958 KB/s    2,282 KB/s    1,962 KB/s
       128K    2,673 KB/s    3,862 KB/s    4,113 KB/s    3,898 KB/s
       256K    7,685 KB/s    7,539 KB/s    7,557 KB/s    7,638 KB/s
       512K   19,556 KB/s   19,558 KB/s   19,652 KB/s   19,688 Kb/s
      Signed-off-by: NMarkus Stockhausen <stockhausen@collogia.de>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      584acdd4
    • S
      raid5: handle expansion/resync case with stripe batching · dabc4ec6
      shli@kernel.org 提交于
      expansion/resync can grab a stripe when the stripe is in batch list. Since all
      stripes in batch list must be in the same state, we can't allow some stripes
      run into expansion/resync. So we delay expansion/resync for stripe in batch
      list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      dabc4ec6
    • S
      raid5: handle io error of batch list · 72ac7330
      shli@kernel.org 提交于
      If io error happens in any stripe of a batch list, the batch list will be
      split, then normal process will run for the stripes in the list.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      72ac7330
    • S
      RAID5: batch adjacent full stripe write · 59fc630b
      shli@kernel.org 提交于
      stripe cache is 4k size. Even adjacent full stripe writes are handled in 4k
      unit. Idealy we should use big size for adjacent full stripe writes. Bigger
      stripe cache size means less stripes runing in the state machine so can reduce
      cpu overhead. And also bigger size can cause bigger IO size dispatched to under
      layer disks.
      
      With below patch, we will automatically batch adjacent full stripe write
      together. Such stripes will be added to the batch list. Only the first stripe
      of the list will be put to handle_list and so run handle_stripe(). Some steps
      of handle_stripe() are extended to cover all stripes of the list, including
      ops_run_io, ops_run_biodrain and so on. With this patch, we have less stripes
      running in handle_stripe() and we send IO of whole stripe list together to
      increase IO size.
      
      Stripes added to a batch list have some limitations. A batch list can only
      include full stripe write and can't cross chunk boundary to make sure stripes
      have the same parity disks. Stripes in a batch list must be in the same state
      (no written, toread and so on). If a stripe is in a batch list, all new
      read/write to add_stripe_bio will be blocked to overlap conflict till the batch
      list is handled. The limitations will make sure stripes in a batch list be in
      exactly the same state in the life circly.
      
      I did test running 160k randwrite in a RAID5 array with 32k chunk size and 6
      PCIe SSD. This patch improves around 30% performance and IO size to under layer
      disk is exactly 32k. I also run a 4k randwrite test in the same array to make
      sure the performance isn't changed with the patch.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      59fc630b
    • S
      raid5: track overwrite disk count · 7a87f434
      shli@kernel.org 提交于
      Track overwrite disk count, so we can know if a stripe is a full stripe write.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      7a87f434
    • S
      raid5: add a new flag to track if a stripe can be batched · da41ba65
      shli@kernel.org 提交于
      A freshly new stripe with write request can be batched. Any time the stripe is
      handled or new read is queued, the flag will be cleared.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      da41ba65
    • S
      raid5: use flex_array for scribble data · 46d5b785
      shli@kernel.org 提交于
      Use flex_array for scribble data. Next patch will batch several stripes
      together, so scribble data should be able to cover several stripes, so this
      patch also allocates scribble data for stripes across a chunk.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      46d5b785
  11. 04 2月, 2015 1 次提交
    • N
      md: make ->congested robust against personality changes. · 5c675f83
      NeilBrown 提交于
      There is currently no locking around calls to the 'congested'
      bdi function.  If called at an awkward time while an array is
      being converted from one level (or personality) to another, there
      is a tiny chance of running code in an unreferenced module etc.
      
      So add a 'congested' function to the md_personality operations
      structure, and call it with appropriate locking from a central
      'mddev_congested'.
      
      When the array personality is changing the array will be 'suspended'
      so no IO is processed.
      If mddev_congested detects this, it simply reports that the
      array is congested, which is a safe guess.
      As mddev_suspend calls synchronize_rcu(), mddev_congested can
      avoid races by included the whole call inside an rcu_read_lock()
      region.
      This require that the congested functions for all subordinate devices
      can be run under rcu_lock.  Fortunately this is the case.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      5c675f83
  12. 14 10月, 2014 1 次提交
  13. 29 5月, 2014 1 次提交
    • S
      raid5: add an option to avoid copy data from bio to stripe cache · d592a996
      Shaohua Li 提交于
      The stripe cache has two goals:
      1. cache data, so next time if data can be found in stripe cache, disk access
      can be avoided.
      2. stable data. data is copied from bio to stripe cache and calculated parity.
      data written to disk is from stripe cache, so if upper layer changes bio data,
      data written to disk isn't impacted.
      
      In my environment, I can guarantee 2 will not happen. And BDI_CAP_STABLE_WRITES
      can guarantee 2 too. For 1, it's not common too. block plug mechanism will
      dispatch a bunch of sequentail small requests together. And since I'm using
      SSD, I'm using small chunk size. It's rare case stripe cache is really useful.
      
      So I'd like to avoid the copy from bio to stripe cache and it's very helpful
      for performance. In my 1M randwrite tests, avoid the copy can increase the
      performance more than 30%.
      
      Of course, this shouldn't be enabled by default. It's reported enabling
      BDI_CAP_STABLE_WRITES can harm some workloads before, so I added an option to
      control it.
      
      Neilb:
        changed BUG_ON to WARN_ON
        Removed some assignments from raid5_build_block which are now not needed.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      d592a996
  14. 19 11月, 2013 1 次提交
  15. 14 11月, 2013 1 次提交
    • S
      raid5: relieve lock contention in get_active_stripe() · 566c09c5
      Shaohua Li 提交于
      get_active_stripe() is the last place we have lock contention. It has two
      paths. One is stripe isn't found and new stripe is allocated, the other is
      stripe is found.
      
      The first path basically calls __find_stripe and init_stripe. It accesses
      conf->generation, conf->previous_raid_disks, conf->raid_disks,
      conf->prev_chunk_sectors, conf->chunk_sectors, conf->max_degraded,
      conf->prev_algo, conf->algorithm, the stripe_hashtbl and inactive_list. Except
      stripe_hashtbl and inactive_list, other fields are changed very rarely.
      
      With this patch, we split inactive_list and add new hash locks. Each free
      stripe belongs to a specific inactive list. Which inactive list is determined
      by stripe's lock_hash. Note, even a stripe hasn't a sector assigned, it has a
      lock_hash assigned. Stripe's inactive list is protected by a hash lock, which
      is determined by it's lock_hash too. The lock_hash is derivied from current
      stripe_hashtbl hash, which guarantees any stripe_hashtbl list will be assigned
      to a specific lock_hash, so we can use new hash lock to protect stripe_hashtbl
      list too. The goal of the new hash locks introduced is we can only use the new
      locks in the first path of get_active_stripe(). Since we have several hash
      locks, lock contention is relieved significantly.
      
      The first path of get_active_stripe() accesses other fields, since they are
      changed rarely, changing them now need take conf->device_lock and all hash
      locks. For a slow path, this isn't a problem.
      
      If we need lock device_lock and hash lock, we always lock hash lock first. The
      tricky part is release_stripe and friends. We need take device_lock first.
      Neil's suggestion is we put inactive stripes to a temporary list and readd it
      to inactive_list after device_lock is released. In this way, we add stripes to
      temporary list with device_lock hold and remove stripes from the list with hash
      lock hold. So we don't allow concurrent access to the temporary list, which
      means we need allocate temporary list for all participants of release_stripe.
      
      One downside is free stripes are maintained in their inactive list, they can't
      across between the lists. By default, we have total 256 stripes and 8 lists, so
      each list will have 32 stripes. It's possible one list has free stripe but
      other list hasn't. The chance should be rare because stripes allocation are
      even distributed. And we can always allocate more stripes for cache, several
      mega bytes memory isn't a big deal.
      
      This completely removes the lock contention of the first path of
      get_active_stripe(). It slows down the second code path a little bit though
      because we now need takes two locks, but since the hash lock isn't contended,
      the overhead should be quite small (several atomic instructions). The second
      path of get_active_stripe() (basically sequential write or big request size
      randwrite) still has lock contentions.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      566c09c5
  16. 14 10月, 2013 1 次提交
  17. 02 9月, 2013 1 次提交
    • S
      raid5: only wakeup necessary threads · bfc90cb0
      Shaohua Li 提交于
      If there are not enough stripes to handle, we'd better not always
      queue all available work_structs. If one worker can only handle small
      or even none stripes, it will impact request merge and create lock
      contention.
      
      With this patch, the number of work_struct running will depend on
      pending stripes number. Note: some statistics info used in the patch
      are accessed without locking protection. This should doesn't matter,
      we just try best to avoid queue unnecessary work_struct.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      bfc90cb0
  18. 28 8月, 2013 2 次提交
    • N
      md/raid5: use seqcount to protect access to shape in make_request. · c46501b2
      NeilBrown 提交于
      make_request() access various shape parameters (raid_disks, chunk_size
      etc) which might be changed by raid5_start_reshape().
      
      If the later is called at and awkward time during the form, the wrong
      stripe_head might be used.
      
      So introduce a 'seqcount' and after finding a stripe_head make sure
      there is no reason to expect that we got the wrong one.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c46501b2
    • S
      raid5: offload stripe handle to workqueue · 851c30c9
      Shaohua Li 提交于
      This is another attempt to create multiple threads to handle raid5 stripes.
      This time I use workqueue.
      
      raid5 handles request (especially write) in stripe unit. A stripe is page size
      aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
      state machine for the corresponding stripe, which includes reading some disks
      of the stripe, calculating parity, and writing some disks of the stripe. The
      state machine is running in raid5d thread currently. Since there is only one
      thread, it doesn't scale well for high speed storage. An obvious solution is
      multi-threading.
      
      To get better performance, we have some requirements:
      a. locality. stripe corresponding to request submitted from one cpu is better
      handled in thread in local cpu or local node. local cpu is preferred but some
      times could be a bottleneck, for example, parity calculation is too heavy.
      local node running has wide adaptability.
      b. configurablity. Different setup of raid5 array might need diffent
      configuration. Especially the thread number. More threads don't always mean
      better performance because of lock contentions.
      
      My original implementation is creating some kernel threads. There are
      interfaces to control which cpu's stripe each thread should handle. And
      userspace can set affinity of the threads. This provides biggest flexibility
      and configurability. But it's hard to use and apparently a new thread pool
      implementation is disfavor.
      
      Recent workqueue improvement is quite promising. unbound workqueue will be
      bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
      do affinity setting. For example, we can only include one HT sibling in
      affinity. Since work is non-reentrant by default, and we can control running
      thread number by limiting dispatched work_struct number.
      
      In this patch, I created several stripe worker group. A group is a numa node.
      stripes from cpus of one node will be added to a group list. Workqueue thread
      of one node will only handle stripes of worker group of the node. In this way,
      stripe handling has numa node locality. And as I said, we can control thread
      number by limiting dispatched work_struct number.
      
      The work_struct callback function handles several stripes in one run. A typical
      work queue usage is to run one unit in each work_struct. In raid5 case, the
      unit is a stripe. But we can't do that:
      a. Though handling a stripe doesn't need lock because of reference accounting
      and stripe isn't in any list, queuing a work_struct for each stripe will make
      workqueue lock contended very heavily.
      b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
      might dispatch request. If each work_struct only handles one stripe, such block
      plug is meaningless.
      
      This implementation can't do very fine grained configuration. But the numa
      binding is most popular usage model, should be enough for most workloads.
      
      Note: since we have only one stripe queue, switching to multi-thread might
      decrease request size dispatching down to low level layer. The impact depends
      on thread number, raid configuration and workload. So multi-thread raid5 might
      not be proper for all setups.
      
      Changes V1 -> V2:
      1. remove WQ_NON_REENTRANT
      2. disabling multi-threading by default
      3. Add more descriptions in changelog
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      851c30c9