1. 21 7月, 2016 1 次提交
  2. 14 6月, 2016 5 次提交
  3. 08 6月, 2016 3 次提交
  4. 26 5月, 2016 1 次提交
  5. 10 5月, 2016 2 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
    • H
      md: raid5: add prerequisite to run underneath dm-raid · fe67d19a
      Heinz Mauelshagen 提交于
      In case md runs underneath the dm-raid target, the mddev does not have
      a request queue or gendisk, thus avoid accesses.
      
      This patch adds a missing conditional to the raid5 personality.
      Signed-of-by: NHeinz Mauelshagen <heinzm@redhat.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      fe67d19a
  6. 30 4月, 2016 1 次提交
  7. 18 3月, 2016 1 次提交
  8. 10 3月, 2016 2 次提交
    • S
      md/raid5: output stripe state for debug · fb3229d5
      Shaohua Li 提交于
      Neil recently fixed an obscure race in break_stripe_batch_list. Debug would be
      quite convenient if we know the stripe state. This is what this patch does.
      Signed-off-by: NShaohua Li <shli@fb.com>
      fb3229d5
    • N
      md/raid5: preserve STRIPE_PREREAD_ACTIVE in break_stripe_batch_list · 550da24f
      NeilBrown 提交于
      break_stripe_batch_list breaks up a batch and copies some flags from
      the batch head to the members, preserving others.
      
      It doesn't preserve or copy STRIPE_PREREAD_ACTIVE.  This is not
      normally a problem as STRIPE_PREREAD_ACTIVE is cleared when a
      stripe_head is added to a batch, and is not set on stripe_heads
      already in a batch.
      
      However there is no locking to ensure one thread doesn't set the flag
      after it has just been cleared in another.  This does occasionally happen.
      
      md/raid5 maintains a count of the number of stripe_heads with
      STRIPE_PREREAD_ACTIVE set: conf->preread_active_stripes.  When
      break_stripe_batch_list clears STRIPE_PREREAD_ACTIVE inadvertently
      this could becomes incorrect and will never again return to zero.
      
      md/raid5 delays the handling of some stripe_heads until
      preread_active_stripes becomes zero.  So when the above mention race
      happens, those stripe_heads become blocked and never progress,
      resulting is write to the array handing.
      
      So: change break_stripe_batch_list to preserve STRIPE_PREREAD_ACTIVE
      in the members of a batch.
      
      URL: https://bugzilla.kernel.org/show_bug.cgi?id=108741
      URL: https://bugzilla.redhat.com/show_bug.cgi?id=1258153
      URL: http://thread.gmane.org/5649C0E9.2030204@zoner.cz
      Reported-by: Martin Svec <martin.svec@zoner.cz> (and others)
      Tested-by: NTom Weber <linux@junkyard.4t2.com>
      Fixes: 1b956f7a ("md/raid5: be more selective about distributing flags across batch.")
      Cc: stable@vger.kernel.org (v4.1 and later)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      550da24f
  9. 27 2月, 2016 2 次提交
    • S
      RAID5: revert e9e4c377 to fix a livelock · 6ab2a4b8
      Shaohua Li 提交于
      Revert commit
      e9e4c377(md/raid5: per hash value and exclusive wait_for_stripe)
      
      The problem is raid5_get_active_stripe waits on
      conf->wait_for_stripe[hash]. Assume hash is 0. My test release stripes
      in this order:
      - release all stripes with hash 0
      - raid5_get_active_stripe still sleeps since active_stripes >
        max_nr_stripes * 3 / 4
      - release all stripes with hash other than 0. active_stripes becomes 0
      - raid5_get_active_stripe still sleeps, since nobody wakes up
        wait_for_stripe[0]
      The system live locks. The problem is active_stripes isn't a per-hash
      count. Revert the patch makes the live lock go away.
      
      Cc: stable@vger.kernel.org (v4.2+)
      Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      6ab2a4b8
    • S
      RAID5: check_reshape() shouldn't call mddev_suspend · 27a353c0
      Shaohua Li 提交于
      check_reshape() is called from raid5d thread. raid5d thread shouldn't
      call mddev_suspend(), because mddev_suspend() waits for all IO finish
      but IO is handled in raid5d thread, we could easily deadlock here.
      
      This issue is introduced by
      738a2738 ("md/raid5: fix allocation of 'scribble' array.")
      
      Cc: stable@vger.kernel.org (v4.1+)
      Reported-and-tested-by: NArtur Paszkiewicz <artur.paszkiewicz@intel.com>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      27a353c0
  10. 26 2月, 2016 1 次提交
  11. 21 1月, 2016 1 次提交
  12. 06 1月, 2016 2 次提交
  13. 01 11月, 2015 10 次提交
  14. 31 10月, 2015 1 次提交
    • R
      md/raid5: fix locking in handle_stripe_clean_event() · b8a9d66d
      Roman Gushchin 提交于
      After commit 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      __find_stripe() is called under conf->hash_locks + hash.
      But handle_stripe_clean_event() calls remove_hash() under
      conf->device_lock.
      
      Under some cirscumstances the hash chain can be circuited,
      and we get an infinite loop with disabled interrupts and locked hash
      lock in __find_stripe(). This leads to hard lockup on multiple CPUs
      and following system crash.
      
      I was able to reproduce this behavior on raid6 over 6 ssd disks.
      The devices_handle_discard_safely option should be set to enable trim
      support. The following script was used:
      
      for i in `seq 1 32`; do
          dd if=/dev/zero of=large$i bs=10M count=100 &
      done
      
      neilb: original was against a 3.x kernel.  I forward-ported
        to 4.3-rc.  This verison is suitable for any kernel since
        Commit: 59fc630b ("RAID5: batch adjacent full stripe write")
        (v4.1+).  I'll post a version for earlier kernels to stable.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Fixes: 566c09c5 ("raid5: relieve lock contention in get_active_stripe()")
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <stable@vger.kernel.org> # 3.13 - 4.2
      b8a9d66d
  15. 24 10月, 2015 4 次提交
    • S
      raid5: log reclaim support · 0576b1c6
      Shaohua Li 提交于
      This is the reclaim support for raid5 log. A stripe write will have
      following steps:
      
      1. reconstruct the stripe, read data/calculate parity. ops_run_io
      prepares to write data/parity to raid disks
      2. hijack ops_run_io. stripe data/parity is appending to log disk
      3. flush log disk cache
      4. ops_run_io run again and do normal operation. stripe data/parity is
      written in raid array disks. raid core can return io to upper layer.
      5. flush cache of all raid array disks
      6. update super block
      7. log disk space used by the stripe can be reused
      
      In practice, several stripes consist of an io_unit and we will batch
      several io_unit in different steps, but the whole process doesn't
      change.
      
      It's possible io return just after data/parity hit log disk, but then
      read IO will need read from log disk. For simplicity, IO return happens
      at step 4, where read IO can directly read from raid disks.
      
      Currently reclaim run if there is specific reclaimable space (1/4 disk
      size or 10G) or we are out of space. Reclaim is just to free log disk
      spaces, it doesn't impact data consistency. The size based force reclaim
      is to make sure log isn't too big, so recovery doesn't scan log too
      much.
      
      Recovery make sure raid disks and log disk have the same data of a
      stripe. If crash happens before 4, recovery might/might not recovery
      stripe's data/parity depending on if data/parity and its checksum
      matches. In either case, this doesn't change the syntax of an IO write.
      After step 3, stripe is guaranteed recoverable, because stripe's
      data/parity is persistent in log disk. In some cases, log disk content
      and raid disks content of a stripe are the same, but recovery will still
      copy log disk content to raid disks, this doesn't impact data
      consistency. space reuse happens after superblock update and cache
      flush.
      
      There is one situation we want to avoid. A broken meta in the middle of
      a log causes recovery can't find meta at the head of log. If operations
      require meta at the head persistent in log, we must make sure meta
      before it persistent in log too. The case is stripe data/parity is in
      log and we start write stripe to raid disks (before step 4). stripe
      data/parity must be persistent in log before we do the write to raid
      disks. The solution is we restrictly maintain io_unit list order. In
      this case, we only write stripes of an io_unit to raid disks till the
      io_unit is the first one whose data/parity is in log.
      
      The io_unit list order is important for other cases too. For example,
      some io_unit are reclaimable and others not. They can be mixed in the
      list, we shouldn't reuse space of an unreclaimable io_unit.
      
      Includes fixes to problems which were...
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0576b1c6
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef
    • S
      raid5: add a new state for stripe log handling · b70abcb2
      Shaohua Li 提交于
      When a stripe finishes construction, we write the stripe to raid in
      ops_run_io normally. With log, we do a bunch of other operations before
      the stripe is written to raid. Mainly write the stripe to log disk,
      flush disk cache and so on. The operations are still driven by raid5d
      and run in the stripe state machine. We introduce a new state for such
      stripe (trapped into log). The stripe is in this state from the time it
      first enters ops_run_io (finish construction) to the time it is written
      to raid. Since we know the state is only for log, we bypass other
      check/operation in handle_stripe.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      b70abcb2
    • S
      raid5: export some functions · 6d036f7d
      Shaohua Li 提交于
      Next several patches use some raid5 functions, rename them with raid5
      prefix and export out.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      6d036f7d
  16. 12 10月, 2015 1 次提交
    • G
      md-cluster: Use a small window for resync · c40f341f
      Goldwyn Rodrigues 提交于
      Suspending the entire device for resync could take too long. Resync
      in small chunks.
      
      cluster's resync window (32M) is maintained in r1conf as
      cluster_sync_low and cluster_sync_high and processed in
      raid1's sync_request(). If the current resync is outside the cluster
      resync window:
      
      1. Set the cluster_sync_low to curr_resync_completed.
      2. Check if the sync will fit in the new window, if not issue a
         wait_barrier() and set cluster_sync_low to sector_nr.
      3. Set cluster_sync_high to cluster_sync_low + resync_window.
      4. Send a message to all nodes so they may add it in their suspension
         list.
      
      bitmap_cond_end_sync is modified to allow to force a sync inorder
      to get the curr_resync_completed uptodate with the sector passed.
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      c40f341f
  17. 02 10月, 2015 2 次提交