1. 01 9月, 2016 1 次提交
    • S
      raid5-cache: fix a deadlock in superblock write · 8e018c21
      Shaohua Li 提交于
      There is a potential deadlock in superblock write. Discard could zero data, so
      before discard we must make sure superblock is updated to new log tail.
      Updating superblock (either directly call md_update_sb() or depend on md
      thread) must hold reconfig mutex. On the other hand, raid5_quiesce is called
      with reconfig_mutex hold. The first step of raid5_quiesce() is waitting for all
      IO finish, hence waitting for reclaim thread, while reclaim thread is calling
      this function and waitting for reconfig mutex. So there is a deadlock. We
      workaround this issue with a trylock. The downside of the solution is we could
      miss discard if we can't take reconfig mutex. But this should happen rarely
      (mainly in raid array stop), so miss discard shouldn't be a big problem.
      
      Cc: NeilBrown <neilb@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      8e018c21
  2. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  3. 08 6月, 2016 3 次提交
  4. 10 5月, 2016 1 次提交
    • G
      md: set MD_CHANGE_PENDING in a atomic region · 85ad1d13
      Guoqing Jiang 提交于
      Some code waits for a metadata update by:
      
      1. flagging that it is needed (MD_CHANGE_DEVS or MD_CHANGE_CLEAN)
      2. setting MD_CHANGE_PENDING and waking the management thread
      3. waiting for MD_CHANGE_PENDING to be cleared
      
      If the first two are done without locking, the code in md_update_sb()
      which checks if it needs to repeat might test if an update is needed
      before step 1, then clear MD_CHANGE_PENDING after step 2, resulting
      in the wait returning early.
      
      So make sure all places that set MD_CHANGE_PENDING are atomicial, and
      bit_clear_unless (suggested by Neil) is introduced for the purpose.
      
      Cc: Martin Kepplinger <martink@posteo.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: <linux-kernel@vger.kernel.org>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NGuoqing Jiang <gqjiang@suse.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      85ad1d13
  5. 14 4月, 2016 1 次提交
  6. 14 1月, 2016 2 次提交
  7. 06 1月, 2016 6 次提交
  8. 01 11月, 2015 22 次提交
  9. 24 10月, 2015 3 次提交
    • S
      raid5: log recovery · 355810d1
      Shaohua Li 提交于
      This is the log recovery support. The process is quite straightforward.
      We scan the log and read all valid meta/data/parity into memory. If a
      stripe's data/parity checksum is correct, the stripe will be recoveried.
      Otherwise, it's discarded and we don't scan the log further. The reclaim
      process guarantees stripe which starts to be flushed raid disks has
      completed data/parity and has correct checksum. To recovery a stripe, we
      just copy its data/parity to corresponding raid disks.
      
      The trick thing is superblock update after recovery. we can't let
      superblock point to last valid meta block. The log might look like:
      | meta 1| meta 2| meta 3|
      meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
      points to meta 1, we write a new valid meta 2n.  If crash happens again,
      new recovery will start from meta 1. Since meta 2n is valid, recovery
      will think meta 3 is valid, which is wrong.  The solution is we create a
      new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
      points to meta2.  recovery will not think meta 3 is a valid meta,
      because its seq is wrong
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      355810d1
    • S
      raid5: log reclaim support · 0576b1c6
      Shaohua Li 提交于
      This is the reclaim support for raid5 log. A stripe write will have
      following steps:
      
      1. reconstruct the stripe, read data/calculate parity. ops_run_io
      prepares to write data/parity to raid disks
      2. hijack ops_run_io. stripe data/parity is appending to log disk
      3. flush log disk cache
      4. ops_run_io run again and do normal operation. stripe data/parity is
      written in raid array disks. raid core can return io to upper layer.
      5. flush cache of all raid array disks
      6. update super block
      7. log disk space used by the stripe can be reused
      
      In practice, several stripes consist of an io_unit and we will batch
      several io_unit in different steps, but the whole process doesn't
      change.
      
      It's possible io return just after data/parity hit log disk, but then
      read IO will need read from log disk. For simplicity, IO return happens
      at step 4, where read IO can directly read from raid disks.
      
      Currently reclaim run if there is specific reclaimable space (1/4 disk
      size or 10G) or we are out of space. Reclaim is just to free log disk
      spaces, it doesn't impact data consistency. The size based force reclaim
      is to make sure log isn't too big, so recovery doesn't scan log too
      much.
      
      Recovery make sure raid disks and log disk have the same data of a
      stripe. If crash happens before 4, recovery might/might not recovery
      stripe's data/parity depending on if data/parity and its checksum
      matches. In either case, this doesn't change the syntax of an IO write.
      After step 3, stripe is guaranteed recoverable, because stripe's
      data/parity is persistent in log disk. In some cases, log disk content
      and raid disks content of a stripe are the same, but recovery will still
      copy log disk content to raid disks, this doesn't impact data
      consistency. space reuse happens after superblock update and cache
      flush.
      
      There is one situation we want to avoid. A broken meta in the middle of
      a log causes recovery can't find meta at the head of log. If operations
      require meta at the head persistent in log, we must make sure meta
      before it persistent in log too. The case is stripe data/parity is in
      log and we start write stripe to raid disks (before step 4). stripe
      data/parity must be persistent in log before we do the write to raid
      disks. The solution is we restrictly maintain io_unit list order. In
      this case, we only write stripes of an io_unit to raid disks till the
      io_unit is the first one whose data/parity is in log.
      
      The io_unit list order is important for other cases too. For example,
      some io_unit are reclaimable and others not. They can be mixed in the
      list, we shouldn't reuse space of an unreclaimable io_unit.
      
      Includes fixes to problems which were...
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      0576b1c6
    • S
      raid5: add basic stripe log · f6bed0ef
      Shaohua Li 提交于
      This introduces a simple log for raid5. Data/parity writing to raid
      array first writes to the log, then write to raid array disks. If
      crash happens, we can recovery data from the log. This can speed up
      raid resync and fix write hole issue.
      
      The log structure is pretty simple. Data/meta data is stored in block
      unit, which is 4k generally. It has only one type of meta data block.
      The meta data block can track 3 types of data, stripe data, stripe
      parity and flush block. MD superblock will point to the last valid
      meta data block. Each meta data block has checksum/seq number, so
      recovery can scan the log correctly. We store a checksum of stripe
      data/parity to the metadata block, so meta data and stripe data/parity
      can be written to log disk together. otherwise, meta data write must
      wait till stripe data/parity is finished.
      
      For stripe data, meta data block will record stripe data sector and
      size. Currently the size is always 4k. This meta data record can be made
      simpler if we just fix write hole (eg, we can record data of a stripe's
      different disks together), but this format can be extended to support
      caching in the future, which must record data address/size.
      
      For stripe parity, meta data block will record stripe sector. It's
      size should be 4k (for raid5) or 8k (for raid6). We always store p
      parity first. This format should work for caching too.
      
      flush block indicates a stripe is in raid array disks. Fixing write
      hole doesn't need this type of meta data, it's for caching extension.
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      f6bed0ef