1. 31 3月, 2018 1 次提交
    • L
      Btrfs: scrub: batch rebuild for raid56 · 6ca1765b
      Liu Bo 提交于
      In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
      as unit, however, scrub_extent() sets blocksize as unit, so rebuild
      process may be triggered on every block on a same stripe.
      
      A typical example would be that when we're replacing a disappeared disk,
      all reads on the disks get -EIO, every block (size is 4K if blocksize is
      4K) would go thru these,
      
      scrub_handle_errored_block
        scrub_recheck_block # re-read pages one by one
        scrub_recheck_block # rebuild by calling raid56_parity_recover()
                              page by page
      
      Although with raid56 stripe cache most of reads during rebuild can be
      avoided, the parity recover calculation(xor or raid6 algorithms) needs to
      be done $(BTRFS_STRIPE_LEN / blocksize) times.
      
      This makes it smarter by doing raid56 scrub/replace on stripe length.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ca1765b
  2. 26 3月, 2018 3 次提交
  3. 22 1月, 2018 8 次提交
  4. 02 11月, 2017 1 次提交
    • Z
      btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents · c995ab3c
      Zygo Blaxell 提交于
      The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
      offset (encoded as a single logical address) to a list of extent refs.
      LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
      (extent ref -> extent bytenr and offset, or logical address).  These are
      useful capabilities for programs that manipulate extents and extent
      references from userspace (e.g. dedup and defrag utilities).
      
      When the extents are uncompressed (and not encrypted and not other),
      check_extent_in_eb performs filtering of the extent refs to remove any
      extent refs which do not contain the same extent offset as the 'logical'
      parameter's extent offset.  This prevents LOGICAL_INO from returning
      references to more than a single block.
      
      To find the set of extent references to an uncompressed extent from [a, b),
      userspace has to run a loop like this pseudocode:
      
      	for (i = a; i < b; ++i)
      		extent_ref_set += LOGICAL_INO(i);
      
      At each iteration of the loop (up to 32768 iterations for a 128M extent),
      data we are interested in is collected in the kernel, then deleted by
      the filter in check_extent_in_eb.
      
      When the extents are compressed (or encrypted or other), the 'logical'
      parameter must be an extent bytenr (the 'a' parameter in the loop).
      No filtering by extent offset is done (or possible?) so the result is
      the complete set of extent refs for the entire extent.  This removes
      the need for the loop, since we get all the extent refs in one call.
      
      Add an 'ignore_offset' argument to iterate_inodes_from_logical,
      [...several levels of function call graph...], and check_extent_in_eb, so
      that we can disable the extent offset filtering for uncompressed extents.
      This flag can be set by an improved version of the LOGICAL_INO ioctl to
      get either behavior as desired.
      
      There is no functional change in this patch.  The new flag is always
      false.
      Signed-off-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor coding style fixes ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c995ab3c
  5. 30 10月, 2017 1 次提交
  6. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  7. 21 8月, 2017 4 次提交
  8. 18 8月, 2017 1 次提交
  9. 16 8月, 2017 2 次提交
  10. 30 6月, 2017 2 次提交
  11. 20 6月, 2017 9 次提交
  12. 09 6月, 2017 1 次提交
  13. 18 4月, 2017 6 次提交
    • Q
      btrfs: scrub: Fix RAID56 recovery race condition · 28d70e23
      Qu Wenruo 提交于
      When scrubbing a RAID5 which has recoverable data corruption (only one
      data stripe is corrupted), sometimes scrub will report more csum errors
      than expected. Sometimes even unrecoverable error will be reported.
      
      The problem can be easily reproduced by the following steps:
      1) Create a btrfs with RAID5 data profile with 3 devs
      2) Mount it with nospace_cache or space_cache=v2
         To avoid extra data space usage.
      3) Create a 128K file and sync the fs, unmount it
         Now the 128K file lies at the beginning of the data chunk
      4) Locate the physical bytenr of data chunk on dev3
         Dev3 is the 1st data stripe.
      5) Corrupt the first 64K of the data chunk stripe on dev3
      6) Mount the fs and scrub it
      
      The correct csum error number should be 16 (assuming using x86_64).
      Larger csum error number can be reported in a 1/3 chance.
      And unrecoverable error can also be reported in a 1/10 chance.
      
      The root cause of the problem is RAID5/6 recover code has race
      condition, due to the fact that full scrub is initiated per device.
      
      While for other mirror based profiles, each mirror is independent with
      each other, so race won't cause any big problem.
      
      For example:
              Corrupted       |       Correct          |      Correct        |
      |   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
      ------------------------------------------------------------------------
      Read out D1             |Read out D2             |Read full stripe     |
      Check csum              |Check csum              |Check parity         |
      Csum mismatch           |Csum match, continue    |Parity mismatch      |
      handle_errored_block    |                        |handle_errored_block |
       Read out full stripe   |                        | Read out full stripe|
       D1 csum error(err++)   |                        | D1 csum error(err++)|
       Recover D1             |                        | Recover D1          |
      
      So D1's csum error is accounted twice, just because
      handle_errored_block() doesn't have enough protection, and race can happen.
      
      On even worse case, for example D1's recovery code is re-writing
      D1/D2/P, and P's recovery code is just reading out full stripe, then we
      can cause unrecoverable error.
      
      This patch will use previously introduced lock_full_stripe() and
      unlock_full_stripe() to protect the whole scrub_handle_errored_block()
      function for RAID56 recovery.
      So no extra csum error nor unrecoverable error.
      Reported-by: NGoffredo Baroncelli <kreijack@libero.it>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28d70e23
    • Q
      btrfs: scrub: Introduce full stripe lock for RAID56 · 0966a7b1
      Qu Wenruo 提交于
      Unlike mirror based profiles, RAID5/6 recovery needs to read out the
      whole full stripe.
      
      And if we don't do proper protection, it can easily cause race condition.
      
      Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
      for RAID5/6.
      Which store a rb_tree of mutexes for full stripes, so scrub callers can
      use them to lock a full stripe to avoid race.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0966a7b1
    • L
      Btrfs: switch to div64_u64 if with a u64 divisor · 42c61ab6
      Liu Bo 提交于
      This is fixing code pieces where we use div_u64 when passing a u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42c61ab6
    • L
      Btrfs: update scrub_parity to use u64 stripe_len · 972d7219
      Liu Bo 提交于
      Commit 3d8da678 ("Btrfs: fix divide error upon chunk's stripe_len")
      changed stripe_len in struct map_lookup to u64, but didn't update
      stripe_len in struct scrub_parity.
      
      This updates the type and switches to div64_u64_rem to match u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      972d7219
    • D
      btrfs: use clear_page where appropriate · 619a9742
      David Sterba 提交于
      There's a helper to clear whole page, with a arch-specific optimized
      code. The replaced cases do not seem to be in performace critical code,
      but we still might get some percent gain.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      619a9742
    • Q
      btrfs: Prevent scrub recheck from racing with dev replace · e501bfe3
      Qu Wenruo 提交于
      scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
      bbio without protection of bio_counter.
      
      This can lead to use-after-free if racing with dev replace cancel.
      
      Fix it by increasing bio_counter before calling btrfs_map_sblock() and
      decreasing the bio_counter when corresponding recover is finished.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e501bfe3