1. 18 4月, 2017 14 次提交
    • Q
      btrfs: scrub: Fix RAID56 recovery race condition · 28d70e23
      Qu Wenruo 提交于
      When scrubbing a RAID5 which has recoverable data corruption (only one
      data stripe is corrupted), sometimes scrub will report more csum errors
      than expected. Sometimes even unrecoverable error will be reported.
      
      The problem can be easily reproduced by the following steps:
      1) Create a btrfs with RAID5 data profile with 3 devs
      2) Mount it with nospace_cache or space_cache=v2
         To avoid extra data space usage.
      3) Create a 128K file and sync the fs, unmount it
         Now the 128K file lies at the beginning of the data chunk
      4) Locate the physical bytenr of data chunk on dev3
         Dev3 is the 1st data stripe.
      5) Corrupt the first 64K of the data chunk stripe on dev3
      6) Mount the fs and scrub it
      
      The correct csum error number should be 16 (assuming using x86_64).
      Larger csum error number can be reported in a 1/3 chance.
      And unrecoverable error can also be reported in a 1/10 chance.
      
      The root cause of the problem is RAID5/6 recover code has race
      condition, due to the fact that full scrub is initiated per device.
      
      While for other mirror based profiles, each mirror is independent with
      each other, so race won't cause any big problem.
      
      For example:
              Corrupted       |       Correct          |      Correct        |
      |   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
      ------------------------------------------------------------------------
      Read out D1             |Read out D2             |Read full stripe     |
      Check csum              |Check csum              |Check parity         |
      Csum mismatch           |Csum match, continue    |Parity mismatch      |
      handle_errored_block    |                        |handle_errored_block |
       Read out full stripe   |                        | Read out full stripe|
       D1 csum error(err++)   |                        | D1 csum error(err++)|
       Recover D1             |                        | Recover D1          |
      
      So D1's csum error is accounted twice, just because
      handle_errored_block() doesn't have enough protection, and race can happen.
      
      On even worse case, for example D1's recovery code is re-writing
      D1/D2/P, and P's recovery code is just reading out full stripe, then we
      can cause unrecoverable error.
      
      This patch will use previously introduced lock_full_stripe() and
      unlock_full_stripe() to protect the whole scrub_handle_errored_block()
      function for RAID56 recovery.
      So no extra csum error nor unrecoverable error.
      Reported-by: NGoffredo Baroncelli <kreijack@libero.it>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28d70e23
    • Q
      btrfs: scrub: Introduce full stripe lock for RAID56 · 0966a7b1
      Qu Wenruo 提交于
      Unlike mirror based profiles, RAID5/6 recovery needs to read out the
      whole full stripe.
      
      And if we don't do proper protection, it can easily cause race condition.
      
      Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
      for RAID5/6.
      Which store a rb_tree of mutexes for full stripes, so scrub callers can
      use them to lock a full stripe to avoid race.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0966a7b1
    • L
      Btrfs: switch to div64_u64 if with a u64 divisor · 42c61ab6
      Liu Bo 提交于
      This is fixing code pieces where we use div_u64 when passing a u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      42c61ab6
    • L
      Btrfs: update scrub_parity to use u64 stripe_len · 972d7219
      Liu Bo 提交于
      Commit 3d8da678 ("Btrfs: fix divide error upon chunk's stripe_len")
      changed stripe_len in struct map_lookup to u64, but didn't update
      stripe_len in struct scrub_parity.
      
      This updates the type and switches to div64_u64_rem to match u64 divisor.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      972d7219
    • D
      btrfs: use clear_page where appropriate · 619a9742
      David Sterba 提交于
      There's a helper to clear whole page, with a arch-specific optimized
      code. The replaced cases do not seem to be in performace critical code,
      but we still might get some percent gain.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      619a9742
    • Q
      btrfs: Prevent scrub recheck from racing with dev replace · e501bfe3
      Qu Wenruo 提交于
      scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
      bbio without protection of bio_counter.
      
      This can lead to use-after-free if racing with dev replace cancel.
      
      Fix it by increasing bio_counter before calling btrfs_map_sblock() and
      decreasing the bio_counter when corresponding recover is finished.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e501bfe3
    • Q
      btrfs: Wait for in-flight bios before freeing target device for raid56 · ae6529c3
      Qu Wenruo 提交于
      When raid56 dev-replace is cancelled by running scrub, we will free
      target device without waiting for in-flight bios, causing the following
      NULL pointer deference or general protection failure.
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
       IP: generic_make_request_checks+0x4d/0x610
       CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
       Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
       task: ffff88002875b4c0 task.stack: ffffc90001334000
       RIP: 0010:generic_make_request_checks+0x4d/0x610
       Call Trace:
        ? generic_make_request+0xc7/0x360
        generic_make_request+0x24/0x360
        ? generic_make_request+0xc7/0x360
        submit_bio+0x64/0x120
        ? page_in_rbio+0x4d/0x80 [btrfs]
        ? rbio_orig_end_io+0x80/0x80 [btrfs]
        finish_rmw+0x3f4/0x540 [btrfs]
        validate_rbio_for_rmw+0x36/0x40 [btrfs]
        raid_rmw_end_io+0x7a/0x90 [btrfs]
        bio_endio+0x56/0x60
        end_workqueue_fn+0x3c/0x40 [btrfs]
        btrfs_scrubparity_helper+0xef/0x620 [btrfs]
        btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
        process_one_work+0x2af/0x720
        ? process_one_work+0x22b/0x720
        worker_thread+0x4b/0x4f0
        kthread+0x10f/0x150
        ? process_one_work+0x720/0x720
        ? kthread_create_on_node+0x40/0x40
        ret_from_fork+0x2e/0x40
       RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8
      
      In btrfs_dev_replace_finishing(), we will call
      btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
      device when scrub is finished normally.
      
      However when dev-replace is aborted, either due to error or cancelled by
      scrub, we didn't wait for bios, this can lead to use-after-free if there
      are bios holding the target device.
      
      Furthermore, for raid56 scrub, at least 2 places are calling
      btrfs_map_sblock() without protection of bio_counter, leading to the
      problem.
      
      This patch fixes the problem:
      1) Wait for bio_counter before freeing target device when canceling
         replace
      2) When calling btrfs_map_sblock() for raid56, use bio_counter to
         protect the call.
      
      Cc: Liu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae6529c3
    • Q
      btrfs: scrub: Don't append on-disk pages for raid56 scrub · 9a33944b
      Qu Wenruo 提交于
      In the following situation, scrub will calculate wrong parity to
      overwrite the correct one:
      
      RAID5 full stripe:
      
      Before
      |     Dev 1      |     Dev  2     |     Dev 3     |
      | Data stripe 1  | Data stripe 2  | Parity Stripe |
      --------------------------------------------------- 0
      | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 4K
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      ...
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 64K
      
      After scrubbing dev3 only:
      
      |     Dev 1      |     Dev  2     |     Dev 3     |
      | Data stripe 1  | Data stripe 2  | Parity Stripe |
      --------------------------------------------------- 0
      | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
      --------------------------------------------------- 4K
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      ...
      |     0xcdcd     |     0xcdcd     |     0x0000    |
      --------------------------------------------------- 64K
      
      The reason is that after raid56 read rebuild rbio->stripe_pages are all
      correctly recovered (0xcd for data stripes).
      
      However when we check and repair parity in
      scrub_parity_check_and_repair(), we will append pages in sparity->spages
      list to rbio->bio_pages[], which contains old on-disk data.
      
      And when we submit parity data to disk, we calculate parity using
      rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
      to rbio->stripe_pages[].
      
      The patch fix it by not appending pages from sparity->spages.
      So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a33944b
    • D
      btrfs: drop redundant parameters from btrfs_map_sblock · 825ad4c9
      David Sterba 提交于
      All callers pass 0 for mirror_num and 1 for need_raid_map.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      825ad4c9
    • L
      Btrfs: set scrub page's io_error if failing to submit io · 1bcd7aa1
      Liu Bo 提交于
      Scrub repairs data by the unit called scrub_block, which may contain
      several pages.  Scrub always tries to look up a good copy of a whole
      block, but if there's no such copy, it tries to do repair page by page.
      
      If we don't set page's io_error when checking this bad copy, in the last
      step, we may skip this page when repairing bad copy from good copy.
      
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1bcd7aa1
    • E
      btrfs: convert scrub_ctx.refs from atomic_t to refcount_t · 99f4cdb1
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99f4cdb1
    • E
      btrfs: convert scrub_parity.refs from atomic_t to refcount_t · 78a76450
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78a76450
    • E
      btrfs: convert scrub_block.refs from atomic_t to refcount_t · 186debd6
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      186debd6
    • E
      btrfs: convert scrub_recover.refs from atomic_t to refcount_t · 6f615018
      Elena Reshetova 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f615018
  2. 28 2月, 2017 3 次提交
  3. 17 2月, 2017 2 次提交
  4. 06 12月, 2016 5 次提交
  5. 29 11月, 2016 1 次提交
  6. 01 11月, 2016 1 次提交
  7. 27 9月, 2016 1 次提交
  8. 26 7月, 2016 2 次提交
  9. 08 6月, 2016 2 次提交
  10. 30 5月, 2016 3 次提交
    • F
      Btrfs: fix race setting block group back to RW mode during device replace · 1a1a8b73
      Filipe Manana 提交于
      After it finishes processing a device extent, the device replace code sets
      back the block group to RW mode and then after that it sets the left cursor
      to match the logical end address of the block group, so that future writes
      into extents belonging to the block group go both the source (old) and
      target (new) devices. However from the moment we turn the block group
      back to RW mode we have a short time window, that lasts until we update
      the left cursor's value, where extents can be allocated from the block
      group and written to, in which case they will not be copied/written to
      the target (new) device. Fix this by updating the left cursor's value
      before turning the block group back to RW mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      1a1a8b73
    • F
      Btrfs: fix unprotected assignment of the left cursor for device replace · 81e87a73
      Filipe Manana 提交于
      We were assigning new values to fields of the device replace object
      without holding the respective lock after processing each device extent.
      This is important for the left cursor field which can be accessed by a
      concurrent task running __btrfs_map_block (which, correctly, takes the
      device replace lock).
      So change these fields while holding the device replace lock.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      81e87a73
    • F
      Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
      Filipe Manana 提交于
      When we do a device replace, for each device extent we find from the
      source device, we set the corresponding block group to readonly mode to
      prevent writes into it from happening while we are copying the device
      extent from the source to the target device. However just before we set
      the block group to readonly mode some concurrent task might have already
      allocated an extent from it or decided it could perform a nocow write
      into one of its extents, which can make the device replace process to
      miss copying an extent since it uses the extent tree's commit root to
      search for extents and only once it finishes searching for all extents
      belonging to the block group it does set the left cursor to the logical
      end address of the block group - this is a problem if the respective
      ordered extents finish while we are searching for extents using the
      extent tree's commit root and no transaction commit happens while we
      are iterating the tree, since it's the delayed references created by the
      ordered extents (when they complete) that insert the extent items into
      the extent tree (using the non-commit root of course).
      Example:
      
                CPU 1                                            CPU 2
      
       btrfs_dev_replace_start()
         btrfs_scrub_dev()
           scrub_enumerate_chunks()
             --> finds device extent belonging
                 to block group X
      
                                     <transaction N starts>
      
                                                            starts buffered write
                                                            against some inode
      
                                                            writepages is run against
                                                            that inode forcing dellaloc
                                                            to run
      
                                                            btrfs_writepages()
                                                              extent_writepages()
                                                                extent_write_cache_pages()
                                                                  __extent_writepage()
                                                                    writepage_delalloc()
                                                                      run_delalloc_range()
                                                                        cow_file_range()
                                                                          btrfs_reserve_extent()
                                                                            --> allocates an extent
                                                                                from block group X
                                                                                (which is not yet
                                                                                 in RO mode)
                                                                          btrfs_add_ordered_extent()
                                                                            --> creates ordered extent Y
                                                              flush_epd_write_bio()
                                                                --> bio against the extent from
                                                                    block group X is submitted
      
             btrfs_inc_block_group_ro(bg X)
               --> sets block group X to readonly
      
             scrub_chunk(bg X)
               scrub_stripe(device extent from srcdev)
                 --> keeps searching for extent items
                     belonging to the block group using
                     the extent tree's commit root
                 --> it never blocks due to
                     fs_info->scrub_pause_req as no
                     one tries to commit transaction N
                 --> copies all extents found from the
                     source device into the target device
                 --> finishes search loop
      
                                                              bio completes
      
                                                              ordered extent Y completes
                                                              and creates delayed data
                                                              reference which will add an
                                                              extent item to the extent
                                                              tree when run (typically
                                                              at transaction commit time)
      
                                                                --> so the task doing the
                                                                    scrub/device replace
                                                                    at CPU 1 misses this
                                                                    and does not copy this
                                                                    extent into the new/target
                                                                    device
      
             btrfs_dec_block_group_ro(bg X)
               --> turns block group X back to RW mode
      
             dev_replace->cursor_left is set to the
             logical end offset of block group X
      
      So fix this by waiting for all cow and nocow writes after setting a block
      group to readonly mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f0e9b7d6
  11. 26 5月, 2016 2 次提交
  12. 16 5月, 2016 1 次提交
  13. 06 5月, 2016 3 次提交