1. 26 9月, 2022 8 次提交
  2. 23 8月, 2022 1 次提交
    • Q
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo 提交于
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4a445b7b
  3. 17 8月, 2022 1 次提交
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
  4. 26 7月, 2022 2 次提交
    • C
      btrfs: fix repair of compressed extents · 81bd9328
      Christoph Hellwig 提交于
      Currently the checksum of compressed extents is verified based on the
      compressed data and the lower btrfs_bio, but the actual repair process
      is driven by end_bio_extent_readpage on the upper btrfs_bio for the
      decompressed data.
      
      This has a bunch of issues, including not being able to properly
      communicate the failed mirror up in case that the I/O submission got
      preempted, a general loss of if an error was an I/O error or a checksum
      verification failure, but most importantly that this design causes
      btrfs_clean_io_failure to eventually write back the uncompressed good
      data onto the disk sectors that are supposed to contain compressed data.
      
      Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
      amount of code has to be reshuffled:
      
       a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
          to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
          the btrfs_bio management of csums.  For a compressed_bio that is
          split into multiple btrfs_bios this means additional memory
          allocations, but the code becomes a lot more regular.
       b) checksum verification now runs directly on the lower btrfs_bio instead
          of the compressed_bio.  This actually nicely simplifies the end I/O
          processing.
       c) btrfs_repair_one_sector can't just look up the logical address for
          the file offset any more, as there is no corresponding relative
          offsets that apply to the file offset and the logic address for
          compressed extents.  Instead require that the saved bvec_iter in the
          btrfs_bio is filled out for all read bios and use that, which again
          removes a fair amount of code.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81bd9328
    • C
      btrfs: pass a btrfs_bio to btrfs_repair_one_sector · 7aa51232
      Christoph Hellwig 提交于
      Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
      and remove the start and failed_mirror arguments in favor of deriving
      them from the btrfs_bio.  For this to work ensure that the file_offset
      field is also initialized for buffered I/O.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7aa51232
  5. 25 7月, 2022 23 次提交
  6. 16 7月, 2022 1 次提交
  7. 15 7月, 2022 1 次提交
  8. 21 6月, 2022 1 次提交
  9. 16 5月, 2022 2 次提交