1. 07 10月, 2020 5 次提交
    • J
      btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks · 9631e4cc
      Josef Bacik 提交于
      When we COW a block we are holding a lock on the original block, and
      then we lock the new COW block.  Because our lockdep maps are based on
      root + level, this will make lockdep complain.  We need a way to
      indicate a subclass for locking the COW'ed block, so plumb through our
      btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer,
      and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks.
      
      The reason I've added all this extra infrastructure is because there
      will be need of different nesting classes in follow up patches.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9631e4cc
    • Q
      btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref() · 07cce5cf
      Qu Wenruo 提交于
      [BUG]
      With a crafted image, btrfs can panic at insert_inline_extent_backref():
      
        kernel BUG at fs/btrfs/extent-tree.c:1857!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9
        RIP: 0010:insert_inline_extent_backref+0xcc/0xe0
        RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001
        RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000
        FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0
        Call Trace:
        ? _cond_resched+0x1a/0x50
        __btrfs_inc_extent_ref.isra.64+0x7e/0x240
        ? btrfs_merge_delayed_refs+0xa5/0x330
        __btrfs_run_delayed_refs+0x653/0x1120
        btrfs_run_delayed_refs+0xdb/0x1b0
        btrfs_commit_transaction+0x52/0x950
        ? start_transaction+0x94/0x450
        transaction_kthread+0x163/0x190
        kthread+0x105/0x140
        ? btrfs_cleanup_transaction+0x560/0x560
        ? kthread_destroy_worker+0x50/0x50
        ret_from_fork+0x35/0x40
        Modules linked in:
        ---[ end trace 2ad8b3de903cf825 ]---
      
      [CAUSE]
      Due to extent tree corruption (still valid by itself, but bad cross
      ref), we can allocate an extent which is still in extent tree.  The
      offending tree block of that case is from csum tree.  The newly
      allocated tree block is also for csum tree.
      
      Then we will try to insert a tree block ref for the existing tree block
      ref.
      
      For a tree extent item, tree block can never be shared directly by the
      same tree twice.  We have such BUG_ON() to prevent such problem, but
      this is not a proper error handling.
      
      [FIX]
      Replace that BUG_ON() with proper error message and leaf dump for debug
      build.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07cce5cf
    • Q
      btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent() · 1c2a07f5
      Qu Wenruo 提交于
      __btrfs_free_extent() is doing two things:
      
      1. Reduce the refs number of an extent backref
         Either it's an inline extent backref (inside EXTENT/METADATA item) or
         a keyed extent backref (SHARED_* item).
         We only need to locate that backref line, either reduce the number or
         remove the backref line completely.
      
      2. Update the refs count in EXTENT/METADATA_ITEM
      
      During step 1), we will try to locate the EXTENT/METADATA_ITEM without
      triggering another btrfs_search_slot() as fast path.
      
      Only when we fail to locate that item, we will trigger another
      btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we
      updated/deleted the backref line.
      
      And we have a lot of strict checks on things like refs_to_drop against
      extent refs and special case checks for single ref extents.
      
      There are 7 BUG_ON()s, although they're doing correct checks, they can
      be triggered by crafted images.
      
      This patch improves the function:
      
      - Introduce two examples to show what __btrfs_free_extent() is doing
        One inline backref case and one keyed case.  Should cover most cases.
      
      - Kill all BUG_ON()s with proper error message and optional leaf dump
      
      - Add comment to show the overall flow
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819
      [ The report triggers one BUG_ON() in __btrfs_free_extent() ]
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c2a07f5
    • J
      btrfs: call btrfs_try_granting_tickets when unpinning anything · 2732798c
      Josef Bacik 提交于
      When unpinning we were only calling btrfs_try_granting_tickets() if
      global_rsv->space_info == space_info, which is problematic because we
      use ticketing for SYSTEM chunks, and want to use it for DATA as well.
      Fix this by moving this call outside of that if statement.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2732798c
    • Q
      btrfs: tracepoints: output proper root owner for trace_find_free_extent() · 437490fe
      Qu Wenruo 提交于
      The current trace event always output result like this:
      
       find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
       find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA)
       find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
       find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA)
       find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
       find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA)
      
      T's saying we're allocating data extent for EXTENT tree, which is not
      even possible.
      
      It's because we always use EXTENT tree as the owner for
      trace_find_free_extent() without using the @root from
      btrfs_reserve_extent().
      
      This patch will change the parameter to use proper @root for
      trace_find_free_extent():
      
      Now it looks much better:
      
       find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
       find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
       find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA)
       find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA)
       find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA)
       find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
       find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
       find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
       find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP)
      Reported-by: NHans van Kranenburg <hans@knorrie.org>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      437490fe
  2. 07 9月, 2020 1 次提交
    • Q
      btrfs: require only sector size alignment for parent eb bytenr · ea57788e
      Qu Wenruo 提交于
      [BUG]
      A completely sane converted fs will cause kernel warning at balance
      time:
      
        [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data
        [ 1563.358078] BTRFS info (device sda7): found 11722 extents
        [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2
        [ 1563.358280] 	item 0 key (7984947200 169 0) itemoff 16250 itemsize 33
        [ 1563.358281] 		extent refs 1 gen 90 flags 2
        [ 1563.358282] 		ref#0: tree block backref root 4
        [ 1563.358285] 	item 1 key (7985602560 169 0) itemoff 16217 itemsize 33
        [ 1563.358286] 		extent refs 1 gen 93 flags 258
        [ 1563.358287] 		ref#0: shared block backref parent 7985602560
        [ 1563.358288] 			(parent 7985602560 is NOT ALIGNED to nodesize 16384)
        [ 1563.358290] 	item 2 key (7985635328 169 0) itemoff 16184 itemsize 33
        ...
        [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182
        [ 1563.358996] ------------[ cut here ]------------
        [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766
      
      Then with transaction abort, and obviously failed to balance the fs.
      
      [CAUSE]
      That mentioned inline ref type 182 is completely sane, it's
      BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to
      believe it's invalid.
      
      Commit 64ecdb64 ("Btrfs: add one more sanity check for shared ref
      type") introduced extra checks for backref type.
      
      One of the requirement is, parent bytenr must be aligned to node size,
      which is not correct.
      
      One example is like this:
      
      0	1G  1G+4K		2G 2G+4K
      	|   |///////////////////|//|  <- A chunk starts at 1G+4K
                  |   |	<- A tree block get reserved at bytenr 1G+4K
      
      Then we have a valid tree block at bytenr 1G+4K, but not aligned to
      nodesize (16K).
      
      Such chunk is not ideal, but current kernel can handle it pretty well.
      We may warn about such tree block in the future, but should not reject
      them.
      
      [FIX]
      Change the alignment requirement from node size alignment to sector size
      alignment.
      
      Also, to make our lives a little easier, also output @iref when
      btrfs_get_extent_inline_ref_type() failed, so we can locate the item
      easier.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205475
      Fixes: 64ecdb64 ("Btrfs: add one more sanity check for shared ref type")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update comments and messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea57788e
  3. 27 8月, 2020 1 次提交
  4. 21 8月, 2020 1 次提交
    • B
      btrfs: detect nocow for swap after snapshot delete · a84d5d42
      Boris Burkov 提交于
      can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for
      detecting a must cow condition which is not exactly accurate, but saves
      unnecessary tree traversal. The incorrect assumption is that if the
      extent was created in a generation smaller than the last snapshot
      generation, it must be referenced by that snapshot. That is true, except
      the snapshot could have since been deleted, without affecting the last
      snapshot generation.
      
      The original patch claimed a performance win from this check, but it
      also leads to a bug where you are unable to use a swapfile if you ever
      snapshotted the subvolume it's in. Make the check slower and more strict
      for the swapon case, without modifying the general cow checks as a
      compromise. Turning swap on does not seem to be a particularly
      performance sensitive operation, so incurring a possibly unnecessary
      btrfs_search_slot seems worthwhile for the added usability.
      
      Note: Until the snapshot is competely cleaned after deletion,
      check_committed_refs will still cause the logic to think that cow is
      necessary, so the user must until 'btrfs subvolu sync' finished before
      activating the swapfile swapon.
      
      CC: stable@vger.kernel.org # 5.4+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a84d5d42
  5. 12 8月, 2020 1 次提交
    • Q
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo 提交于
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c57dd1f2
  6. 27 7月, 2020 2 次提交
    • Q
      btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree · f3e3d9cc
      Qu Wenruo 提交于
      [BUG]
      There is a bug report about bad signal timing could lead to read-only
      fs during balance:
      
        BTRFS info (device xvdb): balance: start -d -m -s
        BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
        BTRFS info (device xvdb): found 12236 extents, stage: move data extents
        BTRFS info (device xvdb): relocating block group 71928119296 flags data
        BTRFS info (device xvdb): found 3 extents, stage: move data extents
        BTRFS info (device xvdb): found 3 extents, stage: update data pointers
        BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
        BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
        BTRFS info (device xvdb): forced readonly
        BTRFS info (device xvdb): balance: ended with status: -4
      
      [CAUSE]
      The direct cause is the -EINTR from the following call chain when a
      fatal signal is pending:
      
       relocate_block_group()
       |- clean_dirty_subvols()
          |- btrfs_drop_snapshot()
             |- btrfs_start_transaction()
                |- btrfs_delayed_refs_rsv_refill()
                   |- btrfs_reserve_metadata_bytes()
                      |- __reserve_metadata_bytes()
                         |- wait_reserve_ticket()
                            |- prepare_to_wait_event();
                            |- ticket->error = -EINTR;
      
      Normally this behavior is fine for most btrfs_start_transaction()
      callers, as they need to catch any other error, same for the signal, and
      exit ASAP.
      
      However for balance, especially for the clean_dirty_subvols() case, we're
      already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
      could cause a lot of unexpected problems.
      
      From the mentioned forced read-only report, to later balance error due
      to half dropped reloc trees.
      
      [FIX]
      Fix this problem by using btrfs_join_transaction() if
      btrfs_drop_snapshot() is called from relocation context.
      
      Since btrfs_join_transaction() won't get interrupted by signal, we can
      continue the cleanup.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: David Sterba <dsterba@suse.com>3
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f3e3d9cc
    • Q
      btrfs: qgroup: free per-trans reserved space when a subvolume gets dropped · a3cf0e43
      Qu Wenruo 提交于
      [BUG]
      Sometime fsstress could lead to qgroup warning for case like
      generic/013:
      
        BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920
        ------------[ cut here ]------------
        WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
        Call Trace:
         btrfs_put_super+0x15/0x17 [btrfs]
         generic_shutdown_super+0x72/0x110
         kill_anon_super+0x18/0x30
         btrfs_kill_super+0x17/0x30 [btrfs]
         deactivate_locked_super+0x3b/0xa0
         deactivate_super+0x40/0x50
         cleanup_mnt+0x135/0x190
         __cleanup_mnt+0x12/0x20
         task_work_run+0x64/0xb0
         __prepare_exit_to_usermode+0x1bc/0x1c0
         __syscall_return_slowpath+0x47/0x230
         do_syscall_64+0x64/0xb0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        ---[ end trace 6c341cdf9b6cc3c1 ]---
        BTRFS error (device dm-3): qgroup reserved space leaked
      
      While that subvolume 259 is no longer in that filesystem.
      
      [CAUSE]
      Normally per-trans qgroup reserved space is freed when a transaction is
      committed, in commit_fs_roots().
      
      However for completely dropped subvolume, that subvolume is completely
      gone, thus is no longer in the fs_roots_radix, and its per-trans
      reserved qgroup will never be freed.
      
      Since the subvolume is already gone, leaked per-trans space won't cause
      any trouble for end users.
      
      [FIX]
      Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is
      completely dropped.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a3cf0e43
  7. 25 5月, 2020 5 次提交
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • F
      btrfs: rename member 'trimming' of block group to a more generic name · 6b7304af
      Filipe Manana 提交于
      Back in 2014, commit 04216820 ("Btrfs: fix race between fs trimming
      and block group remove/allocation"), I added the 'trimming' member to the
      block group structure. Its purpose was to prevent races between trimming
      and block group deletion/allocation by pinning the block group in a way
      that prevents its logical address and device extents from being reused
      while trimming is in progress for a block group, so that if another task
      deletes the block group and then another task allocates a new block group
      that gets the same logical address and device extents while the trimming
      task is still in progress.
      
      After the previous fix for scrub (patch "btrfs: fix a race between scrub
      and block group removal/allocation"), scrub now also has the same needs that
      trimming has, so the member name 'trimming' no longer makes sense.
      Since there is already a 'pinned' member in the block group that refers
      to space reservations (pinned bytes), rename the member to 'frozen',
      add a comment on top of it to describe its general purpose and rename
      the helpers to increment and decrement the counter as well, to match
      the new member name.
      
      The next patch in the series will move the helpers into a more suitable
      file (from free-space-cache.c to block-group.c).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b7304af
    • Y
      btrfs: remove unused function heads_to_leaves · eec5b6e0
      YueHaibing 提交于
      There's no callers in-tree anymore since commit 64403612 ("btrfs:
      rework btrfs_check_space_for_delayed_refs")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eec5b6e0
    • D
      btrfs: don't force read-only after error in drop snapshot · 7c09c030
      David Sterba 提交于
      Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
      forced read-only. This is not a transaction abort and the filesystem is
      otherwise ok, so the error should be just propagated to the callers.
      
      This is caused by unnecessary call to btrfs_handle_fs_error for all
      errors, except EAGAIN. This does not make sense as the standard
      transaction abort mechanism is in btrfs_drop_snapshot so all relevant
      failures are handled.
      
      Originally in commit cb1b69f4 ("Btrfs: forced readonly when
      btrfs_drop_snapshot() fails") there was no return value at all, so the
      btrfs_std_error made some sense but once the error handling and
      propagation has been implemented we don't need it anymore.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c09c030
    • J
      btrfs: add missing annotation for btrfs_lock_cluster() · c142c6a4
      Jules Irenge 提交于
      Sparse reports a warning at btrfs_lock_cluster()
      
      warning: context imbalance in btrfs_lock_cluster()
      	- wrong count
      
      The root cause is the missing annotation at btrfs_lock_cluster()
      Add the missing __acquires(&cluster->refill_lock) annotation.
      Signed-off-by: NJules Irenge <jbi.octave@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c142c6a4
  8. 24 3月, 2020 24 次提交