1. 03 11月, 2022 1 次提交
    • F
      btrfs: fix inode list leak during backref walking at resolve_indirect_refs() · 5614dc3a
      Filipe Manana 提交于
      During backref walking, at resolve_indirect_refs(), if we get an error
      we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
      which frees all the elements in the ulist - however that does not free
      any inode lists that may be attached to elements, through the 'aux' field
      of a ulist node, so we end up leaking lists if we have any attached to
      the unodes.
      
      Fix this by calling free_leaf_list() instead of ulist_free() when we exit
      from resolve_indirect_refs(). The static function free_leaf_list() is
      moved up for this to be possible and it's slightly simplified by removing
      unnecessary code.
      
      Fixes: 3301958b ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5614dc3a
  2. 11 10月, 2022 3 次提交
    • F
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana 提交于
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c84b46
    • F
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana 提交于
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      943553ef
    • F
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana 提交于
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fc7b572
  3. 07 10月, 2022 1 次提交
    • F
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana 提交于
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96dbcc00
  4. 26 9月, 2022 3 次提交
    • F
      btrfs: skip unnecessary extent buffer sharedness checks during fiemap · b8f164e3
      Filipe Manana 提交于
      During fiemap, for each file extent we find, we must check if it's shared
      or not. The sharedness check starts by verifying if the extent is directly
      shared (its refcount in the extent tree is > 1), and if it is not directly
      shared, then we will check if every node in the subvolume b+tree leading
      from the root to the leaf that has the file extent item (in reverse order),
      is shared (through snapshots).
      
      However this second step is not needed if our extent was created in a
      transaction more recent than the last transaction where a snapshot of the
      inode's root happened, because it can't be shared indirectly (through
      shared subtrees) without a snapshot created in a more recent transaction.
      
      So grab the generation of the extent from the extent map and pass it to
      btrfs_is_data_extent_shared(), which will skip this second phase when the
      generation is more recent than the root's last snapshot value. Note that
      we skip this optimization if the extent map is the result of merging 2
      or more extent maps, because in this case its generation is the maximum
      of the generations of all merged extent maps.
      
      The fact the we use extent maps and they can be merged despite the
      underlying extents being distinct (different file extent items in the
      subvolume b+tree and different extent items in the extent b+tree), can
      result in some bugs when reporting shared extents. But this is a problem
      of the current implementation of fiemap relying on extent maps.
      One example where we get incorrect results is:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This is z problem that existed before this change, and remains after this
      change, as it can't be easily fixed. The next patch in the series reworks
      fiemap to primarily use file extent items instead of extent maps (except
      for checking for delalloc ranges), with the goal of improving its
      scalability and performance, but it also ends up fixing this particular
      bug caused by extent map merging.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8f164e3
    • F
      btrfs: speedup checking for extent sharedness during fiemap · 12a824dc
      Filipe Manana 提交于
      One of the most expensive tasks performed during fiemap is to check if
      an extent is shared. This task has two major steps:
      
      1) Check if the data extent is shared. This implies checking the extent
         item in the extent tree, checking delayed references, etc. If we
         find the data extent is directly shared, we terminate immediately;
      
      2) If the data extent is not directly shared (its extent item has a
         refcount of 1), then it may be shared if we have snapshots that share
         subtrees of the inode's subvolume b+tree. So we check if the leaf
         containing the file extent item is shared, then its parent node, then
         the parent node of the parent node, etc, until we reach the root node
         or we find one of them is shared - in which case we stop immediately.
      
      During fiemap we process the extents of a file from left to right, from
      file offset 0 to EOF. This means that we iterate b+tree leaves from left
      to right, and has the implication that we keep repeating that second step
      above several times for the same b+tree path of the inode's subvolume
      b+tree.
      
      For example, if we have two file extent items in leaf X, and the path to
      leaf X is A -> B -> C -> X, then when we try to determine if the data
      extent referenced by the first extent item is shared, we check if the data
      extent is shared - if it's not, then we check if leaf X is shared, if not,
      then we check if node C is shared, if not, then check if node B is shared,
      if not than check if node A is shared. When we move to the next file
      extent item, after determining the data extent is not shared, we repeat
      the checks for X, C, B and A - doing all the expensive searches in the
      extent tree, delayed refs, etc. If we have thousands of tile extents, then
      we keep repeating the sharedness checks for the same paths over and over.
      
      On a file that has no shared extents or only a small portion, it's easy
      to see that this scales terribly with the number of extents in the file
      and the sizes of the extent and subvolume b+trees.
      
      This change eliminates the repeated sharedness check on extent buffers
      by caching the results of the last path used. The results can be used as
      long as no snapshots were created since they were cached (for not shared
      extent buffers) or no roots were dropped since they were cached (for
      shared extent buffers). This greatly reduces the time spent by fiemap for
      files with thousands of extents and/or large extent and subvolume b+trees.
      
      Example performance test:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1646 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 698 milliseconds (metadata cached)
      
      That's about 2.2x faster when no metadata is cached, and about 3x faster
      when all metadata is cached. On a real filesystem with many other files,
      data, directories, etc, the b+trees will be 2 or 3 levels higher,
      therefore this optimization will have a higher impact.
      
      Several reports of a slow fiemap show up often, the two Link tags below
      refer to two recent reports of such slowness. This patch, together with
      the next ones in the series, is meant to address that.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12a824dc
    • F
      btrfs: rename btrfs_check_shared() to a more descriptive name · 8eedadda
      Filipe Manana 提交于
      The function btrfs_check_shared() is supposed to be used to check if a
      data extent is shared, but its name is too generic, may easily cause
      confusion in the sense that it may be used for metadata extents.
      
      So rename it to btrfs_is_data_extent_shared(), which will also make it
      less confusing after the next change that adds a backref lookup cache for
      the b+tree nodes that lead to the leaf that contains the file extent item
      that points to the target data extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eedadda
  5. 25 7月, 2022 3 次提交
  6. 14 3月, 2022 1 次提交
    • Q
      btrfs: unify the error handling pattern for read_tree_block() · 4eb150d6
      Qu Wenruo 提交于
      We had an error handling pattern for read_tree_block() like this:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 * Normally ended up with return or goto out.
      		 */
      	} else if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 * Normally also ended up with return or goto out;
      		 */
      	}
      
      This is fine, but if we want to add extra check for each
      read_tree_block(), the existing if-else-if is not that expandable and
      will take reader some seconds to figure out there is no extra branch.
      
      Here we change it to a more common way, without the extra else:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 */
      		return eb or goto out;
      	}
      	if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 */
      		return eb or goto out;
      	}
      
      This also removes some oddball call sites which uses some creative way
      to check error.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4eb150d6
  7. 03 1月, 2022 9 次提交
  8. 23 8月, 2021 2 次提交
  9. 22 7月, 2021 1 次提交
    • F
      btrfs: fix lock inversion problem when doing qgroup extent tracing · 8949b9a1
      Filipe Manana 提交于
      At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
      NULL value as the transaction handle argument, which makes that function
      take the commit_root_sem semaphore, which is necessary when we don't hold
      a transaction handle or any other mechanism to prevent a transaction
      commit from wiping out commit roots.
      
      However btrfs_qgroup_trace_extent_post() can be called in a context where
      we are holding a write lock on an extent buffer from a subvolume tree,
      namely from btrfs_truncate_inode_items(), called either during truncate
      or unlink operations. In this case we end up with a lock inversion problem
      because the commit_root_sem is a higher level lock, always supposed to be
      acquired before locking any extent buffer.
      
      Lockdep detects this lock inversion problem since we switched the extent
      buffer locks from custom locks to semaphores, and when running btrfs/158
      from fstests, it reported the following trace:
      
      [ 9057.626435] ======================================================
      [ 9057.627541] WARNING: possible circular locking dependency detected
      [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
      [ 9057.628961] ------------------------------------------------------
      [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
      [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.632542]
                     but task is already holding lock:
      [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.635255]
                     which lock already depends on the new lock.
      
      [ 9057.636292]
                     the existing dependency chain (in reverse order) is:
      [ 9057.637240]
                     -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
      [ 9057.638138]        down_read+0x46/0x140
      [ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
      [ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
      [ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
      [ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
      [ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
      [ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
      [ 9057.643418]        evict+0xcf/0x1d0
      [ 9057.643895]        do_unlinkat+0x1e9/0x300
      [ 9057.644525]        do_syscall_64+0x3b/0xc0
      [ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 9057.645835]
                     -> #0 (btrfs-tree-00){++++}-{3:3}:
      [ 9057.646600]        __lock_acquire+0x130e/0x2210
      [ 9057.647248]        lock_acquire+0xd7/0x310
      [ 9057.647773]        down_read_nested+0x4b/0x140
      [ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.655831]        process_one_work+0x247/0x5a0
      [ 9057.656425]        worker_thread+0x55/0x3c0
      [ 9057.656993]        kthread+0x155/0x180
      [ 9057.657494]        ret_from_fork+0x22/0x30
      [ 9057.658030]
                     other info that might help us debug this:
      
      [ 9057.659064]  Possible unsafe locking scenario:
      
      [ 9057.659824]        CPU0                    CPU1
      [ 9057.660402]        ----                    ----
      [ 9057.660988]   lock(&fs_info->commit_root_sem);
      [ 9057.661581]                                lock(btrfs-tree-00);
      [ 9057.662348]                                lock(&fs_info->commit_root_sem);
      [ 9057.663254]   lock(btrfs-tree-00);
      [ 9057.663690]
                      *** DEADLOCK ***
      
      [ 9057.664437] 4 locks held by kworker/u16:4/30781:
      [ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
      [ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.670408]
                     stack backtrace:
      [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
      [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
      [ 9057.674258] Call Trace:
      [ 9057.674588]  dump_stack_lvl+0x57/0x72
      [ 9057.675083]  check_noncircular+0xf3/0x110
      [ 9057.675611]  __lock_acquire+0x130e/0x2210
      [ 9057.676132]  lock_acquire+0xd7/0x310
      [ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.677313]  ? lock_is_held_type+0xe8/0x140
      [ 9057.677849]  down_read_nested+0x4b/0x140
      [ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
      [ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
      [ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.685977]  ? ___ratelimit+0xa4/0x110
      [ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.688649]  ? lock_is_held_type+0xe8/0x140
      [ 9057.689180]  process_one_work+0x247/0x5a0
      [ 9057.689696]  worker_thread+0x55/0x3c0
      [ 9057.690175]  ? process_one_work+0x5a0/0x5a0
      [ 9057.690731]  kthread+0x155/0x180
      [ 9057.691158]  ? set_kthread_struct+0x40/0x40
      [ 9057.691697]  ret_from_fork+0x22/0x30
      
      Fix this by making btrfs_find_all_roots() never attempt to lock the
      commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
      
      We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
      from btrfs_qgroup_trace_extent_post(), because that would make backref
      lookup not use commit roots and acquire read locks on extent buffers, and
      therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
      from the btrfs_truncate_inode_items() code path which has acquired a write
      lock on an extent buffer of the subvolume btree.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8949b9a1
  10. 22 6月, 2021 1 次提交
  11. 19 4月, 2021 1 次提交
  12. 09 2月, 2021 3 次提交
    • J
      btrfs: do not warn if we can't find the reloc root when looking up backref · f78743fb
      Josef Bacik 提交于
      The backref code is looking for a reloc_root that corresponds to the
      given fs root.  However any number of things could have gone wrong while
      initializing that reloc_root, like ENOMEM while trying to allocate the
      root itself, or EIO while trying to write the root item.  This would
      result in no corresponding reloc_root being in the reloc root cache, and
      thus would return NULL when we do the find_reloc_root() call.
      
      Because of this we do not want to WARN_ON().  This presumably was meant
      to catch developer errors, cases where we messed up adding the reloc
      root.  However we can easily hit this case with error injection, and
      thus should not do a WARN_ON().
      
      CC: stable@vger.kernel.org # 5.10+
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f78743fb
    • N
      6e353e3b
    • J
      btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node · 7e2a870a
      Josef Bacik 提交于
      Zygo reported the following panic when testing my error handling patches
      for relocation:
      
        kernel BUG at fs/btrfs/backref.c:2545!
        invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G        W 14
        Hardware name: QEMU Standard PC (i440FX + PIIX,
      
        Call Trace:
         btrfs_backref_error_cleanup+0x4df/0x530
         build_backref_tree+0x1a5/0x700
         ? _raw_spin_unlock+0x22/0x30
         ? release_extent_buffer+0x225/0x280
         ? free_extent_buffer.part.52+0xd7/0x140
         relocate_tree_blocks+0x2a6/0xb60
         ? kasan_unpoison_shadow+0x35/0x50
         ? do_relocation+0xc10/0xc10
         ? kasan_kmalloc+0x9/0x10
         ? kmem_cache_alloc_trace+0x6a3/0xcb0
         ? free_extent_buffer.part.52+0xd7/0x140
         ? rb_insert_color+0x342/0x360
         ? add_tree_block.isra.36+0x236/0x2b0
         relocate_block_group+0x2eb/0x780
         ? merge_reloc_roots+0x470/0x470
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x18f0
         ? pvclock_clocksource_read+0xeb/0x190
         ? btrfs_relocate_chunk+0x120/0x120
         ? lock_contended+0x620/0x6e0
         ? do_raw_spin_lock+0x1e0/0x1e0
         ? do_raw_spin_unlock+0xa8/0x140
         btrfs_ioctl_balance+0x1f9/0x460
         btrfs_ioctl+0x24c8/0x4380
         ? __kasan_check_read+0x11/0x20
         ? check_chain_key+0x1f4/0x2f0
         ? __asan_loadN+0xf/0x20
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? kvm_sched_clock_read+0x18/0x30
         ? check_chain_key+0x1f4/0x2f0
         ? lock_downgrade+0x3f0/0x3f0
         ? handle_mm_fault+0xad6/0x2150
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags+0x26/0x30
         ? lock_is_held_type+0xc3/0xf0
         ? syscall_enter_from_user_mode+0x1b/0x60
         ? do_syscall_64+0x13/0x80
         ? rcu_read_lock_sched_held+0xa1/0xd0
         ? __kasan_check_read+0x11/0x20
         ? __fget_light+0xae/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This occurs because of this check
      
        if (RB_EMPTY_NODE(&upper->rb_node))
      	  BUG_ON(!list_empty(&node->upper));
      
      As we are dropping the backref node, if we discover that our upper node
      in the edge we just cleaned up isn't linked into the cache that we are
      now done with this node, thus the BUG_ON().
      
      However this is an erroneous assumption, as we will look up all the
      references for a node first, and then process the pending edges.  All of
      the 'upper' nodes in our pending edges won't be in the cache's rb_tree
      yet, because they haven't been processed.  We could very well have many
      edges still left to cleanup on this node.
      
      The fact is we simply do not need this check, we can just process all of
      the edges only for this node, because below this check we do the
      following
      
        if (list_empty(&upper->lower)) {
      	  list_add_tail(&upper->lower, &cache->leaves);
      	  upper->lowest = 1;
        }
      
      If the upper node truly isn't used yet, then we add it to the
      cache->leaves list to be cleaned up later.  If it is still used then the
      last child node that has it linked into its node will add it to the
      leaves list and then it will be cleaned up.
      
      Fix this problem by dropping this logic altogether.  With this fix I no
      longer see the panic when testing with error injection in the backref
      code.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7e2a870a
  13. 18 1月, 2021 1 次提交
    • J
      btrfs: do not double free backref nodes on error · 49ecc679
      Josef Bacik 提交于
      Zygo reported the following KASAN splat:
      
        BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
        Read of size 8 at addr ffff888112402950 by task btrfs/28836
      
        CPU: 0 PID: 28836 Comm: btrfs Tainted: G        W         5.10.0-e35f27394290-for-next+ #23
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        Call Trace:
         dump_stack+0xbc/0xf9
         ? btrfs_backref_cleanup_node+0x18a/0x420
         print_address_description.constprop.8+0x21/0x210
         ? record_print_text.cold.34+0x11/0x11
         ? btrfs_backref_cleanup_node+0x18a/0x420
         ? btrfs_backref_cleanup_node+0x18a/0x420
         kasan_report.cold.10+0x20/0x37
         ? btrfs_backref_cleanup_node+0x18a/0x420
         __asan_load8+0x69/0x90
         btrfs_backref_cleanup_node+0x18a/0x420
         btrfs_backref_release_cache+0x83/0x1b0
         relocate_block_group+0x394/0x780
         ? merge_reloc_roots+0x4a0/0x4a0
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         ? check_flags.part.50+0x6c/0x1e0
         ? btrfs_relocate_chunk+0x120/0x120
         ? kmem_cache_alloc_trace+0xa06/0xcb0
         ? _copy_from_user+0x83/0xc0
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         ? __kasan_check_read+0x11/0x20
         ? check_chain_key+0x1f4/0x2f0
         ? __asan_loadN+0xf/0x20
         ? btrfs_ioctl_get_supported_features+0x30/0x30
         ? kvm_sched_clock_read+0x18/0x30
         ? check_chain_key+0x1f4/0x2f0
         ? lock_downgrade+0x3f0/0x3f0
         ? handle_mm_fault+0xad6/0x2150
         ? do_vfs_ioctl+0xfc/0x9d0
         ? ioctl_file_clone+0xe0/0xe0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags.part.50+0x6c/0x1e0
         ? check_flags+0x26/0x30
         ? lock_is_held_type+0xc3/0xf0
         ? syscall_enter_from_user_mode+0x1b/0x60
         ? do_syscall_64+0x13/0x80
         ? rcu_read_lock_sched_held+0xa1/0xd0
         ? __kasan_check_read+0x11/0x20
         ? __fget_light+0xae/0x110
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f4c4bdfe427
      
        Allocated by task 28836:
         kasan_save_stack+0x21/0x50
         __kasan_kmalloc.constprop.18+0xbe/0xd0
         kasan_kmalloc+0x9/0x10
         kmem_cache_alloc_trace+0x410/0xcb0
         btrfs_backref_alloc_node+0x46/0xf0
         btrfs_backref_add_tree_node+0x60d/0x11d0
         build_backref_tree+0xc5/0x700
         relocate_tree_blocks+0x2be/0xb90
         relocate_block_group+0x2eb/0x780
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Freed by task 28836:
         kasan_save_stack+0x21/0x50
         kasan_set_track+0x20/0x30
         kasan_set_free_info+0x1f/0x30
         __kasan_slab_free+0xf3/0x140
         kasan_slab_free+0xe/0x10
         kfree+0xde/0x200
         btrfs_backref_error_cleanup+0x452/0x530
         build_backref_tree+0x1a5/0x700
         relocate_tree_blocks+0x2be/0xb90
         relocate_block_group+0x2eb/0x780
         btrfs_relocate_block_group+0x26e/0x4c0
         btrfs_relocate_chunk+0x52/0x120
         btrfs_balance+0xe2e/0x1900
         btrfs_ioctl_balance+0x3a7/0x460
         btrfs_ioctl+0x24c8/0x4360
         __x64_sys_ioctl+0xc3/0x100
         do_syscall_64+0x37/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This occurred because we freed our backref node in
      btrfs_backref_error_cleanup(), but then tried to free it again in
      btrfs_backref_release_cache().  This is because
      btrfs_backref_release_cache() will cycle through all of the
      cache->leaves nodes and free them up.  However
      btrfs_backref_error_cleanup() freed the backref node with
      btrfs_backref_free_node(), which simply kfree()d the backref node
      without unlinking it from the cache.  Change this to a
      btrfs_backref_drop_node(), which does the appropriate cleanup and
      removes the node from the cache->leaves list, so when we go to free the
      remaining cache we don't trip over items we've already dropped.
      
      Fixes: 75bfb9af ("Btrfs: cleanup error handling in build_backref_tree")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49ecc679
  14. 08 12月, 2020 3 次提交
  15. 26 10月, 2020 1 次提交
    • J
      btrfs: add a helper to read the tree_root commit root for backref lookup · 49d11bea
      Josef Bacik 提交于
      I got the following lockdep splat with tree locks converted to rwsem
      patches on btrfs/104:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0+ #102 Not tainted
        ------------------------------------------------------
        btrfs-cleaner/903 is trying to acquire lock:
        ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
      
        but task is already holding lock:
        ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 caching_thread+0x53/0x5a0
      	 btrfs_work_helper+0xfa/0x520
      	 process_one_work+0x238/0x540
      	 worker_thread+0x55/0x3c0
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7b0
      	 btrfs_cache_block_group+0x1e0/0x510
      	 find_free_extent+0xb6e/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&space_info->groups_sem){++++}-{3:3}:
      	 down_read+0x40/0x130
      	 find_free_extent+0x2ed/0x12f0
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb1/0x330
      	 alloc_tree_block_no_bg_flush+0x4f/0x60
      	 __btrfs_cow_block+0x11d/0x580
      	 btrfs_cow_block+0x10c/0x220
      	 commit_cowonly_roots+0x47/0x2e0
      	 btrfs_commit_transaction+0x595/0xbd0
      	 sync_filesystem+0x74/0x90
      	 generic_shutdown_super+0x22/0x100
      	 kill_anon_super+0x14/0x30
      	 btrfs_kill_super+0x12/0x20
      	 deactivate_locked_super+0x36/0xa0
      	 cleanup_mnt+0x12d/0x190
      	 task_work_run+0x5c/0xa0
      	 exit_to_user_mode_prepare+0x1df/0x200
      	 syscall_exit_to_user_mode+0x54/0x280
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (btrfs-root-00){++++}-{3:3}:
      	 __lock_acquire+0x1167/0x2150
      	 lock_acquire+0xb9/0x3d0
      	 down_read_nested+0x43/0x130
      	 __btrfs_tree_read_lock+0x32/0x170
      	 __btrfs_read_lock_root_node+0x3a/0x50
      	 btrfs_search_slot+0x614/0x9d0
      	 btrfs_find_root+0x35/0x1b0
      	 btrfs_read_tree_root+0x61/0x120
      	 btrfs_get_root_ref+0x14b/0x600
      	 find_parent_nodes+0x3e6/0x1b30
      	 btrfs_find_all_roots_safe+0xb4/0x130
      	 btrfs_find_all_roots+0x60/0x80
      	 btrfs_qgroup_trace_extent_post+0x27/0x40
      	 btrfs_add_delayed_data_ref+0x3fd/0x460
      	 btrfs_free_extent+0x42/0x100
      	 __btrfs_mod_ref+0x1d7/0x2f0
      	 walk_up_proc+0x11c/0x400
      	 walk_up_tree+0xf0/0x180
      	 btrfs_drop_snapshot+0x1c7/0x780
      	 btrfs_clean_one_deleted_snapshot+0xfb/0x110
      	 cleaner_kthread+0xd4/0x140
      	 kthread+0x13a/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&fs_info->commit_root_sem);
      				 lock(&caching_ctl->mutex);
      				 lock(&fs_info->commit_root_sem);
          lock(btrfs-root-00);
      
         *** DEADLOCK ***
      
        3 locks held by btrfs-cleaner/903:
         #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
         #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
         #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
      
        stack backtrace:
        CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb0
         check_noncircular+0xcf/0xf0
         __lock_acquire+0x1167/0x2150
         ? __bfs+0x42/0x210
         lock_acquire+0xb9/0x3d0
         ? __btrfs_tree_read_lock+0x32/0x170
         down_read_nested+0x43/0x130
         ? __btrfs_tree_read_lock+0x32/0x170
         __btrfs_tree_read_lock+0x32/0x170
         __btrfs_read_lock_root_node+0x3a/0x50
         btrfs_search_slot+0x614/0x9d0
         ? find_held_lock+0x2b/0x80
         btrfs_find_root+0x35/0x1b0
         ? do_raw_spin_unlock+0x4b/0xa0
         btrfs_read_tree_root+0x61/0x120
         btrfs_get_root_ref+0x14b/0x600
         find_parent_nodes+0x3e6/0x1b30
         btrfs_find_all_roots_safe+0xb4/0x130
         btrfs_find_all_roots+0x60/0x80
         btrfs_qgroup_trace_extent_post+0x27/0x40
         btrfs_add_delayed_data_ref+0x3fd/0x460
         btrfs_free_extent+0x42/0x100
         __btrfs_mod_ref+0x1d7/0x2f0
         walk_up_proc+0x11c/0x400
         walk_up_tree+0xf0/0x180
         btrfs_drop_snapshot+0x1c7/0x780
         ? btrfs_clean_one_deleted_snapshot+0x73/0x110
         btrfs_clean_one_deleted_snapshot+0xfb/0x110
         cleaner_kthread+0xd4/0x140
         ? btrfs_alloc_root+0x50/0x50
         kthread+0x13a/0x150
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
        BTRFS info (device sdb): disk space caching is enabled
        BTRFS info (device sdb): has skinny extents
      
      This happens because qgroups does a backref lookup when we create a
      delayed ref.  From here it may have to look up a root from an indirect
      ref, which does a normal lookup on the tree_root, which takes the read
      lock on the tree_root nodes.
      
      To fix this we need to add a variant for looking up roots that searches
      the commit root of the tree_root.  Then when we do the backref search
      using the commit root we are sure to not take any locks on the tree_root
      nodes.  This gets rid of the lockdep splat when running btrfs/104.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49d11bea
  16. 07 10月, 2020 1 次提交
  17. 11 8月, 2020 1 次提交
  18. 22 7月, 2020 1 次提交
    • F
      btrfs: fix double free on ulist after backref resolution failure · 580c079b
      Filipe Manana 提交于
      At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
      argument to point to it. However if later we fail due to an error returned
      by find_parent_nodes(), we free that ulist but leave a dangling pointer in
      the **roots argument. Upon receiving the error, a caller of this function
      can attempt to free the same ulist again, resulting in an invalid memory
      access.
      
      One such scenario is during qgroup accounting:
      
      btrfs_qgroup_account_extents()
      
       --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
           pointer) to btrfs_find_all_roots()
      
         --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
             passing &new_roots to it
      
           --> allocates ulist and assigns its address to **roots (which
               points to new_roots from btrfs_qgroup_account_extents())
      
           --> find_parent_nodes() returns an error, so we free the ulist
               and leave **roots pointing to it after returning
      
       --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
           an error and jumps to the label 'cleanup', which just tries to
           free again the same ulist
      
      Stack trace example:
      
       ------------[ cut here ]------------
       BTRFS: tree first key check failed
       WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Modules linked in: dm_snapshot dm_thin_pool (...)
       CPU: 1 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
       Code: 28 5b 5d (...)
       RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
       RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
       R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        read_block_for_search+0xf6/0x350 [btrfs]
        btrfs_next_old_leaf+0x242/0x650 [btrfs]
        resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
        find_parent_nodes+0x4ea/0x12c0 [btrfs]
        btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
        btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last  enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace 8639237550317b48 ]---
       BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
       general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 2 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        ulist_free+0x13/0x20 [btrfs]
        btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
        btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
        btrfs_sync_file+0x3d4/0x4d0 [btrfs]
        do_fsync+0x38/0x70
        __x64_sys_fdatasync+0x13/0x20
        do_syscall_64+0x5c/0xe0
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fc47e2d72e3
       Code: Bad RIP value.
       RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
       RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
       RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
       RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
       R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
       R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
       Modules linked in: dm_snapshot dm_thin_pool (...)
       ---[ end trace 8639237550317b49 ]---
       RIP: 0010:ulist_release+0x14/0x60 [btrfs]
       Code: c7 07 00 (...)
       RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
       RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
       RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
       R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
       FS:  00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
      it frees the ulist.
      
      Fixes: 8da6d581 ("Btrfs: added btrfs_find_all_roots()")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      580c079b
  19. 25 5月, 2020 3 次提交
    • D
      btrfs: simplify root lookup by id · 56e9357a
      David Sterba 提交于
      The main function to lookup a root by its id btrfs_get_fs_root takes the
      whole key, while only using the objectid. The value of offset is preset
      to (u64)-1 but not actually used until btrfs_find_root that does the
      actual search.
      
      Switch btrfs_get_fs_root to use only objectid and remove all local
      variables that existed just for the lookup. The actual key for search is
      set up in btrfs_get_fs_root, reusing another key variable.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56e9357a
    • Q
      btrfs: rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE · 92a7cc42
      Qu Wenruo 提交于
      The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
      
      In fact, that bit can only be set to those trees:
      
      - Subvolume roots
      - Data reloc root
      - Reloc roots for above roots
      
      All other trees won't get this bit set.  So just by the result, it is
      obvious that, roots with this bit set can have tree blocks shared with
      other trees.  Either shared by snapshots, or by reloc roots (an special
      snapshot created by relocation).
      
      This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
      make it easier to understand, and update all comment mentioning
      "reference counted" to follow the rename.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92a7cc42
    • Q
      btrfs: backref: distinguish reloc and non-reloc use of indirect resolution · 876de781
      Qu Wenruo 提交于
      For relocation tree detection, relocation backref cache uses
      btrfs_should_ignore_reloc_root() which uses relocation-specific checks
      like checking the DEAD_RELOC_ROOT bit.
      
      However for general purpose backref cache, we can rely on that check, as
      it's possible that relocation is also running.
      
      For generic purposed backref cache, we detect reloc root by
      SHARED_BLOCK_REF item.  Only reloc root node has its parent bytenr
      pointing back to itself.
      
      And in that case, backref cache will mark the reloc root node useless,
      dropping any child orphan nodes.
      
      So only call btrfs_should_ignore_reloc_root() if the backref cache is
      for relocation.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      876de781