1. 29 3月, 2023 1 次提交
    • F
      btrfs: ignore fiemap path cache when there are multiple paths for a node · 2280d425
      Filipe Manana 提交于
      During fiemap, when walking backreferences to determine if a b+tree
      node/leaf is shared, we may find a tree block (leaf or node) for which
      two parents were added to the references ulist. This happens if we get
      for example one direct ref (shared tree block ref) and one indirect ref
      (non-shared tree block ref) for the tree block at the current level,
      which can happen during relocation.
      
      In that case the fiemap path cache can not be used since it's meant for
      a single path, with one tree block at each possible level, so having
      multiple references for a tree block at any level may result in getting
      the level counter exceed BTRFS_MAX_LEVEL and eventually trigger the
      warning:
      
         WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL)
      
      at lookup_backref_shared_cache() and at store_backref_shared_cache().
      This is harmless since the code ignores any level >= BTRFS_MAX_LEVEL, the
      warning is there just to catch any unexpected case like the one described
      above. However if a user finds this it may be scary and get reported.
      
      So just ignore the path cache once we find a tree block for which there
      are more than one reference, which is the less common case, and update
      the cache with the sharedness check result for all levels below the level
      for which we found multiple references.
      Reported-by: NJarno Pelkonen <jarno.pelkonen@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAKv8qLmDNAGJGCtsevxx_VZ_YOvvs1L83iEJkTgyA4joJertng@mail.gmail.com/
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2280d425
  2. 16 2月, 2023 2 次提交
    • F
      btrfs: skip backref walking during fiemap if we know the leaf is shared · e2fd8306
      Filipe Manana 提交于
      During fiemap, when checking if a data extent is shared we are doing the
      backref walking even if we already know the leaf is shared, which is a
      waste of time since if the leaf shared then the data extent is also
      shared. So skip the backref walking when we know we are in a shared leaf.
      
      The following test was measures the gains for a case where all leaves
      are shared due to a snapshot:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         # Create a snapshot so all the extents become indirectly shared
         # through subtrees, with a generation less than or equals to the
         # generation used to create the snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         # Unmount and mount again to clear cached metadata.
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         # The filefrag tool  uses the fiemap ioctl.
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      The results were the following on a non-debug kernel (Debian's default
      kernel config).
      
      Before this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1821 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 399 milliseconds (metadata cached)
      
      After this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 591 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 123 milliseconds (metadata cached)
      
      That's a speedup of 3.1x and 3.2x.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2fd8306
    • F
      btrfs: assert commit root semaphore is held when accessing backref cache · 4e4488d4
      Filipe Manana 提交于
      During fiemap, when accessing the cache that stores the sharedness of an
      extent, we need to either be holding a transaction handle or the commit
      root semaphore. I left comments about this in the comment that precedes
      store_backref_shared_cache() and lookup_backref_shared_cache(), but have
      actually not enforced it through assertions. So assert that the commit
      root semaphore is held if we are not holding a transaction handle.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4e4488d4
  3. 21 12月, 2022 1 次提交
    • B
      btrfs: fix resolving backrefs for inline extent followed by prealloc · 560840af
      Boris Burkov 提交于
      If a file consists of an inline extent followed by a regular or prealloc
      extent, then a legitimate attempt to resolve a logical address in the
      non-inline region will result in add_all_parents reading the invalid
      offset field of the inline extent. If the inline extent item is placed
      in the leaf eb s.t. it is the first item, attempting to access the
      offset field will not only be meaningless, it will go past the end of
      the eb and cause this panic:
      
        [17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
        [17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
        [17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
        [17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
        [17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
        [17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
        [17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
        [17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
        [17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
        [17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
        [17.663617] FS:  00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
        [17.666525] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
        [17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [17.676034] PKRU: 55555554
        [17.677004] Call Trace:
        [17.677877]  add_all_parents+0x276/0x480
        [17.679325]  find_parent_nodes+0xfae/0x1590
        [17.680771]  btrfs_find_all_leafs+0x5e/0xa0
        [17.682217]  iterate_extent_inodes+0xce/0x260
        [17.683809]  ? btrfs_inode_flags_to_xflags+0x50/0x50
        [17.685597]  ? iterate_inodes_from_logical+0xa1/0xd0
        [17.687404]  iterate_inodes_from_logical+0xa1/0xd0
        [17.689121]  ? btrfs_inode_flags_to_xflags+0x50/0x50
        [17.691010]  btrfs_ioctl_logical_to_ino+0x131/0x190
        [17.692946]  btrfs_ioctl+0x104a/0x2f60
        [17.694384]  ? selinux_file_ioctl+0x182/0x220
        [17.695995]  ? __x64_sys_ioctl+0x84/0xc0
        [17.697394]  __x64_sys_ioctl+0x84/0xc0
        [17.698697]  do_syscall_64+0x33/0x40
        [17.700017]  entry_SYSCALL_64_after_hwframe+0x44/0xae
        [17.701753] RIP: 0033:0x7f64e72761b7
        [17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
        [17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
        [17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
        [17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
        [17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
        [17.724839] Modules linked in:
      
      Fix the bug by detecting the inline extent item in add_all_parents and
      skipping to the next extent item.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      560840af
  4. 06 12月, 2022 26 次提交
    • C
      btrfs: move struct btrfs_tree_parent_check out of disk-io.h · 27137fac
      Christoph Hellwig 提交于
      Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
      an various .c files don't have to include disk-io.h just for it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ use tree-checker.h for the structure ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27137fac
    • Q
      btrfs: concentrate all tree block parentness check parameters into one structure · 789d6a3a
      Qu Wenruo 提交于
      There are several different tree block parentness check parameters used
      across several helpers:
      
      - level
        Mandatory
      
      - transid
        Under most cases it's mandatory, but there are several backref cases
        which skips this check.
      
      - owner_root
      - first_key
        Utilized by most top-down tree search routine. Otherwise can be
        skipped.
      
      Those four members are not always mandatory checks, and some of them are
      the same u64, which means if some arguments got swapped compiler will
      not catch it.
      
      Furthermore if we're going to further expand the parentness check, we
      need to modify quite some helpers just to add one more parameter.
      
      This patch will concentrate all these members into a structure called
      btrfs_tree_parent_check, and pass that structure for the following
      helpers:
      
      - btrfs_read_extent_buffer()
      - read_tree_block()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      789d6a3a
    • F
      btrfs: send: skip resolution of our own backref when finding clone source · adf02418
      Filipe Manana 提交于
      When doing backref walking to determine a source range to clone from, it
      is worthless to collect and resolve our own data backref, as we can't
      obviously use it as a clone source and it represents the range we want to
      clone into. Collecting the backref implies doing the extra work to resolve
      it, doing the search for a file extent item in a subvolume tree, etc.
      Skipping the data backref is valid as long as we only have the send root
      as the single clone root, otherwise the leaf with the file extent item may
      be accessible from another clone root due to shared subtrees created by
      snapshots, and therefore we have to collect the backref and resolve it.
      
      So add a callback to the backref walking code to guide it to skip data
      backrefs.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) before and after applying the patchset:
      
         $ cat test-send-many-shared-extents.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         num_files=50000
         num_clones_per_file=50
      
         for ((i = 1; i <= $num_files; i++)); do
             xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null
             echo -ne "\r$i files created..."
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         cloned=0
         for ((i = 1; i <= $num_clones_per_file; i++)); do
             for ((j = 1; j <= $num_files; j++)); do
                 cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i}
                 cloned=$((cloned + 1))
                 echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations"
             done
         done
         echo
      
         btrfs subvolume snapshot -r $MNT $MNT/snap2
      
         # Unmount and mount again to clear all cached metadata (and data).
         umount $DEV
         mount $DEV $MNT
      
         start=$(date +%s%N)
         btrfs send $MNT/snap2 > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000000 ))
         echo -e "\nFull send took $dur seconds"
      
         # Unmount and mount again to clear all cached metadata (and data).
         umount $DEV
         mount $DEV $MNT
      
         start=$(date +%s%N)
         btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000000 ))
         echo -e "\nIncremental send took $dur seconds"
      
         umount $MNT
      
      Before applying the patchset:
      
         (...)
         Full send took 1108 seconds
         (...)
         Incremental send took 1135 seconds
      
      After applying the whole patchset:
      
         (...)
         Full send took 268 seconds            (-75.8%)
         (...)
         Incremental send took 316 seconds     (-72.2%)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      adf02418
    • F
      btrfs: send: avoid double extent tree search when finding clone source · f73853c7
      Filipe Manana 提交于
      At find_extent_clone() we search twice for the extent item corresponding
      to the data extent that the current file extent items points to:
      
      1) Once with a call to extent_from_logical();
      
      2) Once again during backref walking, through iterate_extent_inodes()
         which eventually leads to find_parent_nodes() where we will search
         again the extent tree for the same extent item.
      
      The extent tree can be huge, so doing this one extra search for every
      extent we want to send adds up and it's expensive.
      
      The first call is there since the send code was introduced and it
      accomplishes two things:
      
      1) Check that the extent is flagged as a data extent in the extent tree.
         But it can not be anything else, otherwise we wouldn't have a file
         extent item in the send root pointing to it.
         This was probably added to catch bugs in the early days where send was
         yet too young and the interaction with everything else was far from
         perfect;
      
      2) Check how many direct references there are on the extent, and if
         there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
         backred walking as it may take too long and slowdown send.
      
      So improve on this by having a callback in the backref walking code that
      is called when it finds the extent item in the extent tree, and have those
      checks done in the callback. When the callback returns anything different
      from 0, it stops the backref walking code. This way we do a single search
      on the extent tree for the extent item of our data extent.
      
      Also, before this change we were only checking the number of references on
      the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
      walking we will end up resolving backrefs for extent buffers in the path
      from a leaf having a file extent item pointing to our data extent, up to
      roots of trees from which the extent buffer is accessible from, due to
      shared subtrees resulting from snapshoting. We were therefore allowing for
      the possibility for send taking too long due to some node in the path from
      the leaf to a root node being shared too many times. After this change we
      check for reference counts being greater than SEND_MAX_EXTENT_REFS for
      both data extents and metadata extents.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f73853c7
    • F
      btrfs: send: skip unnecessary backref iterations · 88ffb665
      Filipe Manana 提交于
      When looking for a clone source for an extent, we are iterating over all
      the backreferences for an extent. This is often a waste of time, because
      once we find a good clone source we could stop immediately instead of
      continuing backref walking, which is expensive.
      
      Basically what happens currently is this:
      
      1) Call iterate_extent_inodes() to iterate over all the backreferences;
      
      2) It calls btrfs_find_all_leafs() which in turn calls the main function
         to walk over backrefs and collect them - find_parent_nodes();
      
      3) Then we collect all the references for our target data extent from the
         extent tree (and delayed refs if any), add them to the rb trees,
         resolve all the indirect backreferences and search for all the file
         extent items in fs trees, building a list of inodes for each one of
         them (struct extent_inode_elem);
      
      4) Then back at iterate_extent_inodes() we find all the roots associated
         to each found leaf, and call the callback __iterate_backrefs defined
         at send.c for each inode in the inode list associated to each leaf.
      
      Some times one the first backreferences we find in a fs tree is optimal
      to satisfy the clone operation that send wants to perform, and in that
      case we could stop immediately and avoid resolving all the remaining
      indirect backreferences (search fs trees for the respective file extent
      items, etc). This possibly if when we find a fs tree leaf with a file
      extent item we are able to know what are all the roots that can lead to
      the leaf - this is now possible after the previous patch in the series
      that adds a cache that maps leaves to a list of roots. So we can now
      shortcircuit backref walking during send, by having the callback we
      pass to iterate_extent_inodes() to be called when we find a file extent
      item for an indirect backreference, and have it return a special value
      when it found a suitable backreference and it does not need to look for
      more backreferences. This change does that.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88ffb665
    • F
      btrfs: send: cache leaf to roots mapping during backref walking · 66d04209
      Filipe Manana 提交于
      During a send operation, when doing backref walking to determine which
      inodes/offsets/roots we can clone from, the most repetitive and expensive
      step is to map each leaf that has file extent items pointing to the target
      data extent to the IDs of the roots from which the leaves are accessible,
      which happens at iterate_extent_inodes(). That step requires finding every
      parent node of a leaf, then the parent of each parent, and so on until we
      reach a root node. So it's a naturally expensive operation, and repetitive
      because each leaf can have hundreds of file extent items (for a nodesize
      of 16K, that can be slightly over 200 file extent items). There's also
      temporal locality, as we process all file extent items from a leave before
      moving the next leaf.
      
      This change caches the mapping of leaves to root IDs, to avoid repeating
      those computations over and over again. The cache is limited to a maximum
      of 128 entries, with each entry being a struct with a size of 128 bytes,
      so the maximum cache size is 16K plus any nodes internally allocated by
      the maple tree that is used to index pointers to those structs. The cache
      is invalidated whenever we detect relocation happened since we started
      filling the cache, because if relocation happened then extent buffers for
      leaves and nodes of the trees used by a send operation may have been
      reallocated.
      
      This cache also allows for another important optimization that is
      introduced in the next patch in the series.
      
      This change is part of a patchset comprised of the following patches:
      
        01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
        02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
        03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
        04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
        05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
        06/17 btrfs: send: update comment at find_extent_clone()
        07/17 btrfs: send: drop unnecessary backref context field initializations
        08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
        09/17 btrfs: send: optimize clone detection to increase extent sharing
        10/17 btrfs: use a single argument for extent offset in backref walking functions
        11/17 btrfs: use a structure to pass arguments to backref walking functions
        12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
        13/17 btrfs: constify ulist parameter of ulist_next()
        14/17 btrfs: send: cache leaf to roots mapping during backref walking
        15/17 btrfs: send: skip unnecessary backref iterations
        16/17 btrfs: send: avoid double extent tree search when finding clone source
        17/17 btrfs: send: skip resolution of our own backref when finding clone source
      
      Performance test results are in the changelog of patch 17/17.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      66d04209
    • F
      btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes() · 1baea6f1
      Filipe Manana 提交于
      At iterate_extent_inodes() we collect a ulist of leaves for a given extent
      with a call to btrfs_find_all_leafs() and then we enter a loop where we
      iterate over all the collected leaves. Each iteration of that loop does a
      call to btrfs_find_all_roots_safe(), to determine all roots from which a
      leaf is accessible, and that results in allocating and releasing a ulist
      to store the root IDs.
      
      Instead of allocating and releasing the roots ulist on every iteration,
      allocate a ulist before entering the loop and keep using it on each
      iteration, reinitializing the ulist at the end of each iteration.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1baea6f1
    • F
      btrfs: use a structure to pass arguments to backref walking functions · a2c8d27e
      Filipe Manana 提交于
      The public backref walking functions have quite a lot of arguments that
      are passed down the call stack to find_parent_nodes(), the core function
      of the backref walking code.
      
      The next patches in series will need to add even arguments to these
      functions that should be passed not only to find_parent_nodes(), but also
      to other functions used by the later (directly or even lower in the call
      stack).
      
      So create a structure to hold all these arguments and state used by the
      main backref walking function, find_parent_nodes(), and use it as the
      argument for the public backref walking functions iterate_extent_inodes(),
      btrfs_find_all_leafs() and btrfs_find_all_roots().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a2c8d27e
    • F
      btrfs: use a single argument for extent offset in backref walking functions · 6ce6ba53
      Filipe Manana 提交于
      The interface for find_parent_nodes() has two extent offset related
      arguments:
      
      1) One u64 pointer argument for the extent offset;
      
      2) One boolean argument to tell if the extent offset should be ignored or
         not.
      
      These are confusing, becase the extent offset pointer can be NULL and in
      some cases callers pass a NULL value as a way to tell the backref walking
      code to ignore offsets in file extent items (and simply consider all file
      extent items that point to the target data extent).
      
      The boolean argument was added in commit c995ab3c ("btrfs: add a flag
      to iterate_inodes_from_logical to find all extent refs for uncompressed
      extents"), but it was never really necessary, it was enough if it could
      find a way to get a NULL value passed to the "extent_item_pos" argument of
      find_parent_nodes(). The arguments are also passed to functions called
      by find_parent_nodes() and respective helper functions, which further
      makes everything more complicated than needed.
      
      Then we have several backref walking related functions that end up calling
      find_parent_nodes(), either directly or through some other function that
      they call, and for many we have to use an "extent_item_pos" (u64) argument
      and a boolean "ignore_offset" argument too.
      
      This is confusing and not really necessary. So use a single argument to
      specify the extent offset, as a simple u64 and not as a pointer, but
      using a special value of (u64)-1, defined as a documented constant, to
      indicate when the extent offset should be ignored.
      
      This is also preparation work for the upcoming patches in the series that
      add other arguments to find_parent_nodes() and other related functions
      that use it.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ce6ba53
    • F
      btrfs: send: optimize clone detection to increase extent sharing · c7499a64
      Filipe Manana 提交于
      Currently send does not do the best decisions when it comes to decide
      between multiple clone sources, which results in clone operations for
      partial extent ranges, which has the following disadvantages:
      
      1) We get less shared extents at the destination;
      
      2) We have to read more data during the send operation and emit more
         write commands.
      
      Besides not being optimal behaviour, it also breaks user expectations and
      is often reported by users, with a recent example in the Link tag at the
      bottom of this change log.
      
      Part of the reason for this non-optimal behaviour is that the backref
      walking code does not provide information about the length of the file
      extent items that were found for each backref, so send is blind about
      which backref is the best to chose as a cloning source.
      
      The other existing reasons are just silliness, namely always prefering
      the inode with the lowest number when multiple are found for the same
      root and when we can clone from multiple roots, always prefer the send
      root over any of the other clone roots. This does not make any sense
      since any inode or root is fine and as good as any other inode/root.
      
      Fix this by making backref walking pass information about the number of
      bytes referenced by each file extent item and then have send's backref
      callback pick the inode with the highest number of bytes for each root.
      Finally select the root from which we can clone more bytes from.
      
      Example reproducer:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
         cp --reflink=always $MNT/foo $MNT/baz
         sync
      
         # Overwrite the second half of file foo.
         xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
         sync
      
         echo
         echo "*** fiemap in the original filesystem ***"
         echo
         xfs_io -c "fiemap -v" $MNT/foo
         xfs_io -c "fiemap -v" $MNT/bar
         xfs_io -c "fiemap -v" $MNT/baz
         echo
      
         btrfs filesystem du $MNT
      
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         btrfs send -f /tmp/send_stream $MNT/snap
      
         umount $MNT
         mkfs.btrfs -f $DEV &> /dev/null
         mount $DEV $MNT
      
         btrfs receive -f /tmp/send_stream $MNT
      
         echo
         echo "*** fiemap in the new filesystem ***"
         echo
         xfs_io -r -c "fiemap -v" $MNT/snap/foo
         xfs_io -r -c "fiemap -v" $MNT/snap/bar
         xfs_io -r -c "fiemap -v" $MNT/snap/baz
         echo
      
         btrfs filesystem du $MNT
      
         rm -f /tmp/send_stream
         rm -f /tmp/snap.fssum
      
         umount $MNT
      
      Before this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
      
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    32768..34815      2048   0x1
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/bar
            2.00MiB     1.00MiB           -  /mnt/sdi/snap/baz
            6.00MiB     2.00MiB           -  /mnt/sdi/snap
            6.00MiB     2.00MiB     2.00MiB  /mnt/sdi
      
      We end up with two 1M extents that are not shared for files bar and baz.
      
      After this change:
      
         $ ./test.sh
         (...)
      
         *** fiemap in the original filesystem ***
      
         /mnt/sdi/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048   0x1
         /mnt/sdi/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB     1.00MiB           -  /mnt/sdi/foo
            2.00MiB       0.00B           -  /mnt/sdi/bar
            2.00MiB       0.00B           -  /mnt/sdi/baz
            6.00MiB     1.00MiB     2.00MiB  /mnt/sdi
         Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
         At subvol /mnt/sdi/snap
         At subvol snap
      
         *** fiemap in the new filesystem ***
      
         /mnt/sdi/snap/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..4095]:       26624..30719      4096 0x2001
         /mnt/sdi/snap/bar:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
         /mnt/sdi/snap/baz:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..2047]:       26624..28671      2048 0x2000
            1: [2048..4095]:    30720..32767      2048 0x2001
      
              Total   Exclusive  Set shared  Filename
            2.00MiB       0.00B           -  /mnt/sdi/snap/foo
            2.00MiB       0.00B           -  /mnt/sdi/snap/bar
            2.00MiB       0.00B           -  /mnt/sdi/snap/baz
            6.00MiB       0.00B           -  /mnt/sdi/snap
            6.00MiB       0.00B     3.00MiB  /mnt/sdi
      
      Now there's a much better sharing, files bar and baz share 1M of the
      extent of file foo and the second extent of files bar and baz is shared
      between themselves.
      
      This will later be turned into a test case for fstests.
      
      Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7499a64
    • J
      btrfs: move relocation prototypes into relocation.h · 67707479
      Josef Bacik 提交于
      Move these out of ctree.h into relocation.h to cut down on code in
      ctree.h
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67707479
    • J
      btrfs: move extent-tree helpers into their own header file · a0231804
      Josef Bacik 提交于
      Move all the extent tree related prototypes to extent-tree.h out of
      ctree.h, and then go include it everywhere needed so everything
      compiles.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0231804
    • D
      btrfs: sink gfp_t parameter to btrfs_backref_iter_alloc · d68194b2
      David Sterba 提交于
      There's only one caller that passes GFP_NOFS, we can drop the parameter
      an use the flags directly.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d68194b2
    • J
      btrfs: move accessor helpers into accessors.h · 07e81dc9
      Josef Bacik 提交于
      This is a large patch, but because they're all macros it's impossible to
      split up.  Simply copy all of the item accessors in ctree.h and paste
      them in accessors.h, and then update any files to include the header so
      everything compiles.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comments, style fixups ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      07e81dc9
    • J
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik 提交于
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c7f13d42
    • F
      btrfs: avoid unnecessary resolution of indirect backrefs during fiemap · 6976201f
      Filipe Manana 提交于
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      However when the generation of the data extent is more recent than the
      last generation used to snapshot the root, we don't need to determine
      the path, since the data extent can not be shared through snapshots.
      For this case we currently still determine the leaf of that path (at
      find_parent_nodes(), but then stop determining the other nodes in the
      path (at btrfs_is_data_extent_shared()) as it's pointless.
      
      So do the check of the data extent's generation earlier, at
      find_parent_nodes(), before trying to resolve the indirect reference to
      determine the leaf in the path. This saves us from doing one expensive
      b+tree search in the fs tree of our target inode, as well as other minor
      work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-fiemap.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1285 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 742 milliseconds (metadata cached)
      
      After applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 689 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 393 milliseconds (metadata cached)
      
      That's a -46.4% total reduction for the metadata not cached case, and
      a -47.0% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6976201f
    • F
      btrfs: avoid duplicated resolution of indirect backrefs during fiemap · 877c1476
      Filipe Manana 提交于
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      Currently whenever we find the data extent that a file extent item points
      to is not directly shared, we always resolve the path in the fs tree, and
      then check if any extent buffer in the path is shared. This is a lot of
      work and when we have file extent items that belong to the same leaf, we
      have the same path, so we only need to calculate it once.
      
      This change does that, it keeps track of the current and previous leaf,
      and when we find that a data extent is not directly shared, we try to
      compute the fs tree path only once and then use it for every other file
      extent item in the same leaf, using the existing cached path result for
      the leaf as long as the cache results are valid.
      
      This saves us from doing expensive b+tree searches in the fs tree of our
      target inode, as well as other minor work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-with-snapshots.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         # Create a snapshot so all the extents become indirectly shared
         # through subtrees, with a generation less than or equals to the
         # generation used to create the snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Result before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1204 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 729 milliseconds (metadata cached)
      
      Result after applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 732 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 421 milliseconds (metadata cached)
      
      That's a -46.1% total reduction for the metadata not cached case, and
      a -42.2% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      877c1476
    • F
      btrfs: move up backref sharedness cache store and lookup functions · 583f4ac5
      Filipe Manana 提交于
      Move the static functions to lookup and store sharedness check of an
      extent buffer to a location above find_all_parents(), because in the
      next patch the lookup function will be used by find_all_parents().
      The store function is also moved just because it's the counter part
      to the lookup function and it's best to have their definitions close
      together.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      583f4ac5
    • F
      btrfs: cache sharedness of the last few data extents during fiemap · 73e339e6
      Filipe Manana 提交于
      During fiemap we process all the file extent items of an inode, by their
      file offset order (left to right b+tree order), and then check if the data
      extent they point at is shared or not. Until now we didn't cache those
      results, we only did it for b+tree nodes/leaves since for each unique
      b+tree path we have access to hundreds of file extent items. However, it
      is also common to repeat checking the sharedness of a particular data
      extent in a very short time window, and the cases that lead to that are
      the following:
      
      1) COW writes.
      
         If have a file extent item like this:
      
                        [ bytenr X, offset = 0, num_bytes = 512K ]
         file offset    0                                        512K
      
         Then a 4K write into file offset 64K happens, we end up with the
         following file extent item layout:
      
                        [ bytenr X, offset = 0, num_bytes = 64K ]
         file offset    0                                       64K
      
                        [ bytenr Y, offset = 0, num_bytes = 4K ]
         file offset   64K                                     68K
      
                        [ bytenr X, offset = 68K, num_bytes = 444K ]
         file offset   68K                                         512K
      
         So during fiemap we well check for the sharedness of the data extent
         with bytenr X twice. Typically for COW writes and for at least
         moderately updated files, we end up with many file extent items that
         point to different sections of the same data extent.
      
      2) Writing into a NOCOW file after a snapshot is taken.
      
         This happens if the target extent was created in a generation older
         than the generation where the last snapshot for the root (the tree the
         inode belongs to) was made.
      
         This leads to a scenario like the previous one.
      
      3) Writing into sections of a preallocated extent.
      
         For example if a file has the following layout:
      
         [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
         0                                                       1M
      
         After doing a 4K write into file offset 0 and another 4K write into
         offset 512K, we get the following layout:
      
            [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
            0                                                      4K
      
            [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
           4K                                                          512K
      
            [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
         512K                                                         516K
      
            [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
         516K                                                            1M
      
         So we end up with 4 consecutive file extent items pointing to the data
         extent at bytenr X.
      
      4) Hole punching in the middle of an extent.
      
         For example if a file has the following file extent item:
      
         [ bytenr X, offset = 0, num_bytes = 8M ]
         0                                      8M
      
         And then hole is punched for the file range [4M, 6M[, we our file
         extent item split into two:
      
         [ bytenr X, offset = 0, num_bytes = 4M  ]
         0                                       4M
      
         [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
         4M                                                            6M
      
         [ bytenr X, offset = 6M, num_bytes = 2M  ]
         6M                                       8M
      
         Again, we end up with two file extent items pointing to the same
         data extent.
      
      5) When reflinking (clone and deduplication) within the same file.
         This is probably the least common case of all.
      
      In cases 1, 2, 4 and 4, when we have multiple file extent items that point
      to the same data extent, their distance is usually short, typically
      separated by a few slots in a b+tree leaf (or across sibling leaves). For
      case 5, the distance can vary a lot, but it's typically the less common
      case.
      
      This change caches the result of the sharedness checks for data extents,
      but only for the last 8 extents that we notice that our inode refers to
      with multiple file extent items. Whenever we want to check if a data
      extent is shared, we lookup the cache which consists of doing a linear
      scan of an 8 elements array, and if we find the data extent there, we
      return the result and don't check the extent tree and delayed refs.
      
      The array/cache is small so that doing the search has no noticeable
      negative impact on the performance in case we don't have file extent items
      within a distance of 8 slots that point to the same data extent.
      
      Slots in the cache/array are overwritten in a simple round robin fashion,
      as that approach fits very well.
      
      Using this simple approach with only the last 8 data extents seen is
      effective as usually when multiple file extents items point to the same
      data extent, their distance is within 8 slots. It also uses very little
      memory and the time to cache a result or lookup the cache is negligible.
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) to measure the impact in the case of COW writes (first example
      given above), where we run fiemap after overwriting 33% of the blocks of
      a file:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024  * 1024))
      
         # Create the file full of 1M extents.
         xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
      
         block_count=$((FILE_SIZE / 4096))
         # Overwrite about 33% of the file blocks.
         overwrite_count=$((block_count / 3))
      
         echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
         RANDOM=123
         for ((i = 1; i <= $overwrite_count; i++)); do
             off=$(((RANDOM % block_count) * 4096))
             xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
             echo -ne "\r$i blocks overwritten..."
         done
         echo -e "\n"
      
         # Unmount and mount to clear all cached metadata.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying this patch:
      
         fiemap took 128 milliseconds
      
      Result after applying this patch:
      
         fiemap took 92 milliseconds   (-28.1%)
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73e339e6
    • F
      btrfs: remove useless logic when finding parent nodes · 56f5c199
      Filipe Manana 提交于
      At find_parent_nodes(), at its last step, when iterating over all direct
      references, we are checking if we have a share context and if we have
      a reference with a different root from the one in the share context.
      However that logic is pointless because of two reasons:
      
      1) After the previous patch in the series (subject "btrfs: remove roots
         ulist when checking data extent sharedness"), the roots argument is
         always NULL when using a share check context (struct share_check), so
         this code is never triggered;
      
      2) Even before that previous patch, we could not hit this code because
         if we had a reference with a root different from the one in our share
         context, then we would have exited earlier when doing either of the
         following:
      
            - Adding a second direct ref to the direct refs red black tree
              resulted in extent_is_shared() returning true when called from
              add_direct_ref() -> add_prelim_ref(), after processing delayed
              references or while processing references in the extent tree;
      
            - When adding a second reference to the indirect refs red black
              tree (same as above, extent_is_shared() returns true);
      
            - If we only have one indirect reference and no direct references,
              then when resolving it at resolve_indirect_refs() we immediately
              return that the target extent is shared, therefore never reaching
              that loop that iterates over all direct references at
              find_parent_nodes();
      
            - If we have 1 indirect reference and 1 direct reference, then we
              also exit early because extent_is_shared() ends up returning true
              when called through add_prelim_ref() (by add_direct_ref() or
              add_indirect_ref()) or add_delayed_refs(). Same applies as when
              having a combination of direct, indirect and indirect with missing
              key references.
      
         This logic had been obsoleted since commit 3ec4d323 ("btrfs:
         allow backref search checks for shared extents"), which introduced the
         early exits in case an extent is shared.
      
      So just remove that logic, and assert at find_parent_nodes() that when we
      have a share context we don't have a roots ulist and that we haven't found
      the extent to be directly shared after processing delayed references and
      all references from the extent tree.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      56f5c199
    • F
      btrfs: remove roots ulist when checking data extent sharedness · b6296858
      Filipe Manana 提交于
      Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
      argument of find_parent_nodes(), however it does not use that ulist for
      anything and for this context that list always ends up with at most one
      element.
      
      Since find_parent_nodes() is able to deal with a NULL ulist for its roots
      argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
      burden of allocating memory for the unnused roots ulist, initializing it,
      releasing it and allocating one struct ulist_node for it during the call
      to find_parent_nodes().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6296858
    • F
      btrfs: move ulists to data extent sharedness check context · 84a7949d
      Filipe Manana 提交于
      When calling btrfs_is_data_extent_shared() we pass two ulists that were
      allocated by the caller. This is because the single caller, fiemap, calls
      btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
      instead of allocating new ones before each call and freeing them after
      each call.
      
      Now that we have a context structure/object that we pass to
      btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
      their allocation and the context's allocation in a helper function, as
      well as the freeing of the ulists and the context object. This allows to
      reduce the number of parameters passed to btrfs_is_data_extent_shared(),
      the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
      and having the caller deal with allocating and releasing the ulists.
      
      Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
      that's a much better name as it reflects what the list is used for (and
      matching the argument name for find_parent_nodes()).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      84a7949d
    • F
      btrfs: turn the backref sharedness check cache into a context object · 61dbb952
      Filipe Manana 提交于
      Right now we are using a struct btrfs_backref_shared_cache to pass state
      across multiple btrfs_is_data_extent_shared() calls. The structure's name
      closely follows its current purpose, which is to cache previous checks
      for the sharedness of metadata extents. However we will start using the
      structure for more things other than caching sharedness checks, so rename
      it to struct btrfs_backref_share_check_ctx.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      61dbb952
    • F
      btrfs: directly pass the inode to btrfs_is_data_extent_shared() · ceb707da
      Filipe Manana 提交于
      Currently we pass a root and an inode number as arguments for
      btrfs_is_data_extent_shared() and the inode number is always from an
      inode that belongs to that root (it wouldn't make sense otherwise).
      In every context that we call btrfs_is_data_extent_shared() (fiemap only),
      we have an inode available, so directly pass the inode to the function
      instead of a root and inode number. This reduces the number of parameters
      and it makes the function's signature conform to most other functions we
      have.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ceb707da
    • F
      btrfs: remove checks for a 0 inode number during backref walking · a0a5472a
      Filipe Manana 提交于
      When doing backref walking to determine if an extent is shared, we are
      testing if the inode number, stored in the 'inum' field of struct
      share_check, is 0. However that can never be case, since the all instances
      of the structure are created at btrfs_is_data_extent_shared(), which
      always initializes it with the inode number from a fs tree (and the number
      for any inode from any tree can never be 0). So remove the checks.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a0a5472a
    • F
      btrfs: remove checks for a root with id 0 during backref walking · c9024219
      Filipe Manana 提交于
      When doing backref walking to determine if an extent is shared, we are
      testing the root_objectid of the given share_check struct is 0, but that
      is an impossible case, since btrfs_is_data_extent_shared() always
      initializes the root_objectid field with the id of the given root, and
      no root can have an objectid of 0. So remove those checks.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c9024219
  5. 03 11月, 2022 2 次提交
    • F
      btrfs: fix inode list leak during backref walking at find_parent_nodes() · 92876eec
      Filipe Manana 提交于
      During backref walking, at find_parent_nodes(), if we are dealing with a
      data extent and we get an error while resolving the indirect backrefs, at
      resolve_indirect_refs(), or in the while loop that iterates over the refs
      in the direct refs rbtree, we end up leaking the inode lists attached to
      the direct refs we have in the direct refs rbtree that were not yet added
      to the refs ulist passed as argument to find_parent_nodes(). Since they
      were not yet added to the refs ulist and prelim_release() does not free
      the lists, on error the caller can only free the lists attached to the
      refs that were added to the refs ulist, all the remaining refs get their
      inode lists never freed, therefore leaking their memory.
      
      Fix this by having prelim_release() always free any attached inode list
      to each ref found in the rbtree, and have find_parent_nodes() set the
      ref's inode list to NULL once it transfers ownership of the inode list
      to a ref added to the refs ulist passed to find_parent_nodes().
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92876eec
    • F
      btrfs: fix inode list leak during backref walking at resolve_indirect_refs() · 5614dc3a
      Filipe Manana 提交于
      During backref walking, at resolve_indirect_refs(), if we get an error
      we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
      which frees all the elements in the ulist - however that does not free
      any inode lists that may be attached to elements, through the 'aux' field
      of a ulist node, so we end up leaking lists if we have any attached to
      the unodes.
      
      Fix this by calling free_leaf_list() instead of ulist_free() when we exit
      from resolve_indirect_refs(). The static function free_leaf_list() is
      moved up for this to be possible and it's slightly simplified by removing
      unnecessary code.
      
      Fixes: 3301958b ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5614dc3a
  6. 11 10月, 2022 3 次提交
    • F
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana 提交于
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      63c84b46
    • F
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana 提交于
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      943553ef
    • F
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana 提交于
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fc7b572
  7. 07 10月, 2022 1 次提交
    • F
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana 提交于
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      96dbcc00
  8. 26 9月, 2022 3 次提交
    • F
      btrfs: skip unnecessary extent buffer sharedness checks during fiemap · b8f164e3
      Filipe Manana 提交于
      During fiemap, for each file extent we find, we must check if it's shared
      or not. The sharedness check starts by verifying if the extent is directly
      shared (its refcount in the extent tree is > 1), and if it is not directly
      shared, then we will check if every node in the subvolume b+tree leading
      from the root to the leaf that has the file extent item (in reverse order),
      is shared (through snapshots).
      
      However this second step is not needed if our extent was created in a
      transaction more recent than the last transaction where a snapshot of the
      inode's root happened, because it can't be shared indirectly (through
      shared subtrees) without a snapshot created in a more recent transaction.
      
      So grab the generation of the extent from the extent map and pass it to
      btrfs_is_data_extent_shared(), which will skip this second phase when the
      generation is more recent than the root's last snapshot value. Note that
      we skip this optimization if the extent map is the result of merging 2
      or more extent maps, because in this case its generation is the maximum
      of the generations of all merged extent maps.
      
      The fact the we use extent maps and they can be merged despite the
      underlying extents being distinct (different file extent items in the
      subvolume b+tree and different extent items in the extent b+tree), can
      result in some bugs when reporting shared extents. But this is a problem
      of the current implementation of fiemap relying on extent maps.
      One example where we get incorrect results is:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This is z problem that existed before this change, and remains after this
      change, as it can't be easily fixed. The next patch in the series reworks
      fiemap to primarily use file extent items instead of extent maps (except
      for checking for delalloc ranges), with the goal of improving its
      scalability and performance, but it also ends up fixing this particular
      bug caused by extent map merging.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b8f164e3
    • F
      btrfs: speedup checking for extent sharedness during fiemap · 12a824dc
      Filipe Manana 提交于
      One of the most expensive tasks performed during fiemap is to check if
      an extent is shared. This task has two major steps:
      
      1) Check if the data extent is shared. This implies checking the extent
         item in the extent tree, checking delayed references, etc. If we
         find the data extent is directly shared, we terminate immediately;
      
      2) If the data extent is not directly shared (its extent item has a
         refcount of 1), then it may be shared if we have snapshots that share
         subtrees of the inode's subvolume b+tree. So we check if the leaf
         containing the file extent item is shared, then its parent node, then
         the parent node of the parent node, etc, until we reach the root node
         or we find one of them is shared - in which case we stop immediately.
      
      During fiemap we process the extents of a file from left to right, from
      file offset 0 to EOF. This means that we iterate b+tree leaves from left
      to right, and has the implication that we keep repeating that second step
      above several times for the same b+tree path of the inode's subvolume
      b+tree.
      
      For example, if we have two file extent items in leaf X, and the path to
      leaf X is A -> B -> C -> X, then when we try to determine if the data
      extent referenced by the first extent item is shared, we check if the data
      extent is shared - if it's not, then we check if leaf X is shared, if not,
      then we check if node C is shared, if not, then check if node B is shared,
      if not than check if node A is shared. When we move to the next file
      extent item, after determining the data extent is not shared, we repeat
      the checks for X, C, B and A - doing all the expensive searches in the
      extent tree, delayed refs, etc. If we have thousands of tile extents, then
      we keep repeating the sharedness checks for the same paths over and over.
      
      On a file that has no shared extents or only a small portion, it's easy
      to see that this scales terribly with the number of extents in the file
      and the sizes of the extent and subvolume b+trees.
      
      This change eliminates the repeated sharedness check on extent buffers
      by caching the results of the last path used. The results can be used as
      long as no snapshots were created since they were cached (for not shared
      extent buffers) or no roots were dropped since they were cached (for
      shared extent buffers). This greatly reduces the time spent by fiemap for
      files with thousands of extents and/or large extent and subvolume b+trees.
      
      Example performance test:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1646 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 698 milliseconds (metadata cached)
      
      That's about 2.2x faster when no metadata is cached, and about 3x faster
      when all metadata is cached. On a real filesystem with many other files,
      data, directories, etc, the b+trees will be 2 or 3 levels higher,
      therefore this optimization will have a higher impact.
      
      Several reports of a slow fiemap show up often, the two Link tags below
      refer to two recent reports of such slowness. This patch, together with
      the next ones in the series, is meant to address that.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      12a824dc
    • F
      btrfs: rename btrfs_check_shared() to a more descriptive name · 8eedadda
      Filipe Manana 提交于
      The function btrfs_check_shared() is supposed to be used to check if a
      data extent is shared, but its name is too generic, may easily cause
      confusion in the sense that it may be used for metadata extents.
      
      So rename it to btrfs_is_data_extent_shared(), which will also make it
      less confusing after the next change that adds a backref lookup cache for
      the b+tree nodes that lead to the leaf that contains the file extent item
      that points to the target data extent.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8eedadda
  9. 25 7月, 2022 1 次提交