1. 26 9月, 2022 3 次提交
  2. 25 7月, 2022 4 次提交
  3. 18 5月, 2022 1 次提交
    • Q
      btrfs: allow defrag to convert inline extents to regular extents · d8101a0c
      Qu Wenruo 提交于
      Btrfs defaults to max_inline=2K to make small writes inlined into
      metadata.
      
      The default value is always a win, as even DUP/RAID1/RAID10 doubles the
      metadata usage, it should still cause less physical space used compared
      to a 4K regular extents.
      
      But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
      users may find inlined extents causing too much space wasted, and want
      to convert those inlined extents back to regular extents.
      
      Unfortunately defrag will unconditionally skip all inline extents, no
      matter if the user is trying to converting them back to regular extents.
      
      So this patch will add a small exception for defrag_collect_targets() to
      allow defragging inline extents, if and only if the inlined extents are
      larger than max_inline, allowing users to convert them to regular ones.
      
      This also allows us to defrag extents like the following:
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
      		generation 7 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
      		generation 7 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 16384 ram 16384
      		extent compression 1 (zlib)
      
      Previously we're unable to do any defrag, since the first extent is
      inlined, and the second one has no extent to merge.
      
      Now we can defrag it to just one single extent, saving 48 bytes metadata
      space.
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		generation 8 type 1 (regular)
      		extent data disk byte 13635584 nr 4096
      		extent data offset 0 nr 20480 ram 20480
      		extent compression 1 (zlib)
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d8101a0c
  4. 16 5月, 2022 8 次提交
  5. 10 5月, 2022 1 次提交
  6. 25 4月, 2022 1 次提交
  7. 18 4月, 2022 2 次提交
  8. 06 4月, 2022 1 次提交
  9. 25 3月, 2022 1 次提交
    • Q
      btrfs: avoid defragging extents whose next extents are not targets · 75a36a7d
      Qu Wenruo 提交于
      [BUG]
      There is a report that autodefrag is defragging single sector, which
      is completely waste of IO, and no help for defragging:
      
         btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
      
      [CAUSE]
      In defrag_collect_targets(), we check if the current range (A) can be merged
      with next one (B).
      
      If mergeable, we will add range A into target for defrag.
      
      However there is a catch for autodefrag, when checking mergeability
      against range B, we intentionally pass 0 as @newer_than, hoping to get a
      higher chance to merge with the next extent.
      
      But in the next iteration, range B will looked up by defrag_lookup_extent(),
      with non-zero @newer_than.
      
      And if range B is not really newer, it will rejected directly, causing
      only range A being defragged, while we expect to defrag both range A and
      B.
      
      [FIX]
      Since the root cause is the difference in check condition of
      defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
      
      1. Pass @newer_than to defrag_check_next_extent()
      2. Pass @extent_thresh to defrag_check_next_extent()
      
      This makes the check between defrag_collect_targets() and
      defrag_check_next_extent() more consistent.
      
      While there is still some minor difference, the remaining checks are
      focus on runtime flags like writeback/delalloc, which are mostly
      transient and safe to be checked only in defrag_collect_targets().
      
      Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75a36a7d
  10. 14 3月, 2022 8 次提交
  11. 24 2月, 2022 5 次提交
    • Q
      btrfs: defrag: don't use merged extent map for their generation check · 199257a7
      Qu Wenruo 提交于
      For extent maps, if they are not compressed extents and are adjacent by
      logical addresses and file offsets, they can be merged into one larger
      extent map.
      
      Such merged extent map will have the higher generation of all the
      original ones.
      
      But this brings a problem for autodefrag, as it relies on accurate
      extent_map::generation to determine if one extent should be defragged.
      
      For merged extent maps, their higher generation can mark some older
      extents to be defragged while the original extent map doesn't meet the
      minimal generation threshold.
      
      Thus this will cause extra IO.
      
      So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
      to indicate if the extent map is merged from one or more ems.
      
      And for autodefrag, if we find a merged extent map, and its generation
      meets the generation requirement, we just don't use this one, and go
      back to defrag_get_extent() to read extent maps from subvolume trees.
      
      This could cause more read IO, but should result less defrag data write,
      so in the long run it should be a win for autodefrag.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      199257a7
    • Q
      btrfs: defrag: bring back the old file extent search behavior · d5633b0d
      Qu Wenruo 提交于
      For defrag, we don't really want to use btrfs_get_extent() to iterate
      all extent maps of an inode.
      
      The reasons are:
      
      - btrfs_get_extent() can merge extent maps
        And the result em has the higher generation of the two, causing defrag
        to mark unnecessary part of such merged large extent map.
      
        This in fact can result extra IO for autodefrag in v5.16+ kernels.
      
        However this patch is not going to completely solve the problem, as
        one can still using read() to trigger extent map reading, and got
        them merged.
      
        The completely solution for the extent map merging generation problem
        will come as an standalone fix.
      
      - btrfs_get_extent() caches the extent map result
        Normally it's fine, but for defrag the target range may not get
        another read/write for a long long time.
        Such cache would only increase the memory usage.
      
      - btrfs_get_extent() doesn't skip older extent map
        Unlike the old find_new_extent() which uses btrfs_search_forward() to
        skip the older subtree, thus it will pick up unnecessary extent maps.
      
      This patch will fix the regression by introducing defrag_get_extent() to
      replace the btrfs_get_extent() call.
      
      This helper will:
      
      - Not cache the file extent we found
        It will search the file extent and manually convert it to em.
      
      - Use btrfs_search_forward() to skip entire ranges which is modified in
        the past
      
      This should reduce the IO for autodefrag.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5633b0d
    • Q
      btrfs: defrag: remove an ambiguous condition for rejection · 550f133f
      Qu Wenruo 提交于
      From the very beginning of btrfs defrag, there is a check to reject
      extents which meet both conditions:
      
      - Physically adjacent
      
        We may want to defrag physically adjacent extents to reduce the number
        of extents or the size of subvolume tree.
      
      - Larger than 128K
      
        This may be there for compressed extents, but unfortunately 128K is
        exactly the max capacity for compressed extents.
        And the check is > 128K, thus it never rejects compressed extents.
      
        Furthermore, the compressed extent capacity bug is fixed by previous
        patch, there is no reason for that check anymore.
      
      The original check has a very small ranges to reject (the target extent
      size is > 128K, and default extent threshold is 256K), and for
      compressed extent it doesn't work at all.
      
      So it's better just to remove the rejection, and allow us to defrag
      physically adjacent extents.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      550f133f
    • Q
      btrfs: defrag: don't defrag extents which are already at max capacity · 979b25c3
      Qu Wenruo 提交于
      [BUG]
      For compressed extents, defrag ioctl will always try to defrag any
      compressed extents, wasting not only IO but also CPU time to
      compress/decompress:
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
         xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
         sync
         xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
         sync
         echo "=== before ==="
         xfs_io -c "fiemap -v" $MNT/foobar
         btrfs filesystem defrag $MNT/foobar
         sync
         echo "=== after ==="
         xfs_io -c "fiemap -v" $MNT/foobar
      
      Then it shows the 2 128K extents just get COW for no extra benefit, with
      extra IO/CPU spent:
      
          === before ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26624..26879       256   0x8
             1: [256..511]:      26632..26887       256   0x9
          === after ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26640..26895       256   0x8
             1: [256..511]:      26648..26903       256   0x9
      
      This affects not only v5.16 (after the defrag rework), but also v5.15
      (before the defrag rework).
      
      [CAUSE]
      From the very beginning, btrfs defrag never checks if one extent is
      already at its max capacity (128K for compressed extents, 128M
      otherwise).
      
      And the default extent size threshold is 256K, which is already beyond
      the compressed extent max size.
      
      This means, by default btrfs defrag ioctl will mark all compressed
      extent which is not adjacent to a hole/preallocated range for defrag.
      
      [FIX]
      Introduce a helper to grab the maximum extent size, and then in
      defrag_collect_targets() and defrag_check_next_extent(), reject extents
      which are already at their max capacity.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      979b25c3
    • Q
      btrfs: defrag: don't try to merge regular extents with preallocated extents · 7093f152
      Qu Wenruo 提交于
      [BUG]
      With older kernels (before v5.16), btrfs will defrag preallocated extents.
      While with newer kernels (v5.16 and newer) btrfs will not defrag
      preallocated extents, but it will defrag the extent just before the
      preallocated extent, even it's just a single sector.
      
      This can be exposed by the following small script:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      	xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
      	xfs_io -c "fiemap -v" $mnt/file
      	btrfs fi defrag $mnt/file
      	sync
      	xfs_io -c "fiemap -v" $mnt/file
      
      The output looks like this on older kernels:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..39]:         26664..26703        40   0x1
      
      Which defrags the single sector along with the preallocated extent, and
      replace them with an regular extent into a new location (caused by data
      COW).
      This wastes most of the data IO just for the preallocated range.
      
      On the other hand, v5.16 is slightly better:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26664..26671         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      
      The preallocated range is not defragged, but the sector before it still
      gets defragged, which has no need for it.
      
      [CAUSE]
      One of the function reused by the old and new behavior is
      defrag_check_next_extent(), it will determine if we should defrag
      current extent by checking the next one.
      
      It only checks if the next extent is a hole or inlined, but it doesn't
      check if it's preallocated.
      
      On the other hand, out of the function, both old and new kernel will
      reject preallocated extents.
      
      Such inconsistent behavior causes above behavior.
      
      [FIX]
      - Also check if next extent is preallocated
        If so, don't defrag current extent.
      
      - Add comments for each branch why we reject the extent
      
      This will reduce the IO caused by defrag ioctl and autodefrag.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7093f152
  12. 16 2月, 2022 1 次提交
    • Q
      btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target · 966d879b
      Qu Wenruo 提交于
      In the rework of btrfs_defrag_file(), we always call
      defrag_one_cluster() and increase the offset by cluster size, which is
      only 256K.
      
      But there are cases where we have a large extent (e.g. 128M) which
      doesn't need to be defragged at all.
      
      Before the refactor, we can directly skip the range, but now we have to
      scan that extent map again and again until the cluster moves after the
      non-target extent.
      
      Fix the problem by allow defrag_one_cluster() to increase
      btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
      the last extent of the cluster is not a target.
      
      The test script looks like this:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      
      	# As btrfs ioctl uses 32M as extent_threshold
      	xfs_io -f -c "pwrite 0 64M" $mnt/file1
      	sync
      	# Some fragemented range to defrag
      	xfs_io -s -c "pwrite 65548k 4k" \
      		  -c "pwrite 65544k 4k" \
      		  -c "pwrite 65540k 4k" \
      		  -c "pwrite 65536k 4k" \
      		  $mnt/file1
      	sync
      
      	echo "=== before ==="
      	xfs_io -c "fiemap -v" $mnt/file1
      	echo "=== after ==="
      	btrfs fi defrag $mnt/file1
      	sync
      	xfs_io -c "fiemap -v" $mnt/file1
      	umount $mnt
      
      With extra ftrace put into defrag_one_cluster(), before the patch it
      would result tons of loops:
      
      (As defrag_one_cluster() is inlined, the function name is its caller)
      
        btrfs-126062  [005] .....  4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
        btrfs-126062  [005] .....  4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
        ...
        btrfs-126062  [005] .....  4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144
      
      But with this patch there will be just one loop, then directly to the
      end of the extent:
      
        btrfs-130471  [014] .....  5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
        btrfs-130471  [014] .....  5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      966d879b
  13. 10 2月, 2022 2 次提交
  14. 31 1月, 2022 2 次提交
    • T
      btrfs: fix use of uninitialized variable at rm device ioctl · 37b45995
      Tom Rix 提交于
      Clang static analysis reports this problem
      ioctl.c:3333:8: warning: 3rd function call argument is an
        uninitialized value
          ret = exclop_start_or_cancel_reloc(fs_info,
      
      cancel is only set in one branch of an if-check and is always used.  So
      initialize to false.
      
      Fixes: 1a15eb72 ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NTom Rix <trix@redhat.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37b45995
    • F
      btrfs: fix use-after-free after failure to create a snapshot · 28b21c55
      Filipe Manana 提交于
      At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
      then attach it to the transaction's list of pending snapshots. After that
      we call btrfs_commit_transaction(), and if that returns an error we jump
      to 'fail' label, where we kfree() the pending snapshot structure. This can
      result in a later use-after-free of the pending snapshot:
      
      1) We allocated the pending snapshot and added it to the transaction's
         list of pending snapshots;
      
      2) We call btrfs_commit_transaction(), and it fails either at the first
         call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
         In both cases, we don't abort the transaction and we release our
         transaction handle. We jump to the 'fail' label and free the pending
         snapshot structure. We return with the pending snapshot still in the
         transaction's list;
      
      3) Another task commits the transaction. This time there's no error at
         all, and then during the transaction commit it accesses a pointer
         to the pending snapshot structure that the snapshot creation task
         has already freed, resulting in a user-after-free.
      
      This issue could actually be detected by smatch, which produced the
      following warning:
      
        fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
      
      So fix this by not having the snapshot creation ioctl directly add the
      pending snapshot to the transaction's list. Instead add the pending
      snapshot to the transaction handle, and then at btrfs_commit_transaction()
      we add the snapshot to the list only when we can guarantee that any error
      returned after that point will result in a transaction abort, in which
      case the ioctl code can safely free the pending snapshot and no one can
      access it anymore.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28b21c55