1. 16 5月, 2022 5 次提交
    • O
      btrfs: move common inode creation code into btrfs_create_new_inode() · caae78e0
      Omar Sandoval 提交于
      All of our inode creation code paths duplicate the calls to
      btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
      additionally duplicates property inheritance and the call to
      btrfs_set_inode_index(). Fix this by moving the common code into
      btrfs_create_new_inode(). This accomplishes a few things at once:
      
      1. It reduces code duplication.
      
      2. It allows us to set up the inode completely before inserting the
         inode item, removing calls to btrfs_update_inode().
      
      3. It fixes a leak of an inode on disk in some error cases. For example,
         in btrfs_create(), if btrfs_new_inode() succeeds, then we have
         inserted an inode item and its inode ref. However, if something after
         that fails (e.g., btrfs_init_inode_security()), then we end the
         transaction and then decrement the link count on the inode. If the
         transaction is committed and the system crashes before the failed
         inode is deleted, then we leak that inode on disk. Instead, this
         refactoring aborts the transaction when we can't recover more
         gracefully.
      
      4. It exposes various ways that subvolume creation diverges from mkdir
         in terms of inheriting flags, properties, permissions, and POSIX
         ACLs, a lot of which appears to be accidental. This patch explicitly
         does _not_ change the existing non-standard behavior, but it makes
         those differences more clear in the code and documents them so that
         we can discuss whether they should be changed.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      caae78e0
    • O
      btrfs: reserve correct number of items for inode creation · 3538d68d
      Omar Sandoval 提交于
      The various inode creation code paths do not account for the compression
      property, POSIX ACLs, or the parent inode item when starting a
      transaction. Fix it by refactoring all of these code paths to use a new
      function, btrfs_new_inode_prepare(), which computes the correct number
      of items. To do so, it needs to know whether POSIX ACLs will be created,
      so move the ACL creation into that function. To reduce the number of
      arguments that need to be passed around for inode creation, define
      struct btrfs_new_inode_args containing all of the relevant information.
      
      btrfs_new_inode_prepare() will also be a good place to set up the
      fscrypt context and encrypted filename in the future.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3538d68d
    • O
      btrfs: allocate inode outside of btrfs_new_inode() · a1fd0c35
      Omar Sandoval 提交于
      Instead of calling new_inode() and inode_init_owner() inside of
      btrfs_new_inode(), do it in the callers. This allows us to pass in just
      the inode instead of the mnt_userns and mode and removes the need for
      memalloc_nofs_{save,restores}() since we do it before starting a
      transaction. In create_subvol(), it also means we no longer have to look
      up the inode again to instantiate it. This also paves the way for some
      more cleanups in later patches.
      
      This also removes the comments about Smack checking i_op, which are no
      longer true since commit 5d6c3191 ("xattr: Add
      __vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
      IOP_XATTR, which is set based on sb->s_xattr.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1fd0c35
    • O
      btrfs: remove redundant name and name_len parameters to create_subvol · 70dc55f4
      Omar Sandoval 提交于
      The passed dentry already contains the name.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      70dc55f4
    • O
      btrfs: fix anon_dev leak in create_subvol() · 2256e901
      Omar Sandoval 提交于
      When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or
      btrfs_insert_root() fail in create_subvol(), we return without freeing
      anon_dev. Reorganize the error handling in create_subvol() to fix this.
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2256e901
  2. 06 4月, 2022 1 次提交
  3. 25 3月, 2022 1 次提交
    • Q
      btrfs: avoid defragging extents whose next extents are not targets · 75a36a7d
      Qu Wenruo 提交于
      [BUG]
      There is a report that autodefrag is defragging single sector, which
      is completely waste of IO, and no help for defragging:
      
         btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096
      
      [CAUSE]
      In defrag_collect_targets(), we check if the current range (A) can be merged
      with next one (B).
      
      If mergeable, we will add range A into target for defrag.
      
      However there is a catch for autodefrag, when checking mergeability
      against range B, we intentionally pass 0 as @newer_than, hoping to get a
      higher chance to merge with the next extent.
      
      But in the next iteration, range B will looked up by defrag_lookup_extent(),
      with non-zero @newer_than.
      
      And if range B is not really newer, it will rejected directly, causing
      only range A being defragged, while we expect to defrag both range A and
      B.
      
      [FIX]
      Since the root cause is the difference in check condition of
      defrag_check_next_extent() and defrag_collect_targets(), we fix it by:
      
      1. Pass @newer_than to defrag_check_next_extent()
      2. Pass @extent_thresh to defrag_check_next_extent()
      
      This makes the check between defrag_collect_targets() and
      defrag_check_next_extent() more consistent.
      
      While there is still some minor difference, the remaining checks are
      focus on runtime flags like writeback/delalloc, which are mostly
      transient and safe to be checked only in defrag_collect_targets().
      
      Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75a36a7d
  4. 14 3月, 2022 8 次提交
  5. 24 2月, 2022 5 次提交
    • Q
      btrfs: defrag: don't use merged extent map for their generation check · 199257a7
      Qu Wenruo 提交于
      For extent maps, if they are not compressed extents and are adjacent by
      logical addresses and file offsets, they can be merged into one larger
      extent map.
      
      Such merged extent map will have the higher generation of all the
      original ones.
      
      But this brings a problem for autodefrag, as it relies on accurate
      extent_map::generation to determine if one extent should be defragged.
      
      For merged extent maps, their higher generation can mark some older
      extents to be defragged while the original extent map doesn't meet the
      minimal generation threshold.
      
      Thus this will cause extra IO.
      
      So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED,
      to indicate if the extent map is merged from one or more ems.
      
      And for autodefrag, if we find a merged extent map, and its generation
      meets the generation requirement, we just don't use this one, and go
      back to defrag_get_extent() to read extent maps from subvolume trees.
      
      This could cause more read IO, but should result less defrag data write,
      so in the long run it should be a win for autodefrag.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      199257a7
    • Q
      btrfs: defrag: bring back the old file extent search behavior · d5633b0d
      Qu Wenruo 提交于
      For defrag, we don't really want to use btrfs_get_extent() to iterate
      all extent maps of an inode.
      
      The reasons are:
      
      - btrfs_get_extent() can merge extent maps
        And the result em has the higher generation of the two, causing defrag
        to mark unnecessary part of such merged large extent map.
      
        This in fact can result extra IO for autodefrag in v5.16+ kernels.
      
        However this patch is not going to completely solve the problem, as
        one can still using read() to trigger extent map reading, and got
        them merged.
      
        The completely solution for the extent map merging generation problem
        will come as an standalone fix.
      
      - btrfs_get_extent() caches the extent map result
        Normally it's fine, but for defrag the target range may not get
        another read/write for a long long time.
        Such cache would only increase the memory usage.
      
      - btrfs_get_extent() doesn't skip older extent map
        Unlike the old find_new_extent() which uses btrfs_search_forward() to
        skip the older subtree, thus it will pick up unnecessary extent maps.
      
      This patch will fix the regression by introducing defrag_get_extent() to
      replace the btrfs_get_extent() call.
      
      This helper will:
      
      - Not cache the file extent we found
        It will search the file extent and manually convert it to em.
      
      - Use btrfs_search_forward() to skip entire ranges which is modified in
        the past
      
      This should reduce the IO for autodefrag.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5633b0d
    • Q
      btrfs: defrag: remove an ambiguous condition for rejection · 550f133f
      Qu Wenruo 提交于
      From the very beginning of btrfs defrag, there is a check to reject
      extents which meet both conditions:
      
      - Physically adjacent
      
        We may want to defrag physically adjacent extents to reduce the number
        of extents or the size of subvolume tree.
      
      - Larger than 128K
      
        This may be there for compressed extents, but unfortunately 128K is
        exactly the max capacity for compressed extents.
        And the check is > 128K, thus it never rejects compressed extents.
      
        Furthermore, the compressed extent capacity bug is fixed by previous
        patch, there is no reason for that check anymore.
      
      The original check has a very small ranges to reject (the target extent
      size is > 128K, and default extent threshold is 256K), and for
      compressed extent it doesn't work at all.
      
      So it's better just to remove the rejection, and allow us to defrag
      physically adjacent extents.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      550f133f
    • Q
      btrfs: defrag: don't defrag extents which are already at max capacity · 979b25c3
      Qu Wenruo 提交于
      [BUG]
      For compressed extents, defrag ioctl will always try to defrag any
      compressed extents, wasting not only IO but also CPU time to
      compress/decompress:
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
         xfs_io -f -c "pwrite -S 0xab 0 128K" $MNT/foobar
         sync
         xfs_io -f -c "pwrite -S 0xcd 128K 128K" $MNT/foobar
         sync
         echo "=== before ==="
         xfs_io -c "fiemap -v" $MNT/foobar
         btrfs filesystem defrag $MNT/foobar
         sync
         echo "=== after ==="
         xfs_io -c "fiemap -v" $MNT/foobar
      
      Then it shows the 2 128K extents just get COW for no extra benefit, with
      extra IO/CPU spent:
      
          === before ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26624..26879       256   0x8
             1: [256..511]:      26632..26887       256   0x9
          === after ===
          /mnt/btrfs/file1:
           EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
             0: [0..255]:        26640..26895       256   0x8
             1: [256..511]:      26648..26903       256   0x9
      
      This affects not only v5.16 (after the defrag rework), but also v5.15
      (before the defrag rework).
      
      [CAUSE]
      From the very beginning, btrfs defrag never checks if one extent is
      already at its max capacity (128K for compressed extents, 128M
      otherwise).
      
      And the default extent size threshold is 256K, which is already beyond
      the compressed extent max size.
      
      This means, by default btrfs defrag ioctl will mark all compressed
      extent which is not adjacent to a hole/preallocated range for defrag.
      
      [FIX]
      Introduce a helper to grab the maximum extent size, and then in
      defrag_collect_targets() and defrag_check_next_extent(), reject extents
      which are already at their max capacity.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      979b25c3
    • Q
      btrfs: defrag: don't try to merge regular extents with preallocated extents · 7093f152
      Qu Wenruo 提交于
      [BUG]
      With older kernels (before v5.16), btrfs will defrag preallocated extents.
      While with newer kernels (v5.16 and newer) btrfs will not defrag
      preallocated extents, but it will defrag the extent just before the
      preallocated extent, even it's just a single sector.
      
      This can be exposed by the following small script:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      	xfs_io -f -c "pwrite 0 4k" -c sync -c "falloc 4k 16K" $mnt/file
      	xfs_io -c "fiemap -v" $mnt/file
      	btrfs fi defrag $mnt/file
      	sync
      	xfs_io -c "fiemap -v" $mnt/file
      
      The output looks like this on older kernels:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..39]:         26664..26703        40   0x1
      
      Which defrags the single sector along with the preallocated extent, and
      replace them with an regular extent into a new location (caused by data
      COW).
      This wastes most of the data IO just for the preallocated range.
      
      On the other hand, v5.16 is slightly better:
      
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26624..26631         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      /mnt/btrfs/file:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..7]:          26664..26671         8   0x0
         1: [8..39]:         26632..26663        32 0x801
      
      The preallocated range is not defragged, but the sector before it still
      gets defragged, which has no need for it.
      
      [CAUSE]
      One of the function reused by the old and new behavior is
      defrag_check_next_extent(), it will determine if we should defrag
      current extent by checking the next one.
      
      It only checks if the next extent is a hole or inlined, but it doesn't
      check if it's preallocated.
      
      On the other hand, out of the function, both old and new kernel will
      reject preallocated extents.
      
      Such inconsistent behavior causes above behavior.
      
      [FIX]
      - Also check if next extent is preallocated
        If so, don't defrag current extent.
      
      - Add comments for each branch why we reject the extent
      
      This will reduce the IO caused by defrag ioctl and autodefrag.
      
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7093f152
  6. 16 2月, 2022 1 次提交
    • Q
      btrfs: defrag: allow defrag_one_cluster() to skip large extent which is not a target · 966d879b
      Qu Wenruo 提交于
      In the rework of btrfs_defrag_file(), we always call
      defrag_one_cluster() and increase the offset by cluster size, which is
      only 256K.
      
      But there are cases where we have a large extent (e.g. 128M) which
      doesn't need to be defragged at all.
      
      Before the refactor, we can directly skip the range, but now we have to
      scan that extent map again and again until the cluster moves after the
      non-target extent.
      
      Fix the problem by allow defrag_one_cluster() to increase
      btrfs_defrag_ctrl::last_scanned to the end of an extent, if and only if
      the last extent of the cluster is not a target.
      
      The test script looks like this:
      
      	mkfs.btrfs -f $dev > /dev/null
      
      	mount $dev $mnt
      
      	# As btrfs ioctl uses 32M as extent_threshold
      	xfs_io -f -c "pwrite 0 64M" $mnt/file1
      	sync
      	# Some fragemented range to defrag
      	xfs_io -s -c "pwrite 65548k 4k" \
      		  -c "pwrite 65544k 4k" \
      		  -c "pwrite 65540k 4k" \
      		  -c "pwrite 65536k 4k" \
      		  $mnt/file1
      	sync
      
      	echo "=== before ==="
      	xfs_io -c "fiemap -v" $mnt/file1
      	echo "=== after ==="
      	btrfs fi defrag $mnt/file1
      	sync
      	xfs_io -c "fiemap -v" $mnt/file1
      	umount $mnt
      
      With extra ftrace put into defrag_one_cluster(), before the patch it
      would result tons of loops:
      
      (As defrag_one_cluster() is inlined, the function name is its caller)
      
        btrfs-126062  [005] .....  4682.816026: btrfs_defrag_file: r/i=5/257 start=0 len=262144
        btrfs-126062  [005] .....  4682.816027: btrfs_defrag_file: r/i=5/257 start=262144 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=524288 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=786432 len=262144
        btrfs-126062  [005] .....  4682.816028: btrfs_defrag_file: r/i=5/257 start=1048576 len=262144
        ...
        btrfs-126062  [005] .....  4682.816043: btrfs_defrag_file: r/i=5/257 start=67108864 len=262144
      
      But with this patch there will be just one loop, then directly to the
      end of the extent:
      
        btrfs-130471  [014] .....  5434.029558: defrag_one_cluster: r/i=5/257 start=0 len=262144
        btrfs-130471  [014] .....  5434.029559: defrag_one_cluster: r/i=5/257 start=67108864 len=16384
      
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      966d879b
  7. 10 2月, 2022 2 次提交
  8. 31 1月, 2022 2 次提交
    • T
      btrfs: fix use of uninitialized variable at rm device ioctl · 37b45995
      Tom Rix 提交于
      Clang static analysis reports this problem
      ioctl.c:3333:8: warning: 3rd function call argument is an
        uninitialized value
          ret = exclop_start_or_cancel_reloc(fs_info,
      
      cancel is only set in one branch of an if-check and is always used.  So
      initialize to false.
      
      Fixes: 1a15eb72 ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NTom Rix <trix@redhat.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37b45995
    • F
      btrfs: fix use-after-free after failure to create a snapshot · 28b21c55
      Filipe Manana 提交于
      At ioctl.c:create_snapshot(), we allocate a pending snapshot structure and
      then attach it to the transaction's list of pending snapshots. After that
      we call btrfs_commit_transaction(), and if that returns an error we jump
      to 'fail' label, where we kfree() the pending snapshot structure. This can
      result in a later use-after-free of the pending snapshot:
      
      1) We allocated the pending snapshot and added it to the transaction's
         list of pending snapshots;
      
      2) We call btrfs_commit_transaction(), and it fails either at the first
         call to btrfs_run_delayed_refs() or btrfs_start_dirty_block_groups().
         In both cases, we don't abort the transaction and we release our
         transaction handle. We jump to the 'fail' label and free the pending
         snapshot structure. We return with the pending snapshot still in the
         transaction's list;
      
      3) Another task commits the transaction. This time there's no error at
         all, and then during the transaction commit it accesses a pointer
         to the pending snapshot structure that the snapshot creation task
         has already freed, resulting in a user-after-free.
      
      This issue could actually be detected by smatch, which produced the
      following warning:
      
        fs/btrfs/ioctl.c:843 create_snapshot() warn: '&pending_snapshot->list' not removed from list
      
      So fix this by not having the snapshot creation ioctl directly add the
      pending snapshot to the transaction's list. Instead add the pending
      snapshot to the transaction handle, and then at btrfs_commit_transaction()
      we add the snapshot to the list only when we can guarantee that any error
      returned after that point will result in a transaction abort, in which
      case the ioctl code can safely free the pending snapshot and no one can
      access it anymore.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      28b21c55
  9. 25 1月, 2022 3 次提交
    • F
      btrfs: update writeback index when starting defrag · 27cdfde1
      Filipe Manana 提交于
      When starting a defrag, we should update the writeback index of the
      inode's mapping in case it currently has a value beyond the start of the
      range we are defragging. This can help performance and often result in
      getting less extents after writeback - for e.g., if the current value
      of the writeback index sits somewhere in the middle of a range that
      gets dirty by the defrag, then after writeback we can get two smaller
      extents instead of a single, larger extent.
      
      We used to have this before the refactoring in 5.16, but it was removed
      without any reason to do so. Originally it was added in kernel 3.1, by
      commit 2a0f7f57 ("Btrfs: fix recursive auto-defrag"), in order to
      fix a loop with autodefrag resulting in dirtying and writing pages over
      and over, but some testing on current code did not show that happening,
      at least with the test described in that commit.
      
      So add back the behaviour, as at the very least it is a nice to have
      optimization.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      27cdfde1
    • F
      btrfs: add back missing dirty page rate limiting to defrag · 3c9d31c7
      Filipe Manana 提交于
      A defrag operation can dirty a lot of pages, specially if operating on
      the entire file or a large file range. Any task dirtying pages should
      periodically call balance_dirty_pages_ratelimited(), as stated in that
      function's comments, otherwise they can leave too many dirty pages in
      the system. This is what we did before the refactoring in 5.16, and
      it should have remained, just like in the buffered write path and
      relocation. So restore that behaviour.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3c9d31c7
    • F
      btrfs: fix deadlock when reserving space during defrag · 0cb5950f
      Filipe Manana 提交于
      When defragging we can end up collecting a range for defrag that has
      already pages under delalloc (dirty), as long as the respective extent
      map for their range is not mapped to a hole, a prealloc extent or
      the extent map is from an old generation.
      
      Most of the time that is harmless from a functional perspective at
      least, however it can result in a deadlock:
      
      1) At defrag_collect_targets() we find an extent map that meets all
         requirements but there's delalloc for the range it covers, and we add
         its range to list of ranges to defrag;
      
      2) The defrag_collect_targets() function is called at defrag_one_range(),
         after it locked a range that overlaps the range of the extent map;
      
      3) At defrag_one_range(), while the range is still locked, we call
         defrag_one_locked_target() for the range associated to the extent
         map we collected at step 1);
      
      4) Then finally at defrag_one_locked_target() we do a call to
         btrfs_delalloc_reserve_space(), which will reserve data and metadata
         space. If the space reservations can not be satisfied right away, the
         flusher might be kicked in and start flushing delalloc and wait for
         the respective ordered extents to complete. If this happens we will
         deadlock, because both flushing delalloc and finishing an ordered
         extent, requires locking the range in the inode's io tree, which was
         already locked at defrag_collect_targets().
      
      So fix this by skipping extent maps for which there's already delalloc.
      
      Fixes: eb793cf8 ("btrfs: defrag: introduce helper to collect target file extents")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cb5950f
  10. 24 1月, 2022 1 次提交
    • A
      fsnotify: invalidate dcache before IN_DELETE event · a37d9a17
      Amir Goldstein 提交于
      Apparently, there are some applications that use IN_DELETE event as an
      invalidation mechanism and expect that if they try to open a file with
      the name reported with the delete event, that it should not contain the
      content of the deleted file.
      
      Commit 49246466 ("fsnotify: move fsnotify_nameremove() hook out of
      d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
      will have access to a positive dentry.
      
      This allowed a race where opening the deleted file via cached dentry
      is now possible after receiving the IN_DELETE event.
      
      To fix the regression, create a new hook fsnotify_delete() that takes
      the unlinked inode as an argument and use a helper d_delete_notify() to
      pin the inode, so we can pass it to fsnotify_delete() after d_delete().
      
      Backporting hint: this regression is from v5.3. Although patch will
      apply with only trivial conflicts to v5.4 and v5.10, it won't build,
      because fsnotify_delete() implementation is different in each of those
      versions (see fsnotify_link()).
      
      A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
      filesystem that do not need to call d_delete().
      
      Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.comReported-by: NIvan Delalande <colona@arista.com>
      Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
      Fixes: 49246466 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
      Cc: stable@vger.kernel.org # v5.3+
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      a37d9a17
  11. 20 1月, 2022 4 次提交
    • Q
      btrfs: defrag: properly update range->start for autodefrag · c080b414
      Qu Wenruo 提交于
      [BUG]
      After commit 7b508037 ("btrfs: defrag: use defrag_one_cluster() to
      implement btrfs_defrag_file()") autodefrag no longer properly re-defrag
      the file from previously finished location.
      
      [CAUSE]
      The recent refactoring of defrag only focuses on defrag ioctl subpage
      support, doesn't take autodefrag into consideration.
      
      There are two problems involved which prevents autodefrag to restart its
      scan:
      
      - No range.start update
        Previously when one defrag target is found, range->start will be
        updated to indicate where next search should start from.
      
        But now btrfs_defrag_file() doesn't update it anymore, making all
        autodefrag to rescan from file offset 0.
      
        This would also make autodefrag to mark the same range dirty again and
        again, causing extra IO.
      
      - No proper quick exit for defrag_one_cluster()
        Currently if we reached or exceed @max_sectors limit, we just exit
        defrag_one_cluster(), and let next defrag_one_cluster() call to do a
        quick exit.
        This makes @cur increase, thus no way to properly know which range is
        defragged and which range is skipped.
      
      [FIX]
      The fix involves two modifications:
      
      - Update range->start to next cluster start
        This is a little different from the old behavior.
        Previously range->start is updated to the next defrag target.
      
        But in the end, the behavior should still be pretty much the same,
        as now we skip to next defrag target inside btrfs_defrag_file().
      
        Thus if auto-defrag determines to re-scan, then we still do the skip,
        just at a different timing.
      
      - Make defrag_one_cluster() to return >0 to indicate a quick exit
        So that btrfs_defrag_file() can also do a quick exit, without
        increasing @cur to the range end, and re-use @cur to update
        @range->start.
      
      - Add comment for btrfs_defrag_file() to mention the range->start update
        Currently only autodefrag utilize this behavior, as defrag ioctl won't
        set @max_to_defrag parameter, thus unless interrupted it will always
        try to defrag the whole range.
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c080b414
    • Q
      btrfs: defrag: fix wrong number of defragged sectors · 484167da
      Qu Wenruo 提交于
      [BUG]
      There are users using autodefrag mount option reporting obvious increase
      in IO:
      
      > If I compare the write average (in total, I don't have it per process)
      > when taking idle periods on the same machine:
      >     Linux 5.16:
      >         without autodefrag: ~ 10KiB/s
      >         with autodefrag: between 1 and 2MiB/s.
      >
      >     Linux 5.15:
      >         with autodefrag:~ 10KiB/s (around the same as without
      > autodefrag on 5.16)
      
      [CAUSE]
      When autodefrag mount option is enabled, btrfs_defrag_file() will be
      called with @max_sectors = BTRFS_DEFRAG_BATCH (1024) to limit how many
      sectors we can defrag in one try.
      
      And then use the number of sectors defragged to determine if we need to
      re-defrag.
      
      But commit b18c3ab2 ("btrfs: defrag: introduce helper to defrag one
      cluster") uses wrong unit to increase @sectors_defragged, which should
      be in unit of sector, not byte.
      
      This means, if we have defragged any sector, then @sectors_defragged
      will be >= sectorsize (normally 4096), which is larger than
      BTRFS_DEFRAG_BATCH.
      
      This makes the @max_sectors check in defrag_one_cluster() to underflow,
      rendering the whole @max_sectors check useless.
      
      Thus causing way more IO for autodefrag mount options, as now there is
      no limit on how many sectors can really be defragged.
      
      [FIX]
      Fix the problems by:
      
      - Use sector as unit when increasing @sectors_defragged
      
      - Include @sectors_defragged > @max_sectors case to break the loop
      
      - Add extra comment on the return value of btrfs_defrag_file()
      Reported-by: NAnthony Ruhier <aruhier@mailbox.org>
      Fixes: b18c3ab2 ("btrfs: defrag: introduce helper to defrag one cluster")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      484167da
    • F
      btrfs: allow defrag to be interruptible · b767c2fc
      Filipe Manana 提交于
      During defrag, at btrfs_defrag_file(), we have this loop that iterates
      over a file range in steps no larger than 256K subranges. If the range
      is too long, there's no way to interrupt it. So make the loop check in
      each iteration if there's signal pending, and if there is, break and
      return -AGAIN to userspace.
      
      Before kernel 5.16, we used to allow defrag to be cancelled through a
      signal, but that was lost with commit 7b508037 ("btrfs: defrag:
      use defrag_one_cluster() to implement btrfs_defrag_file()").
      
      This change adds back the possibility to cancel a defrag with a signal
      and keeps the same semantics, returning -EAGAIN to user space (and not
      the usually more expected -EINTR).
      
      This is also motivated by a recent bug on 5.16 where defragging a 1 byte
      file resulted in iterating from file range 0 to (u64)-1, as hitting the
      bug triggered a too long loop, basically requiring one to reboot the
      machine, as it was not possible to cancel defrag.
      
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b767c2fc
    • F
      btrfs: fix too long loop when defragging a 1 byte file · 6b34cd8e
      Filipe Manana 提交于
      When attempting to defrag a file with a single byte, we can end up in a
      too long loop, which is nearly infinite because at btrfs_defrag_file()
      we end up with the variable last_byte assigned with a value of
      18446744073709551615 (which is (u64)-1). The problem comes from the fact
      we end up doing:
      
          last_byte = round_up(last_byte, fs_info->sectorsize) - 1;
      
      So if last_byte was assigned 0, which is i_size - 1, we underflow and
      end up with the value 18446744073709551615.
      
      This is trivial to reproduce and the following script triggers it:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        echo -n "X" > $MNT/foobar
      
        btrfs filesystem defragment $MNT/foobar
      
        umount $MNT
      
      So fix this by not decrementing last_byte by 1 before doing the sector
      size round up. Also, to make it easier to follow, make the round up right
      after computing last_byte.
      Reported-by: NAnthony Ruhier <aruhier@mailbox.org>
      Fixes: 7b508037 ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()")
      Link: https://lore.kernel.org/linux-btrfs/0a269612-e43f-da22-c5bc-b34b1b56ebe8@mailbox.org/
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6b34cd8e
  12. 07 1月, 2022 4 次提交
  13. 03 1月, 2022 2 次提交
  14. 16 12月, 2021 1 次提交
    • F
      btrfs: fix warning when freeing leaf after subvolume creation failure · 212a58fd
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the root item for the new subvolume into the root tree, we can
      trigger the following warning:
      
      [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.743344] Modules linked in:
      [78961.749440]  dm_snapshot dm_thin_pool (...)
      [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1
      [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.778398] Code: 17 00 48 85 (...)
      [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202
      [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000
      [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff
      [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000
      [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8
      [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000
      [78961.786620] FS:  00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000
      [78961.787717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0
      [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [78961.791684] Call Trace:
      [78961.792082]  <TASK>
      [78961.792359]  create_subvol+0x5d1/0x9a0 [btrfs]
      [78961.793054]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [78961.794009]  ? preempt_count_add+0x49/0xa0
      [78961.794705]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [78961.795712]  ? _copy_from_user+0x66/0xa0
      [78961.796382]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [78961.797392]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [78961.798172]  ? __slab_free+0x10a/0x360
      [78961.798820]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.799664]  ? lock_release+0x223/0x4a0
      [78961.800321]  ? lock_acquired+0x19f/0x420
      [78961.800992]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.801796]  ? trace_hardirqs_on+0x1b/0xe0
      [78961.802495]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [78961.803358]  ? kmem_cache_free+0x321/0x3c0
      [78961.804071]  ? __x64_sys_ioctl+0x83/0xb0
      [78961.804711]  __x64_sys_ioctl+0x83/0xb0
      [78961.805348]  do_syscall_64+0x3b/0xc0
      [78961.805969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [78961.806830] RIP: 0033:0x7f31284bc957
      [78961.807517] Code: 3c 1c 48 f7 d8 (...)
      
      This is because we are calling btrfs_free_tree_block() on an extent
      buffer that is dirty. Fix that by cleaning the extent buffer, with
      btrfs_clean_tree_block(), before freeing it.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      212a58fd