1. 05 9月, 2022 2 次提交
  2. 24 8月, 2022 4 次提交
    • A
      btrfs: add info when mount fails due to stale replace target · f2c3bec2
      Anand Jain 提交于
      If the replace target device reappears after the suspended replace is
      cancelled, it blocks the mount operation as it can't find the matching
      replace-item in the metadata. As shown below,
      
         BTRFS error (device sda5): replace devid present without an active replace item
      
      To overcome this situation, the user can run the command
      
         btrfs device scan --forget <replace target device>
      
      and try the mount command again. And also, to avoid repeating the issue,
      superblock on the devid=0 must be wiped.
      
         wipefs -a device-path-to-devid=0.
      
      This patch adds some info when this situation occurs.
      Reported-by: NSamuel Greiner <samuel@balkonien.org>
      Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f2c3bec2
    • A
      btrfs: replace: drop assert for suspended replace · 59a39919
      Anand Jain 提交于
      If the filesystem mounts with the replace-operation in a suspended state
      and try to cancel the suspended replace-operation, we hit the assert. The
      assert came from the commit fe97e2e1 ("btrfs: dev-replace: replace's
      scrub must not be running in suspended state") that was actually not
      required. So just remove it.
      
       $ mount /dev/sda5 /btrfs
      
          BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
          BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
      
       $ mount -o degraded /dev/sda5 /btrfs <-- success.
      
       $ btrfs replace cancel /btrfs
      
          kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
          kernel: ------------[ cut here ]------------
          kernel: kernel BUG at fs/btrfs/ctree.h:3750!
      
      After the patch:
      
       $ btrfs replace cancel /btrfs
      
          BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
      
      Fixes: fe97e2e1 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59a39919
    • F
      btrfs: fix silent failure when deleting root reference · 47bf225a
      Filipe Manana 提交于
      At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
      up returning from the function with a value of 0 (success). This happens
      because the function returns the value stored in the variable 'err',
      which is 0, while the error value we got from btrfs_search_slot() is
      stored in the 'ret' variable.
      
      So fix it by setting 'err' with the error value.
      
      Fixes: 8289ed9f ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47bf225a
    • O
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval 提交于
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ced8ecf0
  3. 23 8月, 2022 5 次提交
    • J
      btrfs: don't allow large NOWAIT direct reads · 79d3d1d1
      Josef Bacik 提交于
      Dylan and Jens reported a problem where they had an io_uring test that
      was returning short reads, and bisected it to ee5b46a3 ("btrfs:
      increase direct io read size limit to 256 sectors").
      
      The root cause is their test was doing larger reads via io_uring with
      NOWAIT and async.  This was triggering a page fault during the direct
      read, however the first page was able to work just fine and thus we
      submitted a 4k read for a larger iocb.
      
      Btrfs allows for partial IO's in this case specifically because we don't
      allow page faults, and thus we'll attempt to do any io that we can,
      submit what we could, come back and fault in the rest of the range and
      try to do the remaining IO.
      
      However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
      partial dio is done, which is incorrect.  In the sync case we can exit
      the iomap code, submit more io's, and return with the amount of IO we
      were able to complete successfully.
      
      We were always doing short reads in this case, but for NOWAIT we were
      getting saved by the fact that we were limiting direct reads to
      sectorsize, and if we were larger than that we would return EAGAIN.
      
      Fix the regression by simply returning EAGAIN in the NOWAIT case with
      larger reads, that way io_uring can retry and get the larger IO and have
      the fault logic handle everything properly.
      
      This still leaves the AIO short read case, but that existed before this
      change.  The way to properly fix this would be to handle partial iocb
      completions, but that's a lot of work, for now deal with the regression
      in the most straightforward way possible.
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Fixes: ee5b46a3 ("btrfs: increase direct io read size limit to 256 sectors")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79d3d1d1
    • Q
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo 提交于
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4a445b7b
    • F
      btrfs: update generation of hole file extent item when merging holes · e6e3dec6
      Filipe Manana 提交于
      When punching a hole into a file range that is adjacent with a hole and we
      are not using the no-holes feature, we expand the range of the adjacent
      file extent item that represents a hole, to save metadata space.
      
      However we don't update the generation of hole file extent item, which
      means a full fsync will not log that file extent item if the fsync happens
      in a later transaction (since commit 7f30c072 ("btrfs: stop copying
      old file extents when doing a full fsync")).
      
      For example, if we do this:
      
          $ mkfs.btrfs -f -O ^no-holes /dev/sdb
          $ mount /dev/sdb /mnt
          $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
          $ sync
      
      We end up with 2 file extent items in our file:
      
      1) One that represents the hole for the file range [0, 2M), with a
         generation of 7;
      
      2) Another one that represents an extent covering the range [2M, 4M).
      
      After that if we do the following:
      
          $ xfs_io -c "fpunch 2M 2M" /mnt/foobar
      
      We end up with a single file extent item in the file, which represents a
      hole for the range [0, 4M) and with a generation of 7 - because we end
      dropping the data extent for range [2M, 4M) and then update the file
      extent item that represented the hole at [0, 2M), by increasing
      length from 2M to 4M.
      
      Then doing a full fsync and power failing:
      
          $ xfs_io -c "fsync" /mnt/foobar
          <power failure>
      
      will result in the full fsync not logging the file extent item that
      represents the hole for the range [0, 4M), because its generation is 7,
      which is lower than the generation of the current transaction (8).
      As a consequence, after mounting again the filesystem (after log replay),
      the region [2M, 4M) does not have a hole, it still points to the
      previous data extent.
      
      So fix this by always updating the generation of existing file extent
      items representing holes when we merge/expand them. This solves the
      problem and it's the same approach as when we merge prealloc extents that
      got written (at btrfs_mark_extent_written()). Setting the generation to
      the current transaction's generation is also what we do when merging
      the new hole extent map with the previous one or the next one.
      
      A test case for fstests, covering both cases of hole file extent item
      merging (to the left and to the right), will be sent soon.
      
      Fixes: 7f30c072 ("btrfs: stop copying old file extents when doing a full fsync")
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6e3dec6
    • Z
      btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() · 9ea0106a
      Zixuan Fu 提交于
      In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
      the path is invalid. In this case, btrfs_get_dev_args_from_path()
      returns directly without freeing args->uuid and args->fsid allocated
      before, which causes memory leak.
      
      To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
      btrfs_put_dev_args_from_path() is called to clean up the memory.
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      Fixes: faa775c4 ("btrfs: add a btrfs_get_dev_args_from_path helper")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ea0106a
    • G
      btrfs: check if root is readonly while setting security xattr · b5111127
      Goldwyn Rodrigues 提交于
      For a filesystem which has btrfs read-only property set to true, all
      write operations including xattr should be denied. However, security
      xattr can still be changed even if btrfs ro property is true.
      
      This happens because xattr_permission() does not have any restrictions
      on security.*, system.*  and in some cases trusted.* from VFS and
      the decision is left to the underlying filesystem. See comments in
      xattr_permission() for more details.
      
      This patch checks if the root is read-only before performing the set
      xattr operation.
      
      Testcase:
      
        DEV=/dev/vdb
        MNT=/mnt
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
        echo "file one" > $MNT/f1
      
        setfattr -n "security.one" -v 2 $MNT/f1
        btrfs property set /mnt ro true
      
        setfattr -n "security.one" -v 1 $MNT/f1
      
        umount $MNT
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5111127
  4. 17 8月, 2022 6 次提交
    • J
      btrfs: tree-checker: check for overlapping extent items · 899b7f69
      Josef Bacik 提交于
      We're seeing a weird problem in production where we have overlapping
      extent items in the extent tree.  It's unclear where these are coming
      from, and in debugging we realized there's no check in the tree checker
      for this sort of problem.  Add a check to the tree-checker to make sure
      that the extents do not overlap each other.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      899b7f69
    • F
      btrfs: fix warning during log replay when bumping inode link count · 769030e1
      Filipe Manana 提交于
      During log replay, at add_link(), we may increment the link count of
      another inode that has a reference that conflicts with a new reference
      for the inode currently being processed.
      
      During log replay, at add_link(), we may drop (unlink) a reference from
      some inode in the subvolume tree if that reference conflicts with a new
      reference found in the log for the inode we are currently processing.
      
      After the unlink, If the link count has decreased from 1 to 0, then we
      increment the link count to prevent the inode from being deleted if it's
      evicted by an iput() call, because we may have references to add to that
      inode later on (and we will fixup its link count later during log replay).
      
      However incrementing the link count from 0 to 1 triggers a warning:
      
        $ cat fs/inode.c
        (...)
        void inc_nlink(struct inode *inode)
        {
              if (unlikely(inode->i_nlink == 0)) {
                       WARN_ON(!(inode->i_state & I_LINKABLE));
                       atomic_long_dec(&inode->i_sb->s_remove_count);
              }
        (...)
      
      The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
      never set during log replay.
      
      Most of the time, the warning isn't triggered even if we dropped the last
      reference of the conflicting inode, and this is because:
      
      1) The conflicting inode was previously marked for fixup, through a call
         to link_to_fixup_dir(), which increments the inode's link count;
      
      2) And the last iput() on the inode has not triggered eviction of the
         inode, nor was eviction triggered after the iput(). So at add_link(),
         even if we unlink the last reference of the inode, its link count ends
         up being 1 and not 0.
      
      So this means that if eviction is triggered after link_to_fixup_dir() is
      called, at add_link() we will read the inode back from the subvolume tree
      and have it with a correct link count, matching the number of references
      it has on the subvolume tree. So if when we are at add_link() the inode
      has exactly one reference only, its link count is 1, and after the unlink
      its link count becomes 0.
      
      So fix this by using set_nlink() instead of inc_nlink(), as the former
      accepts a transition from 0 to 1 and it's what we use in other similar
      contexts (like at link_to_fixup_dir().
      
      Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
      bump the link count from 0 to 1.
      
      The warning is actually harmless, but it may scare users. Josef also ran
      into it recently.
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      769030e1
    • F
      btrfs: fix lost error handling when looking up extended ref on log replay · 7a6b75b7
      Filipe Manana 提交于
      During log replay, when processing inode references, if we get an error
      when looking up for an extended reference at __add_inode_ref(), we ignore
      it and proceed, returning success (0) if no other error happens after the
      lookup. This is obviously wrong because in case an extended reference
      exists and it encodes some name not in the log, we need to unlink it,
      otherwise the filesystem state will not match the state it had after the
      last fsync.
      
      So just make __add_inode_ref() return an error it gets from the extended
      reference lookup.
      
      Fixes: f186373f ("btrfs: extended inode refs")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a6b75b7
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
    • J
      btrfs: move lockdep class helpers to locking.c · 0a27a047
      Josef Bacik 提交于
      These definitions exist in disk-io.c, which is not related to the
      locking.  Move this over to locking.h/c where it makes more sense.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a27a047
    • Z
      btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() · 85f02d6c
      Zixuan Fu 提交于
      In btrfs_relocate_block_group(), the rc is allocated.  Then
      btrfs_relocate_block_group() calls
      
      relocate_block_group()
        prepare_to_relocate()
          set_reloc_control()
      
      that assigns rc to the variable fs_info->reloc_ctl. When
      prepare_to_relocate() returns, it calls
      
      btrfs_commit_transaction()
        btrfs_start_dirty_block_groups()
          btrfs_alloc_path()
            kmem_cache_zalloc()
      
      which may fail for example (or other errors could happen). When the
      failure occurs, btrfs_relocate_block_group() detects the error and frees
      rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
      btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
      then used, which may cause a use-after-free bug.
      
      This possible bug can be triggered by calling btrfs_ioctl_balance()
      before calling btrfs_ioctl_defrag().
      
      To fix this possible bug, in prepare_to_relocate(), check if
      btrfs_commit_transaction() fails. If the failure occurs,
      unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
      
      The error log in our fault-injection testing is shown as follows:
      
        [   58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        ...
        [   58.753577] Call Trace:
        ...
        [   58.755800]  kasan_report+0x45/0x60
        [   58.756066]  btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        [   58.757304]  record_root_in_trans+0x792/0xa10 [btrfs]
        [   58.757748]  btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
        [   58.758231]  start_transaction+0x896/0x2950 [btrfs]
        [   58.758661]  btrfs_defrag_root+0x250/0xc00 [btrfs]
        [   58.759083]  btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
        [   58.759513]  btrfs_ioctl+0x3c95/0x114e0 [btrfs]
        ...
        [   58.768510] Allocated by task 23683:
        [   58.768777]  ____kasan_kmalloc+0xb5/0xf0
        [   58.769069]  __kmalloc+0x227/0x3d0
        [   58.769325]  alloc_reloc_control+0x10a/0x3d0 [btrfs]
        [   58.769755]  btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
        [   58.770228]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.770655]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.771071]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.771472]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.771902]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
        [   58.773337] Freed by task 23683:
        ...
        [   58.774815]  kfree+0xda/0x2b0
        [   58.775038]  free_reloc_control+0x1d6/0x220 [btrfs]
        [   58.775465]  btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
        [   58.775944]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.776369]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.776784]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.777185]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.777621]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85f02d6c
  5. 28 7月, 2022 1 次提交
    • J
      btrfs: reset RO counter on block group if we fail to relocate · 74944c87
      Josef Bacik 提交于
      With the automatic block group reclaim code we will preemptively try to
      mark the block group RO before we start the relocation.  We do this to
      make sure we should actually try to relocate the block group.
      
      However if we hit an error during the actual relocation we won't clean
      up our RO counter and the block group will remain RO.  This was observed
      internally with file systems reporting less space available from df when
      we had failed background relocations.
      
      Fix this by doing the dec_ro in the error case.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74944c87
  6. 26 7月, 2022 5 次提交
  7. 25 7月, 2022 17 次提交
    • C
      btrfs: repair all known bad mirrors · c144c63f
      Christoph Hellwig 提交于
      When there is more than a single level of redundancy there can also be
      multiple bad mirrors, and the current read repair code only repairs the
      last bad one.
      
      Restructure btrfs_repair_one_sector so that it records the originally
      failed mirror and the number of copies, and then repair all known bad
      copies until we reach the originally failed copy in clean_io_failure.
      Note that this also means the read repair reads will always start from
      the next bad mirror and not mirror 0.
      
      This fixes btrfs/265 in xfstests.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c144c63f
    • C
      btrfs: merge btrfs_dev_stat_print_on_error with its only caller · d28beb3e
      Christoph Hellwig 提交于
      Fold it into the only caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d28beb3e
    • F
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana 提交于
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      723df2bc
    • N
      btrfs: simplify error handling in btrfs_lookup_dentry · fc8b235f
      Nikolay Borisov 提交于
      In btrfs_lookup_dentry releasing the reference of the sub_root and the
      running orphan cleanup should only happen if the dentry found actually
      represents a subvolume. This can only be true in the 'else' branch as
      otherwise either fixup_tree_root_location returned an ENOENT error, in
      which case sub_root wouldn't have been changed or if we got a different
      errno this means btrfs_get_fs_root couldn't have executed successfully
      again meaning sub_root will equal to root. So simplify all the branches
      by moving the code into the 'else'.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fc8b235f
    • F
      btrfs: send: always use the rbtree based inode ref management infrastructure · 0d8869fb
      Filipe Manana 提交于
      After the patch "btrfs: send: fix sending link commands for existing file
      paths", we now have two infrastructures to detect and eliminate duplicated
      inode references (due to names that got removed and re-added between the
      send and parent snapshots):
      
      1) One that works on a single inode ref/extref item;
      
      2) A new one that works acrosss all ref/extref items for an inode, and
         it's also more efficient because even in the single ref/extref item
         case, it does not do a linear search for all the names encoded in the
         ref/extref item, it uses red black trees to speedup up the search.
      
      There's no good reason to keep both infrastructures, we can use the new
      one everywhere, and it's always more efficient.
      
      So remove the old infrastructure and change all sites that are using it
      to use the new one.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d8869fb
    • B
      btrfs: send: fix sending link commands for existing file paths · 3aa5bd36
      BingJing Chang 提交于
      There is a bug sending link commands for existing file paths. When we're
      processing an inode, we go over all references. All the new file paths are
      added to the "new_refs" list. And all the deleted file paths are added to
      the "deleted_refs" list. In the end, when we finish processing the inode,
      we iterate over all the items in the "new_refs" list and send link commands
      for those file paths. After that, we go over all the items in the
      "deleted_refs" list and send unlink commands for them. If there are
      duplicated file paths in both lists, we will try to create them before we
      remove them. Then the receiver gets an -EEXIST error when trying the link
      operations.
      
      Example for having duplicated file paths in both list:
      
        $ btrfs subvolume create vol
      
        # create a file and 2000 hard links to the same inode
        $ touch vol/foo
        $ for i in {1..2000}; do link vol/foo vol/$i ; done
      
        # take a snapshot for a parent snapshot
        $ btrfs subvolume snapshot -r vol snap1
      
        # remove 2000 hard links and re-create the last 1000 links
        $ for i in {1..2000}; do rm vol/$i; done;
        $ for i in {1001..2000}; do link vol/foo vol/$i; done
      
        # take another one for a send snapshot
        $ btrfs subvolume snapshot -r vol snap2
      
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        link 1238 -> foo
        ERROR: link 1238 -> foo failed: File exists
      
      In this case, we will have the same file paths added to both lists. In the
      parent snapshot, reference paths {1..1237} are stored in inode references,
      but reference paths {1238..2000} are stored in inode extended references.
      In the send snapshot, all reference paths {1001..2000} are stored in inode
      references. During the incremental send, we process their inode references
      first. In record_changed_ref(), we iterate all its inode references in the
      send/parent snapshot. For every inode reference, we also use find_iref() to
      check whether the same file path also appears in the parent/send snapshot
      or not. Inode references {1238..2000} which appear in the send snapshot but
      not in the parent snapshot are added to the "new_refs" list. On the other
      hand, Inode references {1..1000} which appear in the parent snapshot but
      not in the send snapshot are added to the "deleted_refs" list. Next, when
      we process their inode extended references, reference paths {1238..2000}
      are added to the "deleted_refs" list because all of them only appear in the
      parent snapshot. Now two lists contain items as below:
      "new_refs" list: {1238..2000}
      "deleted_refs" list: {1..1000}, {1238..2000}
      
      Reference paths {1238..2000} appear in both lists. And as the processing
      order mentioned about before, the receiver gets an -EEXIST error when trying
      the link operations.
      
      To fix the bug, the idea is to process the "deleted_refs" list before
      the "new_refs" list. However, it's not easy to reshuffle the processing
      order. For one reason, if we do so, we may unlink all the existing paths
      first, there's no valid path anymore for links. And it's inefficient
      because we do a bunch of unlinks followed by links for the same paths.
      Moreover, it makes less sense to have duplications in both lists. A
      reference path cannot not only be regarded as new but also has been seen in
      the past, or we won't call it a new path. However, it's also not a good
      idea to make find_iref() check a reference against all inode references
      and all inode extended references because it may result in large disk
      reads.
      
      So we introduce two rbtrees to make the references easier for lookups.
      And we also introduce record_new_ref_if_needed() and
      record_deleted_ref_if_needed() for changed_ref() to check and remove
      duplicated references early.
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3aa5bd36
    • B
      btrfs: send: introduce recorded_ref_alloc and recorded_ref_free · 71ecfc13
      BingJing Chang 提交于
      Introduce wrappers to allocate and free recorded_ref structures.
      Reviewed-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NBingJing Chang <bingjingc@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      71ecfc13
    • N
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota 提交于
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce543f4
    • N
      btrfs: zoned: write out partially allocated region · 898793d9
      Naohiro Aota 提交于
      cow_file_range() works in an all-or-nothing way: if it fails to allocate an
      extent for a part of the given region, it gives up all the region including
      the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
      writes data for the region only when it successfully allocate all the
      region.
      
      This all-or-nothing allocation and write-out are problematic when available
      space in all the block groups are get tight with the active zone
      restriction. btrfs_reserve_extent() try hard to utilize the left space in
      the active block groups and gives up finally and fails with
      -ENOSPC. However, if we send IOs for the successfully allocated region, we
      can finish a zone and can continue on the rest of the allocation on a newly
      allocated block group.
      
      This patch implements the partial write-out for run_delalloc_zoned(). With
      this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
      do something to progress the further allocation, and tells the successfully
      allocated region with done_offset. Furthermore, the zoned extent allocator
      returns -EAGAIN to tell cow_file_range() going back to the caller side.
      
      Actually, we still need to wait for an IO to complete to continue the
      allocation. The next patch implements that part.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      898793d9
    • N
      btrfs: zoned: activate necessary block group · b6a98021
      Naohiro Aota 提交于
      There are two places where allocating a chunk is not enough. These two
      places are trying to ensure the space by allocating a chunk. To meet the
      condition for active_total_bytes, we also need to activate a block group
      there.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b6a98021
    • N
      btrfs: zoned: activate metadata block group on flush_space · b0931513
      Naohiro Aota 提交于
      For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
      means we don't have enough space left in the active_total_bytes. Before
      allocating a new chunk, we can try to activate an existing block group
      in this case.
      
      Also, allocating a chunk is not enough to grant a ticket for metadata
      space on zoned filesystem we need to activate the block group to
      increase the active_total_bytes.
      
      btrfs_zoned_activate_one_bg() implements the activation feature. It will
      activate a block group by (maybe) finishing a block group. It will give up
      activating a block group if it cannot finish any block group.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0931513
    • N
      btrfs: zoned: disable metadata overcommit for zoned · 79417d04
      Naohiro Aota 提交于
      The metadata overcommit makes the space reservation flexible but it is also
      harmful to active zone tracking. Since we cannot finish a block group from
      the metadata allocation context, we might not activate a new block group
      and might not be able to actually write out the overcommit reservations.
      
      So, disable metadata overcommit for zoned filesystems. We will ensure
      the reservations are under active_total_bytes in the following patches.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79417d04
    • N
      btrfs: zoned: introduce space_info->active_total_bytes · 6a921de5
      Naohiro Aota 提交于
      The active_total_bytes, like the total_bytes, accounts for the total bytes
      of active block groups in the space_info.
      
      With an introduction of active_total_bytes, we can check if the reserved
      bytes can be written to the block groups without activating a new block
      group. The check is necessary for metadata allocation on zoned
      filesystem. We cannot finish a block group, which may require waiting
      for the current transaction, from the metadata allocation context.
      Instead, we need to ensure the ongoing allocation (reserved bytes) fits
      in active block groups.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a921de5
    • N
      btrfs: zoned: finish least available block group on data bg allocation · 393f646e
      Naohiro Aota 提交于
      When we run out of active zones and no sufficient space is left in any
      block groups, we need to finish one block group to make room to activate a
      new block group.
      
      However, we cannot do this for metadata block groups because we can cause a
      deadlock by waiting for a running transaction commit. So, do that only for
      a data block group.
      
      Furthermore, the block group to be finished has two requirements. First,
      the block group must not have reserved bytes left. Having reserved bytes
      means we have an allocated region but did not yet send bios for it. If that
      region is allocated by the thread calling btrfs_zone_finish(), it results
      in a deadlock.
      
      Second, the block group to be finished must not be a SYSTEM block
      group. Finishing a SYSTEM block group easily breaks further chunk
      allocation by nullifying the SYSTEM free space.
      
      In a certain case, we cannot find any zone finish candidate or
      btrfs_zone_finish() may fail. In that case, we fall back to split the
      allocation bytes and fill the last spaces left in the block groups.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      393f646e
    • N
      btrfs: let can_allocate_chunk return error · bb9950d3
      Naohiro Aota 提交于
      For the later patch, convert the return type from bool to int and return
      errors. No functional changes.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb9950d3
    • N
      btrfs: use fs_info->max_extent_size in get_extent_max_capacity() · d7601566
      Naohiro Aota 提交于
      Use fs_info->max_extent_size also in get_extent_max_capacity() for the
      completeness. This is only used for defrag and not really necessary to fix
      the metadata reservation size. But, it still suppresses unnecessary defrag
      operations.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7601566
    • N
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota 提交于
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7d7672bc