1. 26 9月, 2022 21 次提交
  2. 13 9月, 2022 3 次提交
    • N
      btrfs: zoned: wait for extent buffer IOs before finishing a zone · 2dd7e7bc
      Naohiro Aota 提交于
      Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that
      ongoing IOs already finished. Or, we will see a "Zone Is Full" error for
      the IOs, as the ZONE_FINISH command makes the zone full.
      
      We ensure that with btrfs_wait_block_group_reservations() and
      btrfs_wait_ordered_roots() for a data block group. And, for a metadata
      block group, the comparison of alloc_offset vs meta_write_pointer mostly
      ensures IOs for the allocated region already sent. However, there still
      can be a little time frame where the IOs are sent but not yet completed.
      
      Introduce wait_eb_writebacks() to ensure such IOs are completed for a
      metadata block group. It walks the buffer_radix to find extent buffers in
      the block group and calls wait_on_extent_buffer_writeback() on them.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.19+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2dd7e7bc
    • F
      btrfs: fix hang during unmount when stopping a space reclaim worker · a362bb86
      Filipe Manana 提交于
      Often when running generic/562 from fstests we can hang during unmount,
      resulting in a trace like this:
      
        Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
        Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
        Sep 07 11:55:32 debian9 kernel:       Not tainted 6.0.0-rc2-btrfs-next-122 #1
        Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        Sep 07 11:55:32 debian9 kernel: task:umount          state:D stack:    0 pid:49438 ppid: 25683 flags:0x00004000
        Sep 07 11:55:32 debian9 kernel: Call Trace:
        Sep 07 11:55:32 debian9 kernel:  <TASK>
        Sep 07 11:55:32 debian9 kernel:  __schedule+0x3c8/0xec0
        Sep 07 11:55:32 debian9 kernel:  ? rcu_read_lock_sched_held+0x12/0x70
        Sep 07 11:55:32 debian9 kernel:  schedule+0x5d/0xf0
        Sep 07 11:55:32 debian9 kernel:  schedule_timeout+0xf1/0x130
        Sep 07 11:55:32 debian9 kernel:  ? lock_release+0x224/0x4a0
        Sep 07 11:55:32 debian9 kernel:  ? lock_acquired+0x1a0/0x420
        Sep 07 11:55:32 debian9 kernel:  ? trace_hardirqs_on+0x2c/0xd0
        Sep 07 11:55:32 debian9 kernel:  __wait_for_common+0xac/0x200
        Sep 07 11:55:32 debian9 kernel:  ? usleep_range_state+0xb0/0xb0
        Sep 07 11:55:32 debian9 kernel:  __flush_work+0x26d/0x530
        Sep 07 11:55:32 debian9 kernel:  ? flush_workqueue_prep_pwqs+0x140/0x140
        Sep 07 11:55:32 debian9 kernel:  ? trace_clock_local+0xc/0x30
        Sep 07 11:55:32 debian9 kernel:  __cancel_work_timer+0x11f/0x1b0
        Sep 07 11:55:32 debian9 kernel:  ? close_ctree+0x12b/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? __trace_bputs+0x10b/0x170
        Sep 07 11:55:32 debian9 kernel:  close_ctree+0x152/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? evict_inodes+0x166/0x1c0
        Sep 07 11:55:32 debian9 kernel:  generic_shutdown_super+0x71/0x120
        Sep 07 11:55:32 debian9 kernel:  kill_anon_super+0x14/0x30
        Sep 07 11:55:32 debian9 kernel:  btrfs_kill_super+0x12/0x20 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  deactivate_locked_super+0x2e/0xa0
        Sep 07 11:55:32 debian9 kernel:  cleanup_mnt+0x100/0x160
        Sep 07 11:55:32 debian9 kernel:  task_work_run+0x59/0xa0
        Sep 07 11:55:32 debian9 kernel:  exit_to_user_mode_prepare+0x1a6/0x1b0
        Sep 07 11:55:32 debian9 kernel:  syscall_exit_to_user_mode+0x16/0x40
        Sep 07 11:55:32 debian9 kernel:  do_syscall_64+0x48/0x90
        Sep 07 11:55:32 debian9 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
        Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
        Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
        Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
        Sep 07 11:55:32 debian9 kernel:  </TASK>
      
      What happens is the following:
      
      1) The cleaner kthread tries to start a transaction to delete an unused
         block group, but the metadata reservation can not be satisfied right
         away, so a reservation ticket is created and it starts the async
         metadata reclaim task (fs_info->async_reclaim_work);
      
      2) Writeback for all the filler inodes with an i_size of 2K starts
         (generic/562 creates a lot of 2K files with the goal of filling
         metadata space). We try to create an inline extent for them, but we
         fail when trying to insert the inline extent with -ENOSPC (at
         cow_file_range_inline()) - since this is not critical, we fallback
         to non-inline mode (back to cow_file_range()), reserve extents, create
         extent maps and create the ordered extents;
      
      3) An unmount starts, enters close_ctree();
      
      4) The async reclaim task is flushing stuff, entering the flush states one
         by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
         delayed iputs.
      
         After running the delayed iputs and before calling
         btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
         and btrfs_add_delayed_iput() is called for each one through
         btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
         in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
      
         So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
         for fs_info->nr_delayed_iputs to become 0;
      
      5) The current transaction is committed by the transaction kthread, we then
         start unpinning extents and end up calling btrfs_try_granting_tickets()
         through unpin_extent_range(), since we released some space.
         This results in satisfying the ticket created by the cleaner kthread at
         step 1, waking up the cleaner kthread;
      
      6) At close_ctree() we ask the cleaner kthread to park;
      
      7) The cleaner kthread starts the transaction, deletes the unused block
         group, and then calls kthread_should_park(), which returns true, so it
         parks. And at this point we have the delayed iputs added by the
         completion of the ordered extents still pending;
      
      8) Then later at close_ctree(), when we call:
      
             cancel_work_sync(&fs_info->async_reclaim_work);
      
         We hang forever, since the cleaner was parked and no one else can run
         delayed iputs after that, while the reclaim task is waiting for the
         remaining delayed iputs to be completed.
      
      Fix this by waiting for all ordered extents to complete and running the
      delayed iputs before attempting to stop the async reclaim tasks. Note that
      we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
      other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
      flag to be set on an ordered extent, but the delayed iput is added after
      that, when doing the final btrfs_put_ordered_extent(). So instead wait for
      the work queues used for executing ordered extent completion to be empty,
      which works because we do the final put on an ordered extent at
      btrfs_finish_ordered_io() (while we are in the unmount context).
      
      Fixes: d6fd0ae2 ("Btrfs: fix missing delayed iputs on unmount")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a362bb86
    • F
      btrfs: fix hang during unmount when stopping block group reclaim worker · 8a1f1e3d
      Filipe Manana 提交于
      During early unmount, at close_ctree(), we try to stop the block group
      reclaim task with cancel_work_sync(), but that may hang if the block group
      reclaim task is currently at btrfs_relocate_block_group() waiting for the
      flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
      unmount we only clear that flag later, after trying to stop the block
      group reclaim task.
      
      Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
      block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
      if the reclaim task is waiting on that bit, it will stop immediately after
      being woken, because it sees the filesystem is closing (with a call to
      btrfs_fs_closing()), and then returns immediately with -EINTR.
      
      Fixes: 31e70e52 ("btrfs: fix hang during unmount when block group reclaim task is running")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8a1f1e3d
  3. 06 9月, 2022 1 次提交
    • Q
      btrfs: fix the max chunk size and stripe length calculation · 5da431b7
      Qu Wenruo 提交于
      [BEHAVIOR CHANGE]
      Since commit f6fca391 ("btrfs: store chunk size in space-info
      struct"), btrfs no longer can create larger data chunks than 1G:
      
        mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
        mount $dev1 $mnt
      
        btrfs balance start --full $mnt
        btrfs balance start --full $mnt
        umount $mnt
      
        btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
      
      Before that offending commit, what we got is a 4G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      Now what we got is only 1G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
      		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      This will increase the number of data chunks by the number of devices,
      not only increase system chunk usage, but also greatly increase mount
      time.
      
      Without a proper reason, we should not change the max chunk size.
      
      [CAUSE]
      Previously, we set max data chunk size to 10G, while max data stripe
      length to 1G.
      
      Commit f6fca391 ("btrfs: store chunk size in space-info struct")
      completely ignored the 10G limit, but use 1G max stripe limit instead,
      causing above shrink in max data chunk size.
      
      [FIX]
      Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
      we limit stripe_size to 1G manually.
      
      This should only affect data chunks, as for metadata chunks we always
      set the max stripe size the same as max chunk size (256M or 1G
      depending on fs size).
      
      Now the same script result the same old result:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Fixes: f6fca391 ("btrfs: store chunk size in space-info struct")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5da431b7
  4. 05 9月, 2022 3 次提交
  5. 24 8月, 2022 4 次提交
    • A
      btrfs: add info when mount fails due to stale replace target · f2c3bec2
      Anand Jain 提交于
      If the replace target device reappears after the suspended replace is
      cancelled, it blocks the mount operation as it can't find the matching
      replace-item in the metadata. As shown below,
      
         BTRFS error (device sda5): replace devid present without an active replace item
      
      To overcome this situation, the user can run the command
      
         btrfs device scan --forget <replace target device>
      
      and try the mount command again. And also, to avoid repeating the issue,
      superblock on the devid=0 must be wiped.
      
         wipefs -a device-path-to-devid=0.
      
      This patch adds some info when this situation occurs.
      Reported-by: NSamuel Greiner <samuel@balkonien.org>
      Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f2c3bec2
    • A
      btrfs: replace: drop assert for suspended replace · 59a39919
      Anand Jain 提交于
      If the filesystem mounts with the replace-operation in a suspended state
      and try to cancel the suspended replace-operation, we hit the assert. The
      assert came from the commit fe97e2e1 ("btrfs: dev-replace: replace's
      scrub must not be running in suspended state") that was actually not
      required. So just remove it.
      
       $ mount /dev/sda5 /btrfs
      
          BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
          BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
      
       $ mount -o degraded /dev/sda5 /btrfs <-- success.
      
       $ btrfs replace cancel /btrfs
      
          kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
          kernel: ------------[ cut here ]------------
          kernel: kernel BUG at fs/btrfs/ctree.h:3750!
      
      After the patch:
      
       $ btrfs replace cancel /btrfs
      
          BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
      
      Fixes: fe97e2e1 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59a39919
    • F
      btrfs: fix silent failure when deleting root reference · 47bf225a
      Filipe Manana 提交于
      At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
      up returning from the function with a value of 0 (success). This happens
      because the function returns the value stored in the variable 'err',
      which is 0, while the error value we got from btrfs_search_slot() is
      stored in the 'ret' variable.
      
      So fix it by setting 'err' with the error value.
      
      Fixes: 8289ed9f ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47bf225a
    • O
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval 提交于
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ced8ecf0
  6. 23 8月, 2022 5 次提交
    • J
      btrfs: don't allow large NOWAIT direct reads · 79d3d1d1
      Josef Bacik 提交于
      Dylan and Jens reported a problem where they had an io_uring test that
      was returning short reads, and bisected it to ee5b46a3 ("btrfs:
      increase direct io read size limit to 256 sectors").
      
      The root cause is their test was doing larger reads via io_uring with
      NOWAIT and async.  This was triggering a page fault during the direct
      read, however the first page was able to work just fine and thus we
      submitted a 4k read for a larger iocb.
      
      Btrfs allows for partial IO's in this case specifically because we don't
      allow page faults, and thus we'll attempt to do any io that we can,
      submit what we could, come back and fault in the rest of the range and
      try to do the remaining IO.
      
      However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
      partial dio is done, which is incorrect.  In the sync case we can exit
      the iomap code, submit more io's, and return with the amount of IO we
      were able to complete successfully.
      
      We were always doing short reads in this case, but for NOWAIT we were
      getting saved by the fact that we were limiting direct reads to
      sectorsize, and if we were larger than that we would return EAGAIN.
      
      Fix the regression by simply returning EAGAIN in the NOWAIT case with
      larger reads, that way io_uring can retry and get the larger IO and have
      the fault logic handle everything properly.
      
      This still leaves the AIO short read case, but that existed before this
      change.  The way to properly fix this would be to handle partial iocb
      completions, but that's a lot of work, for now deal with the regression
      in the most straightforward way possible.
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Fixes: ee5b46a3 ("btrfs: increase direct io read size limit to 256 sectors")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79d3d1d1
    • Q
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo 提交于
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4a445b7b
    • F
      btrfs: update generation of hole file extent item when merging holes · e6e3dec6
      Filipe Manana 提交于
      When punching a hole into a file range that is adjacent with a hole and we
      are not using the no-holes feature, we expand the range of the adjacent
      file extent item that represents a hole, to save metadata space.
      
      However we don't update the generation of hole file extent item, which
      means a full fsync will not log that file extent item if the fsync happens
      in a later transaction (since commit 7f30c072 ("btrfs: stop copying
      old file extents when doing a full fsync")).
      
      For example, if we do this:
      
          $ mkfs.btrfs -f -O ^no-holes /dev/sdb
          $ mount /dev/sdb /mnt
          $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
          $ sync
      
      We end up with 2 file extent items in our file:
      
      1) One that represents the hole for the file range [0, 2M), with a
         generation of 7;
      
      2) Another one that represents an extent covering the range [2M, 4M).
      
      After that if we do the following:
      
          $ xfs_io -c "fpunch 2M 2M" /mnt/foobar
      
      We end up with a single file extent item in the file, which represents a
      hole for the range [0, 4M) and with a generation of 7 - because we end
      dropping the data extent for range [2M, 4M) and then update the file
      extent item that represented the hole at [0, 2M), by increasing
      length from 2M to 4M.
      
      Then doing a full fsync and power failing:
      
          $ xfs_io -c "fsync" /mnt/foobar
          <power failure>
      
      will result in the full fsync not logging the file extent item that
      represents the hole for the range [0, 4M), because its generation is 7,
      which is lower than the generation of the current transaction (8).
      As a consequence, after mounting again the filesystem (after log replay),
      the region [2M, 4M) does not have a hole, it still points to the
      previous data extent.
      
      So fix this by always updating the generation of existing file extent
      items representing holes when we merge/expand them. This solves the
      problem and it's the same approach as when we merge prealloc extents that
      got written (at btrfs_mark_extent_written()). Setting the generation to
      the current transaction's generation is also what we do when merging
      the new hole extent map with the previous one or the next one.
      
      A test case for fstests, covering both cases of hole file extent item
      merging (to the left and to the right), will be sent soon.
      
      Fixes: 7f30c072 ("btrfs: stop copying old file extents when doing a full fsync")
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6e3dec6
    • Z
      btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() · 9ea0106a
      Zixuan Fu 提交于
      In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
      the path is invalid. In this case, btrfs_get_dev_args_from_path()
      returns directly without freeing args->uuid and args->fsid allocated
      before, which causes memory leak.
      
      To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
      btrfs_put_dev_args_from_path() is called to clean up the memory.
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      Fixes: faa775c4 ("btrfs: add a btrfs_get_dev_args_from_path helper")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ea0106a
    • G
      btrfs: check if root is readonly while setting security xattr · b5111127
      Goldwyn Rodrigues 提交于
      For a filesystem which has btrfs read-only property set to true, all
      write operations including xattr should be denied. However, security
      xattr can still be changed even if btrfs ro property is true.
      
      This happens because xattr_permission() does not have any restrictions
      on security.*, system.*  and in some cases trusted.* from VFS and
      the decision is left to the underlying filesystem. See comments in
      xattr_permission() for more details.
      
      This patch checks if the root is read-only before performing the set
      xattr operation.
      
      Testcase:
      
        DEV=/dev/vdb
        MNT=/mnt
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
        echo "file one" > $MNT/f1
      
        setfattr -n "security.one" -v 2 $MNT/f1
        btrfs property set /mnt ro true
      
        setfattr -n "security.one" -v 1 $MNT/f1
      
        umount $MNT
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5111127
  7. 17 8月, 2022 3 次提交
    • J
      btrfs: tree-checker: check for overlapping extent items · 899b7f69
      Josef Bacik 提交于
      We're seeing a weird problem in production where we have overlapping
      extent items in the extent tree.  It's unclear where these are coming
      from, and in debugging we realized there's no check in the tree checker
      for this sort of problem.  Add a check to the tree-checker to make sure
      that the extents do not overlap each other.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      899b7f69
    • F
      btrfs: fix warning during log replay when bumping inode link count · 769030e1
      Filipe Manana 提交于
      During log replay, at add_link(), we may increment the link count of
      another inode that has a reference that conflicts with a new reference
      for the inode currently being processed.
      
      During log replay, at add_link(), we may drop (unlink) a reference from
      some inode in the subvolume tree if that reference conflicts with a new
      reference found in the log for the inode we are currently processing.
      
      After the unlink, If the link count has decreased from 1 to 0, then we
      increment the link count to prevent the inode from being deleted if it's
      evicted by an iput() call, because we may have references to add to that
      inode later on (and we will fixup its link count later during log replay).
      
      However incrementing the link count from 0 to 1 triggers a warning:
      
        $ cat fs/inode.c
        (...)
        void inc_nlink(struct inode *inode)
        {
              if (unlikely(inode->i_nlink == 0)) {
                       WARN_ON(!(inode->i_state & I_LINKABLE));
                       atomic_long_dec(&inode->i_sb->s_remove_count);
              }
        (...)
      
      The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
      never set during log replay.
      
      Most of the time, the warning isn't triggered even if we dropped the last
      reference of the conflicting inode, and this is because:
      
      1) The conflicting inode was previously marked for fixup, through a call
         to link_to_fixup_dir(), which increments the inode's link count;
      
      2) And the last iput() on the inode has not triggered eviction of the
         inode, nor was eviction triggered after the iput(). So at add_link(),
         even if we unlink the last reference of the inode, its link count ends
         up being 1 and not 0.
      
      So this means that if eviction is triggered after link_to_fixup_dir() is
      called, at add_link() we will read the inode back from the subvolume tree
      and have it with a correct link count, matching the number of references
      it has on the subvolume tree. So if when we are at add_link() the inode
      has exactly one reference only, its link count is 1, and after the unlink
      its link count becomes 0.
      
      So fix this by using set_nlink() instead of inc_nlink(), as the former
      accepts a transition from 0 to 1 and it's what we use in other similar
      contexts (like at link_to_fixup_dir().
      
      Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
      bump the link count from 0 to 1.
      
      The warning is actually harmless, but it may scare users. Josef also ran
      into it recently.
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      769030e1
    • F
      btrfs: fix lost error handling when looking up extended ref on log replay · 7a6b75b7
      Filipe Manana 提交于
      During log replay, when processing inode references, if we get an error
      when looking up for an extended reference at __add_inode_ref(), we ignore
      it and proceed, returning success (0) if no other error happens after the
      lookup. This is obviously wrong because in case an extended reference
      exists and it encodes some name not in the log, we need to unlink it,
      otherwise the filesystem state will not match the state it had after the
      last fsync.
      
      So just make __add_inode_ref() return an error it gets from the extended
      reference lookup.
      
      Fixes: f186373f ("btrfs: extended inode refs")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a6b75b7