1. 26 9月, 2022 18 次提交
    • J
      btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY · 9283b9e0
      Josef Bacik 提交于
      We use this during device replace for zoned devices, we were simply
      taking the lock because it was in a bit field and we needed the lock to
      be safe with other modifications in the bitfield.  With the bit helpers
      we no longer require that locking.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9283b9e0
    • J
      btrfs: convert block group bit field to use bit helpers · 3349b57f
      Josef Bacik 提交于
      We use a bit field in the btrfs_block_group for different flags, however
      this is awkward because we have to hold the block_group->lock for any
      modification of any of these fields, and makes the code clunky for a few
      of these flags.  Convert these to a properly flags setup so we can
      utilize the bit helpers.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3349b57f
    • J
      btrfs: handle space_info setting of bg in btrfs_add_bg_to_space_info · 723de71d
      Josef Bacik 提交于
      We previously had the pattern of
      
      	btrfs_update_space_info(all, the, bg, fields, &space_info);
      	link_block_group(bg);
      	bg->space_info = space_info;
      
      Now that we're passing the bg into btrfs_add_bg_to_space_info we can do
      the linking in that function, transforming this to simply
      
      	btrfs_add_bg_to_space_info(fs_info, bg);
      
      and put the link_block_group() and bg->space_info assignment directly in
      btrfs_add_bg_to_space_info.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      723de71d
    • J
      btrfs: simplify arguments of btrfs_update_space_info and rename · 9d4b0a12
      Josef Bacik 提交于
      This function has grown a bunch of new arguments, and it just boils down
      to passing in all the block group fields as arguments.  Simplify this by
      passing in the block group itself and updating the space_info fields
      based on the block group fields directly.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d4b0a12
    • J
      btrfs: use btrfs_fs_closing for background bg work · 2f12741f
      Josef Bacik 提交于
      For both unused bg deletion and async balance work we'll happily run if
      the fs is closing.  However I want to move these to their own worker
      thread, and they can be long running jobs, so add a check to see if
      we're closing and simply bail.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2f12741f
    • O
      btrfs: rename btrfs_insert_file_extent() to btrfs_insert_hole_extent() · d1f68ba0
      Omar Sandoval 提交于
      btrfs_insert_file_extent() is only ever used to insert holes, so rename
      it and remove the redundant parameters.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1f68ba0
    • D
      btrfs: sysfs: use sysfs_streq for string matching · 7f298f22
      David Sterba 提交于
      We have own string matching helper that duplicates what sysfs_streq
      does, with a slight difference that it skips initial whitespace. So far
      this is used for the drive allocation policy. The initial whitespace
      of written sysfs values should be rather discouraged and we should use a
      standard helper.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f298f22
    • Q
      btrfs: scrub: try to fix super block errors · f9eab5f0
      Qu Wenruo 提交于
      [BUG]
      The following script shows that, although scrub can detect super block
      errors, it never tries to fix it:
      
      	mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
      	xfs_io -c "pwrite 67108864 4k" $dev2
      
      	mount $dev1 $mnt
      	btrfs scrub start -B $dev2
      	btrfs scrub start -Br $dev2
      	umount $mnt
      
      The first scrub reports the super error correctly:
      
        scrub done for f3289218-abd3-41ac-a630-202f766c0859
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      But the second read-only scrub still reports the same super error:
      
        Scrub started:    Tue Aug  2 14:44:11 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   1.26GiB
        Rate:             0.00B/s
        Error summary:    super=1
          Corrected:      0
          Uncorrectable:  0
          Unverified:     0
      
      [CAUSE]
      The comments already shows that super block can be easily fixed by
      committing a transaction:
      
      	/*
      	 * If we find an error in a super block, we just report it.
      	 * They will get written with the next transaction commit
      	 * anyway
      	 */
      
      But the truth is, such assumption is not always true, and since scrub
      should try to repair every error it found (except for read-only scrub),
      we should really actively commit a transaction to fix this.
      
      [FIX]
      Just commit a transaction if we found any super block errors, after
      everything else is done.
      
      We cannot do this just after scrub_supers(), as
      btrfs_commit_transaction() will try to pause and wait for the running
      scrub, thus we can not call it with scrub_lock hold.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9eab5f0
    • Q
      btrfs: scrub: properly report super block errors in system log · e69bf81c
      Qu Wenruo 提交于
      [PROBLEM]
      
      Unlike data/metadata corruption, if scrub detected some error in the
      super block, the only error message is from the updated device status:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
      
      This is not helpful at all.
      
      [CAUSE]
      Unlike data/metadata error reporting, there is no visible report in
      kernel dmesg to report supper block errors.
      
      In fact, return value of scrub_checksum_super() is intentionally
      skipped, thus scrub_handle_errored_block() will never be called for
      super blocks.
      
      [FIX]
      Make super block errors to output an error message, now the full
      dmesg would looks like this:
      
        BTRFS info (device dm-1): scrub: started on devid 2
        BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864
        BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
        BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0
        BTRFS info (device dm-1): scrub: started on devid 2
      
      This fix involves:
      
      - Move the super_errors reporting to scrub_handle_errored_block()
        This allows the device status message to show after the super block
        error message.
        But now we no longer distinguish super block corruption and generation
        mismatch, now all counted as corruption.
      
      - Properly check the return value from scrub_checksum_super()
      - Add extra super block error reporting for scrub_print_warning().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e69bf81c
    • A
      btrfs: fix alignment of VMA for memory mapped files on THP · b0c58223
      Alexander Zhu 提交于
      With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for
      read-only mmapped files, such as shared libraries. However, the kernel
      makes no attempt to actually align those mappings on 2MB boundaries,
      which makes it impossible to use those THPs most of the time. This issue
      applies to general file mapping THP as well as existing setups using
      CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using
      thp_get_unmapped_area for the unmapped_area function in btrfs, which
      is what ext2, ext4, fuse, and xfs all use.
      
      Initially btrfs had been left out in commit 8c07fc452ac0 ("btrfs: fix
      alignment of VMA for memory mapped files on THP") as btrfs does not support
      DAX. However, commit 1854bc6e ("mm/readahead: Align file mappings
      for non-DAX") removed the DAX requirement. We should now be able to call
      thp_get_unmapped_area() for btrfs.
      
      The problem can be seen in /proc/PID/smaps where THPeligible is set to 0
      on mappings to eligible shared object files as shown below.
      
      Before this patch:
      
        7fc6a7e18000-7fc6a80cc000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    0
        VmFlags: rd ex mr mw me
      
      With this patch the library is mapped at a 2MB aligned address:
      
        fbdfe200000-7fbdfe4b4000 r-xp 00000000 00:1e 199856
        /usr/lib64/libcrypto.so.1.1.1k
        Size:               2768 kB
        THPeligible:    1
        VmFlags: rd ex mr mw me
      
      This fixes the alignment of VMAs for any mmap of a file that has the
      rd and ex permissions and size >= 2MB. The VMA alignment and
      THPeligible field for anonymous memory is handled separately and
      is thus not effected by this change.
      
      CC: stable@vger.kernel.org # 5.18+
      Signed-off-by: NAlexander Zhu <alexlzhu@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0c58223
    • I
      btrfs: add lockdep annotations for the ordered extents wait event · 5f4403e1
      Ioannis Angelakopoulos 提交于
      This wait event is very similar to the pending ordered wait event in the
      sense that it occurs in a different context than the condition signaling
      for the event. The signaling occurs in btrfs_remove_ordered_extent()
      while the wait event is implemented in btrfs_start_ordered_extent() in
      fs/btrfs/ordered-data.c
      
      However, in this case a thread must not acquire the lockdep map for the
      ordered extents wait event when the ordered extent is related to a free
      space inode. That is because lockdep creates dependencies between locks
      acquired both in execution paths related to normal inodes and paths
      related to free space inodes, thus leading to false positives.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f4403e1
    • I
      btrfs: change the lockdep class of free space inode's invalidate_lock · 9d7464c8
      Ioannis Angelakopoulos 提交于
      Reinitialize the class of the lockdep map for struct inode's
      mapping->invalidate_lock in load_free_space_cache() function in
      fs/btrfs/free-space-cache.c. This will prevent lockdep from producing
      false positives related to execution paths that make use of free space
      inodes and paths that make use of normal inodes.
      
      Specifically, with this change lockdep will create separate lock
      dependencies that include the invalidate_lock, in the case that free
      space inodes are used and in the case that normal inodes are used.
      
      The lockdep class for this lock was first initialized in
      inode_init_always() in fs/inode.c.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9d7464c8
    • I
      btrfs: add lockdep annotations for pending_ordered wait event · 8b53779e
      Ioannis Angelakopoulos 提交于
      In contrast to the num_writers and num_extwriters wait events, the
      condition for the pending ordered wait event is signaled in a different
      context from the wait event itself. The condition signaling occurs in
      btrfs_remove_ordered_extent() in fs/btrfs/ordered-data.c while the wait
      event is implemented in btrfs_commit_transaction() in
      fs/btrfs/transaction.c
      
      Thus the thread signaling the condition has to acquire the lockdep map
      as a reader at the start of btrfs_remove_ordered_extent() and release it
      after it has signaled the condition. In this case some dependencies
      might be left out due to the placement of the annotation, but it is
      better than no annotation at all.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b53779e
    • I
      btrfs: add lockdep annotations for transaction states wait events · 3e738c53
      Ioannis Angelakopoulos 提交于
      Add lockdep annotations for the transaction states that have wait
      events;
      
        1) TRANS_STATE_COMMIT_START
        2) TRANS_STATE_UNBLOCKED
        3) TRANS_STATE_SUPER_COMMITTED
        4) TRANS_STATE_COMPLETED
      
      The new macros introduced here to annotate the transaction states wait
      events have the same effect as the generic lockdep annotation macros.
      
      With the exception of the lockdep annotation for TRANS_STATE_COMMIT_START
      the transaction thread has to acquire the lockdep maps for the
      transaction states as reader after the lockdep map for num_writers is
      released so that lockdep does not complain.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3e738c53
    • I
      btrfs: add lockdep annotations for num_extwriters wait event · 5a9ba670
      Ioannis Angelakopoulos 提交于
      Similarly to the num_writers wait event in fs/btrfs/transaction.c add a
      lockdep annotation for the num_extwriters wait event.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a9ba670
    • I
      btrfs: add lockdep annotations for num_writers wait event · e1489b4f
      Ioannis Angelakopoulos 提交于
      Annotate the num_writers wait event in fs/btrfs/transaction.c with
      lockdep in order to catch deadlocks involving this wait event.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1489b4f
    • I
      btrfs: add macros for annotating wait events with lockdep · ab9a323f
      Ioannis Angelakopoulos 提交于
      Introduce four macros that are used to annotate wait events in btrfs code
      with lockdep;
      
        1) the btrfs_lockdep_init_map
        2) the btrfs_lockdep_acquire,
        3) the btrfs_lockdep_release
        4) the btrfs_might_wait_for_event macros.
      
      The btrfs_lockdep_init_map macro is used to initialize a lockdep map.
      
      The btrfs_lockdep_<acquire,release> macros are used by threads to take
      the lockdep map as readers (shared lock) and release it, respectively.
      
      The btrfs_might_wait_for_event macro is used by threads to take the
      lockdep map as writers (exclusive lock) and release it.
      
      In general, the lockdep annotation for wait events work as follows:
      
      The condition for a wait event can be modified and signaled at the same
      time by multiple threads. These threads hold the lockdep map as readers
      when they enter a context in which blocking would prevent signaling the
      condition. Frequently, this occurs when a thread violates a condition
      (lockdep map acquire), before restoring it and signaling it at a later
      point (lockdep map release).
      
      The threads that block on the wait event take the lockdep map as writers
      (exclusive lock). These threads have to block until all the threads that
      hold the lockdep map as readers signal the condition for the wait event
      and release the lockdep map.
      
      The lockdep annotation is used to warn about potential deadlock scenarios
      that involve the threads that modify and signal the wait event condition
      and threads that block on the wait event. A simple example is illustrated
      below:
      
      Without lockdep:
      
      TA                                        TB
      cond = false
                                                lock(A)
                                                wait_event(w, cond)
                                                unlock(A)
      lock(A)
      cond = true
      signal(w)
      unlock(A)
      
      With lockdep:
      
      TA                                        TB
      rwsem_acquire_read(lockdep_map)
      cond = false
                                                lock(A)
                                                rwsem_acquire(lockdep_map)
                                                rwsem_release(lockdep_map)
                                                wait_event(w, cond)
                                                unlock(A)
      lock(A)
      cond = true
      signal(w)
      unlock(A)
      rwsem_release(lockdep_map)
      
      In the second case, with the lockdep annotation, lockdep would warn about
      an ABBA deadlock, while the first case would just deadlock at some point.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ab9a323f
    • Q
      btrfs: dump extra info if one free space cache has more bitmaps than it should · 62cd9d44
      Qu Wenruo 提交于
      There is an internal report on hitting the following ASSERT() in
      recalculate_thresholds():
      
       	ASSERT(ctl->total_bitmaps <= max_bitmaps);
      
      Above @max_bitmaps is calculated using the following variables:
      
      - bytes_per_bg
        8 * 4096 * 4096 (128M) for x86_64/x86.
      
      - block_group->length
        The length of the block group.
      
      @max_bitmaps is the rounded up value of block_group->length / 128M.
      
      Normally one free space cache should not have more bitmaps than above
      value, but when it happens the ASSERT() can be triggered if
      CONFIG_BTRFS_ASSERT is also enabled.
      
      But the ASSERT() itself won't provide enough info to know which is going
      wrong.
      Is the bg too small thus it only allows one bitmap?
      Or is there something else wrong?
      
      So although I haven't found extra reports or crash dump to do further
      investigation, add the extra info to make it more helpful to debug.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62cd9d44
  2. 13 9月, 2022 3 次提交
    • N
      btrfs: zoned: wait for extent buffer IOs before finishing a zone · 2dd7e7bc
      Naohiro Aota 提交于
      Before sending REQ_OP_ZONE_FINISH to a zone, we need to ensure that
      ongoing IOs already finished. Or, we will see a "Zone Is Full" error for
      the IOs, as the ZONE_FINISH command makes the zone full.
      
      We ensure that with btrfs_wait_block_group_reservations() and
      btrfs_wait_ordered_roots() for a data block group. And, for a metadata
      block group, the comparison of alloc_offset vs meta_write_pointer mostly
      ensures IOs for the allocated region already sent. However, there still
      can be a little time frame where the IOs are sent but not yet completed.
      
      Introduce wait_eb_writebacks() to ensure such IOs are completed for a
      metadata block group. It walks the buffer_radix to find extent buffers in
      the block group and calls wait_on_extent_buffer_writeback() on them.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.19+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2dd7e7bc
    • F
      btrfs: fix hang during unmount when stopping a space reclaim worker · a362bb86
      Filipe Manana 提交于
      Often when running generic/562 from fstests we can hang during unmount,
      resulting in a trace like this:
      
        Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
        Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
        Sep 07 11:55:32 debian9 kernel:       Not tainted 6.0.0-rc2-btrfs-next-122 #1
        Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        Sep 07 11:55:32 debian9 kernel: task:umount          state:D stack:    0 pid:49438 ppid: 25683 flags:0x00004000
        Sep 07 11:55:32 debian9 kernel: Call Trace:
        Sep 07 11:55:32 debian9 kernel:  <TASK>
        Sep 07 11:55:32 debian9 kernel:  __schedule+0x3c8/0xec0
        Sep 07 11:55:32 debian9 kernel:  ? rcu_read_lock_sched_held+0x12/0x70
        Sep 07 11:55:32 debian9 kernel:  schedule+0x5d/0xf0
        Sep 07 11:55:32 debian9 kernel:  schedule_timeout+0xf1/0x130
        Sep 07 11:55:32 debian9 kernel:  ? lock_release+0x224/0x4a0
        Sep 07 11:55:32 debian9 kernel:  ? lock_acquired+0x1a0/0x420
        Sep 07 11:55:32 debian9 kernel:  ? trace_hardirqs_on+0x2c/0xd0
        Sep 07 11:55:32 debian9 kernel:  __wait_for_common+0xac/0x200
        Sep 07 11:55:32 debian9 kernel:  ? usleep_range_state+0xb0/0xb0
        Sep 07 11:55:32 debian9 kernel:  __flush_work+0x26d/0x530
        Sep 07 11:55:32 debian9 kernel:  ? flush_workqueue_prep_pwqs+0x140/0x140
        Sep 07 11:55:32 debian9 kernel:  ? trace_clock_local+0xc/0x30
        Sep 07 11:55:32 debian9 kernel:  __cancel_work_timer+0x11f/0x1b0
        Sep 07 11:55:32 debian9 kernel:  ? close_ctree+0x12b/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? __trace_bputs+0x10b/0x170
        Sep 07 11:55:32 debian9 kernel:  close_ctree+0x152/0x5b3 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  ? evict_inodes+0x166/0x1c0
        Sep 07 11:55:32 debian9 kernel:  generic_shutdown_super+0x71/0x120
        Sep 07 11:55:32 debian9 kernel:  kill_anon_super+0x14/0x30
        Sep 07 11:55:32 debian9 kernel:  btrfs_kill_super+0x12/0x20 [btrfs]
        Sep 07 11:55:32 debian9 kernel:  deactivate_locked_super+0x2e/0xa0
        Sep 07 11:55:32 debian9 kernel:  cleanup_mnt+0x100/0x160
        Sep 07 11:55:32 debian9 kernel:  task_work_run+0x59/0xa0
        Sep 07 11:55:32 debian9 kernel:  exit_to_user_mode_prepare+0x1a6/0x1b0
        Sep 07 11:55:32 debian9 kernel:  syscall_exit_to_user_mode+0x16/0x40
        Sep 07 11:55:32 debian9 kernel:  do_syscall_64+0x48/0x90
        Sep 07 11:55:32 debian9 kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
        Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
        Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
        Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
        Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
        Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
        Sep 07 11:55:32 debian9 kernel:  </TASK>
      
      What happens is the following:
      
      1) The cleaner kthread tries to start a transaction to delete an unused
         block group, but the metadata reservation can not be satisfied right
         away, so a reservation ticket is created and it starts the async
         metadata reclaim task (fs_info->async_reclaim_work);
      
      2) Writeback for all the filler inodes with an i_size of 2K starts
         (generic/562 creates a lot of 2K files with the goal of filling
         metadata space). We try to create an inline extent for them, but we
         fail when trying to insert the inline extent with -ENOSPC (at
         cow_file_range_inline()) - since this is not critical, we fallback
         to non-inline mode (back to cow_file_range()), reserve extents, create
         extent maps and create the ordered extents;
      
      3) An unmount starts, enters close_ctree();
      
      4) The async reclaim task is flushing stuff, entering the flush states one
         by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
         delayed iputs.
      
         After running the delayed iputs and before calling
         btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
         and btrfs_add_delayed_iput() is called for each one through
         btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
         in bumping fs_info->nr_delayed_iputs from 0 to some positive value.
      
         So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
         for fs_info->nr_delayed_iputs to become 0;
      
      5) The current transaction is committed by the transaction kthread, we then
         start unpinning extents and end up calling btrfs_try_granting_tickets()
         through unpin_extent_range(), since we released some space.
         This results in satisfying the ticket created by the cleaner kthread at
         step 1, waking up the cleaner kthread;
      
      6) At close_ctree() we ask the cleaner kthread to park;
      
      7) The cleaner kthread starts the transaction, deletes the unused block
         group, and then calls kthread_should_park(), which returns true, so it
         parks. And at this point we have the delayed iputs added by the
         completion of the ordered extents still pending;
      
      8) Then later at close_ctree(), when we call:
      
             cancel_work_sync(&fs_info->async_reclaim_work);
      
         We hang forever, since the cleaner was parked and no one else can run
         delayed iputs after that, while the reclaim task is waiting for the
         remaining delayed iputs to be completed.
      
      Fix this by waiting for all ordered extents to complete and running the
      delayed iputs before attempting to stop the async reclaim tasks. Note that
      we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
      other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
      flag to be set on an ordered extent, but the delayed iput is added after
      that, when doing the final btrfs_put_ordered_extent(). So instead wait for
      the work queues used for executing ordered extent completion to be empty,
      which works because we do the final put on an ordered extent at
      btrfs_finish_ordered_io() (while we are in the unmount context).
      
      Fixes: d6fd0ae2 ("Btrfs: fix missing delayed iputs on unmount")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a362bb86
    • F
      btrfs: fix hang during unmount when stopping block group reclaim worker · 8a1f1e3d
      Filipe Manana 提交于
      During early unmount, at close_ctree(), we try to stop the block group
      reclaim task with cancel_work_sync(), but that may hang if the block group
      reclaim task is currently at btrfs_relocate_block_group() waiting for the
      flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
      unmount we only clear that flag later, after trying to stop the block
      group reclaim task.
      
      Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
      block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
      if the reclaim task is waiting on that bit, it will stop immediately after
      being woken, because it sees the filesystem is closing (with a call to
      btrfs_fs_closing()), and then returns immediately with -EINTR.
      
      Fixes: 31e70e52 ("btrfs: fix hang during unmount when block group reclaim task is running")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8a1f1e3d
  3. 06 9月, 2022 1 次提交
    • Q
      btrfs: fix the max chunk size and stripe length calculation · 5da431b7
      Qu Wenruo 提交于
      [BEHAVIOR CHANGE]
      Since commit f6fca391 ("btrfs: store chunk size in space-info
      struct"), btrfs no longer can create larger data chunks than 1G:
      
        mkfs.btrfs -f -m raid1 -d raid0 $dev1 $dev2 $dev3 $dev4
        mount $dev1 $mnt
      
        btrfs balance start --full $mnt
        btrfs balance start --full $mnt
        umount $mnt
      
        btrfs ins dump-tree -t chunk $dev1 | grep "DATA|RAID0" -C 2
      
      Before that offending commit, what we got is a 4G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      Now what we got is only 1G data chunk:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 6271533056) itemoff 15491 itemsize 176
      		length 1073741824 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      
      This will increase the number of data chunks by the number of devices,
      not only increase system chunk usage, but also greatly increase mount
      time.
      
      Without a proper reason, we should not change the max chunk size.
      
      [CAUSE]
      Previously, we set max data chunk size to 10G, while max data stripe
      length to 1G.
      
      Commit f6fca391 ("btrfs: store chunk size in space-info struct")
      completely ignored the 10G limit, but use 1G max stripe limit instead,
      causing above shrink in max data chunk size.
      
      [FIX]
      Fix the max data chunk size to 10G, and in decide_stripe_size_regular()
      we limit stripe_size to 1G manually.
      
      This should only affect data chunks, as for metadata chunks we always
      set the max stripe size the same as max chunk size (256M or 1G
      depending on fs size).
      
      Now the same script result the same old result:
      
      	item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 9492758528) itemoff 15491 itemsize 176
      		length 4294967296 owner 2 stripe_len 65536 type DATA|RAID0
      		io_align 65536 io_width 65536 sector_size 4096
      		num_stripes 4 sub_stripes 1
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Fixes: f6fca391 ("btrfs: store chunk size in space-info struct")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5da431b7
  4. 05 9月, 2022 3 次提交
  5. 24 8月, 2022 4 次提交
    • A
      btrfs: add info when mount fails due to stale replace target · f2c3bec2
      Anand Jain 提交于
      If the replace target device reappears after the suspended replace is
      cancelled, it blocks the mount operation as it can't find the matching
      replace-item in the metadata. As shown below,
      
         BTRFS error (device sda5): replace devid present without an active replace item
      
      To overcome this situation, the user can run the command
      
         btrfs device scan --forget <replace target device>
      
      and try the mount command again. And also, to avoid repeating the issue,
      superblock on the devid=0 must be wiped.
      
         wipefs -a device-path-to-devid=0.
      
      This patch adds some info when this situation occurs.
      Reported-by: NSamuel Greiner <samuel@balkonien.org>
      Link: https://lore.kernel.org/linux-btrfs/b4f62b10-b295-26ea-71f9-9a5c9299d42c@balkonien.org/T/
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f2c3bec2
    • A
      btrfs: replace: drop assert for suspended replace · 59a39919
      Anand Jain 提交于
      If the filesystem mounts with the replace-operation in a suspended state
      and try to cancel the suspended replace-operation, we hit the assert. The
      assert came from the commit fe97e2e1 ("btrfs: dev-replace: replace's
      scrub must not be running in suspended state") that was actually not
      required. So just remove it.
      
       $ mount /dev/sda5 /btrfs
      
          BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
          BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
      
       $ mount -o degraded /dev/sda5 /btrfs <-- success.
      
       $ btrfs replace cancel /btrfs
      
          kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
          kernel: ------------[ cut here ]------------
          kernel: kernel BUG at fs/btrfs/ctree.h:3750!
      
      After the patch:
      
       $ btrfs replace cancel /btrfs
      
          BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
      
      Fixes: fe97e2e1 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      59a39919
    • F
      btrfs: fix silent failure when deleting root reference · 47bf225a
      Filipe Manana 提交于
      At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
      up returning from the function with a value of 0 (success). This happens
      because the function returns the value stored in the variable 'err',
      which is 0, while the error value we got from btrfs_search_slot() is
      stored in the 'ret' variable.
      
      So fix it by setting 'err' with the error value.
      
      Fixes: 8289ed9f ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      47bf225a
    • O
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval 提交于
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ced8ecf0
  6. 23 8月, 2022 5 次提交
    • J
      btrfs: don't allow large NOWAIT direct reads · 79d3d1d1
      Josef Bacik 提交于
      Dylan and Jens reported a problem where they had an io_uring test that
      was returning short reads, and bisected it to ee5b46a3 ("btrfs:
      increase direct io read size limit to 256 sectors").
      
      The root cause is their test was doing larger reads via io_uring with
      NOWAIT and async.  This was triggering a page fault during the direct
      read, however the first page was able to work just fine and thus we
      submitted a 4k read for a larger iocb.
      
      Btrfs allows for partial IO's in this case specifically because we don't
      allow page faults, and thus we'll attempt to do any io that we can,
      submit what we could, come back and fault in the rest of the range and
      try to do the remaining IO.
      
      However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
      partial dio is done, which is incorrect.  In the sync case we can exit
      the iomap code, submit more io's, and return with the amount of IO we
      were able to complete successfully.
      
      We were always doing short reads in this case, but for NOWAIT we were
      getting saved by the fact that we were limiting direct reads to
      sectorsize, and if we were larger than that we would return EAGAIN.
      
      Fix the regression by simply returning EAGAIN in the NOWAIT case with
      larger reads, that way io_uring can retry and get the larger IO and have
      the fault logic handle everything properly.
      
      This still leaves the AIO short read case, but that existed before this
      change.  The way to properly fix this would be to handle partial iocb
      completions, but that's a lot of work, for now deal with the regression
      in the most straightforward way possible.
      Reported-by: NDylan Yudaken <dylany@fb.com>
      Fixes: ee5b46a3 ("btrfs: increase direct io read size limit to 256 sectors")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79d3d1d1
    • Q
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo 提交于
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: NZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4a445b7b
    • F
      btrfs: update generation of hole file extent item when merging holes · e6e3dec6
      Filipe Manana 提交于
      When punching a hole into a file range that is adjacent with a hole and we
      are not using the no-holes feature, we expand the range of the adjacent
      file extent item that represents a hole, to save metadata space.
      
      However we don't update the generation of hole file extent item, which
      means a full fsync will not log that file extent item if the fsync happens
      in a later transaction (since commit 7f30c072 ("btrfs: stop copying
      old file extents when doing a full fsync")).
      
      For example, if we do this:
      
          $ mkfs.btrfs -f -O ^no-holes /dev/sdb
          $ mount /dev/sdb /mnt
          $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
          $ sync
      
      We end up with 2 file extent items in our file:
      
      1) One that represents the hole for the file range [0, 2M), with a
         generation of 7;
      
      2) Another one that represents an extent covering the range [2M, 4M).
      
      After that if we do the following:
      
          $ xfs_io -c "fpunch 2M 2M" /mnt/foobar
      
      We end up with a single file extent item in the file, which represents a
      hole for the range [0, 4M) and with a generation of 7 - because we end
      dropping the data extent for range [2M, 4M) and then update the file
      extent item that represented the hole at [0, 2M), by increasing
      length from 2M to 4M.
      
      Then doing a full fsync and power failing:
      
          $ xfs_io -c "fsync" /mnt/foobar
          <power failure>
      
      will result in the full fsync not logging the file extent item that
      represents the hole for the range [0, 4M), because its generation is 7,
      which is lower than the generation of the current transaction (8).
      As a consequence, after mounting again the filesystem (after log replay),
      the region [2M, 4M) does not have a hole, it still points to the
      previous data extent.
      
      So fix this by always updating the generation of existing file extent
      items representing holes when we merge/expand them. This solves the
      problem and it's the same approach as when we merge prealloc extents that
      got written (at btrfs_mark_extent_written()). Setting the generation to
      the current transaction's generation is also what we do when merging
      the new hole extent map with the previous one or the next one.
      
      A test case for fstests, covering both cases of hole file extent item
      merging (to the left and to the right), will be sent soon.
      
      Fixes: 7f30c072 ("btrfs: stop copying old file extents when doing a full fsync")
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e6e3dec6
    • Z
      btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() · 9ea0106a
      Zixuan Fu 提交于
      In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
      the path is invalid. In this case, btrfs_get_dev_args_from_path()
      returns directly without freeing args->uuid and args->fsid allocated
      before, which causes memory leak.
      
      To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
      btrfs_put_dev_args_from_path() is called to clean up the memory.
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      Fixes: faa775c4 ("btrfs: add a btrfs_get_dev_args_from_path helper")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9ea0106a
    • G
      btrfs: check if root is readonly while setting security xattr · b5111127
      Goldwyn Rodrigues 提交于
      For a filesystem which has btrfs read-only property set to true, all
      write operations including xattr should be denied. However, security
      xattr can still be changed even if btrfs ro property is true.
      
      This happens because xattr_permission() does not have any restrictions
      on security.*, system.*  and in some cases trusted.* from VFS and
      the decision is left to the underlying filesystem. See comments in
      xattr_permission() for more details.
      
      This patch checks if the root is read-only before performing the set
      xattr operation.
      
      Testcase:
      
        DEV=/dev/vdb
        MNT=/mnt
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
        echo "file one" > $MNT/f1
      
        setfattr -n "security.one" -v 2 $MNT/f1
        btrfs property set /mnt ro true
      
        setfattr -n "security.one" -v 1 $MNT/f1
      
        umount $MNT
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b5111127
  7. 17 8月, 2022 6 次提交
    • J
      btrfs: tree-checker: check for overlapping extent items · 899b7f69
      Josef Bacik 提交于
      We're seeing a weird problem in production where we have overlapping
      extent items in the extent tree.  It's unclear where these are coming
      from, and in debugging we realized there's no check in the tree checker
      for this sort of problem.  Add a check to the tree-checker to make sure
      that the extents do not overlap each other.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      899b7f69
    • F
      btrfs: fix warning during log replay when bumping inode link count · 769030e1
      Filipe Manana 提交于
      During log replay, at add_link(), we may increment the link count of
      another inode that has a reference that conflicts with a new reference
      for the inode currently being processed.
      
      During log replay, at add_link(), we may drop (unlink) a reference from
      some inode in the subvolume tree if that reference conflicts with a new
      reference found in the log for the inode we are currently processing.
      
      After the unlink, If the link count has decreased from 1 to 0, then we
      increment the link count to prevent the inode from being deleted if it's
      evicted by an iput() call, because we may have references to add to that
      inode later on (and we will fixup its link count later during log replay).
      
      However incrementing the link count from 0 to 1 triggers a warning:
      
        $ cat fs/inode.c
        (...)
        void inc_nlink(struct inode *inode)
        {
              if (unlikely(inode->i_nlink == 0)) {
                       WARN_ON(!(inode->i_state & I_LINKABLE));
                       atomic_long_dec(&inode->i_sb->s_remove_count);
              }
        (...)
      
      The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
      never set during log replay.
      
      Most of the time, the warning isn't triggered even if we dropped the last
      reference of the conflicting inode, and this is because:
      
      1) The conflicting inode was previously marked for fixup, through a call
         to link_to_fixup_dir(), which increments the inode's link count;
      
      2) And the last iput() on the inode has not triggered eviction of the
         inode, nor was eviction triggered after the iput(). So at add_link(),
         even if we unlink the last reference of the inode, its link count ends
         up being 1 and not 0.
      
      So this means that if eviction is triggered after link_to_fixup_dir() is
      called, at add_link() we will read the inode back from the subvolume tree
      and have it with a correct link count, matching the number of references
      it has on the subvolume tree. So if when we are at add_link() the inode
      has exactly one reference only, its link count is 1, and after the unlink
      its link count becomes 0.
      
      So fix this by using set_nlink() instead of inc_nlink(), as the former
      accepts a transition from 0 to 1 and it's what we use in other similar
      contexts (like at link_to_fixup_dir().
      
      Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
      bump the link count from 0 to 1.
      
      The warning is actually harmless, but it may scare users. Josef also ran
      into it recently.
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      769030e1
    • F
      btrfs: fix lost error handling when looking up extended ref on log replay · 7a6b75b7
      Filipe Manana 提交于
      During log replay, when processing inode references, if we get an error
      when looking up for an extended reference at __add_inode_ref(), we ignore
      it and proceed, returning success (0) if no other error happens after the
      lookup. This is obviously wrong because in case an extended reference
      exists and it encodes some name not in the log, we need to unlink it,
      otherwise the filesystem state will not match the state it had after the
      last fsync.
      
      So just make __add_inode_ref() return an error it gets from the extended
      reference lookup.
      
      Fixes: f186373f ("btrfs: extended inode refs")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a6b75b7
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
    • J
      btrfs: move lockdep class helpers to locking.c · 0a27a047
      Josef Bacik 提交于
      These definitions exist in disk-io.c, which is not related to the
      locking.  Move this over to locking.h/c where it makes more sense.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a27a047
    • Z
      btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() · 85f02d6c
      Zixuan Fu 提交于
      In btrfs_relocate_block_group(), the rc is allocated.  Then
      btrfs_relocate_block_group() calls
      
      relocate_block_group()
        prepare_to_relocate()
          set_reloc_control()
      
      that assigns rc to the variable fs_info->reloc_ctl. When
      prepare_to_relocate() returns, it calls
      
      btrfs_commit_transaction()
        btrfs_start_dirty_block_groups()
          btrfs_alloc_path()
            kmem_cache_zalloc()
      
      which may fail for example (or other errors could happen). When the
      failure occurs, btrfs_relocate_block_group() detects the error and frees
      rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
      btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
      then used, which may cause a use-after-free bug.
      
      This possible bug can be triggered by calling btrfs_ioctl_balance()
      before calling btrfs_ioctl_defrag().
      
      To fix this possible bug, in prepare_to_relocate(), check if
      btrfs_commit_transaction() fails. If the failure occurs,
      unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
      
      The error log in our fault-injection testing is shown as follows:
      
        [   58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        ...
        [   58.753577] Call Trace:
        ...
        [   58.755800]  kasan_report+0x45/0x60
        [   58.756066]  btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        [   58.757304]  record_root_in_trans+0x792/0xa10 [btrfs]
        [   58.757748]  btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
        [   58.758231]  start_transaction+0x896/0x2950 [btrfs]
        [   58.758661]  btrfs_defrag_root+0x250/0xc00 [btrfs]
        [   58.759083]  btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
        [   58.759513]  btrfs_ioctl+0x3c95/0x114e0 [btrfs]
        ...
        [   58.768510] Allocated by task 23683:
        [   58.768777]  ____kasan_kmalloc+0xb5/0xf0
        [   58.769069]  __kmalloc+0x227/0x3d0
        [   58.769325]  alloc_reloc_control+0x10a/0x3d0 [btrfs]
        [   58.769755]  btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
        [   58.770228]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.770655]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.771071]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.771472]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.771902]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
        [   58.773337] Freed by task 23683:
        ...
        [   58.774815]  kfree+0xda/0x2b0
        [   58.775038]  free_reloc_control+0x1d6/0x220 [btrfs]
        [   58.775465]  btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
        [   58.775944]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.776369]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.776784]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.777185]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.777621]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85f02d6c