1. 06 12月, 2022 12 次提交
  2. 29 9月, 2022 3 次提交
    • F
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana 提交于
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • F
      btrfs: move btrfs_drop_extent_cache() to extent_map.c · 4c0c8cfc
      Filipe Manana 提交于
      The function btrfs_drop_extent_cache() doesn't really belong at file.c
      because what it does is drop a range of extent maps for a file range.
      It directly allocates and manipulates extent maps, by dropping,
      splitting and replacing them in an extent map tree, so it should be
      located at extent_map.c, where all manipulations of an extent map tree
      and its extent maps are supposed to be done.
      
      So move it out of file.c and into extent_map.c. Additionally do the
      following changes:
      
      1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
         clear about what it does. The term "cache" is a bit confusing as it's
         not widely used, "extent maps" or "extent mapping" is much more common;
      
      2) Change its 'skip_pinned' argument from int to bool;
      
      3) Turn several of its local variables from int to bool, since they are
         used as booleans;
      
      4) Move the declaration of some variables out of the function's main
         scope and into the scopes where they are used;
      
      5) Remove pointless assignment of false to 'modified' early in the while
         loop, as later that variable is set and it's not used before that
         second assignment;
      
      6) Remove checks for NULL before calling free_extent_map().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c0c8cfc
    • J
      btrfs: make can_nocow_extent nowait compatible · 26ce9114
      Josef Bacik 提交于
      If we have NOWAIT specified on our IOCB and we're writing into a
      PREALLOC or NOCOW extent then we need to be able to tell
      can_nocow_extent that we don't want to wait on any locks or metadata IO.
      Fix can_nocow_extent to allow for NOWAIT.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NStefan Roesch <shr@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      26ce9114
  3. 26 9月, 2022 1 次提交
  4. 17 8月, 2022 2 次提交
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
    • Z
      btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() · 85f02d6c
      Zixuan Fu 提交于
      In btrfs_relocate_block_group(), the rc is allocated.  Then
      btrfs_relocate_block_group() calls
      
      relocate_block_group()
        prepare_to_relocate()
          set_reloc_control()
      
      that assigns rc to the variable fs_info->reloc_ctl. When
      prepare_to_relocate() returns, it calls
      
      btrfs_commit_transaction()
        btrfs_start_dirty_block_groups()
          btrfs_alloc_path()
            kmem_cache_zalloc()
      
      which may fail for example (or other errors could happen). When the
      failure occurs, btrfs_relocate_block_group() detects the error and frees
      rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
      btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
      then used, which may cause a use-after-free bug.
      
      This possible bug can be triggered by calling btrfs_ioctl_balance()
      before calling btrfs_ioctl_defrag().
      
      To fix this possible bug, in prepare_to_relocate(), check if
      btrfs_commit_transaction() fails. If the failure occurs,
      unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
      
      The error log in our fault-injection testing is shown as follows:
      
        [   58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        ...
        [   58.753577] Call Trace:
        ...
        [   58.755800]  kasan_report+0x45/0x60
        [   58.756066]  btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        [   58.757304]  record_root_in_trans+0x792/0xa10 [btrfs]
        [   58.757748]  btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
        [   58.758231]  start_transaction+0x896/0x2950 [btrfs]
        [   58.758661]  btrfs_defrag_root+0x250/0xc00 [btrfs]
        [   58.759083]  btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
        [   58.759513]  btrfs_ioctl+0x3c95/0x114e0 [btrfs]
        ...
        [   58.768510] Allocated by task 23683:
        [   58.768777]  ____kasan_kmalloc+0xb5/0xf0
        [   58.769069]  __kmalloc+0x227/0x3d0
        [   58.769325]  alloc_reloc_control+0x10a/0x3d0 [btrfs]
        [   58.769755]  btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
        [   58.770228]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.770655]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.771071]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.771472]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.771902]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
        [   58.773337] Freed by task 23683:
        ...
        [   58.774815]  kfree+0xda/0x2b0
        [   58.775038]  free_reloc_control+0x1d6/0x220 [btrfs]
        [   58.775465]  btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
        [   58.775944]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.776369]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.776784]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.777185]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.777621]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
      Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: NSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NZixuan Fu <r33s3n6@gmail.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      85f02d6c
  5. 16 5月, 2022 4 次提交
    • L
      btrfs: remove unnecessary check of iput argument · 8aa1e49e
      Lv Ruyi 提交于
      iput() already handles NULL and non-NULL parameter, so it is not needed
      to check that. This unifies all iput calls.
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Signed-off-by: NLv Ruyi <lv.ruyi@zte.com.cn>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8aa1e49e
    • Y
      btrfs: remove unnecessary type casts · 0d031dc4
      Yu Zhe 提交于
      Explicit type casts are not necessary when it's void* to another pointer
      type.
      Signed-off-by: NYu Zhe <yuzhe@nfschina.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d031dc4
    • N
      btrfs: assert that relocation is protected with sb_start_write() · 0320b353
      Naohiro Aota 提交于
      Relocation of a data block group creates ordered extents. They can cause
      a hang when a process is trying to thaw the filesystem.
      
      We should have called sb_start_write(), so the filesystem is not being
      frozen. Add an ASSERT to check it is protected.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0320b353
    • F
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana 提交于
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4135134
  6. 10 5月, 2022 1 次提交
  7. 09 5月, 2022 1 次提交
  8. 14 3月, 2022 3 次提交
  9. 02 3月, 2022 1 次提交
    • J
      btrfs: do not start relocation until in progress drops are done · b4be6aef
      Josef Bacik 提交于
      We hit a bug with a recovering relocation on mount for one of our file
      systems in production.  I reproduced this locally by injecting errors
      into snapshot delete with balance running at the same time.  This
      presented as an error while looking up an extent item
      
        WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
        CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
        RIP: 0010:lookup_inline_extent_backref+0x647/0x680
        RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
        RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
        R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
        R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
        FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
        Call Trace:
         <TASK>
         insert_inline_extent_backref+0x46/0xd0
         __btrfs_inc_extent_ref.isra.0+0x5f/0x200
         ? btrfs_merge_delayed_refs+0x164/0x190
         __btrfs_run_delayed_refs+0x561/0xfa0
         ? btrfs_search_slot+0x7b4/0xb30
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_run_delayed_refs+0x73/0x1f0
         ? btrfs_update_root+0x1a9/0x2c0
         btrfs_commit_transaction+0x50/0xa50
         ? btrfs_update_reloc_root+0x122/0x220
         prepare_to_merge+0x29f/0x320
         relocate_block_group+0x2b8/0x550
         btrfs_relocate_block_group+0x1a6/0x350
         btrfs_relocate_chunk+0x27/0xe0
         btrfs_balance+0x777/0xe60
         balance_kthread+0x35/0x50
         ? btrfs_balance+0xe60/0xe60
         kthread+0x16b/0x190
         ? set_kthread_struct+0x40/0x40
         ret_from_fork+0x22/0x30
         </TASK>
      
      Normally snapshot deletion and relocation are excluded from running at
      the same time by the fs_info->cleaner_mutex.  However if we had a
      pending balance waiting to get the ->cleaner_mutex, and a snapshot
      deletion was running, and then the box crashed, we would come up in a
      state where we have a half deleted snapshot.
      
      Again, in the normal case the snapshot deletion needs to complete before
      relocation can start, but in this case relocation could very well start
      before the snapshot deletion completes, as we simply add the root to the
      dead roots list and wait for the next time the cleaner runs to clean up
      the snapshot.
      
      Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
      had a pending drop_progress key.  If they do then we know we were in the
      middle of the drop operation and set a flag on the fs_info.  Then
      balance can wait until this flag is cleared to start up again.
      
      If there are DEAD_ROOT's that don't have a drop_progress set then we're
      safe to start balance right away as we'll be properly protected by the
      cleaner_mutex.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4be6aef
  10. 07 1月, 2022 2 次提交
    • J
      btrfs: add an inode-item.h · 26c2c454
      Josef Bacik 提交于
      We have a few helpers in inode-item.c, and I'm going to make a few
      changes to how we do truncate in the future, so break out these
      definitions into their own header file to trim down ctree.h some and
      make it easier to do the work on truncate in the future.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      26c2c454
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  11. 03 1月, 2022 4 次提交
  12. 27 10月, 2021 6 次提交
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • N
      btrfs: pull up qgroup checks from delayed-ref core to init time · 681145d4
      Nikolay Borisov 提交于
      Instead of checking whether qgroup processing for a dealyed ref has to
      happen in the core of delayed ref, simply pull the check at init time of
      respective delayed ref structures. This eliminates the final use of
      real_root in delayed-ref core paving the way to making this member
      optional.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      681145d4
    • N
      btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref · f42c5da6
      Nikolay Borisov 提交于
      In order to make 'real_root' used only in ref-verify it's required to
      have the necessary context to perform the same checks that this member
      is used for. So add 'mod_root' which will contain the root on behalf of
      which a delayed ref was created and a 'skip_group' parameter which
      will contain callsite-specific override of skip_qgroup.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f42c5da6
    • J
      btrfs: rename setup_extent_mapping in relocation code · 4b01c44f
      Johannes Thumshirn 提交于
      In btrfs code we have two functions called setup_extent_mapping, one in
      the extent_map code and one in the relocation code. While both are
      private to their respective implementation, this can still be confusing
      for the reader.
      
      So rename the version in relocation.c to setup_relocation_extent_mapping.
      No functional changes.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4b01c44f
    • J
      btrfs: zoned: allow preallocation for relocation inodes · 960a3166
      Johannes Thumshirn 提交于
      Now that we use a dedicated block group and regular writes for data
      relocation, we can preallocate the space needed for a relocated inode,
      just like we do in regular mode.
      
      Essentially this reverts commit 32430c61 ("btrfs: zoned: enable
      relocation on a zoned filesystem") as it is not needed anymore.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      960a3166
    • J
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn 提交于
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f00a6d