1. 07 11月, 2022 1 次提交
  2. 29 9月, 2022 1 次提交
  3. 26 9月, 2022 1 次提交
  4. 17 8月, 2022 1 次提交
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
  5. 16 5月, 2022 7 次提交
    • D
      btrfs: sink parameter is_data to btrfs_set_disk_extent_flags · 2fe6a5a1
      David Sterba 提交于
      The parameter has been added in 2009 in the infamous monster commit
      5d4f98a2 ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT
      CHANGE)") but not used ever since. We can sink it and allow further
      simplifications.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2fe6a5a1
    • Q
      btrfs: tree-checker: check extent buffer owner against owner rootid · 88c602ab
      Qu Wenruo 提交于
      Btrfs doesn't check whether the tree block respects the root owner.
      This means, if a tree block referred by a parent in extent tree, but has
      owner of 5, btrfs can still continue reading the tree block, as long as
      it doesn't trigger other sanity checks.
      
      Normally this is fine, but combined with the empty tree check in
      check_leaf(), if we hit an empty extent tree, but the root node has
      csum tree owner, we can let such extent buffer to sneak in.
      
      Shrink the hole by:
      
      - Do extra eb owner check at tree read time
      
      - Make sure the root owner extent buffer exactly matches the root id.
      
      Unfortunately we can't yet completely patch the hole, there are several
      call sites can't pass all info we need:
      
      - For reloc/log trees
        Their owner is key::offset, not key::objectid.
        We need the full root key to do that accurate check.
      
        For now, we just skip the ownership check for those trees.
      
      - For add_data_references() of relocation
        That call site doesn't have any parent/ownership info, as all the
        bytenrs are all from btrfs_find_all_leafs().
      
      - For direct backref items walk
        Direct backref items records the parent bytenr directly, thus unlike
        indirect backref item, we don't do a full tree search.
      
        Thus in that case, we don't have full parent owner to check.
      
      For the later two cases, they all pass 0 as @owner_root, thus we can
      skip those cases if @owner_root is 0.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      88c602ab
    • G
      btrfs: introduce btrfs_for_each_slot iterator macro · 62142be3
      Gabriel Niebler 提交于
      There is a common pattern when searching for a key in btrfs:
      
      * Call btrfs_search_slot to find the slot for the key
      * Enter an endless loop:
        * If the found slot is larger than the no. of items in the current
          leaf, check the next leaf
        * If it's still not found in the next leaf, terminate the loop
        * Otherwise do something with the found key
        * Increment the current slot and continue
      
      To reduce code duplication, we can replace this code pattern with an
      iterator macro, similar to the existing for_each_X macros found
      elsewhere in the kernel.  This also makes the code easier to understand
      for newcomers by putting a name to the encapsulated functionality.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62142be3
    • F
      btrfs: remove trivial wrapper btrfs_read_buffer() · 6a2e9dc4
      Filipe Manana 提交于
      The function btrfs_read_buffer() is useless, it just calls
      btree_read_extent_buffer_pages() with exactly the same arguments.
      
      So remove it and rename btree_read_extent_buffer_pages() to
      btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
      prefix (since it's used outside disk-io.c) and the name is clear enough
      about what it does.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a2e9dc4
    • F
      btrfs: update outdated comment for read_block_for_search() · 376a21d7
      Filipe Manana 提交于
      The comment at the top of read_block_for_search() is very outdated, as it
      refers to the blocking versus spinning path locking modes. We no longer
      have these two locking modes after we switched the btree locks from custom
      code to rw semaphores. So update the comment to stop referring to the
      blocking mode and put it more up to date.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      376a21d7
    • F
      btrfs: release upper nodes when reading stale btree node from disk · b246666e
      Filipe Manana 提交于
      When reading a btree node (or leaf), at read_block_for_search(), if we
      can't find its extent buffer in the cache (the fs_info->buffer_radix
      radix tree), then we unlock all upper level nodes before reading the
      btree node/leaf from disk, to prevent blocking other tasks for too long.
      
      However if we find that the extent buffer is in the cache but it is not
      up to date, we don't unlock upper level nodes before reading it from disk,
      potentially blocking other tasks on upper level nodes for too long.
      
      Fix this inconsistent behaviour by unlocking upper level nodes if we need
      to read a node/leaf from disk because its in-memory extent buffer is not
      up to date. If we unlocked upper level nodes then we must return -EAGAIN
      to the caller, just like the case where the extent buffer is not cached in
      memory. And like that case, we determine if upper level nodes are locked
      by checking only if the parent node is locked - if it isn't, then no other
      upper level nodes are locked.
      
      This is actually a rare case, as if we have an extent buffer in memory,
      it typically has the uptodate flag set and passes all the checks done by
      btrfs_buffer_uptodate().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b246666e
    • F
      btrfs: avoid unnecessary btree search restarts when reading node · 4bb59055
      Filipe Manana 提交于
      When reading a btree node, at read_block_for_search(), if we don't find
      the node's (or leaf) extent buffer in the cache, we will read it from
      disk. Since that requires waiting on IO, we release all upper level nodes
      from our path before reading the target node/leaf, and then return -EAGAIN
      to the caller, which will make the caller restart the while btree search.
      
      However we are causing the restart of btree search even for cases where
      it is not necessary:
      
      1) We have a path with ->skip_locking set to true, typically when doing
         a search on a commit root, so we are never holding locks on any node;
      
      2) We are doing a read search (the "ins_len" argument passed to
         btrfs_search_slot() is 0), or we are doing a search to modify an
         existing key (the "cow" argument passed to btrfs_search_slot() has
         a value of 1 and "ins_len" is 0), in which case we never hold locks
         for upper level nodes;
      
      3) We are doing a search to insert or delete a key, in which case we may
         or may not have upper level nodes locked. That depends on the current
         minimum write lock levels at btrfs_search_slot(), if we had to split
         or merge parent nodes, if we had to COW upper level nodes and if
         we ever visited slot 0 of an upper level node. It's still common to
         not have upper level nodes locked, but our current node must be at
         least at level 1, for insertions, or at least at level 2 for deletions.
         In these cases when we have locks on upper level nodes, they are always
         write locks.
      
      These cases where we are not holding locks on upper level nodes far
      outweigh the cases where we are holding locks, so it's completely wasteful
      to retry the whole search when we have no upper nodes locked.
      
      So change the logic to not return -EAGAIN, and make the caller retry the
      search, when we don't have the parent node locked - when it's not locked
      it means no other upper level nodes are locked as well.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4bb59055
  6. 14 3月, 2022 5 次提交
    • Q
      btrfs: unify the error handling of btrfs_read_buffer() · 9a4ffa1b
      Qu Wenruo 提交于
      There is one oddball error handling of btrfs_read_buffer():
      
      	ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key);
      	if (!ret) {
      		*eb_ret = tmp;
      		return 0;
      	}
      	free_extent_buffer(tmp);
      	btrfs_release_path(p);
      	return -EIO;
      
      While all other call sites check the error first.  Unify the behavior.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9a4ffa1b
    • Q
      btrfs: unify the error handling pattern for read_tree_block() · 4eb150d6
      Qu Wenruo 提交于
      We had an error handling pattern for read_tree_block() like this:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 * Normally ended up with return or goto out.
      		 */
      	} else if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 * Normally also ended up with return or goto out;
      		 */
      	}
      
      This is fine, but if we want to add extra check for each
      read_tree_block(), the existing if-else-if is not that expandable and
      will take reader some seconds to figure out there is no extra branch.
      
      Here we change it to a more common way, without the extra else:
      
      	eb = read_tree_block();
      	if (IS_ERR(eb)) {
      		/*
      		 * Handling error here
      		 */
      		return eb or goto out;
      	}
      	if (!extent_buffer_uptodate(eb)) {
      		/*
      		 * Different error handling here
      		 */
      		return eb or goto out;
      	}
      
      This also removes some oddball call sites which uses some creative way
      to check error.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4eb150d6
    • F
      btrfs: avoid unnecessary computation when deleting items from a leaf · 0cae23b6
      Filipe Manana 提交于
      When deleting items from a leaf, we always compute the sum of the data
      sizes of the items that are going to be deleted. However we only use
      that sum when the last item to delete is behind the last item in the
      leaf. This unnecessarily wastes CPU time when we are deleting either
      the whole leaf or from some slot > 0 up to the last item in the leaf,
      and both of these cases are common (e.g. truncation operation, either
      as a result of truncate(2) or when logging inodes, deleting checksums
      after removing a large enough extent, etc).
      
      So compute only the sum of the data sizes if the last item to be
      deleted does not match the last item in the leaf.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0cae23b6
    • F
      btrfs: avoid unnecessary COW of leaves when deleting items from a leaf · 7c4063d1
      Filipe Manana 提交于
      When we delete items from a leaf, if we end up with more than two thirds
      of unused leaf space, we try to delete the leaf by moving all its items
      into its left and right neighbour leaves. Sometimes that is not possible
      because there is not enough free space in the left and right leaves, and
      in that case we end up not deleting our leaf.
      
      The way we are doing this is not ideal and can be improved in the
      following ways:
      
      1) When we call push_leaf_left(), we pass a value of 1 byte to the data
         size parameter of push_leaf_left(). This is not realistic value because
         no item can have a size less than 25 bytes, which is the size of struct
         btrfs_item. This means that means that if the left leaf has not enough
         free space to push any item, we end up COWing it even if we end up not
         changing its content at all.
      
         COWing that leaf means allocating a new metadata extent, marking it
         dirty and doing more IO when committing a transaction or when syncing a
         log tree. For a log tree case, it's particularly more important to
         avoid the useless COW operation, as more IO can imply a higher latency
         for an fsync operation.
      
         So instead of passing 1 as the minimum data size for push_leaf_left(),
         pass the size of the first item in our leaf, as we don't want to COW
         the left leaf if we can't at least push the first item of our leaf;
      
      2) When we call push_leaf_right(), we also pass a value of 1 byte as the
         data size parameter of push_leaf_right(). Like the previous case, it
         will also result in COWing the right leaf even if we are not able to
         move any items into it, since there can't be any item with a size
         smaller than 25 bytes (the size of struct btrfs_item).
      
         So instead of passing 1 as the minimum data size to push_leaf_right(),
         pass a size that corresponds to the sum of the size of all the
         remaining items in our leaf. We are not interested in moving less than
         that, because if we do, we are not able to delete our leaf and we have
         COWed the right leaf for nothing. Plus, moving only some of the items
         of our leaf, it means an even less balanced tree.
      
         Just like the previous case, we want to avoid the useless COW of the
         right leaf, this way we don't have to spend time allocating one new
         metadata extent, and doing more IO when committing a transaction or
         syncing a log tree. For the log tree case it's specially more important
         because more IO can result in a higher latency for a fsync operation.
      
      So adjust the minimum data size passed to push_leaf_left() and
      push_leaf_right() as mentioned above.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      Not being able to delete a leaf that became less than 1/3 full after
      deleting items from it is actually common. For example, for the fio test
      mentioned in the changelog of patch 6/6, we are only able to delete a
      leaf at btrfs_del_items() about 5.3% of the time, due to its left and
      right neighbour leaves not having enough free space to push all the
      remaining items into them.
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7c4063d1
    • F
      btrfs: remove unnecessary leaf free space checks when pushing items · b4e098a9
      Filipe Manana 提交于
      When trying to push items from a leaf into its left and right neighbours,
      we lock the left or right leaf, check if it has the required minimum free
      space, COW the leaf and then check again if it has the minimum required
      free space. This second check is pointless:
      
      1) Most and foremost because it's not needed. We have a write lock on the
         leaf and on its parent node, so no one can come in and change either
         the pre-COW or post-COW version of the leaf for the whole duration of
         the push_leaf_left() and push_leaf_right() calls;
      
      2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
         amount of arithmetic operations and access to fields in the leaf's
         header and items, so it's not very cheap.
      
      So remove the duplicated free space checks.
      
      This change if part of a patchset that is comprised of the following
      patches:
      
        1/6 btrfs: remove unnecessary leaf free space checks when pushing items
        2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
        3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
        4/6 btrfs: remove constraint on number of visited leaves when replacing extents
        5/6 btrfs: remove useless path release in the fast fsync path
        6/6 btrfs: prepare extents to be logged before locking a log tree path
      
      The last patch in the series has some performance test result in its
      changelog.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b4e098a9
  7. 07 1月, 2022 9 次提交
    • N
      btrfs: refactor unlock_up · c1227996
      Nikolay Borisov 提交于
      The purpose of this function is to unlock all nodes in a btrfs path
      which are above 'lowest_unlock' and whose slot used is different than 0.
      As such it used slightly awkward structure of 'if' as well as somewhat
      cryptic "no_skip" control variable which denotes whether we should
      check the current level of skipability or no.
      
      This patch does the following (cosmetic) refactorings:
      
      * Renames 'no_skip' to 'check_skip' and makes it a boolean. This
        variable controls whether we are below the lowest_unlock/skip_level
        levels.
      
      * Consolidates the 2 conditions which warrant checking whether the
        current level should be skipped under 1 common if (check_skip) branch,
        this increase indentation level but is not critical.
      
      * Consolidates the 'skip_level < i && i >= lowest_unlock' and
        'i >= lowest_unlock && i > skip_level' condition into a common branch
        since those are identical.
      
      * Eliminates the local extent_buffer variable as in this case it doesn't
        bring anything to function readability.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1227996
    • F
      btrfs: remove stale comment about locking at btrfs_search_slot() · 727e6060
      Filipe Manana 提交于
      The comment refers to the old extent buffer locking code, where we used to
      have custom locks that had blocking and spinning behaviour modes. That is
      not the case anymore, since we have transitioned to rw semaphores, so the
      comment does not offer any value anymore. Remove it.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      727e6060
    • F
      btrfs: remove BUG_ON() after splitting leaf · bb8e9a60
      Filipe Manana 提交于
      After calling split_leaf() we BUG_ON() if the returned value is greater
      than zero. However split_leaf() only returns 0, in case of success, or a
      negative value in case of an error.
      
      The reason for the BUG_ON() is that if we ever get a positive return
      value from split_leaf(), we can not simply propagate it to the callers
      of btrfs_search_slot(), as that would be interpreted as "key not found"
      and not as an error. That means it could result in callers ending up
      causing some potential silent corruption.
      
      So change the BUG_ON() to an ASSERT(), and in case assertions are
      disabled, produce a warning and set the return value to an error, to make
      it not possible to get into a silent corruption and having the error not
      noticed.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bb8e9a60
    • F
      btrfs: move leaf search logic out of btrfs_search_slot() · 109324cf
      Filipe Manana 提交于
      There's quite a significant amount of code for doing the key search for a
      leaf at btrfs_search_slot(), with a couple labels and gotos in it, plus
      btrfs_search_slot() is already big enough.
      
      So move the logic that does the key search on a leaf into a new helper
      function. This makes it better organized, removing the need for the labels
      and the gotos, as well as reducing the indentation level and the size of
      btrfs_search_slot().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      109324cf
    • F
      btrfs: remove useless condition check before splitting leaf · e5e1c174
      Filipe Manana 提交于
      When inserting a key, we check if the write_lock_level is less than 1,
      and if so we set it to 1, release the path and retry the tree traversal.
      
      However that is unnecessary, because when ins_len is greater than 0, we
      know that write_lock_level can never be less than 1.
      
      The logic to retry is also buggy, because in case ins_len was decremented,
      due to an exact key match and the search is not meant for item extension
      (path->search_for_extension is 0), we retry without incrementing ins_len,
      which would make the next retry decrement it again by the same amount.
      
      So remove the check for write_lock_level being less than 1 and add an
      assertion to assert it's always >= 1.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e5e1c174
    • F
      btrfs: try to unlock parent nodes earlier when inserting a key · e2e58d0f
      Filipe Manana 提交于
      When inserting a new key, we release the write lock on the leaf's parent
      only after doing the binary search on the leaf. This is because if the
      key ends up at slot 0, we will have to update the key at slot 0 of the
      parent node. The same reasoning applies to any other upper level nodes
      when their slot is 0. We also need to keep the parent locked in case the
      leaf does not have enough free space to insert the new key/item, because
      in that case we will split the leaf and we will need to add a new key to
      the parent due to a new leaf resulting from the split operation.
      
      However if the leaf has enough space for the new key and the key does not
      end up at slot 0 of the leaf we could release our write lock on the parent
      before doing the binary search on the leaf to figure out the destination
      slot. That leads to reducing the amount of time other tasks are blocked
      waiting to lock the parent, therefore increasing parallelism when there
      are other tasks that are trying to access other leaves accessible through
      the same parent. This also applies to other upper nodes besides the
      immediate parent, when their slot is 0, since we keep locks on them until
      we figure out if the leaf slot is slot 0 or not.
      
      In fact, having the key ending at up slot 0 when is rare. Typically it
      only happens when the key is less than or equals to the smallest, the
      "left most", key of the entire btree, during a split attempt when we try
      to push to the right sibling leaf or when the caller just wants to update
      the item of an existing key. It's also very common that a leaf has enough
      space to insert a new key, since after a split we move about half of the
      keys from one into the new leaf.
      
      So unlock the parent, and any other upper level nodes, when during a key
      insertion we notice the key is greater then the first key in the leaf and
      the leaf has enough free space. After unlocking the upper level nodes, do
      the binary search using a low boundary of slot 1 and not slot 0, to figure
      out the slot where the key will be inserted (or where the key already is
      in case it exists and the caller wants to modify its item data).
      This extra comparison, with the first key, is cheap and the key is very
      likely already in a cache line because it immediately follows the header
      of the extent buffer and we have recently read the level field of the
      header (which in fact is the last field of the header).
      
      The following fs_mark test was run on a non-debug kernel (debian's default
      kernel config), with a 12 cores intel CPU, and using a NVMe device:
      
        $ cat run-fsmark.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-O no-holes -R free-space-tree"
        FILES=100000
        THREADS=$(nproc --all)
        FILE_SIZE=0
      
        echo "performance" | \
      	tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s $FILE_SIZE -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           0      1200000            0     165273.6          5958381
           0      2400000            0     190938.3          6284477
           0      3600000            0     181429.1          6044059
           0      4800000            0     173979.2          6223418
           0      6000000            0     139288.0          6384560
           0      7200000            0     163000.4          6520083
           1      8400000            0      57799.2          5388544
           1      9600000            0      66461.6          5552969
           2     10800000            0      49593.5          5163675
           2     12000000            0      57672.1          4889398
      
      After this change:
      
      FSUse%        Count         Size    Files/sec            App Overhead
           0      1200000            0     167987.3 (+1.6%)         6272730
           0      2400000            0     198563.9 (+4.0%)         6048847
           0      3600000            0     197436.6 (+8.8%)         6163637
           0      4800000            0     202880.7 (+16.6%)        6371771
           1      6000000            0     167275.9 (+20.1%)        6556733
           1      7200000            0     204051.2 (+25.2%)        6817091
           1      8400000            0      69622.8 (+20.5%)        5525675
           1      9600000            0      69384.5 (+4.4%)         5700723
           1     10800000            0      61454.1 (+23.9%)        5363754
           3     12000000            0      61908.7 (+7.3%)         5370196
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e2e58d0f
    • F
      btrfs: allow generic_bin_search() to take low boundary as an argument · fb81212c
      Filipe Manana 提交于
      Right now generic_bin_search() always uses a low boundary slot of 0, but
      in the next patch we'll want to often skip slot 0 when searching for a
      key. So make generic_bin_search() have the low boundary slot specified
      as an argument, and move the check for the extent buffer level from
      btrfs_bin_search() to generic_bin_search() to avoid adding another
      wrapper around generic_bin_search().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb81212c
    • J
      btrfs: check the root node for uptodate before returning it · 120de408
      Josef Bacik 提交于
      Now that we clear the extent buffer uptodate if we fail to write it out
      we need to check to see if our root node is uptodate before we search
      down it.  Otherwise we could return stale data (or potentially corrupt
      data that was caught by the write verification step) and think that the
      path is OK to search down.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      120de408
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  8. 03 1月, 2022 5 次提交
  9. 16 12月, 2021 1 次提交
    • F
      btrfs: fix invalid delayed ref after subvolume creation failure · 7a163608
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the new root's root item into the root tree, we are freeing the
      metadata extent we reserved for the new root to prevent a metadata
      extent leak, as we don't abort the transaction at that point (since
      there is nothing at that point that is irreversible).
      
      However we allocated the metadata extent for the new root which we are
      creating for the new subvolume, so its delayed reference refers to the
      ID of this new root. But when we free the metadata extent we pass the
      root of the subvolume where the new subvolume is located to
      btrfs_free_tree_block() - this is incorrect because this will generate
      a delayed reference that refers to the ID of the parent subvolume's root,
      and not to ID of the new root.
      
      This results in a failure when running delayed references that leads to
      a transaction abort and a trace like the following:
      
      [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
      [3868.739857] Code: 68 0f 85 e6 fb ff (...)
      [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
      [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
      [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
      [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
      [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
      [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
      [3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
      [3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
      [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [3868.757803] Call Trace:
      [3868.758281]  <TASK>
      [3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
      [3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
      [3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
      [3868.762069]  ? lock_acquired+0x19f/0x420
      [3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
      [3868.763860]  ? _raw_spin_unlock+0x29/0x40
      [3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
      [3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
      [3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [3868.767669]  ? preempt_count_add+0x49/0xa0
      [3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [3868.769639]  ? _copy_from_user+0x66/0xa0
      [3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [3868.772364]  ? __slab_free+0x10a/0x360
      [3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.774121]  ? lock_release+0x223/0x4a0
      [3868.774863]  ? lock_acquired+0x19f/0x420
      [3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
      [3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [3868.778280]  ? kmem_cache_free+0x321/0x3c0
      [3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
      [3868.779718]  __x64_sys_ioctl+0x83/0xb0
      [3868.780387]  do_syscall_64+0x3b/0xc0
      [3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [3868.781953] RIP: 0033:0x7f281c59e957
      [3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
      [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
      [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
      [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
      [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
      [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
      [3868.793765]  </TASK>
      [3868.794037] irq event stamp: 0
      [3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [3868.799284] ---[ end trace be24c7002fe27747 ]---
      [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
      [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
      [3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
      [3868.802863]          extent refs 1 gen 1265 flags 2
      [3868.803447]          ref#0: tree block backref root 1610
      (...)
      [3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
      [3869.065421]          extent refs 1 gen 1268 flags 2
      [3869.066115]          ref#0: tree block backref root 1689
      (...)
      [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
      [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
      [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
      
      Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
      This requires changing the root argument of btrfs_free_tree_block() from
      struct btrfs_root * to a u64, since at this point during the subvolume
      creation we have not yet created the struct btrfs_root for the new
      subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
      else from a struct btrfs_root.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a163608
  10. 27 10月, 2021 3 次提交
    • F
      btrfs: unexport setup_items_for_insert() · f0641656
      Filipe Manana 提交于
      Since setup_items_for_insert() is not used anymore outside of ctree.c,
      make it static and remove its prototype from ctree.h. This also requires
      to move the definition of setup_item_for_insert() from ctree.h to ctree.c
      and move down btrfs_duplicate_item() so that it's defined after
      setup_items_for_insert().
      
      Further, since setup_item_for_insert() is used outside ctree.c, rename it
      to btrfs_setup_item_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 2/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f0641656
    • F
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana 提交于
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • F
      btrfs: assert that extent buffers are write locked instead of only locked · 49d0c642
      Filipe Manana 提交于
      We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and
      that checks that we hold a lock either in read mode or write mode.
      
      However in all contexts we use btrfs_assert_tree_locked(), we actually
      want to check if we are holding a write lock on the extent buffer's rw
      semaphore - it would be a bug if in any of those contexts we were holding
      a read lock instead.
      
      So change btrfs_assert_tree_locked() to use lockdep_assert_held_write()
      instead and, to make it more explicit, rename btrfs_assert_tree_locked()
      to btrfs_assert_tree_write_locked(), so that it's clear we want to check
      we are holding a write lock.
      
      For now there are no contexts where we want to assert that we must have
      a read lock, but in case that is needed in the future, we can add a new
      helper function that just calls out lockdep_assert_held_read().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49d0c642
  11. 18 10月, 2021 1 次提交
  12. 23 8月, 2021 4 次提交
    • M
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza 提交于
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ff40a91
    • D
      btrfs: make btrfs_next_leaf static inline · 809d6902
      David Sterba 提交于
      btrfs_next_leaf is a simple wrapper for btrfs_next_old_leaf so move it
      to header to avoid the function call overhead.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      809d6902
    • F
      btrfs: continue readahead of siblings even if target node is in memory · 069a2e37
      Filipe Manana 提交于
      At reada_for_search(), when attempting to readahead a node or leaf's
      siblings, we skip the readahead of the siblings if the node/leaf is
      already in memory. That is probably fine for the READA_FORWARD and
      READA_BACK readahead types, as they are used on contexts where we
      end up reading some consecutive leaves, but usually not the whole btree.
      
      However for a READA_FORWARD_ALWAYS mode, currently only used for full
      send operations, it does not make sense to skip the readahead if the
      target node or leaf is already loaded in memory, since we know the caller
      is visiting every node and leaf of the btree in ascending order.
      
      So change the behaviour to not skip the readahead when the target node is
      already in memory and the readahead mode is READA_FORWARD_ALWAYS.
      
      The following test script was used to measure the improvement on a box
      using an average, consumer grade, spinning disk, with 32GiB of RAM and
      using a non-debug kernel config (Debian's default config).
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
        MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
        MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        # Create files with inline data to make it easier and faster to create
        # large btrees.
        add_files()
        {
            local total=$1
            local start_offset=$2
            local number_jobs=$3
            local total_per_job=$(($total / $number_jobs))
      
            echo "Creating $total new files using $number_jobs jobs"
            for ((n = 0; n < $number_jobs; n++)); do
                (
                    local start_num=$(($start_offset + $n * $total_per_job))
                    for ((i = 1; i <= $total_per_job; i++)); do
                        local file_num=$((start_num + $i))
                        local file_path="$MNT/file_${file_num}"
                        xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                        if [ $? -ne 0 ]; then
                            echo "Failed creating file $file_path"
                            break
                        fi
                    done
                ) &
                worker_pids[$n]=$!
            done
      
            wait ${worker_pids[@]}
      
            sync
            echo
            echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
        }
      
        file_count=2000000
        add_files $file_count 0 4
      
        echo
        echo "Creating snapshot..."
        btrfs subvolume snapshot -r $MNT $MNT/snap1
      
        umount $MNT
      
        echo 3 > /proc/sys/vm/drop_caches
        blockdev --flushbufs $DEV &> /dev/null
        hdparm -F $DEV &> /dev/null
      
        mount $MOUNT_OPTIONS $DEV $MNT
      
        echo
        echo "Testing full send..."
        start=$(date +%s)
        btrfs send $MNT/snap1 > /dev/null
        end=$(date +%s)
        echo
        echo "Full send took $((end - start)) seconds"
      
        umount $MNT
      
      The duration of the full send operations, in seconds, were the following:
      
      Before this change:  85 seconds
      After this change:   76 seconds (-11.2%)
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      069a2e37
    • M
      btrfs: remove max argument from generic_bin_search · 67d5e289
      Marcos Paulo de Souza 提交于
      Both callers use btrfs_header_nritems to feed the max argument.  Remove
      the argument and let generic_bin_search call it itself.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      67d5e289
  13. 07 7月, 2021 1 次提交
    • F
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana 提交于
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79bd3712