1. 26 9月, 2022 10 次提交
  2. 05 9月, 2022 1 次提交
    • N
      btrfs: zoned: fix API misuse of zone finish waiting · d5b81ced
      Naohiro Aota 提交于
      The commit 2ce543f4 ("btrfs: zoned: wait until zone is finished when
      allocation didn't progress") implemented a zone finish waiting mechanism
      to the write path of zoned mode. However, using
      wait_var_event()/wake_up_all() on fs_info->zone_finish_wait is wrong and
      wait_var_event() just hangs because no one ever wakes it up once it goes
      into sleep.
      
      Instead, we can simply use wait_on_bit_io() and clear_and_wake_up_bit()
      on fs_info->flags with a proper barrier installed.
      
      Fixes: 2ce543f4 ("btrfs: zoned: wait until zone is finished when allocation didn't progress")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5b81ced
  3. 24 8月, 2022 1 次提交
    • O
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval 提交于
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ced8ecf0
  4. 17 8月, 2022 1 次提交
    • J
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik 提交于
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b40130b2
  5. 26 7月, 2022 2 次提交
    • C
      btrfs: fix repair of compressed extents · 81bd9328
      Christoph Hellwig 提交于
      Currently the checksum of compressed extents is verified based on the
      compressed data and the lower btrfs_bio, but the actual repair process
      is driven by end_bio_extent_readpage on the upper btrfs_bio for the
      decompressed data.
      
      This has a bunch of issues, including not being able to properly
      communicate the failed mirror up in case that the I/O submission got
      preempted, a general loss of if an error was an I/O error or a checksum
      verification failure, but most importantly that this design causes
      btrfs_clean_io_failure to eventually write back the uncompressed good
      data onto the disk sectors that are supposed to contain compressed data.
      
      Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
      amount of code has to be reshuffled:
      
       a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
          to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
          the btrfs_bio management of csums.  For a compressed_bio that is
          split into multiple btrfs_bios this means additional memory
          allocations, but the code becomes a lot more regular.
       b) checksum verification now runs directly on the lower btrfs_bio instead
          of the compressed_bio.  This actually nicely simplifies the end I/O
          processing.
       c) btrfs_repair_one_sector can't just look up the logical address for
          the file offset any more, as there is no corresponding relative
          offsets that apply to the file offset and the logic address for
          compressed extents.  Instead require that the saved bvec_iter in the
          btrfs_bio is filled out for all read bios and use that, which again
          removes a fair amount of code.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81bd9328
    • C
      btrfs: remove the start argument to check_data_csum and export · 7959bd44
      Christoph Hellwig 提交于
      Derive the value of start from the btrfs_bio now that ->file_offset is
      always valid.  Also export and rename the function so it's available
      outside of inode.c as we'll need that soon.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7959bd44
  6. 25 7月, 2022 16 次提交
    • N
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota 提交于
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ce543f4
    • N
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota 提交于
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7d7672bc
    • N
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota 提交于
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7b12a62
    • N
      btrfs: zoned: revive max_zone_append_bytes · c2ae7b77
      Naohiro Aota 提交于
      This patch is basically a revert of commit 5a80d1c6 ("btrfs: zoned:
      remove max_zone_append_size logic"), but without unnecessary ASSERT and
      check. The max_zone_append_size will be used as a hint to estimate the
      number of extents to cover delalloc/writeback region in the later commits.
      
      The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
      this commit considers it to calculate max_zone_append_size. Technically, a
      bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
      contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
      as an upper limit of an extent size to calculate the number of extents
      needed to write data.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2ae7b77
    • I
      btrfs: collect commit stats, count, duration · e55958c8
      Ioannis Angelakopoulos 提交于
      Track several stats about transaction commit, to be later exported via
      sysfs:
      
      - number of commits so far
      - duration of the last commit in ns
      - maximum commit duration seen so far in ns
      - total duration for all commits so far in ns
      
      The update of the commit stats occurs after the commit thread has gone
      through all the logic that checks if there is another thread committing
      at the same time. This means that we only account for actual commit work
      in the commit stats we report and not the time the thread spends waiting
      until it is ready to do the commit work.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NIoannis Angelakopoulos <iangelak@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e55958c8
    • Q
      btrfs: use named constant for reserved device space · 37f85ec3
      Qu Wenruo 提交于
      There's a reserved space on each device of size 1MiB that can be used by
      bootloaders or to avoid accidental overwrite. Use a symbolic constant
      with the explaining comment instead of hard coding the value and
      multiple comments.
      
      Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
      the primary super block (at offset 64KiB), until then the range could
      have been used by mistake. Kernel has been always respecting the 1MiB
      range for writes.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      37f85ec3
    • D
      btrfs: pass bits by value not by pointer for extent_state helpers · 6d92b304
      David Sterba 提交于
      The bits are passed to all extent state helpers for no apparent reason,
      the value only read and never updated so remove the indirection and pass
      it directly. Also unify the type to u32 where needed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6d92b304
    • Q
      btrfs: make btrfs_super_block::log_root_transid deprecated · 97f09d55
      Qu Wenruo 提交于
      When using "btrfs inspect-internal dump-super" to inspect an fs with
      dirty log, it always shows the log_root_transid as 0:
      
        log_root                30474240
        log_root_transid        0 <<<
        log_root_level          0
      
      It turns out that, btrfs_super_block::log_root_transid is never really
      utilized (even no read for it).
      
      This can date back to the introduction of btrfs into upstream kernel.
      
      In fact, when reading log tree root, we always use
      btrfs_super_block::generation + 1 as the expected generation.
      So here we're completely safe to mark this member deprecated.
      
      In theory we can easily reuse this member for other purposes, but to be
      extra safe, here we follow the leafsize way, by adding "__unused_" for
      log_root_transid.
      And we can safely remove the accessors, since there is no such callers
      from the very beginning.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      97f09d55
    • C
      btrfs: remove btrfs_end_io_wq · d7b9416f
      Christoph Hellwig 提交于
      All reads bio that go through btrfs_map_bio need to be completed in
      user context.  And read I/Os are the most common and timing critical
      in almost any file system workloads.
      
      Embed a work_struct into struct btrfs_bio and use it to complete all
      read bios submitted through btrfs_map, using the REQ_META flag to decide
      which workqueue they are placed on.
      
      This removes the need for a separate 128 byte allocation (typically
      rounded up to 192 bytes by slab) for all reads with a size increase
      of 24 bytes for struct btrfs_bio.  Future patches will reorganize
      struct btrfs_bio to make use of this extra space for writes as well.
      
      (All sizes are based a on typical 64-bit non-debug build)
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7b9416f
    • C
      btrfs: don't use btrfs_bio_wq_end_io for compressed writes · fed8a72d
      Christoph Hellwig 提交于
      Compressed write bio completion is the only user of btrfs_bio_wq_end_io
      for writes, and the use of btrfs_bio_wq_end_io is a little suboptimal
      here as we only real need user context for the final completion of a
      compressed_bio structure, and not every single bio completion.
      
      Add a work_struct to struct compressed_bio instead and use that to call
      finish_compressed_bio_write.  This allows to remove all handling of
      write bios in the btrfs_bio_wq_end_io infrastructure.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fed8a72d
    • C
      btrfs: defer I/O completion based on the btrfs_raid_bio · d34e123d
      Christoph Hellwig 提交于
      Instead of attaching an extra allocation an indirect call to each
      low-level bio issued by the RAID code, add a work_struct to struct
      btrfs_raid_bio and only defer the per-rbio completion action.  The
      per-bio action for all the I/Os are trivial and can be safely done
      from interrupt context.
      
      As a nice side effect this also allows sharing the boilerplate code
      for the per-bio completions
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d34e123d
    • C
      btrfs: split btrfs_submit_data_bio to read and write parts · c93104e7
      Christoph Hellwig 提交于
      Split btrfs_submit_data_bio into one helper for reads and one for writes.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c93104e7
    • O
      btrfs: send: send compressed extents with encoded writes · 3ea4dc5b
      Omar Sandoval 提交于
      Now that all of the pieces are in place, we can use the ENCODED_WRITE
      command to send compressed extents when appropriate.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ea4dc5b
    • C
      btrfs: factor out a btrfs_csum_ptr helper · a89ce08c
      Christoph Hellwig 提交于
      Add a helper to find the csum for a byte offset into the csum buffer.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a89ce08c
    • Q
      btrfs: introduce a data checksum checking helper · ae643a74
      Qu Wenruo 提交于
      Although we have several data csum verification code, we never have a
      function really just to verify checksum for one sector.
      
      Function check_data_csum() do extra work for error reporting, thus it
      requires a lot of extra things like file offset, bio_offset etc.
      
      Function btrfs_verify_data_csum() is even worse, it will utilize page
      checked flag, which means it can not be utilized for direct IO pages.
      
      Here we introduce a new helper, btrfs_check_sector_csum(), which really
      only accept a sector in page, and expected checksum pointer.
      
      We use this function to implement check_data_csum(), and export it for
      incoming patch.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [hch: keep passing the csum array as an arguments, as the callers want
            to print it, rename per request]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae643a74
    • D
      btrfs: fix typos in comments · 143823cf
      David Sterba 提交于
      Codespell has found a few typos.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      143823cf
  7. 16 7月, 2022 3 次提交
  8. 21 6月, 2022 1 次提交
    • F
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana 提交于
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: NBoris Burkov <boris@bur.io>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      983d8209
  9. 16 5月, 2022 5 次提交