1. 25 7月, 2022 3 次提交
  2. 16 5月, 2022 8 次提交
    • N
      btrfs: zoned: zone finish unused block group · 74e91b12
      Naohiro Aota 提交于
      While the active zones within an active block group are reset, and their
      active resource is released, the block group itself is kept in the active
      block group list and marked as active. As a result, the list will contain
      more than max_active_zones block groups. That itself is not fatal for the
      device as the zones are properly reset.
      
      However, that inflated list is, of course, strange. Also, a to-appear
      patch series, which deactivates an active block group on demand, gets
      confused with the wrong list.
      
      So, fix the issue by finishing the unused block group once it gets
      read-only, so that we can release the active resource in an early stage.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74e91b12
    • F
      btrfs: avoid double search for block group during NOCOW writes · 2306e83e
      Filipe Manana 提交于
      When doing a NOCOW write, either through direct IO or buffered IO, we do
      two lookups for the block group that contains the target extent: once
      when we call btrfs_inc_nocow_writers() and then later again when we call
      btrfs_dec_nocow_writers() after creating the ordered extent.
      
      The lookups require taking a lock and navigating the red black tree used
      to track all block groups, which can take a non-negligible amount of time
      for a large filesystem with thousands of block groups, as well as lock
      contention and cache line bouncing.
      
      Improve on this by having a single block group search: making
      btrfs_inc_nocow_writers() return the block group to its caller and then
      have the caller pass that block group to btrfs_dec_nocow_writers().
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: remove search start argument from first_logical_byte()
        btrfs: use rbtree with leftmost node cached for tracking lowest block group
        btrfs: use a read/write lock for protecting the block groups tree
        btrfs: return block group directly at btrfs_next_block_group()
        btrfs: avoid double search for block group during NOCOW writes
      
      The following test was used to test these changes from a performance
      perspective:
      
         $ cat test.sh
         #!/bin/bash
      
         modprobe null_blk nr_devices=0
      
         NULL_DEV_PATH=/sys/kernel/config/nullb/nullb0
         mkdir $NULL_DEV_PATH
         if [ $? -ne 0 ]; then
             echo "Failed to create nullb0 directory."
             exit 1
         fi
         echo 2 > $NULL_DEV_PATH/submit_queues
         echo 16384 > $NULL_DEV_PATH/size # 16G
         echo 1 > $NULL_DEV_PATH/memory_backed
         echo 1 > $NULL_DEV_PATH/power
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
         LOOP_MNT="$MNT/loop"
         MOUNT_OPTIONS="-o ssd -o nodatacow"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [io_uring_writes]
         rw=randwrite
         fsync=0
         fallocate=posix
         group_reporting=1
         direct=1
         ioengine=io_uring
         iodepth=64
         bs=64k
         filesize=1g
         runtime=300
         time_based
         directory=$LOOP_MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
         mount $MOUNT_OPTIONS $DEV $MNT
      
         mkdir $LOOP_MNT
      
         truncate -s 4T $MNT/loopfile
         mkfs.btrfs -f $MKFS_OPTIONS $MNT/loopfile &> /dev/null
         mount $MOUNT_OPTIONS $MNT/loopfile $LOOP_MNT
      
         # Trigger the allocation of about 3500 data block groups, without
         # actually consuming space on underlying filesystem, just to make
         # the tree of block group large.
         fallocate -l 3500G $LOOP_MNT/filler
      
         fio /tmp/fio-job.ini
      
         umount $LOOP_MNT
         umount $MNT
      
         echo 0 > $NULL_DEV_PATH/power
         rmdir $NULL_DEV_PATH
      
      The test was run on a non-debug kernel (Debian's default kernel config),
      the result were the following.
      
      Before patchset:
      
        WRITE: bw=1455MiB/s (1526MB/s), 1455MiB/s-1455MiB/s (1526MB/s-1526MB/s), io=426GiB (458GB), run=300006-300006msec
      
      After patchset:
      
        WRITE: bw=1503MiB/s (1577MB/s), 1503MiB/s-1503MiB/s (1577MB/s-1577MB/s), io=440GiB (473GB), run=300006-300006msec
      
        +3.3% write throughput and +3.3% IO done in the same time period.
      
      The test has somewhat limited coverage scope, as with only NOCOW writes
      we get less contention on the red black tree of block groups, since we
      don't have the extra contention caused by COW writes, namely when
      allocating data extents, pinning and unpinning data extents, but on the
      hand there's access to tree in the NOCOW path, when incrementing a block
      group's number of NOCOW writers.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2306e83e
    • F
      btrfs: return block group directly at btrfs_next_block_group() · 8b01f931
      Filipe Manana 提交于
      At btrfs_next_block_group(), we have this long line with two statements:
      
        cache = btrfs_lookup_first_block_group(...); return cache;
      
      This makes it a bit harder to read due to two statements on the same
      line, so change that to directly return the result of the call to
      btrfs_lookup_first_block_group().
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b01f931
    • F
      btrfs: use a read/write lock for protecting the block groups tree · 16b0c258
      Filipe Manana 提交于
      Currently we use a spin lock to protect the red black tree that we use to
      track block groups. Most accesses to that tree are actually read only and
      for large filesystems, with thousands of block groups, it actually has
      a bad impact on performance, as concurrent read only searches on the tree
      are serialized.
      
      Read only searches on the tree are very frequent and done when:
      
      1) Pinning and unpinning extents, as we need to lookup the respective
         block group from the tree;
      
      2) Freeing the last reference of a tree block, regardless if we pin the
         underlying extent or add it back to free space cache/tree;
      
      3) During NOCOW writes, both buffered IO and direct IO, we need to check
         if the block group that contains an extent is read only or not and to
         increment the number of NOCOW writers in the block group. For those
         operations we need to search for the block group in the tree.
         Similarly, after creating the ordered extent for the NOCOW write, we
         need to decrement the number of NOCOW writers from the same block
         group, which requires searching for it in the tree;
      
      4) Decreasing the number of extent reservations in a block group;
      
      5) When allocating extents and freeing reserved extents;
      
      6) Adding and removing free space to the free space tree;
      
      7) When releasing delalloc bytes during ordered extent completion;
      
      8) When relocating a block group;
      
      9) During fitrim, to iterate over the block groups;
      
      10) etc;
      
      Write accesses to the tree, to add or remove block groups, are much less
      frequent as they happen only when allocating a new block group or when
      deleting a block group.
      
      We also use the same spin lock to protect the list of currently caching
      block groups. Additions to this list are made when we need to cache a
      block group, because we don't have a free space cache for it (or we have
      but it's invalid), and removals from this list are done when caching of
      the block group's free space finishes. These cases are also not very
      common, but when they happen, they happen only once when the filesystem
      is mounted.
      
      So switch the lock that protects the tree of block groups from a spinning
      lock to a read/write lock.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      16b0c258
    • F
      btrfs: use rbtree with leftmost node cached for tracking lowest block group · 08dddb29
      Filipe Manana 提交于
      We keep track of the start offset of the block group with the lowest start
      offset at fs_info->first_logical_byte. This requires explicitly updating
      that field every time we add, delete or lookup a block group to/from the
      red black tree at fs_info->block_group_cache_tree.
      
      Since the block group with the lowest start address happens to always be
      the one that is the leftmost node of the tree, we can use a red black tree
      that caches the left most node. Then when we need the start address of
      that block group, we can just quickly get the leftmost node in the tree
      and extract the start offset of that node's block group. This avoids the
      need to explicitly keep track of that address in the dedicated member
      fs_info->first_logical_byte, and it also allows the next patch in the
      series to switch the lock that protects the red black tree from a spin
      lock to a read/write lock - without this change it would be tricky
      because block group searches also update fs_info->first_logical_byte.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      08dddb29
    • J
      btrfs: zoned: make auto-reclaim less aggressive · 3687fcb0
      Johannes Thumshirn 提交于
      The current auto-reclaim algorithm starts reclaiming all block groups
      with a zone_unusable value above a configured threshold. This is causing
      a lot of reclaim IO even if there would be enough free zones on the
      device.
      
      Instead of only accounting a block groups zone_unusable value, also take
      the ratio of free and not usable (written as well as zone_unusable)
      bytes a device has into account.
      Tested-by: NPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3687fcb0
    • J
      btrfs: allow block group background reclaim for non-zoned filesystems · ac2f1e63
      Josef Bacik 提交于
      This will allow us to set a threshold for block groups to be
      automatically relocated even if we don't have zoned devices.
      
      We have found this feature invaluable at Facebook due to how our
      workload interacts with the allocator.  We have been using this in
      production for months with only a single problem that has already been
      fixed.
      Tested-by: NPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac2f1e63
    • G
      btrfs: use btrfs_for_each_slot in find_first_block_group · 36dfbbe2
      Gabriel Niebler 提交于
      This function can be simplified by refactoring to use the new iterator
      macro.  No functional changes.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NGabriel Niebler <gniebler@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      36dfbbe2
  3. 06 4月, 2022 3 次提交
  4. 14 3月, 2022 2 次提交
    • N
      btrfs: zoned: mark relocation as writing · ca5e4ea0
      Naohiro Aota 提交于
      There is a hung_task issue with running generic/068 on an SMR
      device. The hang occurs while a process is trying to thaw the
      filesystem. The process is trying to take sb->s_umount to thaw the
      FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
      waiting for an ordered extent to finish. However, as the FS is frozen,
      the ordered extents never finish.
      
      Having an ordered extent while the FS is frozen is the root cause of
      the hang. The ordered extent is initiated from btrfs_relocate_chunk()
      which is called from btrfs_reclaim_bgs_work().
      
      This commit adds sb_*_write() around btrfs_relocate_chunk() call
      site. For the usual "btrfs balance" command, we already call it with
      mnt_want_file() in btrfs_ioctl_balance().
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.13+
      Link: https://github.com/naota/linux/issues/56Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca5e4ea0
    • J
      btrfs: add support for multiple global roots · f7238e50
      Josef Bacik 提交于
      With extent tree v2 you will be able to create multiple csum, extent,
      and free space trees.  They will be used based on the block group, which
      will now use the block_group_item->chunk_objectid to point to the set of
      global roots that it will use.  When allocating new block groups we'll
      simply mod the gigabyte offset of the block group against the number of
      global roots we have and that will be the block groups global id.
      
      >From there we can take the bytenr that we're modifying in the respective
      tree, look up the block group and get that block groups corresponding
      global root id.  From there we can get to the appropriate global root
      for that bytenr.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7238e50
  5. 31 1月, 2022 2 次提交
    • F
      btrfs: skip reserved bytes warning on unmount after log cleanup failure · 40cdc509
      Filipe Manana 提交于
      After the recent changes made by commit c2e39305 ("btrfs: clear
      extent buffer uptodate when we fail to write it") and its followup fix,
      commit 651740a5 ("btrfs: check WRITE_ERR when trying to read an
      extent buffer"), we can now end up not cleaning up space reservations of
      log tree extent buffers after a transaction abort happens, as well as not
      cleaning up still dirty extent buffers.
      
      This happens because if writeback for a log tree extent buffer failed,
      then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
      and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
      when trying to free the log tree with free_log_tree(), which iterates
      over the tree, we can end up getting an -EIO error when trying to read
      a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
      extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
      EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
      immediately as we can not iterate over the entire tree.
      
      In that case we never update the reserved space for an extent buffer in
      the respective block group and space_info object.
      
      When this happens we get the following traces when unmounting the fs:
      
      [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
      [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
      [174957.399379] ------------[ cut here ]------------
      [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.407523] Modules linked in: btrfs overlay dm_zero (...)
      [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.429717] Code: 21 48 8b bd (...)
      [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
      [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
      [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
      [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
      [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
      [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
      [174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.443948] Call Trace:
      [174957.444264]  <TASK>
      [174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
      [174957.445238]  close_ctree+0x301/0x357 [btrfs]
      [174957.445803]  ? call_rcu+0x16c/0x290
      [174957.446250]  generic_shutdown_super+0x74/0x120
      [174957.446832]  kill_anon_super+0x14/0x30
      [174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.447890]  deactivate_locked_super+0x31/0xa0
      [174957.448440]  cleanup_mnt+0x147/0x1c0
      [174957.448888]  task_work_run+0x5c/0xa0
      [174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.449934]  syscall_exit_to_user_mode+0x16/0x40
      [174957.450512]  do_syscall_64+0x48/0xc0
      [174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.451605] RIP: 0033:0x7f328fdc4a97
      [174957.452059] Code: 03 0c 00 f7 (...)
      [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.460193]  </TASK>
      [174957.460534] irq event stamp: 0
      [174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.466323] ---[ end trace bc7ee0c490bce3af ]---
      [174957.467282] ------------[ cut here ]------------
      [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.470066] Modules linked in: btrfs overlay dm_zero (...)
      [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.488050] Code: 00 00 00 ad de (...)
      [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
      [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
      [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
      [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
      [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
      [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
      [174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.506167] Call Trace:
      [174957.506654]  <TASK>
      [174957.507047]  close_ctree+0x301/0x357 [btrfs]
      [174957.507867]  ? call_rcu+0x16c/0x290
      [174957.508567]  generic_shutdown_super+0x74/0x120
      [174957.509447]  kill_anon_super+0x14/0x30
      [174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.511123]  deactivate_locked_super+0x31/0xa0
      [174957.511976]  cleanup_mnt+0x147/0x1c0
      [174957.512610]  task_work_run+0x5c/0xa0
      [174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.514231]  syscall_exit_to_user_mode+0x16/0x40
      [174957.515069]  do_syscall_64+0x48/0xc0
      [174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.516688] RIP: 0033:0x7f328fdc4a97
      [174957.517413] Code: 03 0c 00 f7 d8 (...)
      [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.529404]  </TASK>
      [174957.529843] irq event stamp: 0
      [174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
      [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
      [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
      [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
      [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
      [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
      [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
      [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
      
      This also means that in case we have log tree extent buffers that are
      still dirty, we can end up not cleaning them up in case we find an
      extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
      we have no way for iterating over the rest of the tree.
      
      This issue is very often triggered with test cases generic/475 and
      generic/648 from fstests.
      
      The issue could almost be fixed by iterating over the io tree attached to
      each log root which keeps tracks of the range of allocated extent buffers,
      log_root->dirty_log_pages, however that does not work and has some
      inconveniences:
      
      1) After we sync the log, we clear the range of the extent buffers from
         the io tree, so we can't find them after writeback. We could keep the
         ranges in the io tree, with a separate bit to signal they represent
         extent buffers already written, but that means we need to hold into
         more memory until the transaction commits.
      
         How much more memory is used depends a lot on whether we are able to
         allocate contiguous extent buffers on disk (and how often) for a log
         tree - if we are able to, then a single extent state record can
         represent multiple extent buffers, otherwise we need multiple extent
         state record structures to track each extent buffer.
         In fact, my earlier approach did that:
      
         https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
      
         However that can cause a very significant negative impact on
         performance, not only due to the extra memory usage but also because
         we get a larger and deeper dirty_log_pages io tree.
         We got a report that, on beefy machines at least, we can get such
         performance drop with fsmark for example:
      
         https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
      
      2) We would be doing it only to deal with an unexpected and exceptional
         case, which is basically failure to read an extent buffer from disk
         due to IO failures. On a healthy system we don't expect transaction
         aborts to happen after all;
      
      3) Instead of relying on iterating the log tree or tracking the ranges
         of extent buffers in the dirty_log_pages io tree, using the radix
         tree that tracks extent buffers (fs_info->buffer_radix) to find all
         log tree extent buffers is not reliable either, because after writeback
         of an extent buffer it can be evicted from memory by the release page
         callback of the btree inode (btree_releasepage()).
      
      Since there's no way to be able to properly cleanup a log tree without
      being able to read its extent buffers from disk and without using more
      memory to track the logical ranges of the allocated extent buffers do
      the following:
      
      1) When we fail to cleanup a log tree, setup a flag that indicates that
         failure;
      
      2) Trigger writeback of all log tree extent buffers that are still dirty,
         and wait for the writeback to complete. This is just to cleanup their
         state, page states, page leaks, etc;
      
      3) When unmounting the fs, ignore if the number of bytes reserved in a
         block group and in a space_info is not 0 if, and only if, we failed to
         cleanup a log tree. Also ignore only for metadata block groups and the
         metadata space_info object.
      
      This is far from a perfect solution, but it serves to silence test
      failures such as those from generic/475 and generic/648. However having
      a non-zero value for the reserved bytes counters on unmount after a
      transaction abort, is not such a terrible thing and it's completely
      harmless, it does not affect the filesystem integrity in any way.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40cdc509
    • Q
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo 提交于
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4 ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d192fc4
  6. 07 1月, 2022 1 次提交
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  7. 03 1月, 2022 4 次提交
  8. 27 10月, 2021 11 次提交
    • F
      btrfs: update comments for chunk allocation -ENOSPC cases · ecd84d54
      Filipe Manana 提交于
      Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
      describe which cases can lead to a failure to allocate metadata and system
      space despite having previously reserved space. This adds one more reason
      that I previously forgot to mention.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ecd84d54
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • J
      btrfs: zoned: use greedy gc for auto reclaim · 2ca0ec77
      Johannes Thumshirn 提交于
      Currently auto reclaim of unusable zones reclaims the block-groups in
      the order they have been added to the reclaim list.
      
      Change this to a greedy algorithm by sorting the list so we have the
      block-groups with the least amount of valid bytes reclaimed first.
      
      Note: we can't splice the block groups from reclaim_bgs to let the sort
      happen outside of the lock. The block groups can be still in use by
      other parts eg. via bg_list and we must hold unused_bgs_lock while
      processing them.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ write note and comment why we can't splice the list ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ca0ec77
    • A
      btrfs: reduce btrfs_update_block_group alloc argument to bool · 11b66fa6
      Anand Jain 提交于
      btrfs_update_block_group() accounts for the number of bytes allocated or
      freed. Argument @alloc specifies whether the call is for alloc or free.
      Convert the argument @alloc type from int to bool.
      Reviewed-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11b66fa6
    • J
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn 提交于
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2707a25
    • N
      btrfs: zoned: activate new block group · eb66a010
      Naohiro Aota 提交于
      Activate new block group at btrfs_make_block_group(). We do not check the
      return value. If failed, we can try again later at the actual extent
      allocation phase.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb66a010
    • N
      btrfs: zoned: implement active zone tracking · afba2bc0
      Naohiro Aota 提交于
      Add zone_is_active flag to btrfs_block_group. This flag indicates the
      underlying zones are all active. Such zone active block groups are tracked
      by fs_info->active_bg_list.
      
      btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
      device part. They set/clear the bitmap to indicate zone activeness and
      count the number of zones we can activate left.
      
      btrfs_zone_{activate,finish}() take responsibility for the logical part and
      the list management. In addition, btrfs_zone_finish() wait for any writes
      on it and send REQ_OP_ZONE_FINISH to the zone.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afba2bc0
    • N
      btrfs: zoned: introduce physical_map to btrfs_block_group · dafc340d
      Naohiro Aota 提交于
      We will use a block group's physical location to track active zones and
      finish fully written zones in the following commits. Since the zone
      activation is done in the extent allocation context which already holding
      the tree locks, we can't query the chunk tree for the physical locations.
      So, copy the location info into a block group and use it for activation.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dafc340d
    • N
      btrfs: zoned: calculate free space from zone capacity · 98173255
      Naohiro Aota 提交于
      Now that we introduced capacity in a block group, we need to calculate free
      space using the capacity instead of the length. Thus, bytes we account
      capacity - alloc_pointer as free, and account bytes [capacity, length] as
      zone unusable.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98173255
    • N
      btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusable · c46c4247
      Naohiro Aota 提交于
      btrfs_free_excluded_extents() is not neccessary for
      btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable()
      difficult to reuse. Move it out and call btrfs_free_excluded_extents()
      in proper context.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c46c4247
    • A
      btrfs: rename and switch to bool btrfs_chunk_readonly · a09f23c3
      Anand Jain 提交于
      btrfs_chunk_readonly() checks if the given chunk is writeable. It
      returns 1 for readonly, and 0 for writeable. So the return argument type
      bool shall suffice instead of the current type int.
      
      Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
      check if the bg is writeable, and helps to keep the logic at the parent
      function simpler to understand.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a09f23c3
  9. 26 10月, 2021 1 次提交
  10. 23 8月, 2021 3 次提交
  11. 07 7月, 2021 2 次提交
    • J
      btrfs: don't block if we can't acquire the reclaim lock · 9cc0b837
      Johannes Thumshirn 提交于
      If we can't acquire the reclaim_bgs_lock on block group reclaim, we
      block until it is free. This can potentially stall for a long time.
      
      While reclaim of block groups is necessary for a good user experience on
      a zoned file system, there still is no need to block as it is best
      effort only, just like when we're deleting unused block groups.
      
      CC: stable@vger.kernel.org # 5.13
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cc0b837
    • F
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana 提交于
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79bd3712