1. 16 5月, 2022 3 次提交
  2. 06 4月, 2022 3 次提交
  3. 14 3月, 2022 2 次提交
    • N
      btrfs: zoned: mark relocation as writing · ca5e4ea0
      Naohiro Aota 提交于
      There is a hung_task issue with running generic/068 on an SMR
      device. The hang occurs while a process is trying to thaw the
      filesystem. The process is trying to take sb->s_umount to thaw the
      FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
      waiting for an ordered extent to finish. However, as the FS is frozen,
      the ordered extents never finish.
      
      Having an ordered extent while the FS is frozen is the root cause of
      the hang. The ordered extent is initiated from btrfs_relocate_chunk()
      which is called from btrfs_reclaim_bgs_work().
      
      This commit adds sb_*_write() around btrfs_relocate_chunk() call
      site. For the usual "btrfs balance" command, we already call it with
      mnt_want_file() in btrfs_ioctl_balance().
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.13+
      Link: https://github.com/naota/linux/issues/56Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ca5e4ea0
    • J
      btrfs: add support for multiple global roots · f7238e50
      Josef Bacik 提交于
      With extent tree v2 you will be able to create multiple csum, extent,
      and free space trees.  They will be used based on the block group, which
      will now use the block_group_item->chunk_objectid to point to the set of
      global roots that it will use.  When allocating new block groups we'll
      simply mod the gigabyte offset of the block group against the number of
      global roots we have and that will be the block groups global id.
      
      >From there we can take the bytenr that we're modifying in the respective
      tree, look up the block group and get that block groups corresponding
      global root id.  From there we can get to the appropriate global root
      for that bytenr.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f7238e50
  4. 31 1月, 2022 2 次提交
    • F
      btrfs: skip reserved bytes warning on unmount after log cleanup failure · 40cdc509
      Filipe Manana 提交于
      After the recent changes made by commit c2e39305 ("btrfs: clear
      extent buffer uptodate when we fail to write it") and its followup fix,
      commit 651740a5 ("btrfs: check WRITE_ERR when trying to read an
      extent buffer"), we can now end up not cleaning up space reservations of
      log tree extent buffers after a transaction abort happens, as well as not
      cleaning up still dirty extent buffers.
      
      This happens because if writeback for a log tree extent buffer failed,
      then we have cleared the bit EXTENT_BUFFER_UPTODATE from the extent buffer
      and we have also set the bit EXTENT_BUFFER_WRITE_ERR on it. Later on,
      when trying to free the log tree with free_log_tree(), which iterates
      over the tree, we can end up getting an -EIO error when trying to read
      a node or a leaf, since read_extent_buffer_pages() returns -EIO if an
      extent buffer does not have EXTENT_BUFFER_UPTODATE set and has the
      EXTENT_BUFFER_WRITE_ERR bit set. Getting that -EIO means that we return
      immediately as we can not iterate over the entire tree.
      
      In that case we never update the reserved space for an extent buffer in
      the respective block group and space_info object.
      
      When this happens we get the following traces when unmounting the fs:
      
      [174957.284509] BTRFS: error (device dm-0) in cleanup_transaction:1913: errno=-5 IO failure
      [174957.286497] BTRFS: error (device dm-0) in free_log_tree:3420: errno=-5 IO failure
      [174957.399379] ------------[ cut here ]------------
      [174957.402497] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:127 btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.407523] Modules linked in: btrfs overlay dm_zero (...)
      [174957.424917] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.426689] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.428716] RIP: 0010:btrfs_put_block_group+0x77/0xb0 [btrfs]
      [174957.429717] Code: 21 48 8b bd (...)
      [174957.432867] RSP: 0018:ffffb70d41cffdd0 EFLAGS: 00010206
      [174957.433632] RAX: 0000000000000001 RBX: ffff8b09c3848000 RCX: ffff8b0758edd1c8
      [174957.434689] RDX: 0000000000000001 RSI: ffffffffc0b467e7 RDI: ffff8b0758edd000
      [174957.436068] RBP: ffff8b0758edd000 R08: 0000000000000000 R09: 0000000000000000
      [174957.437114] R10: 0000000000000246 R11: 0000000000000000 R12: ffff8b09c3848148
      [174957.438140] R13: ffff8b09c3848198 R14: ffff8b0758edd188 R15: dead000000000100
      [174957.439317] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.440402] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.441164] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.442117] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.443076] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.443948] Call Trace:
      [174957.444264]  <TASK>
      [174957.444538]  btrfs_free_block_groups+0x255/0x3c0 [btrfs]
      [174957.445238]  close_ctree+0x301/0x357 [btrfs]
      [174957.445803]  ? call_rcu+0x16c/0x290
      [174957.446250]  generic_shutdown_super+0x74/0x120
      [174957.446832]  kill_anon_super+0x14/0x30
      [174957.447305]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.447890]  deactivate_locked_super+0x31/0xa0
      [174957.448440]  cleanup_mnt+0x147/0x1c0
      [174957.448888]  task_work_run+0x5c/0xa0
      [174957.449336]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.449934]  syscall_exit_to_user_mode+0x16/0x40
      [174957.450512]  do_syscall_64+0x48/0xc0
      [174957.450980]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.451605] RIP: 0033:0x7f328fdc4a97
      [174957.452059] Code: 03 0c 00 f7 (...)
      [174957.454320] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.455262] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.456131] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.457118] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.458005] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.459113] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.460193]  </TASK>
      [174957.460534] irq event stamp: 0
      [174957.461003] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.461947] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.463147] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.465116] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.466323] ---[ end trace bc7ee0c490bce3af ]---
      [174957.467282] ------------[ cut here ]------------
      [174957.468184] WARNING: CPU: 2 PID: 3206883 at fs/btrfs/block-group.c:3976 btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.470066] Modules linked in: btrfs overlay dm_zero (...)
      [174957.483137] CPU: 2 PID: 3206883 Comm: umount Tainted: G        W         5.16.0-rc5-btrfs-next-109 #1
      [174957.484691] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [174957.486853] RIP: 0010:btrfs_free_block_groups+0x330/0x3c0 [btrfs]
      [174957.488050] Code: 00 00 00 ad de (...)
      [174957.491479] RSP: 0018:ffffb70d41cffde0 EFLAGS: 00010206
      [174957.492520] RAX: ffff8b08d79310b0 RBX: ffff8b09c3848000 RCX: 0000000000000000
      [174957.493868] RDX: 0000000000000001 RSI: fffff443055ee600 RDI: ffffffffb1131846
      [174957.495183] RBP: ffff8b08d79310b0 R08: 0000000000000000 R09: 0000000000000000
      [174957.496580] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8b08d7931000
      [174957.498027] R13: ffff8b09c38492b0 R14: dead000000000122 R15: dead000000000100
      [174957.499438] FS:  00007f328fb82800(0000) GS:ffff8b0a2d200000(0000) knlGS:0000000000000000
      [174957.500990] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [174957.502117] CR2: 00007fff13563e98 CR3: 0000000404f4e005 CR4: 0000000000370ee0
      [174957.503513] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [174957.504864] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [174957.506167] Call Trace:
      [174957.506654]  <TASK>
      [174957.507047]  close_ctree+0x301/0x357 [btrfs]
      [174957.507867]  ? call_rcu+0x16c/0x290
      [174957.508567]  generic_shutdown_super+0x74/0x120
      [174957.509447]  kill_anon_super+0x14/0x30
      [174957.510194]  btrfs_kill_super+0x12/0x20 [btrfs]
      [174957.511123]  deactivate_locked_super+0x31/0xa0
      [174957.511976]  cleanup_mnt+0x147/0x1c0
      [174957.512610]  task_work_run+0x5c/0xa0
      [174957.513309]  exit_to_user_mode_prepare+0x1e5/0x1f0
      [174957.514231]  syscall_exit_to_user_mode+0x16/0x40
      [174957.515069]  do_syscall_64+0x48/0xc0
      [174957.515718]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [174957.516688] RIP: 0033:0x7f328fdc4a97
      [174957.517413] Code: 03 0c 00 f7 d8 (...)
      [174957.521052] RSP: 002b:00007fff13564ec8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [174957.522514] RAX: 0000000000000000 RBX: 00007f328feea264 RCX: 00007f328fdc4a97
      [174957.523950] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000560b8ae51dd0
      [174957.525375] RBP: 0000560b8ae51ba0 R08: 0000000000000000 R09: 00007fff13563c40
      [174957.526763] R10: 00007f328fe49fc0 R11: 0000000000000246 R12: 0000000000000000
      [174957.528058] R13: 0000560b8ae51dd0 R14: 0000560b8ae51cb0 R15: 0000000000000000
      [174957.529404]  </TASK>
      [174957.529843] irq event stamp: 0
      [174957.530256] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [174957.531061] hardirqs last disabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.532075] softirqs last  enabled at (0): [<ffffffffb0e94214>] copy_process+0x934/0x2040
      [174957.533083] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [174957.533865] ---[ end trace bc7ee0c490bce3b0 ]---
      [174957.534452] BTRFS info (device dm-0): space_info 4 has 1070841856 free, is not full
      [174957.535404] BTRFS info (device dm-0): space_info total=1073741824, used=2785280, pinned=0, reserved=49152, may_use=0, readonly=65536 zone_unusable=0
      [174957.537029] BTRFS info (device dm-0): global_block_rsv: size 0 reserved 0
      [174957.537859] BTRFS info (device dm-0): trans_block_rsv: size 0 reserved 0
      [174957.538697] BTRFS info (device dm-0): chunk_block_rsv: size 0 reserved 0
      [174957.539552] BTRFS info (device dm-0): delayed_block_rsv: size 0 reserved 0
      [174957.540403] BTRFS info (device dm-0): delayed_refs_rsv: size 0 reserved 0
      
      This also means that in case we have log tree extent buffers that are
      still dirty, we can end up not cleaning them up in case we find an
      extent buffer with EXTENT_BUFFER_WRITE_ERR set on it, as in that case
      we have no way for iterating over the rest of the tree.
      
      This issue is very often triggered with test cases generic/475 and
      generic/648 from fstests.
      
      The issue could almost be fixed by iterating over the io tree attached to
      each log root which keeps tracks of the range of allocated extent buffers,
      log_root->dirty_log_pages, however that does not work and has some
      inconveniences:
      
      1) After we sync the log, we clear the range of the extent buffers from
         the io tree, so we can't find them after writeback. We could keep the
         ranges in the io tree, with a separate bit to signal they represent
         extent buffers already written, but that means we need to hold into
         more memory until the transaction commits.
      
         How much more memory is used depends a lot on whether we are able to
         allocate contiguous extent buffers on disk (and how often) for a log
         tree - if we are able to, then a single extent state record can
         represent multiple extent buffers, otherwise we need multiple extent
         state record structures to track each extent buffer.
         In fact, my earlier approach did that:
      
         https://lore.kernel.org/linux-btrfs/3aae7c6728257c7ce2279d6660ee2797e5e34bbd.1641300250.git.fdmanana@suse.com/
      
         However that can cause a very significant negative impact on
         performance, not only due to the extra memory usage but also because
         we get a larger and deeper dirty_log_pages io tree.
         We got a report that, on beefy machines at least, we can get such
         performance drop with fsmark for example:
      
         https://lore.kernel.org/linux-btrfs/20220117082426.GE32491@xsang-OptiPlex-9020/
      
      2) We would be doing it only to deal with an unexpected and exceptional
         case, which is basically failure to read an extent buffer from disk
         due to IO failures. On a healthy system we don't expect transaction
         aborts to happen after all;
      
      3) Instead of relying on iterating the log tree or tracking the ranges
         of extent buffers in the dirty_log_pages io tree, using the radix
         tree that tracks extent buffers (fs_info->buffer_radix) to find all
         log tree extent buffers is not reliable either, because after writeback
         of an extent buffer it can be evicted from memory by the release page
         callback of the btree inode (btree_releasepage()).
      
      Since there's no way to be able to properly cleanup a log tree without
      being able to read its extent buffers from disk and without using more
      memory to track the logical ranges of the allocated extent buffers do
      the following:
      
      1) When we fail to cleanup a log tree, setup a flag that indicates that
         failure;
      
      2) Trigger writeback of all log tree extent buffers that are still dirty,
         and wait for the writeback to complete. This is just to cleanup their
         state, page states, page leaks, etc;
      
      3) When unmounting the fs, ignore if the number of bytes reserved in a
         block group and in a space_info is not 0 if, and only if, we failed to
         cleanup a log tree. Also ignore only for metadata block groups and the
         metadata space_info object.
      
      This is far from a perfect solution, but it serves to silence test
      failures such as those from generic/475 and generic/648. However having
      a non-zero value for the reserved bytes counters on unmount after a
      transaction abort, is not such a terrible thing and it's completely
      harmless, it does not affect the filesystem integrity in any way.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40cdc509
    • Q
      btrfs: don't start transaction for scrub if the fs is mounted read-only · 2d192fc4
      Qu Wenruo 提交于
      [BUG]
      The following super simple script would crash btrfs at unmount time, if
      CONFIG_BTRFS_ASSERT() is set.
      
       mkfs.btrfs -f $dev
       mount $dev $mnt
       xfs_io -f -c "pwrite 0 4k" $mnt/file
       umount $mnt
       mount -r ro $dev $mnt
       btrfs scrub start -Br $mnt
       umount $mnt
      
      This will trigger the following ASSERT() introduced by commit
      0a31daa4 ("btrfs: add assertion for empty list of transactions at
      late stage of umount").
      
      That patch is definitely not the cause, it just makes enough noise for
      developers.
      
      [CAUSE]
      We will start transaction for the following call chain during scrub:
      
        scrub_enumerate_chunks()
        |- btrfs_inc_block_group_ro()
           |- btrfs_join_transaction()
      
      However for RO mount, there is no running transaction at all, thus
      btrfs_join_transaction() will start a new transaction.
      
      Furthermore, since it's read-only mount, btrfs_sync_fs() will not call
      btrfs_commit_super() to commit the new but empty transaction.
      
      And leads to the ASSERT().
      
      The bug has been there for a long time. Only the new ASSERT() makes it
      noisy enough to be noticed.
      
      [FIX]
      For read-only scrub on read-only mount, there is no need to start a
      transaction nor to allocate new chunks in btrfs_inc_block_group_ro().
      
      Just do extra read-only mount check in btrfs_inc_block_group_ro(), and
      if it's read-only, skip all chunk allocation and go inc_block_group_ro()
      directly.
      
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2d192fc4
  5. 07 1月, 2022 1 次提交
    • F
      btrfs: make send work with concurrent block group relocation · d96b3424
      Filipe Manana 提交于
      We don't allow send and balance/relocation to run in parallel in order
      to prevent send failing or silently producing some bad stream. This is
      because while send is using an extent (specially metadata) or about to
      read a metadata extent and expecting it belongs to a specific parent
      node, relocation can run, the transaction used for the relocation is
      committed and the extent gets reallocated while send is still using the
      extent, so it ends up with a different content than expected. This can
      result in just failing to read a metadata extent due to failure of the
      validation checks (parent transid, level, etc), failure to find a
      backreference for a data extent, and other unexpected failures. Besides
      reallocation, there's also a similar problem of an extent getting
      discarded when it's unpinned after the transaction used for block group
      relocation is committed.
      
      The restriction between balance and send was added in commit 9e967495
      ("Btrfs: prevent send failures and crashes due to concurrent relocation"),
      kernel 5.3, while the more general restriction between send and relocation
      was added in commit 1cea5cf0 ("btrfs: ensure relocation never runs
      while we have send operations running"), kernel 5.14.
      
      Both send and relocation can be very long running operations. Relocation
      because it has to do a lot of IO and expensive backreference lookups in
      case there are many snapshots, and send due to read IO when operating on
      very large trees. This makes it inconvenient for users and tools to deal
      with scheduling both operations.
      
      For zoned filesystem we also have automatic block group relocation, so
      send can fail with -EAGAIN when users least expect it or send can end up
      delaying the block group relocation for too long. In the future we might
      also get the automatic block group relocation for non zoned filesystems.
      
      This change makes it possible for send and relocation to run in parallel.
      This is achieved the following way:
      
      1) For all tree searches, send acquires a read lock on the commit root
         semaphore;
      
      2) After each tree search, and before releasing the commit root semaphore,
         the leaf is cloned and placed in the search path (struct btrfs_path);
      
      3) After releasing the commit root semaphore, the changed_cb() callback
         is invoked, which operates on the leaf and writes commands to the pipe
         (or file in case send/receive is not used with a pipe). It's important
         here to not hold a lock on the commit root semaphore, because if we did
         we could deadlock when sending and receiving to the same filesystem
         using a pipe - the send task blocks on the pipe because it's full, the
         receive task, which is the only consumer of the pipe, triggers a
         transaction commit when attempting to create a subvolume or reserve
         space for a write operation for example, but the transaction commit
         blocks trying to write lock the commit root semaphore, resulting in a
         deadlock;
      
      4) Before moving to the next key, or advancing to the next change in case
         of an incremental send, check if a transaction used for relocation was
         committed (or is about to finish its commit). If so, release the search
         path(s) and restart the search, to where we were before, so that we
         don't operate on stale extent buffers. The search restarts are always
         possible because both the send and parent roots are RO, and no one can
         add, remove of update keys (change their offset) in RO trees - the
         only exception is deduplication, but that is still not allowed to run
         in parallel with send;
      
      5) Periodically check if there is contention on the commit root semaphore,
         which means there is a transaction commit trying to write lock it, and
         release the semaphore and reschedule if there is contention, so as to
         avoid causing any significant delays to transaction commits.
      
      This leaves some room for optimizations for send to have less path
      releases and re searching the trees when there's relocation running, but
      for now it's kept simple as it performs quite well (on very large trees
      with resulting send streams in the order of a few hundred gigabytes).
      
      Test case btrfs/187, from fstests, stresses relocation, send and
      deduplication attempting to run in parallel, but without verifying if send
      succeeds and if it produces correct streams. A new test case will be added
      that exercises relocation happening in parallel with send and then checks
      that send succeeds and the resulting streams are correct.
      
      A final note is that for now this still leaves the mutual exclusion
      between send operations and deduplication on files belonging to a root
      used by send operations. A solution for that will be slightly more complex
      but it will eventually be built on top of this change.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d96b3424
  6. 03 1月, 2022 4 次提交
  7. 27 10月, 2021 11 次提交
    • F
      btrfs: update comments for chunk allocation -ENOSPC cases · ecd84d54
      Filipe Manana 提交于
      Update the comments at btrfs_chunk_alloc() and do_chunk_alloc() that
      describe which cases can lead to a failure to allocate metadata and system
      space despite having previously reserved space. This adds one more reason
      that I previously forgot to mention.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ecd84d54
    • F
      btrfs: fix deadlock between chunk allocation and chunk btree modifications · 2bb2e00e
      Filipe Manana 提交于
      When a task is doing some modification to the chunk btree and it is not in
      the context of a chunk allocation or a chunk removal, it can deadlock with
      another task that is currently allocating a new data or metadata chunk.
      
      These contexts are the following:
      
      * When relocating a system chunk, when we need to COW the extent buffers
        that belong to the chunk btree;
      
      * When adding a new device (ioctl), where we need to add a new device item
        to the chunk btree;
      
      * When removing a device (ioctl), where we need to remove a device item
        from the chunk btree;
      
      * When resizing a device (ioctl), where we need to update a device item in
        the chunk btree and may need to relocate a system chunk that lies beyond
        the new device size when shrinking a device.
      
      The problem happens due to a sequence of steps like the following:
      
      1) Task A starts a data or metadata chunk allocation and it locks the
         chunk mutex;
      
      2) Task B is relocating a system chunk, and when it needs to COW an extent
         buffer of the chunk btree, it has locked both that extent buffer as
         well as its parent extent buffer;
      
      3) Since there is not enough available system space, either because none
         of the existing system block groups have enough free space or because
         the only one with enough free space is in RO mode due to the relocation,
         task B triggers a new system chunk allocation. It blocks when trying to
         acquire the chunk mutex, currently held by task A;
      
      4) Task A enters btrfs_chunk_alloc_add_chunk_item(), in order to insert
         the new chunk item into the chunk btree and update the existing device
         items there. But in order to do that, it has to lock the extent buffer
         that task B locked at step 2, or its parent extent buffer, but task B
         is waiting on the chunk mutex, which is currently locked by task A,
         therefore resulting in a deadlock.
      
      One example report when the deadlock happens with system chunk relocation:
      
        INFO: task kworker/u9:5:546 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:kworker/u9:5    state:D stack:25936 pid:  546 ppid:     2 flags:0x00004000
        Workqueue: events_unbound btrfs_async_reclaim_metadata_space
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         rwsem_down_read_slowpath+0x4ee/0x9d0 kernel/locking/rwsem.c:993
         __down_read_common kernel/locking/rwsem.c:1214 [inline]
         __down_read kernel/locking/rwsem.c:1223 [inline]
         down_read_nested+0xe6/0x440 kernel/locking/rwsem.c:1590
         __btrfs_tree_read_lock+0x31/0x350 fs/btrfs/locking.c:47
         btrfs_tree_read_lock fs/btrfs/locking.c:54 [inline]
         btrfs_read_lock_root_node+0x8a/0x320 fs/btrfs/locking.c:191
         btrfs_search_slot_get_root fs/btrfs/ctree.c:1623 [inline]
         btrfs_search_slot+0x13b4/0x2140 fs/btrfs/ctree.c:1728
         btrfs_update_device+0x11f/0x500 fs/btrfs/volumes.c:2794
         btrfs_chunk_alloc_add_chunk_item+0x34d/0xea0 fs/btrfs/volumes.c:5504
         do_chunk_alloc fs/btrfs/block-group.c:3408 [inline]
         btrfs_chunk_alloc+0x84d/0xf50 fs/btrfs/block-group.c:3653
         flush_space+0x54e/0xd80 fs/btrfs/space-info.c:670
         btrfs_async_reclaim_metadata_space+0x396/0xa90 fs/btrfs/space-info.c:953
         process_one_work+0x9df/0x16d0 kernel/workqueue.c:2297
         worker_thread+0x90/0xed0 kernel/workqueue.c:2444
         kthread+0x3e5/0x4d0 kernel/kthread.c:319
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
        INFO: task syz-executor:9107 blocked for more than 143 seconds.
              Not tainted 5.15.0-rc3+ #1
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        task:syz-executor    state:D stack:23200 pid: 9107 ppid:  7792 flags:0x00004004
        Call Trace:
         context_switch kernel/sched/core.c:4940 [inline]
         __schedule+0xcd9/0x2530 kernel/sched/core.c:6287
         schedule+0xd3/0x270 kernel/sched/core.c:6366
         schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:6425
         __mutex_lock_common kernel/locking/mutex.c:669 [inline]
         __mutex_lock+0xc96/0x1680 kernel/locking/mutex.c:729
         btrfs_chunk_alloc+0x31a/0xf50 fs/btrfs/block-group.c:3631
         find_free_extent_update_loop fs/btrfs/extent-tree.c:3986 [inline]
         find_free_extent+0x25cb/0x3a30 fs/btrfs/extent-tree.c:4335
         btrfs_reserve_extent+0x1f1/0x500 fs/btrfs/extent-tree.c:4415
         btrfs_alloc_tree_block+0x203/0x1120 fs/btrfs/extent-tree.c:4813
         __btrfs_cow_block+0x412/0x1620 fs/btrfs/ctree.c:415
         btrfs_cow_block+0x2f6/0x8c0 fs/btrfs/ctree.c:570
         btrfs_search_slot+0x1094/0x2140 fs/btrfs/ctree.c:1768
         relocate_tree_block fs/btrfs/relocation.c:2694 [inline]
         relocate_tree_blocks+0xf73/0x1770 fs/btrfs/relocation.c:2757
         relocate_block_group+0x47e/0xc70 fs/btrfs/relocation.c:3673
         btrfs_relocate_block_group+0x48a/0xc60 fs/btrfs/relocation.c:4070
         btrfs_relocate_chunk+0x96/0x280 fs/btrfs/volumes.c:3181
         __btrfs_balance fs/btrfs/volumes.c:3911 [inline]
         btrfs_balance+0x1f03/0x3cd0 fs/btrfs/volumes.c:4301
         btrfs_ioctl_balance+0x61e/0x800 fs/btrfs/ioctl.c:4137
         btrfs_ioctl+0x39ea/0x7b70 fs/btrfs/ioctl.c:4949
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      So fix this by making sure that whenever we try to modify the chunk btree
      and we are neither in a chunk allocation context nor in a chunk remove
      context, we reserve system space before modifying the chunk btree.
      Reported-by: NHao Sun <sunhao.th@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CACkBjsax51i4mu6C0C3vJqQN3NR_iVuucoeG3U1HXjrgzn5FFQ@mail.gmail.com/
      Fixes: 79bd3712 ("btrfs: rework chunk allocation to avoid exhaustion of the system chunk array")
      CC: stable@vger.kernel.org # 5.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bb2e00e
    • J
      btrfs: zoned: use greedy gc for auto reclaim · 2ca0ec77
      Johannes Thumshirn 提交于
      Currently auto reclaim of unusable zones reclaims the block-groups in
      the order they have been added to the reclaim list.
      
      Change this to a greedy algorithm by sorting the list so we have the
      block-groups with the least amount of valid bytes reclaimed first.
      
      Note: we can't splice the block groups from reclaim_bgs to let the sort
      happen outside of the lock. The block groups can be still in use by
      other parts eg. via bg_list and we must hold unused_bgs_lock while
      processing them.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ write note and comment why we can't splice the list ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2ca0ec77
    • A
      btrfs: reduce btrfs_update_block_group alloc argument to bool · 11b66fa6
      Anand Jain 提交于
      btrfs_update_block_group() accounts for the number of bytes allocated or
      freed. Argument @alloc specifies whether the call is for alloc or free.
      Convert the argument @alloc type from int to bool.
      Reviewed-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11b66fa6
    • J
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn 提交于
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c2707a25
    • N
      btrfs: zoned: activate new block group · eb66a010
      Naohiro Aota 提交于
      Activate new block group at btrfs_make_block_group(). We do not check the
      return value. If failed, we can try again later at the actual extent
      allocation phase.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb66a010
    • N
      btrfs: zoned: implement active zone tracking · afba2bc0
      Naohiro Aota 提交于
      Add zone_is_active flag to btrfs_block_group. This flag indicates the
      underlying zones are all active. Such zone active block groups are tracked
      by fs_info->active_bg_list.
      
      btrfs_dev_{set,clear}_active_zone() take responsibility for the underlying
      device part. They set/clear the bitmap to indicate zone activeness and
      count the number of zones we can activate left.
      
      btrfs_zone_{activate,finish}() take responsibility for the logical part and
      the list management. In addition, btrfs_zone_finish() wait for any writes
      on it and send REQ_OP_ZONE_FINISH to the zone.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afba2bc0
    • N
      btrfs: zoned: introduce physical_map to btrfs_block_group · dafc340d
      Naohiro Aota 提交于
      We will use a block group's physical location to track active zones and
      finish fully written zones in the following commits. Since the zone
      activation is done in the extent allocation context which already holding
      the tree locks, we can't query the chunk tree for the physical locations.
      So, copy the location info into a block group and use it for activation.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dafc340d
    • N
      btrfs: zoned: calculate free space from zone capacity · 98173255
      Naohiro Aota 提交于
      Now that we introduced capacity in a block group, we need to calculate free
      space using the capacity instead of the length. Thus, bytes we account
      capacity - alloc_pointer as free, and account bytes [capacity, length] as
      zone unusable.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98173255
    • N
      btrfs: zoned: move btrfs_free_excluded_extents out of btrfs_calc_zone_unusable · c46c4247
      Naohiro Aota 提交于
      btrfs_free_excluded_extents() is not neccessary for
      btrfs_calc_zone_unusable() and it makes btrfs_calc_zone_unusable()
      difficult to reuse. Move it out and call btrfs_free_excluded_extents()
      in proper context.
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c46c4247
    • A
      btrfs: rename and switch to bool btrfs_chunk_readonly · a09f23c3
      Anand Jain 提交于
      btrfs_chunk_readonly() checks if the given chunk is writeable. It
      returns 1 for readonly, and 0 for writeable. So the return argument type
      bool shall suffice instead of the current type int.
      
      Also, rename btrfs_chunk_readonly() to btrfs_chunk_writeable() as we
      check if the bg is writeable, and helps to keep the logic at the parent
      function simpler to understand.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a09f23c3
  8. 26 10月, 2021 1 次提交
  9. 23 8月, 2021 3 次提交
  10. 07 7月, 2021 5 次提交
    • J
      btrfs: don't block if we can't acquire the reclaim lock · 9cc0b837
      Johannes Thumshirn 提交于
      If we can't acquire the reclaim_bgs_lock on block group reclaim, we
      block until it is free. This can potentially stall for a long time.
      
      While reclaim of block groups is necessary for a good user experience on
      a zoned file system, there still is no need to block as it is best
      effort only, just like when we're deleting unused block groups.
      
      CC: stable@vger.kernel.org # 5.13
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9cc0b837
    • F
      btrfs: rework chunk allocation to avoid exhaustion of the system chunk array · 79bd3712
      Filipe Manana 提交于
      Commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations") fixed a problem that resulted in
      exhausting the system chunk array in the superblock when there are many
      tasks allocating chunks in parallel. Basically too many tasks enter the
      first phase of chunk allocation without previous tasks having finished
      their second phase of allocation, resulting in too many system chunks
      being allocated. That was originally observed when running the fallocate
      tests of stress-ng on a PowerPC machine, using a node size of 64K.
      
      However that commit also introduced a deadlock where a task in phase 1 of
      the chunk allocation waited for another task that had allocated a system
      chunk to finish its phase 2, but that other task was waiting on an extent
      buffer lock held by the first task, therefore resulting in both tasks not
      making any progress. That change was later reverted by a patch with the
      subject "btrfs: fix deadlock with concurrent chunk allocations involving
      system chunks", since there is no simple and short solution to address it
      and the deadlock is relatively easy to trigger on zoned filesystems, while
      the system chunk array exhaustion is not so common.
      
      This change reworks the chunk allocation to avoid the system chunk array
      exhaustion. It accomplishes that by making the first phase of chunk
      allocation do the updates of the device items in the chunk btree and the
      insertion of the new chunk item in the chunk btree. This is done while
      under the protection of the chunk mutex (fs_info->chunk_mutex), in the
      same critical section that checks for available system space, allocates
      a new system chunk if needed and reserves system chunk space. This way
      we do not have chunk space reserved until the second phase completes.
      
      The same logic is applied to chunk removal as well, since it keeps
      reserved system space long after it is done updating the chunk btree.
      
      For direct allocation of system chunks, the previous behaviour remains,
      because otherwise we would deadlock on extent buffers of the chunk btree.
      Changes to the chunk btree are by large done by chunk allocation and chunk
      removal, which first reserve chunk system space and then later do changes
      to the chunk btree. The other remaining cases are uncommon and correspond
      to adding a device, removing a device and resizing a device. All these
      other cases do not pre-reserve system space, they modify the chunk btree
      right away, so they don't hold reserved space for a long period like chunk
      allocation and chunk removal do.
      
      The diff of this change is huge, but more than half of it is just addition
      of comments describing both how things work regarding chunk allocation and
      removal, including both the new behavior and the parts of the old behavior
      that did not change.
      
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79bd3712
    • F
      btrfs: fix deadlock with concurrent chunk allocations involving system chunks · 1cb3db1c
      Filipe Manana 提交于
      When a task attempting to allocate a new chunk verifies that there is not
      currently enough free space in the system space_info and there is another
      task that allocated a new system chunk but it did not finish yet the
      creation of the respective block group, it waits for that other task to
      finish creating the block group. This is to avoid exhaustion of the system
      chunk array in the superblock, which is limited, when we have a thundering
      herd of tasks allocating new chunks. This problem was described and fixed
      by commit eafa4fd0 ("btrfs: fix exhaustion of the system chunk array
      due to concurrent allocations").
      
      However there are two very similar scenarios where this can lead to a
      deadlock:
      
      1) Task B allocated a new system chunk and task A is waiting on task B
         to finish creation of the respective system block group. However before
         task B ends its transaction handle and finishes the creation of the
         system block group, it attempts to allocate another chunk (like a data
         chunk for an fallocate operation for a very large range). Task B will
         be unable to progress and allocate the new chunk, because task A set
         space_info->chunk_alloc to 1 and therefore it loops at
         btrfs_chunk_alloc() waiting for task A to finish its chunk allocation
         and set space_info->chunk_alloc to 0, but task A is waiting on task B
         to finish creation of the new system block group, therefore resulting
         in a deadlock;
      
      2) Task B allocated a new system chunk and task A is waiting on task B to
         finish creation of the respective system block group. By the time that
         task B enter the final phase of block group allocation, which happens
         at btrfs_create_pending_block_groups(), when it modifies the extent
         tree, the device tree or the chunk tree to insert the items for some
         new block group, it needs to allocate a new chunk, so it ends up at
         btrfs_chunk_alloc() and keeps looping there because task A has set
         space_info->chunk_alloc to 1, but task A is waiting for task B to
         finish creation of the new system block group and release the reserved
         system space, therefore resulting in a deadlock.
      
      In short, the problem is if a task B needs to allocate a new chunk after
      it previously allocated a new system chunk and if another task A is
      currently waiting for task B to complete the allocation of the new system
      chunk.
      
      Unfortunately this deadlock scenario introduced by the previous fix for
      the system chunk array exhaustion problem does not have a simple and short
      fix, and requires a big change to rework the chunk allocation code so that
      chunk btree updates are all made in the first phase of chunk allocation.
      And since this deadlock regression is being frequently hit on zoned
      filesystems and the system chunk array exhaustion problem is triggered
      in more extreme cases (originally observed on PowerPC with a node size
      of 64K when running the fallocate tests from stress-ng), revert the
      changes from that commit. The next patch in the series, with a subject
      of "btrfs: rework chunk allocation to avoid exhaustion of the system
      chunk array" does the necessary changes to fix the system chunk array
      exhaustion problem.
      Reported-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Link: https://lore.kernel.org/linux-btrfs/20210621015922.ewgbffxuawia7liz@naota-xeon/
      Fixes: eafa4fd0 ("btrfs: fix exhaustion of the system chunk array due to concurrent allocations")
      CC: stable@vger.kernel.org # 5.12+
      Tested-by: NShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cb3db1c
    • J
      btrfs: zoned: print unusable percentage when reclaiming block groups · 5f93e776
      Johannes Thumshirn 提交于
      When we're automatically reclaiming a zone, because its zone_unusable
      value is above the reclaim threshold, we're only logging how much
      percent of the zone's capacity are used, but not how much of the
      capacity is unusable.
      
      Also print the percentage of the unusable space in the block group
      before we're reclaiming it.
      
      Example:
      
        BTRFS info (device sdg): reclaiming chunk 230686720 with 13% used 86% unusable
      
      CC: stable@vger.kernel.org # 5.13
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5f93e776
    • D
      btrfs: zoned: fix types for u64 division in btrfs_reclaim_bgs_work · 54afaae3
      David Sterba 提交于
      The types in calculation of the used percentage in the reclaiming
      messages are both u64, though bg->length is either 1GiB (non-zoned) or
      the zone size in the zoned mode. The upper limit on zone size is 8GiB so
      this could theoretically overflow in the future, right now the values
      fit.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.13
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      54afaae3
  11. 22 6月, 2021 2 次提交
    • J
      btrfs: rip out btrfs_space_info::total_bytes_pinned · 138a12d8
      Josef Bacik 提交于
      We used this in may_commit_transaction() in order to determine if we
      needed to commit the transaction.  However we no longer have that logic
      and thus have no use of this counter anymore, so delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      138a12d8
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
  12. 21 6月, 2021 1 次提交
  13. 17 6月, 2021 1 次提交
    • N
      btrfs: zoned: fix negative space_info->bytes_readonly · f9f28e5b
      Naohiro Aota 提交于
      Consider we have a using block group on zoned btrfs.
      
      |<- ZU ->|<- used ->|<---free--->|
                           `- Alloc offset
      ZU: Zone unusable
      
      Marking the block group read-only will migrate the zone unusable bytes
      to the read-only bytes. So, we will have this.
      
      |<- RO ->|<- used ->|<--- RO --->|
      
      RO: Read only
      
      When marking it back to read-write, btrfs_dec_block_group_ro()
      subtracts the above "RO" bytes from the
      space_info->bytes_readonly. And, it moves the zone unusable bytes back
      and again subtracts those bytes from the space_info->bytes_readonly,
      leading to negative bytes_readonly.
      
      This can be observed in the output as eg.:
      
        Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
        Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688
      
      This commit fixes the issue by reordering the operations.
      
      Link: https://github.com/naota/linux/issues/37Reported-by: NDavid Sterba <dsterba@suse.com>
      Fixes: 169e0da9 ("btrfs: zoned: track unusable bytes for zones")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f9f28e5b
  14. 21 4月, 2021 1 次提交
    • J
      btrfs: zoned: automatically reclaim zones · 18bb8bbf
      Johannes Thumshirn 提交于
      When a file gets deleted on a zoned file system, the space freed is not
      returned back into the block group's free space, but is migrated to
      zone_unusable.
      
      As this zone_unusable space is behind the current write pointer it is not
      possible to use it for new allocations. In the current implementation a
      zone is reset once all of the block group's space is accounted as zone
      unusable.
      
      This behaviour can lead to premature ENOSPC errors on a busy file system.
      
      Instead of only reclaiming the zone once it is completely unusable,
      kick off a reclaim job once the amount of unusable bytes exceeds a user
      configurable threshold between 51% and 100%. It can be set per mounted
      filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75%
      by default.
      
      Similar to reclaiming unused block groups, these dirty block groups are
      added to a to_reclaim list and then on a transaction commit, the reclaim
      process is triggered but after we deleted unused block groups, which will
      free space for the relocation process.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18bb8bbf