1. 07 10月, 2020 3 次提交
  2. 27 8月, 2020 1 次提交
    • M
      btrfs: block-group: fix free-space bitmap threshold · e3e39c72
      Marcos Paulo de Souza 提交于
      [BUG]
      After commit 9afc6649 ("btrfs: block-group: refactor how we read one
      block group item"), cache->length is being assigned after calling
      btrfs_create_block_group_cache. This causes a problem since
      set_free_space_tree_thresholds calculates the free-space threshold to
      decide if the free-space tree should convert from extents to bitmaps.
      
      The current code calls set_free_space_tree_thresholds with cache->length
      being 0, which then makes cache->bitmap_high_thresh zero. This implies
      the system will always use bitmap instead of extents, which is not
      desired if the block group is not fragmented.
      
      This behavior can be seen by a test that expects to repair systems
      with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
      created FREE_SPACE_BITMAP.
      
      [FIX]
      Call set_free_space_tree_thresholds after setting cache->length. There
      is now a WARN_ON in set_free_space_tree_thresholds to help preventing
      the same mistake to happen again in the future.
      
      Link: https://github.com/kdave/btrfs-progs/issues/251
      Fixes: 9afc6649 ("btrfs: block-group: refactor how we read one block group item")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3e39c72
  3. 27 7月, 2020 9 次提交
  4. 17 6月, 2020 2 次提交
    • F
      btrfs: fix race between block group removal and block group creation · ffcb9d44
      Filipe Manana 提交于
      There is a race between block group removal and block group creation
      when the removal is completed by a task running fitrim or scrub. When
      this happens we end up failing the block group creation with an error
      -EEXIST since we attempt to insert a duplicate block group item key
      in the extent tree. That results in a transaction abort.
      
      The race happens like this:
      
      1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
         block group X with btrfs_freeze_block_group() (until very recently
         that was named btrfs_get_block_group_trimming());
      
      2) Task B starts removing block group X, either because it's now unused
         or due to relocation for example. So at btrfs_remove_block_group(),
         while holding the chunk mutex and the block group's lock, it sets
         the 'removed' flag of the block group and it sets the local variable
         'remove_em' to false, because the block group is currently frozen
         (its 'frozen' counter is > 0, until very recently this counter was
         named 'trimming');
      
      3) Task B unlocks the block group and the chunk mutex;
      
      4) Task A is done trimming the block group and unfreezes the block group
         by calling btrfs_unfreeze_block_group() (until very recently this was
         named btrfs_put_block_group_trimming()). In this function we lock the
         block group and set the local variable 'cleanup' to true because we
         were able to decrement the block group's 'frozen' counter down to 0 and
         the flag 'removed' is set in the block group.
      
         Since 'cleanup' is set to true, it locks the chunk mutex and removes
         the extent mapping representing the block group from the mapping tree;
      
      5) Task C allocates a new block group Y and it picks up the logical address
         that block group X had as the logical address for Y, because X was the
         block group with the highest logical address and now the second block
         group with the highest logical address, the last in the fs mapping tree,
         ends at an offset corresponding to block group X's logical address (this
         logical address selection is done at volumes.c:find_next_chunk()).
      
         At this point the new block group Y does not have yet its item added
         to the extent tree (nor the corresponding device extent items and
         chunk item in the device and chunk trees). The new group Y is added to
         the list of pending block groups in the transaction handle;
      
      6) Before task B proceeds to removing the block group item for block
         group X from the extent tree, which has a key matching:
      
         (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
      
         task C while ending its transaction handle calls
         btrfs_create_pending_block_groups(), which finds block group Y and
         tries to insert the block group item for Y into the exten tree, which
         fails with -EEXIST since logical offset is the same that X had and
         task B hasn't yet deleted the key from the extent tree.
         This failure results in a transaction abort, producing a stack like
         the following:
      
      ------------[ cut here ]------------
       BTRFS: Transaction aborted (error -17)
       WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
       CPU: 2 PID: 19736 Comm: fsstress Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Code: ff ff ff 48 8b 55 50 f0 48 (...)
       RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
       RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
       R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
       R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
       FS:  00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
        btrfs_commit_transaction+0xd0/0xc50 [btrfs]
        ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
        ? __ia32_sys_fdatasync+0x20/0x20
        iterate_supers+0xdb/0x180
        ksys_sync+0x60/0xb0
        __ia32_sys_sync+0xa/0x10
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7f2ce1d4d5b7
       Code: 83 c4 08 48 3d 01 (...)
       RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
       RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
       RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
       RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
       R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
       R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a9c ]---
      
      Fix this simply by making btrfs_remove_block_group() remove the block
      group's item from the extent tree before it flags the block group as
      removed. Also make the free space deletion from the free space tree
      before flagging the block group as removed, to avoid a similar race
      with adding and removing free space entries for the free space tree.
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffcb9d44
    • F
      btrfs: fix a block group ref counter leak after failure to remove block group · 9fecd132
      Filipe Manana 提交于
      When removing a block group, if we fail to delete the block group's item
      from the extent tree, we jump to the 'out' label and end up decrementing
      the block group's reference count once only (by 1), resulting in a counter
      leak because the block group at that point was already removed from the
      block group cache rbtree - so we have to decrement the reference count
      twice, once for the rbtree and once for our lookup at the start of the
      function.
      
      There is a second bug where if removing the free space tree entries (the
      call to remove_block_group_free_space()) fails we end up jumping to the
      'out_put_group' label but end up decrementing the reference count only
      once, when we should have done it twice, since we have already removed
      the block group from the block group cache rbtree. This happens because
      the reference count decrement for the rbtree reference happens after
      attempting to remove the free space tree entries, which is far away from
      the place where we remove the block group from the rbtree.
      
      To make things less error prone, decrement the reference count for the
      rbtree immediately after removing the block group from it. This also
      eleminates the need for two different exit labels on error, renaming
      'out_put_label' to just 'out' and removing the old 'out'.
      
      Fixes: f6033c5e ("btrfs: fix block group leak when removing fails")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9fecd132
  5. 25 5月, 2020 10 次提交
  6. 23 4月, 2020 2 次提交
    • X
      btrfs: fix block group leak when removing fails · f6033c5e
      Xiyu Yang 提交于
      btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
      returns a local reference of the block group that contains the given
      bytenr to "block_group" with increased refcount.
      
      When btrfs_remove_block_group() returns, "block_group" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in several exception handling paths
      of btrfs_remove_block_group(). When those error scenarios occur such as
      btrfs_alloc_path() returns NULL, the function forgets to decrease its
      refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
      leak.
      
      Fix this issue by jumping to "out_put_group" label and calling
      btrfs_put_block_group() when those error scenarios occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6033c5e
    • F
      btrfs: fix memory leak of transaction when deleting unused block group · 5150bf19
      Filipe Manana 提交于
      When cleaning pinned extents right before deleting an unused block group,
      we check if there's still a previous transaction running and if so we
      increment its reference count before using it for cleaning pinned ranges
      in its pinned extents iotree. However we ended up never decrementing the
      reference count after using the transaction, resulting in a memory leak.
      
      Fix it by decrementing the reference count.
      
      Fixes: fe119a6e ("btrfs: switch to per-transaction pinned extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5150bf19
  7. 09 4月, 2020 1 次提交
    • F
      btrfs: fix reclaim counter leak of space_info objects · d611add4
      Filipe Manana 提交于
      Whenever we add a ticket to a space_info object we increment the object's
      reclaim_size counter witht the ticket's bytes, and we decrement it with
      the corresponding amount only when we are able to grant the requested
      space to the ticket. When we are not able to grant the space to a ticket,
      or when the ticket is removed due to a signal (e.g. an application has
      received sigterm from the terminal) we never decrement the counter with
      the corresponding bytes from the ticket. This leak can result in the
      space reclaim code to later do much more work than necessary. So fix it
      by decrementing the counter when those two cases happen as well.
      
      Fixes: db161806 ("btrfs: account ticket size at add/delete time")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d611add4
  8. 24 3月, 2020 4 次提交
  9. 21 3月, 2020 1 次提交
  10. 31 1月, 2020 2 次提交
    • J
      btrfs: take overcommit into account in inc_block_group_ro · a30a3d20
      Josef Bacik 提交于
      inc_block_group_ro does a calculation to see if we have enough room left
      over if we mark this block group as read only in order to see if it's ok
      to mark the block group as read only.
      
      The problem is this calculation _only_ works for data, where our used is
      always less than our total.  For metadata we will overcommit, so this
      will almost always fail for metadata.
      
      Fix this by exporting btrfs_can_overcommit, and then see if we have
      enough space to remove the remaining free space in the block group we
      are trying to mark read only.  If we do then we can mark this block
      group as read only.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a30a3d20
    • J
      btrfs: fix force usage in inc_block_group_ro · a7a63acc
      Josef Bacik 提交于
      For some reason we've translated the do_chunk_alloc that goes into
      btrfs_inc_block_group_ro to force in inc_block_group_ro, but these are
      two different things.
      
      force for inc_block_group_ro is used when we are forcing the block group
      read only no matter what, for example when the underlying chunk is
      marked read only.  We need to not do the space check here as this block
      group needs to be read only.
      
      btrfs_inc_block_group_ro() has a do_chunk_alloc flag that indicates that
      we need to pre-allocate a chunk before marking the block group read
      only.  This has nothing to do with forcing, and in fact we _always_ want
      to do the space check in this case, so unconditionally pass false for
      force in this case.
      
      Then fixup inc_block_group_ro to honor force as it's expected and
      documented to do.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a7a63acc
  11. 24 1月, 2020 2 次提交
  12. 20 1月, 2020 3 次提交
    • J
      btrfs: remove unnecessary wrapper get_alloc_profile · ef0a82da
      Johannes Thumshirn 提交于
      btrfs_get_alloc_profile() is a simple wrapper over get_alloc_profile().
      The only difference is btrfs_get_alloc_profile() is visible to other
      functions in btrfs while get_alloc_profile() is static and thus only
      visible to functions in block-group.c.
      
      Let's just fold get_alloc_profile() into btrfs_get_alloc_profile() to
      get rid of the unnecessary second function.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <jth@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ef0a82da
    • D
      btrfs: handle empty block_group removal for async discard · 6e80d4f8
      Dennis Zhou 提交于
      block_group removal is a little tricky. It can race with the extent
      allocator, the cleaner thread, and balancing. The current path is for a
      block_group to be added to the unused_bgs list. Then, when the cleaner
      thread comes around, it starts a transaction and then proceeds with
      removing the block_group. Extents that are pinned are subsequently
      removed from the pinned trees and then eventually a discard is issued
      for the entire block_group.
      
      Async discard introduces another player into the game, the discard
      workqueue. While it has none of the racing issues, the new problem is
      ensuring we don't leave free space untrimmed prior to forgetting the
      block_group.  This is handled by placing fully free block_groups on a
      separate discard queue. This is necessary to maintain discarding order
      as in the future we will slowly trim even fully free block_groups. The
      ordering helps us make progress on the same block_group rather than say
      the last fully freed block_group or needing to search through the fully
      freed block groups at the beginning of a list and insert after.
      
      The new order of events is a fully freed block group gets placed on the
      unused discard queue first. Once it's processed, it will be placed on
      the unusued_bgs list and then the original sequence of events will
      happen, just without the final whole block_group discard.
      
      The mount flags can change when processing unused_bgs, so when flipping
      from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the
      discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt
      free block groups on the discard_list to the unused_bg queue which will
      do the final discard for us.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6e80d4f8
    • D
      btrfs: add the beginning of async discard, discard workqueue · b0643e59
      Dennis Zhou 提交于
      When discard is enabled, everytime a pinned extent is released back to
      the block_group's free space cache, a discard is issued for the extent.
      This is an overeager approach when it comes to discarding and helping
      the SSD maintain enough free space to prevent severe garbage collection
      situations.
      
      This adds the beginning of async discard. Instead of issuing a discard
      prior to returning it to the free space, it is just marked as untrimmed.
      The block_group is then added to a LRU which then feeds into a workqueue
      to issue discards at a much slower rate. Full discarding of unused block
      groups is still done and will be addressed in a future patch of the
      series.
      
      For now, we don't persist the discard state of extents and bitmaps.
      Therefore, our failure recovery mode will be to consider extents
      untrimmed. This lets us handle failure and unmounting as one in the
      same.
      
      On a number of Facebook webservers, I collected data every minute
      accounting the time we spent in btrfs_finish_extent_commit() (col. 1)
      and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit()
      is where we discard extents synchronously before returning them to the
      free space cache.
      
      discard=sync:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          ---------------------------------------------------------------
           Drive A  |           434           |          1170
           Drive B  |           880           |          2330
           Drive C  |          2943           |          3920
           Drive D  |          4763           |          5701
      
      discard=async:
                       p99 total per minute       p99 total per minute
            Drive   |   extent_commit() (ms)  |    commit_trans() (ms)
          --------------------------------------------------------------
           Drive A  |           134           |           956
           Drive B  |            64           |          1972
           Drive C  |            59           |          1032
           Drive D  |            62           |          1200
      
      While it's not great that the stats are cumulative over 1m, all of these
      servers are running the same workload and and the delta between the two
      are substantial. We are spending significantly less time in
      btrfs_finish_extent_commit() which is responsible for discarding.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b0643e59