1. 05 12月, 2019 1 次提交
    • F
      Btrfs: allow clear_extent_dirty() to receive a cached extent state record · e9a7415a
      Filipe Manana 提交于
      [ Upstream commit 0e6ec385b55f6001da8c6b1532494241e52c550d ]
      
      We can have a lot freed extents during the life span of transaction, so
      the red black tree that keeps track of the ranges of each freed extent
      (fs_info->freed_extents[]) can get quite big. When finishing a
      transaction commit we find each range, process it (discard the extents,
      unpin them) and then remove it from the red black tree.
      
      We can use an extent state record as a cache when searching for a range,
      so that when we clean the range we can use the cached extent state we
      passed to the search function instead of iterating the red black tree
      again. Doing things as fast as possible when finishing a transaction (in
      state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
      block another task that wants to commit the next transaction.
      
      So change clear_extent_dirty() to allow an optional extent state record to
      be passed as an argument, which will be passed down to __clear_extent_bit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e9a7415a
  2. 06 11月, 2019 1 次提交
    • Q
      btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents() · 6bcbe350
      Qu Wenruo 提交于
      [ Upstream commit 8702ba9396bf7bbae2ab93c94acd4bd37cfa4f09 ]
      
      [Background]
      Btrfs qgroup uses two types of reserved space for METADATA space,
      PERTRANS and PREALLOC.
      
      PERTRANS is metadata space reserved for each transaction started by
      btrfs_start_transaction().
      While PREALLOC is for delalloc, where we reserve space before joining a
      transaction, and finally it will be converted to PERTRANS after the
      writeback is done.
      
      [Inconsistency]
      However there is inconsistency in how we handle PREALLOC metadata space.
      
      The most obvious one is:
      In btrfs_buffered_write():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
      
      We always free qgroup PREALLOC meta space.
      
      While in btrfs_truncate_block():
      	btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
      
      We only free qgroup PREALLOC meta space when something went wrong.
      
      [The Correct Behavior]
      The correct behavior should be the one in btrfs_buffered_write(), we
      should always free PREALLOC metadata space.
      
      The reason is, the btrfs_delalloc_* mechanism works by:
      - Reserve metadata first, even it's not necessary
        In btrfs_delalloc_reserve_metadata()
      
      - Free the unused metadata space
        Normally in:
        btrfs_delalloc_release_extents()
        |- btrfs_inode_rsv_release()
           Here we do calculation on whether we should release or not.
      
      E.g. for 64K buffered write, the metadata rsv works like:
      
      /* The first page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=0
      total:		num_bytes=calc_inode_reservations()
      /* The first page caused one outstanding extent, thus needs metadata
         rsv */
      
      /* The 2nd page */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed
      /* The 2nd page doesn't cause new outstanding extent, needs no new meta
         rsv, so we free what we have reserved */
      
      /* The 3rd~16th pages */
      reserve_meta:	num_bytes=calc_inode_reservations()
      free_meta:	num_bytes=calc_inode_reservations()
      total:		not changed (still space for one outstanding extent)
      
      This means, if btrfs_delalloc_release_extents() determines to free some
      space, then those space should be freed NOW.
      So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
      than btrfs_qgroup_convert_reserved_meta().
      
      The good news is:
      - The callers are not that hot
        The hottest caller is in btrfs_buffered_write(), which is already
        fixed by commit 336a8bb8 ("btrfs: Fix wrong
        btrfs_delalloc_release_extents parameter"). Thus it's not that
        easy to cause false EDQUOT.
      
      - The trans commit in advance for qgroup would hide the bug
        Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
        commit more aggressive"), when btrfs qgroup metadata free space is slow,
        it will try to commit transaction and free the wrongly converted
        PERTRANS space, so it's not that easy to hit such bug.
      
      [FIX]
      So to fix the problem, remove the @qgroup_free parameter for
      btrfs_delalloc_release_extents(), and always pass true to
      btrfs_inode_rsv_release().
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6bcbe350
  3. 29 10月, 2019 1 次提交
  4. 05 10月, 2019 1 次提交
    • Q
      btrfs: extent-tree: Make sure we only allocate extents from block groups with the same type · c5dbd74f
      Qu Wenruo 提交于
      [ Upstream commit 2a28468e525f3924efed7f29f2bc5a2926e7e19a ]
      
      [BUG]
      With fuzzed image and MIXED_GROUPS super flag, we can hit the following
      BUG_ON():
      
        kernel BUG at fs/btrfs/delayed-ref.c:491!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 0 PID: 1849 Comm: sync Tainted: G           O      5.2.0-custom #27
        RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs]
        Call Trace:
         add_delayed_ref_head+0x20c/0x2d0 [btrfs]
         btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs]
         btrfs_free_tree_block+0x123/0x380 [btrfs]
         __btrfs_cow_block+0x435/0x500 [btrfs]
         btrfs_cow_block+0x110/0x240 [btrfs]
         btrfs_search_slot+0x230/0xa00 [btrfs]
         ? __lock_acquire+0x105e/0x1e20
         btrfs_insert_empty_items+0x67/0xc0 [btrfs]
         alloc_reserved_file_extent+0x9e/0x340 [btrfs]
         __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs]
         ? kvm_clock_read+0x18/0x30
         ? __sched_clock_gtod_offset+0x21/0x50
         btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs]
         btrfs_run_delayed_refs+0x23/0x30 [btrfs]
         btrfs_commit_transaction+0x53/0x9f0 [btrfs]
         btrfs_sync_fs+0x7c/0x1c0 [btrfs]
         ? __ia32_sys_fdatasync+0x20/0x20
         sync_fs_one_sb+0x23/0x30
         iterate_supers+0x95/0x100
         ksys_sync+0x62/0xb0
         __ia32_sys_sync+0xe/0x20
         do_syscall_64+0x65/0x240
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      This situation is caused by several factors:
      - Fuzzed image
        The extent tree of this fs missed one backref for extent tree root.
        So we can allocated space from that slot.
      
      - MIXED_BG feature
        Super block has MIXED_BG flag.
      
      - No mixed block groups exists
        All block groups are just regular ones.
      
      This makes data space_info->block_groups[] contains metadata block
      groups.  And when we reserve space for data, we can use space in
      metadata block group.
      
      Then we hit the following file operations:
      
      - fallocate
        We need to allocate data extents.
        find_free_extent() choose to use the metadata block to allocate space
        from, and choose the space of extent tree root, since its backref is
        missing.
      
        This generate one delayed ref head with is_data = 1.
      
      - extent tree update
        We need to update extent tree at run_delayed_ref time.
      
        This generate one delayed ref head with is_data = 0, for the same
        bytenr of old extent tree root.
      
      Then we trigger the BUG_ON().
      
      [FIX]
      The quick fix here is to check block_group->flags before using it.
      
      The problem can only happen for MIXED_GROUPS fs. Regular filesystems
      won't have space_info with DATA|METADATA flag, and no way to hit the
      bug.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c5dbd74f
  5. 16 9月, 2019 1 次提交
    • F
      Btrfs: fix race between block group removal and block group allocation · 1d064876
      Filipe Manana 提交于
      [ Upstream commit 8eaf40c0e24e98899a0f3ac9d25a33aafe13822a ]
      
      If a task is removing the block group that currently has the highest start
      offset amongst all existing block groups, there is a short time window
      where it races with a concurrent block group allocation, resulting in a
      transaction abort with an error code of EEXIST.
      
      The following diagram explains the race in detail:
      
            Task A                                                        Task B
      
       btrfs_remove_block_group(bg offset X)
      
         remove_extent_mapping(em offset X)
           -> removes extent map X from the
              tree of extent maps
              (fs_info->mapping_tree), so the
              next call to find_next_chunk()
              will return offset X
      
                                                         btrfs_alloc_chunk()
                                                           find_next_chunk()
                                                             --> returns offset X
      
                                                           __btrfs_alloc_chunk(offset X)
                                                             btrfs_make_block_group()
                                                               btrfs_create_block_group_cache()
                                                                 --> creates btrfs_block_group_cache
                                                                     object with a key corresponding
                                                                     to the block group item in the
                                                                     extent, the key is:
                                                                     (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
                                                               --> adds the btrfs_block_group_cache object
                                                                   to the list new_bgs of the transaction
                                                                   handle
      
                                                         btrfs_end_transaction(trans handle)
                                                           __btrfs_end_transaction()
                                                             btrfs_create_pending_block_groups()
                                                               --> sees the new btrfs_block_group_cache
                                                                   in the new_bgs list of the transaction
                                                                   handle
                                                               --> its call to btrfs_insert_item() fails
                                                                   with -EEXIST when attempting to insert
                                                                   the block group item key
                                                                   (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
                                                                   because task A has not removed that key yet
                                                               --> aborts the running transaction with
                                                                   error -EEXIST
      
         btrfs_del_item()
           -> removes the block group's key from
              the extent tree, key is
              (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G)
      
      A sample transaction abort trace:
      
        [78912.403537] ------------[ cut here ]------------
        [78912.403811] BTRFS: Transaction aborted (error -17)
        [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G        W         5.0.0-btrfs-next-46 #1
        [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
        [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs]
        (...)
        [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282
        [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006
        [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860
        [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000
        [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588
        [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158
        [78912.409899] FS:  00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000
        [78912.410285] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0
        [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [78912.411898] Call Trace:
        [78912.412318]  __btrfs_end_transaction+0x5b/0x1c0 [btrfs]
        [78912.412746]  btrfs_inc_block_group_ro+0xcf/0x160 [btrfs]
        [78912.413179]  scrub_enumerate_chunks+0x188/0x5b0 [btrfs]
        [78912.413622]  ? __mutex_unlock_slowpath+0x100/0x2a0
        [78912.414078]  btrfs_scrub_dev+0x2ef/0x720 [btrfs]
        [78912.414535]  ? __sb_start_write+0xd4/0x1c0
        [78912.414963]  ? mnt_want_write_file+0x24/0x50
        [78912.415403]  btrfs_ioctl+0x17fb/0x3120 [btrfs]
        [78912.415832]  ? lock_acquire+0xa6/0x190
        [78912.416256]  ? do_vfs_ioctl+0xa2/0x6f0
        [78912.416685]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
        [78912.417116]  do_vfs_ioctl+0xa2/0x6f0
        [78912.417534]  ? __fget+0x113/0x200
        [78912.417954]  ksys_ioctl+0x70/0x80
        [78912.418369]  __x64_sys_ioctl+0x16/0x20
        [78912.418812]  do_syscall_64+0x60/0x1b0
        [78912.419231]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [78912.419644] RIP: 0033:0x7f3880252dd7
        (...)
        [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7
        [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003
        [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000
        [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000
        [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040
        [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]---
      
      Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(),
      only at the very end, after removing the block group item key from the
      extent tree (and removing the free space tree entry if we are using the
      free space tree feature).
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1d064876
  6. 31 5月, 2019 2 次提交
  7. 22 5月, 2019 1 次提交
  8. 03 4月, 2019 1 次提交
  9. 10 1月, 2019 1 次提交
    • J
      btrfs: run delayed items before dropping the snapshot · 6911b074
      Josef Bacik 提交于
      commit 0568e82dbe2510fc1fa664f58e5c997d3f1e649e upstream.
      
      With my delayed refs patches in place we started seeing a large amount
      of aborts in __btrfs_free_extent:
      
       BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964  owner 1 offset 0
       Call Trace:
        ? btrfs_merge_delayed_refs+0xaf/0x340
        __btrfs_run_delayed_refs+0x6ea/0xfc0
        ? btrfs_set_path_blocking+0x31/0x60
        btrfs_run_delayed_refs+0xeb/0x180
        btrfs_commit_transaction+0x179/0x7f0
        ? btrfs_check_space_for_delayed_refs+0x30/0x50
        ? should_end_transaction.isra.19+0xe/0x40
        btrfs_drop_snapshot+0x41c/0x7c0
        btrfs_clean_one_deleted_snapshot+0xb5/0xd0
        cleaner_kthread+0xf6/0x120
        kthread+0xf8/0x130
        ? btree_invalidatepage+0x90/0x90
        ? kthread_bind+0x10/0x10
        ret_from_fork+0x35/0x40
      
      This was because btrfs_drop_snapshot depends on the root not being
      modified while it's dropping the snapshot.  It will unlock the root node
      (and really every node) as it walks down the tree, only to re-lock it
      when it needs to do something.  This is a problem because if we modify
      the tree we could cow a block in our path, which frees our reference to
      that block.  Then once we get back to that shared block we'll free our
      reference to it again, and get ENOENT when trying to lookup our extent
      reference to that block in __btrfs_free_extent.
      
      This is ultimately happening because we have delayed items left to be
      processed for our deleted snapshot _after_ all of the inodes are closed
      for the snapshot.  We only run the delayed inode item if we're deleting
      the inode, and even then we do not run the delayed insertions or delayed
      removals.  These can be run at any point after our final inode does its
      last iput, which is what triggers the snapshot deletion.  We can end up
      with the snapshot deletion happening and then have the delayed items run
      on that file system, resulting in the above problem.
      
      This problem has existed forever, however my patches made it much easier
      to hit as I wake up the cleaner much more often to deal with delayed
      iputs, which made us more likely to start the snapshot dropping work
      before the transaction commits, which is when the delayed items would
      generally be run.  Before, generally speaking, we would run the delayed
      items, commit the transaction, and wakeup the cleaner thread to start
      deleting snapshots, which means we were less likely to hit this problem.
      You could still hit it if you had multiple snapshots to be deleted and
      ended up with lots of delayed items, but it was definitely harder.
      
      Fix for now by simply running all the delayed items before starting to
      drop the snapshot.  We could make this smarter in the future by making
      the delayed items per-root, and then simply drop any delayed items for
      roots that we are going to delete.  But for now just a quick and easy
      solution is the safest.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6911b074
  10. 14 11月, 2018 13 次提交
    • J
      btrfs: fix insert_reserved error handling · 186b5248
      Josef Bacik 提交于
      commit 80ee54bfe8a3850015585ebc84e8d207fcae6831 upstream.
      
      We were not handling the reserved byte accounting properly for data
      references.  Metadata was fine, if it errored out the error paths would
      free the bytes_reserved count and pin the extent, but it even missed one
      of the error cases.  So instead move this handling up into
      run_one_delayed_ref so we are sure that both cases are properly cleaned
      up in case of a transaction abort.
      
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      186b5248
    • J
      btrfs: don't use ctl->free_space for max_extent_size · a746cfd0
      Josef Bacik 提交于
      commit fb5c39d7a887108087de6ff93d3f326b01b4ef41 upstream.
      
      max_extent_size is supposed to be the largest contiguous range for the
      space info, and ctl->free_space is the total free space in the block
      group.  We need to keep track of these separately and _only_ use the
      max_free_space if we don't have a max_extent_size, as that means our
      original request was too large to search any of the block groups for and
      therefore wouldn't have a max_extent_size set.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a746cfd0
    • J
      btrfs: reset max_extent_size properly · e982beca
      Josef Bacik 提交于
      commit 21a94f7acf0f748599ea552af5d9ee7d7e41c72f upstream.
      
      If we use up our block group before allocating a new one we'll easily
      get a max_extent_size that's set really really low, which will result in
      a lot of fragmentation.  We need to make sure we're resetting the
      max_extent_size when we add a new chunk or add new space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e982beca
    • F
      Btrfs: fix deadlock when writing out free space caches · ea9c846f
      Filipe Manana 提交于
      commit 5ce555578e0919237fa4bda92b4670e2dd176f85 upstream.
      
      When writing out a block group free space cache we can end deadlocking
      with ourselves on an extent buffer lock resulting in a warning like the
      following:
      
        [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G
          W I      4.16.8 #1
        [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs]
        [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246
        [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001
        [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20
        [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000
        [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630
        [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000
        [245043.404780] FS:  0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000
        [245043.406050] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0
        [245043.408670] Call Trace:
        [245043.409977]  btrfs_search_slot+0x761/0xa60 [btrfs]
        [245043.411278]  btrfs_insert_empty_items+0x62/0xb0 [btrfs]
        [245043.412572]  btrfs_insert_item+0x5b/0xc0 [btrfs]
        [245043.413922]  btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs]
        [245043.415216]  do_chunk_alloc+0x1e5/0x2a0 [btrfs]
        [245043.416487]  find_free_extent+0xcd0/0xf60 [btrfs]
        [245043.417813]  btrfs_reserve_extent+0x96/0x1e0 [btrfs]
        [245043.419105]  btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs]
        [245043.420378]  __btrfs_cow_block+0x127/0x550 [btrfs]
        [245043.421652]  btrfs_cow_block+0xee/0x190 [btrfs]
        [245043.422979]  btrfs_search_slot+0x227/0xa60 [btrfs]
        [245043.424279]  ? btrfs_update_inode_item+0x59/0x100 [btrfs]
        [245043.425538]  ? iput+0x72/0x1e0
        [245043.426798]  write_one_cache_group.isra.49+0x20/0x90 [btrfs]
        [245043.428131]  btrfs_start_dirty_block_groups+0x102/0x420 [btrfs]
        [245043.429419]  btrfs_commit_transaction+0x11b/0x880 [btrfs]
        [245043.430712]  ? start_transaction+0x8e/0x410 [btrfs]
        [245043.432006]  transaction_kthread+0x184/0x1a0 [btrfs]
        [245043.433341]  kthread+0xf0/0x130
        [245043.434628]  ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs]
        [245043.435928]  ? kthread_create_worker_on_cpu+0x40/0x40
        [245043.437236]  ret_from_fork+0x1f/0x30
        [245043.441054] ---[ end trace 15abaa2aaf36827f ]---
      
      This is because at write_one_cache_group() when we are COWing a leaf from
      the extent tree we end up allocating a new block group (chunk) and,
      because we have hit a threshold on the number of bytes reserved for system
      chunks, we attempt to finalize the creation of new block groups from the
      current transaction, by calling btrfs_create_pending_block_groups().
      However here we also need to modify the extent tree in order to insert
      a block group item, and if the location for this new block group item
      happens to be in the same leaf that we were COWing earlier, we deadlock
      since btrfs_search_slot() tries to write lock the extent buffer that we
      locked before at write_one_cache_group().
      
      We have already hit similar cases in the past and commit d9a0540a
      ("Btrfs: fix deadlock when finalizing block group creation") fixed some
      of those cases by delaying the creation of pending block groups at the
      known specific spots that could lead to a deadlock. This change reworks
      that commit to be more generic so that we don't have to add similar logic
      to every possible path that can lead to a deadlock. This is done by
      making __btrfs_cow_block() disallowing the creation of new block groups
      (setting the transaction's can_flush_pending_bgs to false) before it
      attempts to allocate a new extent buffer for either the extent, chunk or
      device trees, since those are the trees that pending block creation
      modifies. Once the new extent buffer is allocated, it allows creation of
      pending block groups to happen again.
      
      This change depends on a recent patch from Josef which is not yet in
      Linus' tree, named "btrfs: make sure we create all new block groups" in
      order to avoid occasional warnings at btrfs_trans_release_chunk_metadata().
      
      Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation")
      CC: stable@vger.kernel.org # 4.4+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753
      Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea9c846f
    • J
      btrfs: make sure we create all new block groups · 1d6d4a03
      Josef Bacik 提交于
      commit 545e3366db823dc3342ca9d7fea803f829c9062f upstream.
      
      Allocating new chunks modifies both the extent and chunk tree, which can
      trigger new chunk allocations.  So instead of doing list_for_each_safe,
      just do while (!list_empty()) so we make sure we don't exit with other
      pending bg's still on our list.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d6d4a03
    • J
      btrfs: wait on caching when putting the bg cache · 2974abff
      Josef Bacik 提交于
      commit 3aa7c7a31c26321696b92841d5103461c6f3f517 upstream.
      
      While testing my backport I noticed there was a panic if I ran
      generic/416 generic/417 generic/418 all in a row.  This just happened to
      uncover a race where we had outstanding IO after we destroy all of our
      workqueues, and then we'd go to queue the endio work on those free'd
      workqueues.
      
      This is because we aren't waiting for the caching threads to be done
      before freeing everything up, so to fix this make sure we wait on any
      outstanding caching that's being done before we free up the block group,
      so we're sure to be done with all IO by the time we get to
      btrfs_stop_all_workers().  This fixes the panic I was seeing
      consistently in testing.
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/volumes.c:6112!
      SMP PTI
      Modules linked in:
      CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
      Workqueue: btrfs-cache btrfs_cache_helper
      RIP: 0010:btrfs_map_bio+0x346/0x370
      RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
      RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
      RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
      RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
      R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
      FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       btree_submit_bio_hook+0x8a/0xd0
       submit_one_bio+0x5d/0x80
       read_extent_buffer_pages+0x18a/0x320
       btree_read_extent_buffer_pages+0xbc/0x200
       ? alloc_extent_buffer+0x359/0x3e0
       read_tree_block+0x3d/0x60
       read_block_for_search.isra.30+0x1a5/0x360
       btrfs_search_slot+0x41b/0xa10
       btrfs_next_old_leaf+0x212/0x470
       caching_thread+0x323/0x490
       normal_work_helper+0xc5/0x310
       process_one_work+0x141/0x340
       worker_thread+0x44/0x3c0
       kthread+0xf8/0x130
       ? process_one_work+0x340/0x340
       ? kthread_bind+0x10/0x10
       ret_from_fork+0x35/0x40
      RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
      ---[ end trace 827eb13e50846033 ]---
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2974abff
    • J
      btrfs: keep trim from interfering with transaction commits · 0a7f6c7e
      Jeff Mahoney 提交于
      commit fee7acc361314df6561208c2d3c0882d663dd537 upstream.
      
      Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM)
      fixed free space trimming, but introduced latency when it was running.
      This is due to it pinning the transaction using both a incremented
      refcount and holding the commit root sem for the duration of a single
      trim operation.
      
      This was to ensure safety but it's unnecessary.  We already hold the the
      chunk mutex so we know that the chunk we're using can't be allocated
      while we're trimming it.
      
      In order to check against chunks allocated already in this transaction,
      we need to check the pending chunks list.  To to that safely without
      joining the transaction (or attaching than then having to commit it) we
      need to ensure that the dev root's commit root doesn't change underneath
      us and the pending chunk lists stays around until we're done with it.
      
      We can ensure the former by holding the commit root sem and the latter
      by pinning the transaction.  We do this now, but the critical section
      covers the trim operation itself and we don't need to do that.
      
      This patch moves the pinning and unpinning logic into helpers and unpins
      the transaction after performing the search and check for pending
      chunks.
      
      Limiting the critical section of the transaction pinning improves the
      latency substantially on slower storage (e.g. image files over NFS).
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a7f6c7e
    • J
      btrfs: don't attempt to trim devices that don't support it · 4d0dfd8f
      Jeff Mahoney 提交于
      commit 0be88e367fd8fbdb45257615d691f4675dda062f upstream.
      
      We check whether any device the file system is using supports discard in
      the ioctl call, but then we attempt to trim free extents on every device
      regardless of whether discard is supported.  Due to the way we mask off
      EOPNOTSUPP, we can end up issuing the trim operations on each free range
      on devices that don't support it, just wasting time.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d0dfd8f
    • J
      btrfs: iterate all devices during trim, instead of fs_devices::alloc_list · 76e59a62
      Jeff Mahoney 提交于
      commit d4e329de5e5e21594df2e0dd59da9acee71f133b upstream.
      
      btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
      device_list_mutex.  The problem is that ->alloc_list is protected by the
      chunk mutex.  We don't want to hold the chunk mutex over the trim of the
      entire file system.  Fortunately, the ->dev_list list is protected by
      the dev_list mutex and while it will give us all devices, including
      read-only devices, we already just skip the read-only devices.  Then we
      can continue to take and release the chunk mutex while scanning each
      device.
      
      Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76e59a62
    • Q
      btrfs: Ensure btrfs_trim_fs can trim the whole filesystem · d147f4dc
      Qu Wenruo 提交于
      commit 6ba9fc8e628becf0e3ec94083450d089b0dec5f5 upstream.
      
      [BUG]
      fstrim on some btrfs only trims the unallocated space, not trimming any
      space in existing block groups.
      
      [CAUSE]
      Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
      range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
      able to trim block groups in range [0, super->total_bytes).
      
      While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
      uses its logical address space, there is nothing limiting the location
      where we put block groups.
      
      For filesystem with frequent balance, it's quite easy to relocate all
      block groups and bytenr of block groups will start beyond
      super->total_bytes.
      
      In that case, btrfs will not trim existing block groups.
      
      [FIX]
      Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
      can get the unmodified range, which is normally set to [0, U64_MAX].
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
      CC: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d147f4dc
    • Q
      btrfs: Enhance btrfs_trim_fs function to handle error better · c9ee7109
      Qu Wenruo 提交于
      commit 93bba24d4b5ad1e5cd8b43f64e66ff9d6355dd20 upstream.
      
      Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
      error happens when trimming existing block groups, it will skip the
      remaining blocks and continue to trim unallocated space for each device.
      
      The return value will only reflect the final error from device trimming.
      
      This patch will fix such behavior by:
      
      1) Recording the last error from block group or device trimming
         The return value will also reflect the last error during trimming.
         Make developer more aware of the problem.
      
      2) Continuing trimming if possible
         If we failed to trim one block group or device, we could still try
         the next block group or device.
      
      3) Report number of failures during block group and device trimming
         It would be less noisy, but still gives user a brief summary of
         what's going wrong.
      
      Such behavior can avoid confusion for cases like failure to trim the
      first block group and then only unallocated space is trimmed.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add bg_ret and dev_ret to the messages ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9ee7109
    • Q
      btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock · d4f56c44
      Qu Wenruo 提交于
      commit b72c3aba09a53fc7c1824250d71180ca154517a7 upstream.
      
      [BUG]
      For certain crafted image, whose csum root leaf has missing backref, if
      we try to trigger write with data csum, it could cause deadlock with the
      following kernel WARN_ON():
      
        WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
        CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
        Workqueue: btrfs-endio-write btrfs_endio_write_helper
        RIP: 0010:btrfs_tree_lock+0x3e2/0x400
        Call Trace:
         btrfs_alloc_tree_block+0x39f/0x770
         __btrfs_cow_block+0x285/0x9e0
         btrfs_cow_block+0x191/0x2e0
         btrfs_search_slot+0x492/0x1160
         btrfs_lookup_csum+0xec/0x280
         btrfs_csum_file_blocks+0x2be/0xa60
         add_pending_csums+0xaf/0xf0
         btrfs_finish_ordered_io+0x74b/0xc90
         finish_ordered_fn+0x15/0x20
         normal_work_helper+0xf6/0x500
         btrfs_endio_write_helper+0x12/0x20
         process_one_work+0x302/0x770
         worker_thread+0x81/0x6d0
         kthread+0x180/0x1d0
         ret_from_fork+0x35/0x40
      
      [CAUSE]
      That crafted image has missing backref for csum tree root leaf.  And
      when we try to allocate new tree block, since there is no
      EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
      and use it.
      
      The extent tree of the image looks like:
      
        Normal image                      |       This fuzzed image
        ----------------------------------+--------------------------------
        BG 29360128                       | BG 29360128
         One empty slot                   |  One empty slot
        29364224: backref to UUID tree    | 29364224: backref to UUID tree
         Two empty slots                  |  Two empty slots
        29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
        29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
        ...                               | ...
      
      Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
      alloc tree block, it's an valid slot for btrfs.
      
      And for finish_ordered_write, when we need to insert csum, we try to CoW
      csum tree root.
      
      By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
      BG_OFFSET + 12K is already used by tree block COW for other trees, the
      next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
      tree.
      
      But due to the bad type, btrfs can recognize it and still consider it as
      an empty slot, and will try to use it for csum tree CoW.
      
      Then in the following call trace, we will try to lock the new tree
      block, which turns out to be the old csum tree root which is already
      locked:
      
      btrfs_search_slot() called on csum tree root, which is at 29376512
      |- btrfs_cow_block()
         |- btrfs_set_lock_block()
         |  |- Now locks tree block 29376512 (old csum tree root)
         |- __btrfs_cow_block()
            |- btrfs_alloc_tree_block()
               |- btrfs_reserve_extent()
                  | Now it returns tree block 29376512, which extent tree
                  | shows its empty slot, but it's already hold by csum tree
                  |- btrfs_init_new_buffer()
                     |- btrfs_tree_lock()
                        | Triggers WARN_ON(eb->lock_owner == current->pid)
                        |- wait_event()
                           Wait lock owner to release the lock, but it's
                           locked by ourself, so it will deadlock
      
      [FIX]
      This patch will do the lock_owner and current->pid check at
      btrfs_init_new_buffer().
      So above deadlock can be avoided.
      
      Since such problem can only happen in crafted image, we will still
      trigger kernel warning for later aborted transaction, but with a little
      more meaningful warning message.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4f56c44
    • Q
      btrfs: Handle owner mismatch gracefully when walking up tree · 4f2a4e02
      Qu Wenruo 提交于
      commit 65c6e82becec33731f48786e5a30f98662c86b16 upstream.
      
      [BUG]
      When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
      when trying to recover balance:
      
        kernel BUG at fs/btrfs/extent-tree.c:8956!
        invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
        CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
        RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
        RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
        Call Trace:
         walk_up_tree+0x172/0x1f0 [btrfs]
         btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
         merge_reloc_roots+0xe1/0x1d0 [btrfs]
         btrfs_recover_relocation+0x3ea/0x420 [btrfs]
         open_ctree+0x1af3/0x1dd0 [btrfs]
         btrfs_mount_root+0x66b/0x740 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         btrfs_mount+0x16d/0x890 [btrfs]
         mount_fs+0x3b/0x16a
         vfs_kern_mount.part.9+0x54/0x140
         do_mount+0x1fd/0xda0
         ksys_mount+0xba/0xd0
         __x64_sys_mount+0x21/0x30
         do_syscall_64+0x60/0x210
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      [CAUSE]
      Extent tree corruption.  In this particular case, reloc tree root's
      owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
      corrupted and we failed the owner check in walk_up_tree().
      
      [FIX]
      It's pretty hard to take care of every extent tree corruption, but at
      least we can remove such BUG_ON() and exit more gracefully.
      
      And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
      the same root (which is obviously invalid), we needs to make
      __del_reloc_root() more robust to detect such invalid sharing to avoid
      possible NULL dereference as root->node can be NULL in this case.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f2a4e02
  11. 23 8月, 2018 1 次提交
  12. 06 8月, 2018 16 次提交