1. 09 2月, 2021 1 次提交
    • J
      btrfs: fix possible free space tree corruption with online conversion · 348ec0f4
      Josef Bacik 提交于
      stable inclusion
      from stable-5.10.13
      commit 2175bf57dc9522c58d93dcd474758434a3f05c57
      bugzilla: 47995
      
      --------------------------------
      
      commit 2f96e402 upstream.
      
      While running btrfs/011 in a loop I would often ASSERT() while trying to
      add a new free space entry that already existed, or get an EEXIST while
      adding a new block to the extent tree, which is another indication of
      double allocation.
      
      This occurs because when we do the free space tree population, we create
      the new root and then populate the tree and commit the transaction.
      The problem is when you create a new root, the root node and commit root
      node are the same.  During this initial transaction commit we will run
      all of the delayed refs that were paused during the free space tree
      generation, and thus begin to cache block groups.  While caching block
      groups the caching thread will be reading from the main root for the
      free space tree, so as we make allocations we'll be changing the free
      space tree, which can cause us to add the same range twice which results
      in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
      variety of different errors when running delayed refs because of a
      double allocation.
      
      Fix this by marking the fs_info as unsafe to load the free space tree,
      and fall back on the old slow method.  We could be smarter than this,
      for example caching the block group while we're populating the free
      space tree, but since this is a serious problem I've opted for the
      simplest solution.
      
      CC: stable@vger.kernel.org # 4.9+
      Fixes: a5ed9182 ("Btrfs: implement the free space B-tree")
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      348ec0f4
  2. 08 2月, 2021 1 次提交
  3. 26 10月, 2020 1 次提交
    • J
      btrfs: drop the path before adding block group sysfs files · 7837fa88
      Josef Bacik 提交于
      Dave reported a problem with my rwsem conversion patch where we got the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-default+ #1297 Not tainted
        ------------------------------------------------------
        kswapd0/76 is trying to acquire lock:
        ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}-{0:0}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 fs_reclaim_acquire.part.0+0x25/0x30
      	 kmem_cache_alloc+0x30/0x9c0
      	 alloc_inode+0x81/0x90
      	 iget_locked+0xcd/0x1a0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x210
      	 sysfs_get_tree+0x1a/0x50
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #3 (kernfs_mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x58/0x80
      	 sysfs_create_dir_ns+0x70/0xd0
      	 kobject_add_internal+0xbb/0x2d0
      	 kobject_add+0x7a/0xd0
      	 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs]
      	 btrfs_read_block_groups+0x1f1/0x8c0 [btrfs]
      	 open_ctree+0x981/0x1108 [btrfs]
      	 btrfs_mount_root.cold+0xe/0xb0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 fc_mount+0xe/0x40
      	 vfs_kern_mount.part.0+0x71/0x90
      	 btrfs_mount+0x13b/0x3e0 [btrfs]
      	 legacy_get_tree+0x2d/0x60
      	 vfs_get_tree+0x1d/0xb0
      	 path_mount+0x70f/0xa80
      	 do_mount+0x75/0x90
      	 __x64_sys_mount+0x8e/0xd0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (btrfs-extent-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 check_committed_ref+0x69/0x200 [btrfs]
      	 btrfs_cross_ref_exist+0x65/0xb0 [btrfs]
      	 run_delalloc_nocow+0x446/0x9b0 [btrfs]
      	 btrfs_run_delalloc_range+0x61/0x6a0 [btrfs]
      	 writepage_delalloc+0xae/0x160 [btrfs]
      	 __extent_writepage+0x262/0x420 [btrfs]
      	 extent_write_cache_pages+0x2b6/0x510 [btrfs]
      	 extent_writepages+0x43/0x90 [btrfs]
      	 do_writepages+0x40/0xe0
      	 __writeback_single_inode+0x62/0x610
      	 writeback_sb_inodes+0x20f/0x500
      	 wb_writeback+0xef/0x4a0
      	 wb_do_writeback+0x49/0x2e0
      	 wb_workfn+0x81/0x340
      	 process_one_work+0x233/0x5d0
      	 worker_thread+0x50/0x3b0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-fs-00){++++}-{3:3}:
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 down_read_nested+0x45/0x220
      	 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs]
      	 __btrfs_read_lock_root_node+0x3a/0x50 [btrfs]
      	 btrfs_search_slot+0x6d4/0xfd0 [btrfs]
      	 btrfs_lookup_inode+0x3a/0xc0 [btrfs]
      	 __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs]
      	 __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs]
      	 __btrfs_run_delayed_items+0x8e/0x140 [btrfs]
      	 btrfs_commit_transaction+0x367/0xbc0 [btrfs]
      	 btrfs_mksubvol+0x2db/0x470 [btrfs]
      	 btrfs_mksnapshot+0x7b/0xb0 [btrfs]
      	 __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
      	 btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
      	 btrfs_ioctl+0xd0b/0x2690 [btrfs]
      	 __x64_sys_ioctl+0x6f/0xa0
      	 do_syscall_64+0x2d/0x70
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 check_prev_add+0x91/0xc60
      	 validate_chain+0xa6e/0x2a20
      	 __lock_acquire+0x582/0xac0
      	 lock_acquire+0xca/0x430
      	 __mutex_lock+0xa0/0xaf0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      	 btrfs_evict_inode+0x3cc/0x560 [btrfs]
      	 evict+0xd6/0x1c0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x54/0x80
      	 super_cache_scan+0x121/0x1a0
      	 do_shrink_slab+0x16d/0x3b0
      	 shrink_slab+0xb1/0x2e0
      	 shrink_node+0x230/0x6a0
      	 balance_pgdat+0x325/0x750
      	 kswapd+0x206/0x4d0
      	 kthread+0x137/0x150
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/76:
         #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
         #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
        Call Trace:
         dump_stack+0x77/0x97
         check_noncircular+0xff/0x110
         ? save_trace+0x50/0x470
         check_prev_add+0x91/0xc60
         validate_chain+0xa6e/0x2a20
         ? save_trace+0x50/0x470
         __lock_acquire+0x582/0xac0
         lock_acquire+0xca/0x430
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __mutex_lock+0xa0/0xaf0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? __lock_acquire+0x582/0xac0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         ? btrfs_evict_inode+0x30b/0x560 [btrfs]
         ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x3cc/0x560 [btrfs]
         evict+0xd6/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x121/0x1a0
         do_shrink_slab+0x16d/0x3b0
         shrink_slab+0xb1/0x2e0
         shrink_node+0x230/0x6a0
         balance_pgdat+0x325/0x750
         kswapd+0x206/0x4d0
         ? finish_wait+0x90/0x90
         ? balance_pgdat+0x750/0x750
         kthread+0x137/0x150
         ? kthread_mod_delayed_work+0xc0/0xc0
         ret_from_fork+0x1f/0x30
      
      This happens because we are still holding the path open when we start
      adding the sysfs files for the block groups, which creates a dependency
      on fs_reclaim via the tree lock.  Fix this by dropping the path before
      we start doing anything with sysfs.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7837fa88
  4. 07 10月, 2020 6 次提交
    • J
      btrfs: do not create raid sysfs entries under any locks · 49ea112d
      Josef Bacik 提交于
      While running xfstests btrfs/177 I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #5 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x7a/0xb0
      	 sysfs_create_dir_ns+0x60/0xb0
      	 kobject_add_internal+0xc0/0x2c0
      	 kobject_add+0x6e/0x90
      	 btrfs_sysfs_add_block_group_type+0x102/0x160
      	 btrfs_make_block_group+0x167/0x230
      	 btrfs_alloc_chunk+0x54f/0xb80
      	 btrfs_chunk_alloc+0x18e/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_new_inode+0x225/0x730
      	 btrfs_create+0xab/0x1f0
      	 lookup_open.isra.0+0x52d/0x690
      	 path_openat+0x2a7/0x9e0
      	 do_filp_open+0x75/0x100
      	 do_sys_openat2+0x7b/0x130
      	 __x64_sys_openat+0x46/0x70
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 do_unlinkat+0x1a9/0x2b0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because when we link in a block group with a new raid index
      type we'll create the corresponding sysfs entries for it.  This is
      problematic because while restriping we're holding the chunk_mutex, and
      while mounting we're holding the tree locks.
      
      Fixing this isn't pretty, we move the call to the sysfs stuff into the
      btrfs_create_pending_block_groups() work, where we're not holding any
      locks.  This creates a slight race where other threads could see that
      there's no sysfs kobj for that raid type, and race to create the
      sysfs dir.  Fix this by wrapping the creation in space_info->lock, so we
      only get one thread calling kobject_add() for the new directory.  We
      don't worry about the lock on cleanup as it only gets deleted on
      unmount.
      
      On mount it's more straightforward, we loop through the space_infos
      already, just check every raid index in each space_info and added the
      sysfs entries for the corresponding block groups.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      49ea112d
    • J
      btrfs: kill the RCU protection for fs_info->space_info · 72804905
      Josef Bacik 提交于
      We have this thing wrapped in an RCU lock, but it's really not needed.
      We create all the space_info's on mount, and we destroy them on unmount.
      The list never changes and we're protected from messing with it by the
      normal mount/umount path, so kill the RCU stuff around it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72804905
    • M
      btrfs: make read_block_group_item return void · 4c448ce8
      Marcos Paulo de Souza 提交于
      Since it's inclusion on 9afc6649 ("btrfs: block-group: refactor how
      we read one block group item") this function always returned 0, so there
      is no need to check for the returned value.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c448ce8
    • J
      btrfs: call btrfs_try_granting_tickets when reserving space · 99ffb43e
      Josef Bacik 提交于
      If we have compression on we could free up more space than we reserved,
      and thus be able to make a space reservation.  Add the call for this
      scenario.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      99ffb43e
    • J
      btrfs: call btrfs_try_granting_tickets when freeing reserved bytes · 3308234a
      Josef Bacik 提交于
      We were missing a call to btrfs_try_granting_tickets in
      btrfs_free_reserved_bytes, so add it to handle the case where we're able
      to satisfy an allocation because we've freed a pending reservation.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Tested-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3308234a
    • R
      btrfs: delete duplicated words + other fixes in comments · 260db43c
      Randy Dunlap 提交于
      Delete repeated words in fs/btrfs/.
      {to, the, a, and old}
      and change "into 2 part" to "into 2 parts".
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      260db43c
  5. 27 8月, 2020 1 次提交
    • M
      btrfs: block-group: fix free-space bitmap threshold · e3e39c72
      Marcos Paulo de Souza 提交于
      [BUG]
      After commit 9afc6649 ("btrfs: block-group: refactor how we read one
      block group item"), cache->length is being assigned after calling
      btrfs_create_block_group_cache. This causes a problem since
      set_free_space_tree_thresholds calculates the free-space threshold to
      decide if the free-space tree should convert from extents to bitmaps.
      
      The current code calls set_free_space_tree_thresholds with cache->length
      being 0, which then makes cache->bitmap_high_thresh zero. This implies
      the system will always use bitmap instead of extents, which is not
      desired if the block group is not fragmented.
      
      This behavior can be seen by a test that expects to repair systems
      with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
      created FREE_SPACE_BITMAP.
      
      [FIX]
      Call set_free_space_tree_thresholds after setting cache->length. There
      is now a WARN_ON in set_free_space_tree_thresholds to help preventing
      the same mistake to happen again in the future.
      
      Link: https://github.com/kdave/btrfs-progs/issues/251
      Fixes: 9afc6649 ("btrfs: block-group: refactor how we read one block group item")
      CC: stable@vger.kernel.org # 5.8+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3e39c72
  6. 27 7月, 2020 9 次提交
  7. 17 6月, 2020 2 次提交
    • F
      btrfs: fix race between block group removal and block group creation · ffcb9d44
      Filipe Manana 提交于
      There is a race between block group removal and block group creation
      when the removal is completed by a task running fitrim or scrub. When
      this happens we end up failing the block group creation with an error
      -EEXIST since we attempt to insert a duplicate block group item key
      in the extent tree. That results in a transaction abort.
      
      The race happens like this:
      
      1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes
         block group X with btrfs_freeze_block_group() (until very recently
         that was named btrfs_get_block_group_trimming());
      
      2) Task B starts removing block group X, either because it's now unused
         or due to relocation for example. So at btrfs_remove_block_group(),
         while holding the chunk mutex and the block group's lock, it sets
         the 'removed' flag of the block group and it sets the local variable
         'remove_em' to false, because the block group is currently frozen
         (its 'frozen' counter is > 0, until very recently this counter was
         named 'trimming');
      
      3) Task B unlocks the block group and the chunk mutex;
      
      4) Task A is done trimming the block group and unfreezes the block group
         by calling btrfs_unfreeze_block_group() (until very recently this was
         named btrfs_put_block_group_trimming()). In this function we lock the
         block group and set the local variable 'cleanup' to true because we
         were able to decrement the block group's 'frozen' counter down to 0 and
         the flag 'removed' is set in the block group.
      
         Since 'cleanup' is set to true, it locks the chunk mutex and removes
         the extent mapping representing the block group from the mapping tree;
      
      5) Task C allocates a new block group Y and it picks up the logical address
         that block group X had as the logical address for Y, because X was the
         block group with the highest logical address and now the second block
         group with the highest logical address, the last in the fs mapping tree,
         ends at an offset corresponding to block group X's logical address (this
         logical address selection is done at volumes.c:find_next_chunk()).
      
         At this point the new block group Y does not have yet its item added
         to the extent tree (nor the corresponding device extent items and
         chunk item in the device and chunk trees). The new group Y is added to
         the list of pending block groups in the transaction handle;
      
      6) Before task B proceeds to removing the block group item for block
         group X from the extent tree, which has a key matching:
      
         (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length)
      
         task C while ending its transaction handle calls
         btrfs_create_pending_block_groups(), which finds block group Y and
         tries to insert the block group item for Y into the exten tree, which
         fails with -EEXIST since logical offset is the same that X had and
         task B hasn't yet deleted the key from the extent tree.
         This failure results in a transaction abort, producing a stack like
         the following:
      
      ------------[ cut here ]------------
       BTRFS: Transaction aborted (error -17)
       WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Modules linked in: btrfs blake2b_generic xor raid6_pq (...)
       CPU: 2 PID: 19736 Comm: fsstress Tainted: G        W         5.6.0-rc7-btrfs-next-58 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs]
       Code: ff ff ff 48 8b 55 50 f0 48 (...)
       RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001
       RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001
       R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10
       R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000
       FS:  00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs]
        btrfs_commit_transaction+0xd0/0xc50 [btrfs]
        ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
        ? __ia32_sys_fdatasync+0x20/0x20
        iterate_supers+0xdb/0x180
        ksys_sync+0x60/0xb0
        __ia32_sys_sync+0xa/0x10
        do_syscall_64+0x5c/0x280
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
       RIP: 0033:0x7f2ce1d4d5b7
       Code: 83 c4 08 48 3d 01 (...)
       RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
       RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7
       RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c
       RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00
       R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032
       R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last  enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace bd7c03622e0b0a9c ]---
      
      Fix this simply by making btrfs_remove_block_group() remove the block
      group's item from the extent tree before it flags the block group as
      removed. Also make the free space deletion from the free space tree
      before flagging the block group as removed, to avoid a similar race
      with adding and removing free space entries for the free space tree.
      
      Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ffcb9d44
    • F
      btrfs: fix a block group ref counter leak after failure to remove block group · 9fecd132
      Filipe Manana 提交于
      When removing a block group, if we fail to delete the block group's item
      from the extent tree, we jump to the 'out' label and end up decrementing
      the block group's reference count once only (by 1), resulting in a counter
      leak because the block group at that point was already removed from the
      block group cache rbtree - so we have to decrement the reference count
      twice, once for the rbtree and once for our lookup at the start of the
      function.
      
      There is a second bug where if removing the free space tree entries (the
      call to remove_block_group_free_space()) fails we end up jumping to the
      'out_put_group' label but end up decrementing the reference count only
      once, when we should have done it twice, since we have already removed
      the block group from the block group cache rbtree. This happens because
      the reference count decrement for the rbtree reference happens after
      attempting to remove the free space tree entries, which is far away from
      the place where we remove the block group from the rbtree.
      
      To make things less error prone, decrement the reference count for the
      rbtree immediately after removing the block group from it. This also
      eleminates the need for two different exit labels on error, renaming
      'out_put_label' to just 'out' and removing the old 'out'.
      
      Fixes: f6033c5e ("btrfs: fix block group leak when removing fails")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9fecd132
  8. 25 5月, 2020 10 次提交
  9. 23 4月, 2020 2 次提交
    • X
      btrfs: fix block group leak when removing fails · f6033c5e
      Xiyu Yang 提交于
      btrfs_remove_block_group() invokes btrfs_lookup_block_group(), which
      returns a local reference of the block group that contains the given
      bytenr to "block_group" with increased refcount.
      
      When btrfs_remove_block_group() returns, "block_group" becomes invalid,
      so the refcount should be decreased to keep refcount balanced.
      
      The reference counting issue happens in several exception handling paths
      of btrfs_remove_block_group(). When those error scenarios occur such as
      btrfs_alloc_path() returns NULL, the function forgets to decrease its
      refcnt increased by btrfs_lookup_block_group() and will cause a refcnt
      leak.
      
      Fix this issue by jumping to "out_put_group" label and calling
      btrfs_put_block_group() when those error scenarios occur.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f6033c5e
    • F
      btrfs: fix memory leak of transaction when deleting unused block group · 5150bf19
      Filipe Manana 提交于
      When cleaning pinned extents right before deleting an unused block group,
      we check if there's still a previous transaction running and if so we
      increment its reference count before using it for cleaning pinned ranges
      in its pinned extents iotree. However we ended up never decrementing the
      reference count after using the transaction, resulting in a memory leak.
      
      Fix it by decrementing the reference count.
      
      Fixes: fe119a6e ("btrfs: switch to per-transaction pinned extents")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5150bf19
  10. 09 4月, 2020 1 次提交
    • F
      btrfs: fix reclaim counter leak of space_info objects · d611add4
      Filipe Manana 提交于
      Whenever we add a ticket to a space_info object we increment the object's
      reclaim_size counter witht the ticket's bytes, and we decrement it with
      the corresponding amount only when we are able to grant the requested
      space to the ticket. When we are not able to grant the space to a ticket,
      or when the ticket is removed due to a signal (e.g. an application has
      received sigterm from the terminal) we never decrement the counter with
      the corresponding bytes from the ticket. This leak can result in the
      space reclaim code to later do much more work than necessary. So fix it
      by decrementing the counter when those two cases happen as well.
      
      Fixes: db161806 ("btrfs: account ticket size at add/delete time")
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d611add4
  11. 24 3月, 2020 4 次提交
  12. 21 3月, 2020 1 次提交
  13. 31 1月, 2020 1 次提交