- 05 12月, 2019 1 次提交
-
-
由 Filipe Manana 提交于
[ Upstream commit 0e6ec385b55f6001da8c6b1532494241e52c550d ] We can have a lot freed extents during the life span of transaction, so the red black tree that keeps track of the ranges of each freed extent (fs_info->freed_extents[]) can get quite big. When finishing a transaction commit we find each range, process it (discard the extents, unpin them) and then remove it from the red black tree. We can use an extent state record as a cache when searching for a range, so that when we clean the range we can use the cached extent state we passed to the search function instead of iterating the red black tree again. Doing things as fast as possible when finishing a transaction (in state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we block another task that wants to commit the next transaction. So change clear_extent_dirty() to allow an optional extent state record to be passed as an argument, which will be passed down to __clear_extent_bit. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 06 11月, 2019 1 次提交
-
-
由 Qu Wenruo 提交于
[ Upstream commit 8702ba9396bf7bbae2ab93c94acd4bd37cfa4f09 ] [Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8 ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: NFilipe Manana <fdmanana@suse.com> Fixes: 43b18595 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 29 10月, 2019 1 次提交
-
-
由 Qu Wenruo 提交于
commit 4b654acdae850f48b8250b9a578a4eaa518c7a6f upstream. In btrfs_read_block_groups(), if we have an invalid block group which has mixed type (DATA|METADATA) while the fs doesn't have MIXED_GROUPS feature, we error out without freeing the block group cache. This patch will add the missing btrfs_put_block_group() to prevent memory leak. Note for stable backports: the file to patch in versions <= 5.3 is fs/btrfs/extent-tree.c Fixes: 49303381 ("Btrfs: bail out if block group has different mixed flag") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 05 10月, 2019 1 次提交
-
-
由 Qu Wenruo 提交于
[ Upstream commit 2a28468e525f3924efed7f29f2bc5a2926e7e19a ] [BUG] With fuzzed image and MIXED_GROUPS super flag, we can hit the following BUG_ON(): kernel BUG at fs/btrfs/delayed-ref.c:491! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 1849 Comm: sync Tainted: G O 5.2.0-custom #27 RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs] Call Trace: add_delayed_ref_head+0x20c/0x2d0 [btrfs] btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs] btrfs_free_tree_block+0x123/0x380 [btrfs] __btrfs_cow_block+0x435/0x500 [btrfs] btrfs_cow_block+0x110/0x240 [btrfs] btrfs_search_slot+0x230/0xa00 [btrfs] ? __lock_acquire+0x105e/0x1e20 btrfs_insert_empty_items+0x67/0xc0 [btrfs] alloc_reserved_file_extent+0x9e/0x340 [btrfs] __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs] ? kvm_clock_read+0x18/0x30 ? __sched_clock_gtod_offset+0x21/0x50 btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs] btrfs_run_delayed_refs+0x23/0x30 [btrfs] btrfs_commit_transaction+0x53/0x9f0 [btrfs] btrfs_sync_fs+0x7c/0x1c0 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 sync_fs_one_sb+0x23/0x30 iterate_supers+0x95/0x100 ksys_sync+0x62/0xb0 __ia32_sys_sync+0xe/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] This situation is caused by several factors: - Fuzzed image The extent tree of this fs missed one backref for extent tree root. So we can allocated space from that slot. - MIXED_BG feature Super block has MIXED_BG flag. - No mixed block groups exists All block groups are just regular ones. This makes data space_info->block_groups[] contains metadata block groups. And when we reserve space for data, we can use space in metadata block group. Then we hit the following file operations: - fallocate We need to allocate data extents. find_free_extent() choose to use the metadata block to allocate space from, and choose the space of extent tree root, since its backref is missing. This generate one delayed ref head with is_data = 1. - extent tree update We need to update extent tree at run_delayed_ref time. This generate one delayed ref head with is_data = 0, for the same bytenr of old extent tree root. Then we trigger the BUG_ON(). [FIX] The quick fix here is to check block_group->flags before using it. The problem can only happen for MIXED_GROUPS fs. Regular filesystems won't have space_info with DATA|METADATA flag, and no way to hit the bug. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255Reported-by: NJungyeon Yoon <jungyeon.yoon@gmail.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 16 9月, 2019 1 次提交
-
-
由 Filipe Manana 提交于
[ Upstream commit 8eaf40c0e24e98899a0f3ac9d25a33aafe13822a ] If a task is removing the block group that currently has the highest start offset amongst all existing block groups, there is a short time window where it races with a concurrent block group allocation, resulting in a transaction abort with an error code of EEXIST. The following diagram explains the race in detail: Task A Task B btrfs_remove_block_group(bg offset X) remove_extent_mapping(em offset X) -> removes extent map X from the tree of extent maps (fs_info->mapping_tree), so the next call to find_next_chunk() will return offset X btrfs_alloc_chunk() find_next_chunk() --> returns offset X __btrfs_alloc_chunk(offset X) btrfs_make_block_group() btrfs_create_block_group_cache() --> creates btrfs_block_group_cache object with a key corresponding to the block group item in the extent, the key is: (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G) --> adds the btrfs_block_group_cache object to the list new_bgs of the transaction handle btrfs_end_transaction(trans handle) __btrfs_end_transaction() btrfs_create_pending_block_groups() --> sees the new btrfs_block_group_cache in the new_bgs list of the transaction handle --> its call to btrfs_insert_item() fails with -EEXIST when attempting to insert the block group item key (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G) because task A has not removed that key yet --> aborts the running transaction with error -EEXIST btrfs_del_item() -> removes the block group's key from the extent tree, key is (offset X, BTRFS_BLOCK_GROUP_ITEM_KEY, 1G) A sample transaction abort trace: [78912.403537] ------------[ cut here ]------------ [78912.403811] BTRFS: Transaction aborted (error -17) [78912.404082] WARNING: CPU: 2 PID: 20465 at fs/btrfs/extent-tree.c:10551 btrfs_create_pending_block_groups+0x196/0x250 [btrfs] (...) [78912.405642] CPU: 2 PID: 20465 Comm: btrfs Tainted: G W 5.0.0-btrfs-next-46 #1 [78912.405941] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [78912.406586] RIP: 0010:btrfs_create_pending_block_groups+0x196/0x250 [btrfs] (...) [78912.407636] RSP: 0018:ffff9d3d4b7e3b08 EFLAGS: 00010282 [78912.407997] RAX: 0000000000000000 RBX: ffff90959a3796f0 RCX: 0000000000000006 [78912.408369] RDX: 0000000000000007 RSI: 0000000000000001 RDI: ffff909636b16860 [78912.408746] RBP: ffff909626758a58 R08: 0000000000000000 R09: 0000000000000000 [78912.409144] R10: ffff9095ff462400 R11: 0000000000000000 R12: ffff90959a379588 [78912.409521] R13: ffff909626758ab0 R14: ffff9095036c0000 R15: ffff9095299e1158 [78912.409899] FS: 00007f387f16f700(0000) GS:ffff909636b00000(0000) knlGS:0000000000000000 [78912.410285] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [78912.410673] CR2: 00007f429fc87cbc CR3: 000000014440a004 CR4: 00000000003606e0 [78912.411095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [78912.411496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [78912.411898] Call Trace: [78912.412318] __btrfs_end_transaction+0x5b/0x1c0 [btrfs] [78912.412746] btrfs_inc_block_group_ro+0xcf/0x160 [btrfs] [78912.413179] scrub_enumerate_chunks+0x188/0x5b0 [btrfs] [78912.413622] ? __mutex_unlock_slowpath+0x100/0x2a0 [78912.414078] btrfs_scrub_dev+0x2ef/0x720 [btrfs] [78912.414535] ? __sb_start_write+0xd4/0x1c0 [78912.414963] ? mnt_want_write_file+0x24/0x50 [78912.415403] btrfs_ioctl+0x17fb/0x3120 [btrfs] [78912.415832] ? lock_acquire+0xa6/0x190 [78912.416256] ? do_vfs_ioctl+0xa2/0x6f0 [78912.416685] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [78912.417116] do_vfs_ioctl+0xa2/0x6f0 [78912.417534] ? __fget+0x113/0x200 [78912.417954] ksys_ioctl+0x70/0x80 [78912.418369] __x64_sys_ioctl+0x16/0x20 [78912.418812] do_syscall_64+0x60/0x1b0 [78912.419231] entry_SYSCALL_64_after_hwframe+0x49/0xbe [78912.419644] RIP: 0033:0x7f3880252dd7 (...) [78912.420957] RSP: 002b:00007f387f16ed68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [78912.421426] RAX: ffffffffffffffda RBX: 000055f5becc1df0 RCX: 00007f3880252dd7 [78912.421889] RDX: 000055f5becc1df0 RSI: 00000000c400941b RDI: 0000000000000003 [78912.422354] RBP: 0000000000000000 R08: 00007f387f16f700 R09: 0000000000000000 [78912.422790] R10: 00007f387f16f700 R11: 0000000000000246 R12: 0000000000000000 [78912.423202] R13: 00007ffda49c266f R14: 0000000000000000 R15: 00007f388145e040 [78912.425505] ---[ end trace eb9bfe7c426fc4d3 ]--- Fix this by calling remove_extent_mapping(), at btrfs_remove_block_group(), only at the very end, after removing the block group item key from the extent tree (and removing the free space tree entry if we are using the free space tree feature). Fixes: 04216820 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 31 5月, 2019 2 次提交
-
-
由 David Sterba 提交于
This reverts commit 8b13bb91. There is currently no corresponding patch in master due to additional changes that would be significantly different from plain revert in the respective stable branch. The range argument was not handled correctly and could cause trim to overlap allocated areas or reach beyond the end of the device. The address space that fitrim normally operates on is in logical coordinates, while the discards are done on the physical device extents. This distinction cannot be made with the current ioctl interface and caused the confusion. The bug depends on the layout of block groups and does not always happen. The whole-fs trim (run by default by the fstrim tool) is not affected. Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Tobin C. Harding 提交于
commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream. If a call to kobject_init_and_add() fails we must call kobject_put() otherwise we leak memory. Calling kobject_put() when kobject_init_and_add() fails drops the refcount back to 0 and calls the ktype release method (which in turn calls the percpu destroy and kfree). Add call to kobject_put() in the error path of call to kobject_init_and_add(). Cc: stable@vger.kernel.org # v4.4+ Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NTobin C. Harding <tobin@kernel.org> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 22 5月, 2019 1 次提交
-
-
由 Nikolay Borisov 提交于
commit c2d1b3aae33605a61cbab445d8ae1c708ccd2698 upstream. Up until now trimming the freespace was done irrespective of what the arguments of the FITRIM ioctl were. For example fstrim's -o/-l arguments will be entirely ignored. Fix it by correctly handling those paramter. This requires breaking if the found freespace extent is after the end of the passed range as well as completing trim after trimming fstrim_range::len bytes. Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 03 4月, 2019 1 次提交
-
-
由 Nikolay Borisov 提交于
commit 139a56170de67101791d6e6c8e940c6328393fe9 upstream. qgroup_rsv_size is calculated as the product of outstanding_extent * fs_info->nodesize. The product is calculated with 32 bit precision since both variables are defined as u32. Yet qgroup_rsv_size expects a 64 bit result. Avoid possible multiplication overflow by casting outstanding_extent to u64. Such overflow would in the worst case (64K nodesize) require more than 65536 extents, which is quite large and i'ts not likely that it would happen in practice. Fixes-coverity-id: 1435101 Fixes: ff6bc37e ("btrfs: qgroup: Use independent and accurate per inode qgroup rsv") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 10 1月, 2019 1 次提交
-
-
由 Josef Bacik 提交于
commit 0568e82dbe2510fc1fa664f58e5c997d3f1e649e upstream. With my delayed refs patches in place we started seeing a large amount of aborts in __btrfs_free_extent: BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 35964 owner 1 offset 0 Call Trace: ? btrfs_merge_delayed_refs+0xaf/0x340 __btrfs_run_delayed_refs+0x6ea/0xfc0 ? btrfs_set_path_blocking+0x31/0x60 btrfs_run_delayed_refs+0xeb/0x180 btrfs_commit_transaction+0x179/0x7f0 ? btrfs_check_space_for_delayed_refs+0x30/0x50 ? should_end_transaction.isra.19+0xe/0x40 btrfs_drop_snapshot+0x41c/0x7c0 btrfs_clean_one_deleted_snapshot+0xb5/0xd0 cleaner_kthread+0xf6/0x120 kthread+0xf8/0x130 ? btree_invalidatepage+0x90/0x90 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 This was because btrfs_drop_snapshot depends on the root not being modified while it's dropping the snapshot. It will unlock the root node (and really every node) as it walks down the tree, only to re-lock it when it needs to do something. This is a problem because if we modify the tree we could cow a block in our path, which frees our reference to that block. Then once we get back to that shared block we'll free our reference to it again, and get ENOENT when trying to lookup our extent reference to that block in __btrfs_free_extent. This is ultimately happening because we have delayed items left to be processed for our deleted snapshot _after_ all of the inodes are closed for the snapshot. We only run the delayed inode item if we're deleting the inode, and even then we do not run the delayed insertions or delayed removals. These can be run at any point after our final inode does its last iput, which is what triggers the snapshot deletion. We can end up with the snapshot deletion happening and then have the delayed items run on that file system, resulting in the above problem. This problem has existed forever, however my patches made it much easier to hit as I wake up the cleaner much more often to deal with delayed iputs, which made us more likely to start the snapshot dropping work before the transaction commits, which is when the delayed items would generally be run. Before, generally speaking, we would run the delayed items, commit the transaction, and wakeup the cleaner thread to start deleting snapshots, which means we were less likely to hit this problem. You could still hit it if you had multiple snapshots to be deleted and ended up with lots of delayed items, but it was definitely harder. Fix for now by simply running all the delayed items before starting to drop the snapshot. We could make this smarter in the future by making the delayed items per-root, and then simply drop any delayed items for roots that we are going to delete. But for now just a quick and easy solution is the safest. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 14 11月, 2018 13 次提交
-
-
由 Josef Bacik 提交于
commit 80ee54bfe8a3850015585ebc84e8d207fcae6831 upstream. We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. CC: stable@vger.kernel.org # 4.18+ Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit fb5c39d7a887108087de6ff93d3f326b01b4ef41 upstream. max_extent_size is supposed to be the largest contiguous range for the space info, and ctl->free_space is the total free space in the block group. We need to keep track of these separately and _only_ use the max_free_space if we don't have a max_extent_size, as that means our original request was too large to search any of the block groups for and therefore wouldn't have a max_extent_size set. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 21a94f7acf0f748599ea552af5d9ee7d7e41c72f upstream. If we use up our block group before allocating a new one we'll easily get a max_extent_size that's set really really low, which will result in a lot of fragmentation. We need to make sure we're resetting the max_extent_size when we add a new chunk or add new space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Filipe Manana 提交于
commit 5ce555578e0919237fa4bda92b4670e2dd176f85 upstream. When writing out a block group free space cache we can end deadlocking with ourselves on an extent buffer lock resulting in a warning like the following: [245043.379979] WARNING: CPU: 4 PID: 2608 at fs/btrfs/locking.c:251 btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.392792] CPU: 4 PID: 2608 Comm: btrfs-transacti Tainted: G W I 4.16.8 #1 [245043.395489] RIP: 0010:btrfs_tree_lock+0x1be/0x1d0 [btrfs] [245043.396791] RSP: 0018:ffffc9000424b840 EFLAGS: 00010246 [245043.398093] RAX: 0000000000000a30 RBX: ffff8807e20a3d20 RCX: 0000000000000001 [245043.399414] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff8807e20a3d20 [245043.400732] RBP: 0000000000000001 R08: ffff88041f39a700 R09: ffff880000000000 [245043.402021] R10: 0000000000000040 R11: ffff8807e20a3d20 R12: ffff8807cb220630 [245043.403296] R13: 0000000000000001 R14: ffff8807cb220628 R15: ffff88041fbdf000 [245043.404780] FS: 0000000000000000(0000) GS:ffff88082fc80000(0000) knlGS:0000000000000000 [245043.406050] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [245043.407321] CR2: 00007fffdbdb9f10 CR3: 0000000001c09005 CR4: 00000000000206e0 [245043.408670] Call Trace: [245043.409977] btrfs_search_slot+0x761/0xa60 [btrfs] [245043.411278] btrfs_insert_empty_items+0x62/0xb0 [btrfs] [245043.412572] btrfs_insert_item+0x5b/0xc0 [btrfs] [245043.413922] btrfs_create_pending_block_groups+0xfb/0x1e0 [btrfs] [245043.415216] do_chunk_alloc+0x1e5/0x2a0 [btrfs] [245043.416487] find_free_extent+0xcd0/0xf60 [btrfs] [245043.417813] btrfs_reserve_extent+0x96/0x1e0 [btrfs] [245043.419105] btrfs_alloc_tree_block+0xfb/0x4a0 [btrfs] [245043.420378] __btrfs_cow_block+0x127/0x550 [btrfs] [245043.421652] btrfs_cow_block+0xee/0x190 [btrfs] [245043.422979] btrfs_search_slot+0x227/0xa60 [btrfs] [245043.424279] ? btrfs_update_inode_item+0x59/0x100 [btrfs] [245043.425538] ? iput+0x72/0x1e0 [245043.426798] write_one_cache_group.isra.49+0x20/0x90 [btrfs] [245043.428131] btrfs_start_dirty_block_groups+0x102/0x420 [btrfs] [245043.429419] btrfs_commit_transaction+0x11b/0x880 [btrfs] [245043.430712] ? start_transaction+0x8e/0x410 [btrfs] [245043.432006] transaction_kthread+0x184/0x1a0 [btrfs] [245043.433341] kthread+0xf0/0x130 [245043.434628] ? btrfs_cleanup_transaction+0x4e0/0x4e0 [btrfs] [245043.435928] ? kthread_create_worker_on_cpu+0x40/0x40 [245043.437236] ret_from_fork+0x1f/0x30 [245043.441054] ---[ end trace 15abaa2aaf36827f ]--- This is because at write_one_cache_group() when we are COWing a leaf from the extent tree we end up allocating a new block group (chunk) and, because we have hit a threshold on the number of bytes reserved for system chunks, we attempt to finalize the creation of new block groups from the current transaction, by calling btrfs_create_pending_block_groups(). However here we also need to modify the extent tree in order to insert a block group item, and if the location for this new block group item happens to be in the same leaf that we were COWing earlier, we deadlock since btrfs_search_slot() tries to write lock the extent buffer that we locked before at write_one_cache_group(). We have already hit similar cases in the past and commit d9a0540a ("Btrfs: fix deadlock when finalizing block group creation") fixed some of those cases by delaying the creation of pending block groups at the known specific spots that could lead to a deadlock. This change reworks that commit to be more generic so that we don't have to add similar logic to every possible path that can lead to a deadlock. This is done by making __btrfs_cow_block() disallowing the creation of new block groups (setting the transaction's can_flush_pending_bgs to false) before it attempts to allocate a new extent buffer for either the extent, chunk or device trees, since those are the trees that pending block creation modifies. Once the new extent buffer is allocated, it allows creation of pending block groups to happen again. This change depends on a recent patch from Josef which is not yet in Linus' tree, named "btrfs: make sure we create all new block groups" in order to avoid occasional warnings at btrfs_trans_release_chunk_metadata(). Fixes: d9a0540a ("Btrfs: fix deadlock when finalizing block group creation") CC: stable@vger.kernel.org # 4.4+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199753 Link: https://lore.kernel.org/linux-btrfs/CAJtFHUTHna09ST-_EEiyWmDH6gAqS6wa=zMNMBsifj8ABu99cw@mail.gmail.com/Reported-by: NE V <eliventer@gmail.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 545e3366db823dc3342ca9d7fea803f829c9062f upstream. Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Josef Bacik 提交于
commit 3aa7c7a31c26321696b92841d5103461c6f3f517 upstream. While testing my backport I noticed there was a panic if I ran generic/416 generic/417 generic/418 all in a row. This just happened to uncover a race where we had outstanding IO after we destroy all of our workqueues, and then we'd go to queue the endio work on those free'd workqueues. This is because we aren't waiting for the caching threads to be done before freeing everything up, so to fix this make sure we wait on any outstanding caching that's being done before we free up the block group, so we're sure to be done with all IO by the time we get to btrfs_stop_all_workers(). This fixes the panic I was seeing consistently in testing. ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6112! SMP PTI Modules linked in: CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Workqueue: btrfs-cache btrfs_cache_helper RIP: 0010:btrfs_map_bio+0x346/0x370 RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000 RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000 RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000 R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00 FS: 0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btree_submit_bio_hook+0x8a/0xd0 submit_one_bio+0x5d/0x80 read_extent_buffer_pages+0x18a/0x320 btree_read_extent_buffer_pages+0xbc/0x200 ? alloc_extent_buffer+0x359/0x3e0 read_tree_block+0x3d/0x60 read_block_for_search.isra.30+0x1a5/0x360 btrfs_search_slot+0x41b/0xa10 btrfs_next_old_leaf+0x212/0x470 caching_thread+0x323/0x490 normal_work_helper+0xc5/0x310 process_one_work+0x141/0x340 worker_thread+0x44/0x3c0 kthread+0xf8/0x130 ? process_one_work+0x340/0x340 ? kthread_bind+0x10/0x10 ret_from_fork+0x35/0x40 RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0 ---[ end trace 827eb13e50846033 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit fee7acc361314df6561208c2d3c0882d663dd537 upstream. Commit 499f377f (btrfs: iterate over unused chunk space in FITRIM) fixed free space trimming, but introduced latency when it was running. This is due to it pinning the transaction using both a incremented refcount and holding the commit root sem for the duration of a single trim operation. This was to ensure safety but it's unnecessary. We already hold the the chunk mutex so we know that the chunk we're using can't be allocated while we're trimming it. In order to check against chunks allocated already in this transaction, we need to check the pending chunks list. To to that safely without joining the transaction (or attaching than then having to commit it) we need to ensure that the dev root's commit root doesn't change underneath us and the pending chunk lists stays around until we're done with it. We can ensure the former by holding the commit root sem and the latter by pinning the transaction. We do this now, but the critical section covers the trim operation itself and we don't need to do that. This patch moves the pinning and unpinning logic into helpers and unpins the transaction after performing the search and check for pending chunks. Limiting the critical section of the transaction pinning improves the latency substantially on slower storage (e.g. image files over NFS). Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit 0be88e367fd8fbdb45257615d691f4675dda062f upstream. We check whether any device the file system is using supports discard in the ioctl call, but then we attempt to trim free extents on every device regardless of whether discard is supported. Due to the way we mask off EOPNOTSUPP, we can end up issuing the trim operations on each free range on devices that don't support it, just wasting time. Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jeff Mahoney 提交于
commit d4e329de5e5e21594df2e0dd59da9acee71f133b upstream. btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the device_list_mutex. The problem is that ->alloc_list is protected by the chunk mutex. We don't want to hold the chunk mutex over the trim of the entire file system. Fortunately, the ->dev_list list is protected by the dev_list mutex and while it will give us all devices, including read-only devices, we already just skip the read-only devices. Then we can continue to take and release the chunk mutex while scanning each device. Fixes: 499f377f ("btrfs: iterate over unused chunk space in FITRIM") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJeff Mahoney <jeffm@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 6ba9fc8e628becf0e3ec94083450d089b0dec5f5 upstream. [BUG] fstrim on some btrfs only trims the unallocated space, not trimming any space in existing block groups. [CAUSE] Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to range [0, super->total_bytes). So later btrfs_trim_fs() will only be able to trim block groups in range [0, super->total_bytes). While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs uses its logical address space, there is nothing limiting the location where we put block groups. For filesystem with frequent balance, it's quite easy to relocate all block groups and bytenr of block groups will start beyond super->total_bytes. In that case, btrfs will not trim existing block groups. [FIX] Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() can get the unmodified range, which is normally set to [0, U64_MAX]. Reported-by: NChris Murphy <lists@colorremedies.com> Fixes: f4c697e6 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl") CC: <stable@vger.kernel.org> # v4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 93bba24d4b5ad1e5cd8b43f64e66ff9d6355dd20 upstream. Function btrfs_trim_fs() doesn't handle errors in a consistent way. If error happens when trimming existing block groups, it will skip the remaining blocks and continue to trim unallocated space for each device. The return value will only reflect the final error from device trimming. This patch will fix such behavior by: 1) Recording the last error from block group or device trimming The return value will also reflect the last error during trimming. Make developer more aware of the problem. 2) Continuing trimming if possible If we failed to trim one block group or device, we could still try the next block group or device. 3) Report number of failures during block group and device trimming It would be less noisy, but still gives user a brief summary of what's going wrong. Such behavior can avoid confusion for cases like failure to trim the first block group and then only unallocated space is trimmed. Reported-by: NChris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ add bg_ret and dev_ret to the messages ] Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit b72c3aba09a53fc7c1824250d71180ca154517a7 upstream. [BUG] For certain crafted image, whose csum root leaf has missing backref, if we try to trigger write with data csum, it could cause deadlock with the following kernel WARN_ON(): WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400 CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8 Workqueue: btrfs-endio-write btrfs_endio_write_helper RIP: 0010:btrfs_tree_lock+0x3e2/0x400 Call Trace: btrfs_alloc_tree_block+0x39f/0x770 __btrfs_cow_block+0x285/0x9e0 btrfs_cow_block+0x191/0x2e0 btrfs_search_slot+0x492/0x1160 btrfs_lookup_csum+0xec/0x280 btrfs_csum_file_blocks+0x2be/0xa60 add_pending_csums+0xaf/0xf0 btrfs_finish_ordered_io+0x74b/0xc90 finish_ordered_fn+0x15/0x20 normal_work_helper+0xf6/0x500 btrfs_endio_write_helper+0x12/0x20 process_one_work+0x302/0x770 worker_thread+0x81/0x6d0 kthread+0x180/0x1d0 ret_from_fork+0x35/0x40 [CAUSE] That crafted image has missing backref for csum tree root leaf. And when we try to allocate new tree block, since there is no EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot and use it. The extent tree of the image looks like: Normal image | This fuzzed image ----------------------------------+-------------------------------- BG 29360128 | BG 29360128 One empty slot | One empty slot 29364224: backref to UUID tree | 29364224: backref to UUID tree Two empty slots | Two empty slots 29376512: backref to CSUM tree | One empty slot (bad type) <<< 29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree ... | ... Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to alloc tree block, it's an valid slot for btrfs. And for finish_ordered_write, when we need to insert csum, we try to CoW csum tree root. By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K, BG_OFFSET + 12K is already used by tree block COW for other trees, the next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM tree. But due to the bad type, btrfs can recognize it and still consider it as an empty slot, and will try to use it for csum tree CoW. Then in the following call trace, we will try to lock the new tree block, which turns out to be the old csum tree root which is already locked: btrfs_search_slot() called on csum tree root, which is at 29376512 |- btrfs_cow_block() |- btrfs_set_lock_block() | |- Now locks tree block 29376512 (old csum tree root) |- __btrfs_cow_block() |- btrfs_alloc_tree_block() |- btrfs_reserve_extent() | Now it returns tree block 29376512, which extent tree | shows its empty slot, but it's already hold by csum tree |- btrfs_init_new_buffer() |- btrfs_tree_lock() | Triggers WARN_ON(eb->lock_owner == current->pid) |- wait_event() Wait lock owner to release the lock, but it's locked by ourself, so it will deadlock [FIX] This patch will do the lock_owner and current->pid check at btrfs_init_new_buffer(). So above deadlock can be avoided. Since such problem can only happen in crafted image, we will still trigger kernel warning for later aborted transaction, but with a little more meaningful warning message. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405Reported-by: NXu Wen <wen.xu@gatech.edu> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Qu Wenruo 提交于
commit 65c6e82becec33731f48786e5a30f98662c86b16 upstream. [BUG] When mounting certain crafted image, btrfs will trigger kernel BUG_ON() when trying to recover balance: kernel BUG at fs/btrfs/extent-tree.c:8956! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10 RIP: 0010:walk_up_proc+0x336/0x480 [btrfs] RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202 Call Trace: walk_up_tree+0x172/0x1f0 [btrfs] btrfs_drop_snapshot+0x3a4/0x830 [btrfs] merge_reloc_roots+0xe1/0x1d0 [btrfs] btrfs_recover_relocation+0x3ea/0x420 [btrfs] open_ctree+0x1af3/0x1dd0 [btrfs] btrfs_mount_root+0x66b/0x740 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 btrfs_mount+0x16d/0x890 [btrfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.9+0x54/0x140 do_mount+0x1fd/0xda0 ksys_mount+0xba/0xd0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x60/0x210 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] Extent tree corruption. In this particular case, reloc tree root's owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is corrupted and we failed the owner check in walk_up_tree(). [FIX] It's pretty hard to take care of every extent tree corruption, but at least we can remove such BUG_ON() and exit more gracefully. And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share the same root (which is obviously invalid), we needs to make __del_reloc_root() more robust to detect such invalid sharing to avoid possible NULL dereference as root->node can be NULL in this case. Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411Reported-by: NXu Wen <wen.xu@gatech.edu> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 8月, 2018 1 次提交
-
-
由 Lu Fengqi 提交于
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we can't properly free the num_bytes of the previous qgroup_reserve. Use a separate variable to store the num_bytes of the qgroup_reserve. Delete the comment for the qgroup_reserved that does not exist and add a comment about use_global_rsv. Fixes: c4c129db ("btrfs: drop unused parameter qgroup_reserved") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 8月, 2018 16 次提交
-
-
由 Qu Wenruo 提交于
If a crafted image has missing block group items, it could cause unexpected behavior and breaks the assumption of 1:1 chunk<->block group mapping. Although we have the block group -> chunk mapping check, we still need chunk -> block group mapping check. This patch will do extra check to ensure each chunk has its corresponding block group. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199847Reported-by: NXu Wen <wen.xu@gatech.edu> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NGu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
A crafted btrfs image with incorrect chunk<->block group mapping will trigger a lot of unexpected things as the mapping is essential. Although the problem can be caught by block group item checker added in "btrfs: tree-checker: Verify block_group_item", it's still not sufficient. A sufficiently valid block group item can pass the check added by the mentioned patch but could fail to match the existing chunk. This patch will add extra block group -> chunk mapping check, to ensure we have a completely matching (start, len, flags) chunk for each block group at mount time. Here we reuse the original helper find_first_block_group(), which is already doing the basic bg -> chunk checks, adding further checks of the start/len and type flags. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199837Reported-by: NXu Wen <wen.xu@gatech.edu> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NSu Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Misono Tomohiro 提交于
There is no user of this function anymore. This was forgotten to be removed in commit a575ceeb ("Btrfs: get rid of unused orphan infrastructure"). Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
It can be referenced from the passed transaction handle. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Leftover after fix e339a6b0 ("Btrfs: __btrfs_mod_ref should always use no_quota"), that removed it from the function calls but not the structure. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
If we're trying to make a data reservation and we have to allocate a data chunk we could leak ret == 1, as do_chunk_alloc() will return 1 if it allocated a chunk. Since the end of the function is the success path just return 0. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
It can be referenced from the passed transaction handle. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
In find_free_extent() under checks: label, we have the following code: search_start = ALIGN(offset, fs_info->stripesize); /* move on to the next group */ if (search_start + num_bytes > block_group->key.objectid + block_group->key.offset) { btrfs_add_free_space(block_group, offset, num_bytes); goto loop; } if (offset < search_start) btrfs_add_free_space(block_group, offset, search_start - offset); BUG_ON(offset > search_start); However ALIGN() is rounding up, thus @search_start >= @offset and that BUG_ON() will never be triggered. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are many places that open code the duplicity factor of the block group profiles, create a common helper. This can be easily extended for more copies. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
The fs_info can be fetched from the transaction handle directly. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Lu Fengqi 提交于
It can be fetched from the transaction handle. Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Introduce a small helper, btrfs_mark_bg_unused(), to acquire locks and add a block group to unused_bgs list. No functional modification, and only 3 callers are involved. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
do_chunk_alloc implements logic to detect whether there is currently pending chunk allocation (by means of space_info->chunk_alloc being set) and if so it loops around to the 'again' label. Additionally, based on the state of the space_info (e.g. whether it's full or not) and the return value of should_alloc_chunk() it decides whether this is a "hard" error (ENOSPC) or we can just return 0. This patch refactors all of this: 1. Put order to the scattered ifs handling the various cases in an easy-to-read if {} else if{} branches. This makes clear the various cases we are interested in handling. 2. Call should_alloc_chunk only once and use the result in the if/else if constructs. All of this is done under space_info->lock, so even before multiple calls of should_alloc_chunk were unnecessary. 3. Rewrite the "do {} while()" loop currently implemented via label into an explicit loop construct. 4. Move the mutex locking for the case where the caller is the one doing the allocation. For the case where the caller needs to wait a concurrent allocation, introduce a pair of mutex_lock/mutex_unlock to act as a barrier and reword the comment. 5. Switch local vars to bool type where pertinent. All in all this shouldn't introduce any functional changes. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Ethan Lien 提交于
In commit b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to allocate new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We tested the patch on a 4-cores x86 machine: 1. fallocate a 16GiB size test file 2. take snapshot (so all following writes will be COW) 3. run a 180 sec, 4 jobs, 4K random write fio on test file We also added a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track it by lock_stat. [Results] unpatched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 ----------------------------------------------------------------------- class name con-bounces contentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protects a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock gets transferred between different cpus, so the patch can really reduce cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f1 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: NEthan Lien <ethanlien@synology.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Functions that get btrfs inode can simply reach the fs_info by dereferencing the root and this looks a bit more straightforward compared to the btrfs_sb(...) indirection. If the transaction handle is available and not NULL it's used instead. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The v0 extent type checks are the right case for the unlikely annotations as we don't expect to ever see them, so let's give the compiler some hint. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-