- 06 8月, 2018 1 次提交
-
-
由 Nikolay Borisov 提交于
It can be referenced from the passed bg cache. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 29 5月, 2018 2 次提交
-
-
由 Anand Jain 提交于
Looks like the original idea was to print the hex of the flags which is not coded with their flag name. So use the current buf pointer bp instead of buf. Reaching the uknown flags should never happen, it's there just in case. Fixes: ebce0e01 ("btrfs: make block group flags in balance printks human-readable") Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Gu Jinxiang 提交于
Since the commit c6100a4b ("Btrfs: replace tree->mapping with tree->private_data"), parameter fs_info in alloc_reloc_control is not used. So remove it. Signed-off-by: NGu Jinxiang <gujx@cn.fujitsu.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 26 4月, 2018 1 次提交
-
-
由 Qu Wenruo 提交于
Commit 581c1760 ("btrfs: Validate child tree block's level and first key") introduced new @first_key parameter for read_tree_block(), however caller in replace_path() is parasing wrong key to read_tree_block(). It should use parameter @first_key other than @key. Normally it won't expose problem as @key is normally initialzied to the same value of @first_key we expect. However in relocation recovery case, @key can be set to (0, 0, 0), and since no valid key in relocation tree can be (0, 0, 0), it will cause read_tree_block() to return -EUCLEAN and interrupt relocation recovery. Fix it by setting @first_key correctly. Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key") Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 12 4月, 2018 1 次提交
-
-
由 David Sterba 提交于
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 31 3月, 2018 2 次提交
-
-
由 Qu Wenruo 提交于
We have several reports about node pointer points to incorrect child tree blocks, which could have even wrong owner and level but still with valid generation and checksum. Although btrfs check could handle it and print error message like: leaf parent key incorrect 60670574592 Kernel doesn't have enough check on this type of corruption correctly. At least add such check to read_tree_block() and btrfs_read_buffer(), where we need two new parameters @level and @first_key to verify the child tree block. The new @level check is mandatory and all call sites are already modified to extract expected level from its call chain. While @first_key is optional, the following call sites are skipping such check: 1) Root node/leaf As ROOT_ITEM doesn't contain the first key, skip @first_key check. 2) Direct backref Only parent bytenr and level is known and we need to resolve the key all by ourselves, skip @first_key check. Another note of this verification is, it needs extra info from nodeptr or ROOT_ITEM, so it can't fit into current tree-checker framework, which is limited to node/leaf boundary. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with preallocated meta rsv, making it quite easy to underflow qgroup meta reservation. Since we have the new qgroup meta rsv types, apply it to delalloc reservation. Now for delalloc, most of its reserved space will use META_PREALLOC qgroup rsv type. And for callers reducing outstanding extent like btrfs_finish_ordered_io(), they will convert corresponding META_PREALLOC reservation to META_PERTRANS. This is mainly due to the fact that current qgroup numbers will only be updated in btrfs_commit_transaction(), that's to say if we don't keep such placeholder reservation, we can exceed qgroup limitation. And for callers freeing outstanding extent in error handler, we will just free META_PREALLOC bytes. This behavior makes callers of btrfs_qgroup_release_meta() or btrfs_qgroup_convert_meta() to be aware of which type they are. So in this patch, btrfs_delalloc_release_metadata() and its callers get an extra parameter to info qgroup to do correct meta convert/release. The good news is, even we use the wrong type (convert or free), it won't cause obvious bug, as prealloc type is always in good shape, and the type only affects how per-trans meta is increased or not. So the worst case will be at most metadata limitation can be sometimes exceeded (no convert at all) or metadata limitation is reached too soon (no free at all). Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 01 3月, 2018 1 次提交
-
-
由 Nikolay Borisov 提交于
Essentially duplicate the error handling from the above block which handles the !PageUptodate(page) case and additionally clear EXTENT_BOUNDARY. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 11月, 2017 1 次提交
-
-
由 Filipe Manana 提交于
The patch from commit a7e3b975 ("Btrfs: fix reported number of inode blocks") introduced a regression where if we do a buffered write starting at position equal to or greater than the file's size and then stat(2) the file before writeback is triggered, the number of used blocks does not change (unless there's a prealloc/unwritten extent). Example: $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar $ du -h foobar 0 foobar $ sync $ du -h foobar 64K foobar The first version of that patch didn't had this regression and the second version, which was the one committed, was made only to address some performance regression detected by the intel test robots using fs_mark. This fixes the regression by setting the new delaloc bit in the range, and doing it at btrfs_dirty_pages() while setting the regular dealloc bit as well, so that this way we set both bits at once avoiding navigation of the inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most meaninful place, as we should set the new dellaloc bit when if we set the delalloc bit, which happens only if we copied bytes into the pages at __btrfs_buffered_write(). This was making some of LTP's du tests fail, which can be quickly run using a command line like the following: $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt Fixes: a7e3b975 ("Btrfs: fix reported number of inode blocks") Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 02 11月, 2017 1 次提交
-
-
由 Josef Bacik 提交于
Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 10月, 2017 1 次提交
-
-
由 Josef Bacik 提交于
We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 26 9月, 2017 1 次提交
-
-
由 Naohiro Aota 提交于
__del_reloc_root should be called before freeing up reloc_root->node. If not, calling __del_reloc_root() dereference reloc_root->node, causing the system BUG. Fixes: 6bdf131f ("Btrfs: don't leak reloc root nodes on error") Cc: <stable@vger.kernel.org> # 4.9 Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 8月, 2017 3 次提交
-
-
由 Liu Bo 提交于
The BUG_ON() can be triggered when the caller is processing an invalid extent inline ref, e.g. a shared data ref is offered instead of an extent data ref, such that it tries to find a non-existent tree block and then btrfs_search_slot returns 1 for no such item. This replaces the BUG_ON() with a WARN() followed by calling btrfs_print_leaf() to show more details about what's going on and returning -EINVAL to upper callers. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
Now that we have a helper to report invalid value of extent inline ref type, we need to quit gracefully instead of throwing out a kernel panic. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Liu Bo 提交于
Since we have a helper which can do sanity check, this converts all btrfs_extent_inline_ref_type to it. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 8月, 2017 1 次提交
-
-
由 Nikolay Borisov 提交于
One of the error handling paths in __add_reloc_root contains btrfs_panic() followed by some other code. As the name implies what it does is print some error message and call BUG, naturally what follow afterwards is not invoked. So remove this extra code. Signed-off-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 6月, 2017 3 次提交
-
-
由 Chris Mason 提交于
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit 70e7af24 made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: NDave Jones <davej@codemonkey.org.uk> Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: NChris Mason <clm@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A | Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) | |- check_data_free_space() | | |- qgroup_reserve_data() | | Range aligned to page | | range [0, 4K) <<< | | 4K bytes reserved <<< | |- copy pages to page cache | | Buffered_write [2K, 4K) | |- check_data_free_space() | | |- qgroup_reserved_data() | | Range alinged to page | | range [0, 4K) | | Already reserved by A <<< | | 0 bytes reserved <<< | |- delalloc_reserve_metadata() | | And it *FAILED* (Maybe EQUOTA) | |- free_reserved_data_space() |- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 20 6月, 2017 1 次提交
-
-
由 Josef Bacik 提交于
For extent_io tree's we have carried the address_mapping of the inode around in the io tree in order to pull the inode back out for calling into various tree ops hooks. This works fine when everything that has an extent_io_tree has an inode. But we are going to remove the btree_inode, so we need to change this. Instead just have a generic void * for private data that we can initialize with, and have all the tree ops use that instead. This had a lot of cascading changes but should be relatively straightforward. Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ minor reordering of the callback prototypes ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 28 2月, 2017 4 次提交
-
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 17 2月, 2017 2 次提交
-
-
由 Jeff Mahoney 提交于
The free space cache APIs accept a root but always use the tree root. Also, btrfs_truncate_free_space_cache accepts a root AND an inode but the inode always points to the root anyway, so let's just pass the inode. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
btrfs_inc_block_group_ro is either passed the extent root or the dev root, but it doesn't do anything with the dev tree. Let's convert to passing an fs_info and using the extent root. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 14 2月, 2017 2 次提交
-
-
由 David Sterba 提交于
This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Nikolay Borisov 提交于
Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 12月, 2016 5 次提交
-
-
由 Jeff Mahoney 提交于
Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Jeff Mahoney 提交于
There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: NJeff Mahoney <jeffm@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 30 11月, 2016 4 次提交
-
-
由 Qu Wenruo 提交于
Commit 62b99540 (btrfs: relocation: Fix leaking qgroups numbers on data extents) only fixes the problem partly. The previous fix is to trace all new data extents at transaction commit time when balance finishes. However balance is not done in a large transaction, every path replacement can happen in its own transaction. This makes the fix useless if transaction commits during relocation. For example: relocate_block_group() |-merge_reloc_roots() | |- merge_reloc_root() | |- btrfs_start_transaction() <- Trans X | |- replace_path() <- Cause leak | |- btrfs_end_transaction_throttle() <- Trans X commits here | | Leak not fixed | | | |- btrfs_start_transaction() <- Trans Y | |- replace_path() <- Cause leak | |- btrfs_end_transaction_throttle() <- Trans Y ends | but not committed |-btrfs_join_transaction() <- Still trans Y |-qgroup_fix() <- Only fixes data leak | in trans Y |-btrfs_commit_transaction() <- Trans Y commits In that case, qgroup fixup can only fix data leak in trans Y, data leak in trans X is out of fix. So the correct fix should happen in the same transaction of replace_path(). This patch fixes it by tracing both subtrees of tree block swap, so it can fix the problem and ensure all leaking and fix are in the same transaction, so no leak again. Reported-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Rename btrfs_qgroup_insert_dirty_extent(_nolock) to btrfs_qgroup_trace_extent(_nolock), according to the new reserve/trace/account naming schema. Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Adam Borowski 提交于
They're not even documented anywhere, letting users with no recourse but to RTFS. It's no big burden to output the bitfield as words. Also, display unknown flags as hex. Signed-off-by: NAdam Borowski <kilobyte@angband.pl> Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 19 11月, 2016 2 次提交
-
-
由 Filipe Manana 提交于
In commit 5bc7247a (Btrfs: fix broken nocow after balance) we started abusing the rtransid and otransid fields of root items from relocation trees to fix some issues with nodatacow mode. However later in commit ba8b0289 (Btrfs: do not reset last_snapshot after relocation) we dropped the code that made use of those fields but did not remove the code that sets those fields. So just remove them to avoid confusion. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
由 Filipe Manana 提交于
During relocation of a data block group we create a relocation tree for each fs/subvol tree by making a snapshot of each tree using btrfs_copy_root() and the tree's commit root, and then setting the last snapshot field for the fs/subvol tree's root to the value of the current transaction id minus 1. However this can lead to relocation later dropping references that it did not create if we have qgroups enabled, leaving the filesystem in an inconsistent state that keeps aborting transactions. Lets consider the following example to explain the problem, which requires qgroups to be enabled. We are relocating data block group Y, we have a subvolume with id 258 that has a root at level 1, that subvolume is used to store directory entries for snapshots and we are currently at transaction 3404. When committing transaction 3404, we have a pending snapshot and therefore we call btrfs_run_delayed_items() at transaction.c:create_pending_snapshot() in order to create its dentry at subvolume 258. This results in COWing leaf A from root 258 in order to add the dentry. Note that leaf A also contains file extent items referring to extents from some other block group X (we are currently relocating block group Y). Later on, still at create_pending_snapshot() we call qgroup_account_snapshot(), which switches the commit root for root 258 when it calls switch_commit_roots(), so now the COWed version of leaf A, lets call it leaf A', is accessible from the commit root of tree 258. At the end of qgroup_account_snapshot(), we call record_root_in_trans() with 258 as its argument, which results in btrfs_init_reloc_root() being called, which in turn calls relocation.c:create_reloc_root() in order to create a relocation tree associated to root 258, which results in assigning the value of 3403 (which is the current transaction id minus 1 = 3404 - 1) to the last_snapshot field of root 258. When creating the relocation tree root at ctree.c:btrfs_copy_root() we add a shared reference for leaf A', corresponding to the relocation tree's root, when we call btrfs_inc_ref() against the COWed root (a copy of the commit root from tree 258), which is at level 1. So at this point leaf A' has 2 references, one normal reference corresponding to root 258 and one shared reference corresponding to the root of the relocation tree. Transaction 3404 finishes its commit and transaction 3405 is started by relocation when calling merge_reloc_root() for the relocation tree associated to root 258. In the meanwhile leaf A' is COWed again, in response to some filesystem operation, when we are still at transaction 3405. However when we COW leaf A', at ctree.c:update_ref_for_cow(), we call btrfs_block_can_be_shared() in order to figure out if other trees refer to the leaf and if any such trees exists, add a full back reference to leaf A' - but btrfs_block_can_be_shared() incorrectly returns false because the following condition is false: btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) which evaluates to 3404 <= 3403. So after leaf A' is COWed, it stays with only one reference, corresponding to the shared reference we created when we called btrfs_copy_root() to create the relocation tree's root and btrfs_inc_ref() ends up not being called for leaf A' nor we end up setting the flag BTRFS_BLOCK_FLAG_FULL_BACKREF in leaf A'. This results in not adding shared references for the extents from block group X that leaf A' refers to with its file extent items. Later, after merging the relocation root we do a call to to btrfs_drop_snapshot() in order to delete the relocation tree. This ends up calling do_walk_down() when path->slots[1] points to leaf A', which results in calling btrfs_lookup_extent_info() to get the number of references for leaf A', which is 1 at this time (only the shared reference exists) and this value is stored at wc->refs[0]. After this walk_up_proc() is called when wc->level is 0 and path->nodes[0] corresponds to leaf A'. Because the current level is 0 and wc->refs[0] is 1, it does call btrfs_dec_ref() against leaf A', which results in removing the single references that the extents from block group X have which are associated to root 258 - the expectation was to have each of these extents with 2 references - one reference for root 258 and one shared reference related to the root of the relocation tree, and so we would drop only the shared reference (because leaf A' was supposed to have the flag BTRFS_BLOCK_FLAG_FULL_BACKREF set). This leaves the filesystem in an inconsistent state as we now have file extent items in a subvolume tree that point to extents from block group X without references in the extent tree. So later on when we try to decrement the references for these extents, for example due to a file unlink operation, truncate operation or overwriting ranges of a file, we fail because the expected references do not exist in the extent tree. This leads to warnings and transaction aborts like the following: [ 588.965795] ------------[ cut here ]------------ [ 588.965815] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:1625 lookup_inline_extent_backref+0x432/0x5b0 [btrfs] [ 588.965816] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg [ 588.965831] CPU: 2 PID: 2479 Comm: kworker/u8:7 Not tainted 4.7.3-3-default-fdm+ #1 [ 588.965832] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 588.965844] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [ 588.965845] 0000000000000000 ffff8802263bfa28 ffffffff813af542 0000000000000000 [ 588.965847] 0000000000000000 ffff8802263bfa68 ffffffff81081e8b 0000065900000000 [ 588.965848] ffff8801db2af000 000000012bbe2000 0000000000000000 ffff880215703b48 [ 588.965849] Call Trace: [ 588.965852] [<ffffffff813af542>] dump_stack+0x63/0x81 [ 588.965854] [<ffffffff81081e8b>] __warn+0xcb/0xf0 [ 588.965855] [<ffffffff81081f7d>] warn_slowpath_null+0x1d/0x20 [ 588.965863] [<ffffffffa0175042>] lookup_inline_extent_backref+0x432/0x5b0 [btrfs] [ 588.965865] [<ffffffff81143220>] ? trace_clock_local+0x10/0x30 [ 588.965867] [<ffffffff8114c5df>] ? rb_reserve_next_event+0x6f/0x460 [ 588.965875] [<ffffffffa0175215>] insert_inline_extent_backref+0x55/0xd0 [btrfs] [ 588.965882] [<ffffffffa017531f>] __btrfs_inc_extent_ref.isra.55+0x8f/0x240 [btrfs] [ 588.965890] [<ffffffffa017acea>] __btrfs_run_delayed_refs+0x74a/0x1260 [btrfs] [ 588.965892] [<ffffffff810cb046>] ? cpuacct_charge+0x86/0xa0 [ 588.965900] [<ffffffffa017e74f>] btrfs_run_delayed_refs+0x9f/0x2c0 [btrfs] [ 588.965908] [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs] [ 588.965918] [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs] [ 588.965928] [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs] [ 588.965930] [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0 [ 588.965931] [<ffffffff8109b658>] worker_thread+0x48/0x4e0 [ 588.965932] [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0 [ 588.965934] [<ffffffff810a1659>] kthread+0xc9/0xe0 [ 588.965936] [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40 [ 588.965937] [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170 [ 588.965938] ---[ end trace 34e5232c933a1749 ]--- [ 588.966187] ------------[ cut here ]------------ [ 588.966196] WARNING: CPU: 2 PID: 2479 at fs/btrfs/extent-tree.c:2966 btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs] [ 588.966196] BTRFS: Transaction aborted (error -5) [ 588.966197] Modules linked in: af_packet iscsi_ibft iscsi_boot_sysfs xfs libcrc32c ppdev acpi_cpufreq button tpm_tis e1000 i2c_piix4 pcspkr parport_pc parport tpm qemu_fw_cfg joydev btrfs xor raid6_pq sr_mod cdrom ata_generic virtio_scsi ata_piix virtio_pci bochs_drm virtio_ring drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio ttm serio_raw drm floppy sg [ 588.966206] CPU: 2 PID: 2479 Comm: kworker/u8:7 Tainted: G W 4.7.3-3-default-fdm+ #1 [ 588.966207] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 588.966217] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] [ 588.966217] 0000000000000000 ffff8802263bfc98 ffffffff813af542 ffff8802263bfce8 [ 588.966219] 0000000000000000 ffff8802263bfcd8 ffffffff81081e8b 00000b96345ee000 [ 588.966220] ffffffffa021ae1c ffff880215703b48 00000000000005fe ffff8802345ee000 [ 588.966221] Call Trace: [ 588.966223] [<ffffffff813af542>] dump_stack+0x63/0x81 [ 588.966224] [<ffffffff81081e8b>] __warn+0xcb/0xf0 [ 588.966225] [<ffffffff81081eff>] warn_slowpath_fmt+0x4f/0x60 [ 588.966233] [<ffffffffa017e93c>] btrfs_run_delayed_refs+0x28c/0x2c0 [btrfs] [ 588.966241] [<ffffffffa017ea04>] delayed_ref_async_start+0x94/0xb0 [btrfs] [ 588.966250] [<ffffffffa01c799a>] btrfs_scrubparity_helper+0xca/0x350 [btrfs] [ 588.966259] [<ffffffffa01c7c5e>] btrfs_extent_refs_helper+0xe/0x10 [btrfs] [ 588.966260] [<ffffffff8109b323>] process_one_work+0x1f3/0x4e0 [ 588.966261] [<ffffffff8109b658>] worker_thread+0x48/0x4e0 [ 588.966263] [<ffffffff8109b610>] ? process_one_work+0x4e0/0x4e0 [ 588.966264] [<ffffffff810a1659>] kthread+0xc9/0xe0 [ 588.966265] [<ffffffff816f2f1f>] ret_from_fork+0x1f/0x40 [ 588.966267] [<ffffffff810a1590>] ? kthread_worker_fn+0x170/0x170 [ 588.966268] ---[ end trace 34e5232c933a174a ]--- [ 588.966269] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2966: errno=-5 IO failure [ 588.966270] BTRFS info (device sda2): forced readonly This was happening often on openSUSE and SLE systems using btrfs as the root filesystem (with its default layout where multiple subvolumes are used) where balance happens in the background triggered by a cron job and snapshots are automatically created before/after package installations, upgrades and removals. The issue could be triggered simply by running the following loop on the first system boot post installation: while true; do zypper -n in nfs-kernel-server zypper -n rm nfs-kernel-server done (If we were fast enough and made that loop before the cron job triggered a balance operation and the balance finished) So fix by setting the last_snapshot field of the root to the value of the generation of its commit root. Like this btrfs_block_can_be_shared() behaves correctly for the case where the relocation root is created during a transaction commit and for the case where it's created before a transaction commit. Fixes: 6426c7ad (btrfs: qgroup: Fix qgroup accounting when creating snapshot) Cc: stable@vger.kernel.org # 4.7+ Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NJosef Bacik <jbacik@fb.com>
-
- 17 10月, 2016 1 次提交
-
-
由 Liu Bo 提交于
While updating btree, we try to push items between sibling nodes/leaves in order to keep height as low as possible. But we don't memset the original places with zero when pushing items so that we could end up leaving stale content in nodes/leaves. One may read the above stale content by increasing btree blocks' @nritems. One case I've come across is that in fs tree, a leaf has two parent nodes, hence running balance ends up with processing this leaf with two parent nodes, but it can only reach the valid parent node through btrfs_search_slot, so it'd be like, do_relocation for P in all parent nodes of block A: if !P->eb: btrfs_search_slot(key); --> get path from P to A. if lowest: BUG_ON(A->bytenr != bytenr of A recorded in P); btrfs_cow_block(P, A); --> change A's bytenr in P. After btrfs_cow_block, P has the new bytenr of A, but with the same @key, we get the same path again, and get panic by BUG_ON. Note that this is only happening in a corrupted fs, for a regular fs in which we have correct @nritems so that we won't read stale content in any case. Reviewed-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-