- 18 4月, 2023 18 次提交
-
-
由 Qu Wenruo 提交于
Implement the only missing part for scrub: RAID56 P/Q stripe scrub. The workflow is pretty straightforward for the new function, scrub_raid56_parity_stripe(): - Go through the regular scrub path for each data stripe - Wait for the verification and repair to finish - Writeback the repaired sectors to data stripes - Make sure all stripes are properly repaired If we have sectors unrepaired, we cannot continue, or we could further corrupt the P/Q stripe. - Submit the rbio for P/Q stripe The dev-replace would be handled inside raid56_parity_submit_scrub_rbio() path. - Wait for the above bio to finish Although the old code is no longer used, we still keep the declaration, as the cleanup can be several times larger than this patch itself. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure. Since scrub_simple_mirror() is the core part of scrub (only RAID56 P/Q stripes don't utilize it), we can get rid of a big chunk of code, mostly scrub_extent(), scrub_sectors() and directly called functions. There is a functionality change: - Scrub speed throttle now only affects read on the scrubbing device Writes (for repair and replace), and reads from other mirrors won't be limited by the set limits. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, queue_scrub_stripe(), would try to queue a stripe for scrub. If all stripes are already in use, we will submit all the existing ones and wait for them to finish. Currently we would queue up to 8 stripes, to enlarge the blocksize to 512KiB to improve the performance. Sectors repaired on zoned need to be relocated instead of in-place fix. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, scrub_stripe_report_errors(), will report the result of the scrub to system log. The main reporting is done by introducing a new helper, scrub_print_common_warning(), which is mostly the same content from scrub_print_wanring(), but without the need for a scrub_block. Since we're reporting the errors, it's the perfect time to update the scrub stats too. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Add a new helper, scrub_write_sectors(), to submit write bios for specified sectors to the target disk. There are several differences compared to read path: - Utilize btrfs_submit_scrub_write() Now we still rely on the @mirror_num based writeback, but the requirement is also a little different than regular writeback or read, thus we have to call btrfs_submit_scrub_write(). - We cannot write the full stripe back We can only write the sectors we have. There will be two call sites later, one for repaired sectors, one for all utilized sectors of dev-replace. Thus the callers should specify their own write_bitmap. This function only submit the bios, will not wait for them unless for zoned case. Caller must explicitly wait for the IO to finish. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, scrub_stripe_read_repair_worker(), would handle the read-repair part: - Wait for the previous submitted read IO to finish - Verify the contents of the stripe - Go through the remaining mirrors, using as large blocksize as possible At this stage, we just read out all the failed sectors from each mirror and re-verify. If no more failed sector, we can exit. - Go through all mirrors again, sector-by-sector This time, we read sector by sector, this is to address cases where one bad sector mismatches the drive's internal checksum, and cause the whole read range to fail. We put this recovery method as the last resort, as sector-by-sector reading is slow, and reading from other mirrors may have already fixed the errors. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, scrub_verify_stripe(), shares the same main workflow of the old scrub code. The major differences are: - How pages/page_offset is grabbed Everything can be grabbed from scrub_stripe easily. - When error report happens Currently the helper only verifies the sectors, not really doing any error reporting. The error reporting would be done after we have done the repair. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, scrub_verify_one_metadata(), is almost the same as scrub_checksum_tree_block(). The difference is in how we grab the pages from other structures. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
This patch introduces the following structures: - scrub_sector_verification Contains all the needed info to verify one sector (data or metadata). - scrub_stripe Contains all needed members (mostly bitmap based) to scrub one stripe (with a length of BTRFS_STRIPE_LEN). The basic idea is, we keep the existing per-device scrub behavior, but merge all the scrub_bio/scrub_bio into one generic structure, and read the full BTRFS_STRIPE_LEN stripe on the first try. This means we will read some sectors which are not scrub target, but that's fine. At dev-replace time we only writeback the utilized and good sectors, and for read-repair we only writeback the repaired sectors. With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio form shaping would be gone. Although to get the same performance of the old scrub behavior, we would need to submit the initial read for two stripes at once. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There is really no need to go through the super complex scrub_sectors() to just handle super blocks. Introduce a dedicated function to handle super block scrubbing. This new function will introduce a behavior change, instead of using the complex but concurrent scrub_bio system, here we just go submit-and-wait. There is really not much sense to care the performance of super block scrubbing. It only has 3 super blocks at most, and they are all scattered around the devices already. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
We don't need to pass the roots as arguments, reading them from the rb-tree is cheap. Thus there is really not much need to pre-fetch it and pass it all the way from scrub_stripe(). And we already have more than enough arguments in scrub_simple_mirror() and scrub_simple_stripe(), it's better to remove them and only grab those roots in scrub_simple_mirror(). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The variable @path is no longer passed into any call sites after commit 18d30ab9 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56 data stripe scrub"), thus we can remove the variable completely. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] Although dev replace ioctl has a way to specify the mode on whether we should read from the source device, it's not properly followed. # mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 # mount $dev1 $mnt # xfs_io -f -c "pwrite 0 32M" $mnt/file # sync # btrfs replace start -r -f 1 $dev3 $mnt And one extra trace is added to scrub_submit(), showing the detail about the bio: btrfs-11569 [005] ... 37.0270: scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384 btrfs-11569 [005] ... 37.0273: scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768 btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152 btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768 btrfs-11569 [005] ... 37.0275: scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536 btrfs-11569 [005] ... 37.0281: scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072 ... btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072 btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072 One can see that all the reads are submitted to devid 1, even if we have specified "-r" option to avoid reading from the source device. [CAUSE] The dev-replace read mode is only set but not followed by scrub code at all. In fact, only common read path is properly following the read mode, but scrub itself has its own read path, thus not following the mode. [FIX] Here we enhance scrub_find_good_copy() to also follow the read mode. The idea is pretty simple, in the first loop, we avoid the following devices: - Missing devices This is the existing condition - The source device if the replace wants to avoid it. And if above loop found no candidate (e.g. replace a single device), then we discard the 2nd condition, and try again. Since we're here, also enhance the function scrub_find_good_copy() by: - Remove the forward declaration - Makes it return int To indicates errors, e.g. no good mirror found. - Add extra error messages Now with the same trace, "btrfs replace start -r" works as expected: btrfs-1213 [000] ... 991.9059: scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384 btrfs-1213 [000] ... 991.9062: scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768 btrfs-1213 [000] ... 991.9063: scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152 btrfs-1213 [000] ... 991.9064: scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768 btrfs-1213 [000] ... 991.9065: scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536 btrfs-1213 [000] ... 991.9073: scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072 btrfs-1213 [000] ... 991.9075: scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072 btrfs-1213 [000] ... 991.9078: scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072 ... btrfs-1213 [000] ... 991.9474: scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072 btrfs-1213 [000] ... 991.9476: scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072 btrfs-1213 [000] ... 991.9479: scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072 Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
In btrfs_io_context structure, we have a pointer raid_map, which indicates the logical bytenr for each stripe. But considering we always call sort_parity_stripes(), the result raid_map[] is always sorted, thus raid_map[0] is always the logical bytenr of the full stripe. So why we waste the space and time (for sorting) for raid_map? This patch will replace btrfs_io_context::raid_map with a single u64 number, full_stripe_start, by: - Replace btrfs_io_context::raid_map with full_stripe_start - Replace call sites using raid_map[0] to use full_stripe_start - Replace call sites using raid_map[i] to compare with nr_data_stripes. The benefits are: - Less memory wasted on raid_map It's sizeof(u64) * num_stripes vs sizeof(u64). It'll always save at least one u64, and the benefit grows larger with num_stripes. - No more weird alloc_btrfs_io_context() behavior As there is only one fixed size + one variable length array. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
For btrfs dev-replace, we have to duplicate writes to the source device into the target device. For non-RAID56, all writes into the same mapped ranges are sharing the same content, thus they don't really need to bother anything. (E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the same write to all involved devices). But for RAID56, all stripes contain different content, thus we must have a clear mapping of which stripe is duplicated from which original stripe. Currently we use a complex way using tgtdev_map[] array, e.g: num_tgtdevs = 1 tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace. tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace, and it's duplicated to stripes[3]. tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace. But this is wasting some space, and ignores one important thing for dev-replace, there is at most one running replace. Thus we can change it to a fixed array to represent the mapping: replace_nr_stripes = 1 replace_stripe_src = 1 <- Means stripes[1] is involved in replace. thus the extra stripe is a copy of stripes[1] By this we can save some space for bioc on RAID56 chunks with many devices. And we get rid of one variable sized array from bioc. Thus the patch involves the following changes: - Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes and @replace_stripe_src. @num_tgtdevs is just renamed to @replace_nr_stripes. While the mapping is completely changed. - Add extra ASSERT()s for RAID56 code - Only add two more extra stripes for dev-replace cases. As we have an upper limit on how many dev-replace stripes we can have. - Unify the behavior of handle_ops_on_dev_replace() Previously handle_ops_on_dev_replace() go two different paths for WRITE and GET_READ_MIRRORS. Now unify them by always going the WRITE path first (with at most 2 replace stripes), then if we're doing GET_READ_MIRRORS and we have 2 extra stripes, just drop one stripe. - Remove the @real_stripes argument from alloc_btrfs_io_context() As we don't need the old variable length array any more. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are quite some div64 calls inside btrfs_map_block() and its variants. Such calls are for @stripe_nr, where @stripe_nr is the number of stripes before our logical bytenr inside a chunk. However we can eliminate such div64 calls by just reducing the width of @stripe_nr from 64 to 32. This can be done because our chunk size limit is already 10G, with fixed stripe length 64K. Thus a U32 is definitely enough to contain the number of stripes. With such width reduction, we can get rid of slower div64, and extra warning for certain 32bit arch. This patch would do: - Add a new tree-checker chunk validation on chunk length Make sure no chunk can reach 256G, which can also act as a bitflip checker. - Reduce the width from u64 to u32 for @stripe_nr variables - Replace unnecessary div64 calls with regular modulo and division 32bit division and modulo are much faster than 64bit operations, and we are finally free of the div64 fear at least in those involved functions. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently btrfs doesn't support stripe lengths other than 64KiB. This is already set in the tree-checker. There is really no meaning to record that fixed value in map_lookup for now, and can all be replaced with BTRFS_STRIPE_LEN. Furthermore we can use the fix stripe length to do the following optimization: - Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division Now we only need to do a right shift. And the value of BTRFS_STRIPE_LEN itself is already too large for bit shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift, a compiler warning would be triggered. Thus this bit shift optimization would be safe. - Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 2月, 2023 1 次提交
-
-
由 Colin Ian King 提交于
There quite a few spelling mistakes as found using codespell. Fix them. Signed-off-by: NColin Ian King <colin.i.king@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 14 2月, 2023 1 次提交
-
-
由 Qu Wenruo 提交于
[BUG] When debugging a scrub related metadata error, it turns out that our metadata error reporting is not ideal. The only 3 error messages are: - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1 Showing we have metadata generation mismatch errors. - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1 Showing which tree blocks are corrupted. - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5 Showing which physical range the corrupted metadata is at. We have to combine the above 3 to know we have a corrupted metadata with generation mismatch. And this is already the better case, if we have other problems, like fsid mismatch, we can not even know the cause. [CAUSE] The problem is caused by the fact that, scrub_checksum_tree_block() never outputs any error message. It just return two bits for scrub: sblock->header_error, and sblock->generation_error. And later we report error in scrub_print_warning(), but unfortunately we only have two bits, there is not really much thing we can done to print any detailed errors. [FIX] This patch will do the following to enhance the error reporting of metadata scrub: - Add extra warning (ratelimited) for every error we hit This can help us to distinguish the different types of errors. Some errors can help us to know what's going wrong immediately, like bytenr mismatch. - Re-order the checks Currently we check bytenr first, then immediately generation. This can lead to false generation mismatch reports, while the fsid mismatches. Here is the new output for the bug I'm debugging (we forgot to writeback tree blocks for commit roots): BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc Now we can immediately know it's some tree blocks didn't even get written back, other than the original confusing generation mismatch. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 06 12月, 2022 12 次提交
-
-
由 Qu Wenruo 提交于
Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] If dev-replace failed to re-construct its data/metadata, the kernel message would be incorrect for the missing device: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev (efault) Note the above "dev (efault)" of the second line. While the first line is properly reporting "<missing disk>". [CAUSE] Although dev-replace is using btrfs_dev_name(), the heavy lifting work is still done by scrub (scrub is reused by both dev-replace and regular scrub). Unfortunately scrub code never uses btrfs_dev_name() helper, as it's only declared locally inside dev-replace.c. [FIX] Fix the output by: - Move the btrfs_dev_name() helper to volumes.h - Use btrfs_dev_name() to replace open-coded rcu_str_deref() calls Only zoned code is not touched, as I'm not familiar with degraded zoned code. - Constify return value and parameter Now the output looks pretty sane: BTRFS info (device dm-1): dev_replace from <missing disk> (devid 2) to /dev/mapper/test-scratch2 started BTRFS error (device dm-1): failed to rebuild valid logical 38862848 for dev <missing disk> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
The public backref walking functions have quite a lot of arguments that are passed down the call stack to find_parent_nodes(), the core function of the backref walking code. The next patches in series will need to add even arguments to these functions that should be passed not only to find_parent_nodes(), but also to other functions used by the later (directly or even lower in the call stack). So create a structure to hold all these arguments and state used by the main backref walking function, find_parent_nodes(), and use it as the argument for the public backref walking functions iterate_extent_inodes(), btrfs_find_all_leafs() and btrfs_find_all_roots(). Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
The interface for find_parent_nodes() has two extent offset related arguments: 1) One u64 pointer argument for the extent offset; 2) One boolean argument to tell if the extent offset should be ignored or not. These are confusing, becase the extent offset pointer can be NULL and in some cases callers pass a NULL value as a way to tell the backref walking code to ignore offsets in file extent items (and simply consider all file extent items that point to the target data extent). The boolean argument was added in commit c995ab3c ("btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents"), but it was never really necessary, it was enough if it could find a way to get a NULL value passed to the "extent_item_pos" argument of find_parent_nodes(). The arguments are also passed to functions called by find_parent_nodes() and respective helper functions, which further makes everything more complicated than needed. Then we have several backref walking related functions that end up calling find_parent_nodes(), either directly or through some other function that they call, and for many we have to use an "extent_item_pos" (u64) argument and a boolean "ignore_offset" argument too. This is confusing and not really necessary. So use a single argument to specify the extent offset, as a simple u64 and not as a pointer, but using a special value of (u64)-1, defined as a documented constant, to indicate when the extent offset should be ignored. This is also preparation work for the upcoming patches in the series that add other arguments to find_parent_nodes() and other related functions that use it. Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
Currently send does not do the best decisions when it comes to decide between multiple clone sources, which results in clone operations for partial extent ranges, which has the following disadvantages: 1) We get less shared extents at the destination; 2) We have to read more data during the send operation and emit more write commands. Besides not being optimal behaviour, it also breaks user expectations and is often reported by users, with a recent example in the Link tag at the bottom of this change log. Part of the reason for this non-optimal behaviour is that the backref walking code does not provide information about the length of the file extent items that were found for each backref, so send is blind about which backref is the best to chose as a cloning source. The other existing reasons are just silliness, namely always prefering the inode with the lowest number when multiple are found for the same root and when we can clone from multiple roots, always prefer the send root over any of the other clone roots. This does not make any sense since any inode or root is fine and as good as any other inode/root. Fix this by making backref walking pass information about the number of bytes referenced by each file extent item and then have send's backref callback pick the inode with the highest number of bytes for each root. Finally select the root from which we can clone more bytes from. Example reproducer: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo cp --reflink=always $MNT/foo $MNT/bar cp --reflink=always $MNT/foo $MNT/baz sync # Overwrite the second half of file foo. xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo sync echo echo "*** fiemap in the original filesystem ***" echo xfs_io -c "fiemap -v" $MNT/foo xfs_io -c "fiemap -v" $MNT/bar xfs_io -c "fiemap -v" $MNT/baz echo btrfs filesystem du $MNT btrfs subvolume snapshot -r $MNT $MNT/snap btrfs send -f /tmp/send_stream $MNT/snap umount $MNT mkfs.btrfs -f $DEV &> /dev/null mount $DEV $MNT btrfs receive -f /tmp/send_stream $MNT echo echo "*** fiemap in the new filesystem ***" echo xfs_io -r -c "fiemap -v" $MNT/snap/foo xfs_io -r -c "fiemap -v" $MNT/snap/bar xfs_io -r -c "fiemap -v" $MNT/snap/baz echo btrfs filesystem du $MNT rm -f /tmp/send_stream rm -f /tmp/snap.fssum umount $MNT Before this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 32768..34815 2048 0x1 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 1.00MiB - /mnt/sdi/snap/bar 2.00MiB 1.00MiB - /mnt/sdi/snap/baz 6.00MiB 2.00MiB - /mnt/sdi/snap 6.00MiB 2.00MiB 2.00MiB /mnt/sdi We end up with two 1M extents that are not shared for files bar and baz. After this change: $ ./test.sh (...) *** fiemap in the original filesystem *** /mnt/sdi/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x1 /mnt/sdi/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 Total Exclusive Set shared Filename 2.00MiB 1.00MiB - /mnt/sdi/foo 2.00MiB 0.00B - /mnt/sdi/bar 2.00MiB 0.00B - /mnt/sdi/baz 6.00MiB 1.00MiB 2.00MiB /mnt/sdi Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap' At subvol /mnt/sdi/snap At subvol snap *** fiemap in the new filesystem *** /mnt/sdi/snap/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..4095]: 26624..30719 4096 0x2001 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 /mnt/sdi/snap/baz: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2047]: 26624..28671 2048 0x2000 1: [2048..4095]: 30720..32767 2048 0x2001 Total Exclusive Set shared Filename 2.00MiB 0.00B - /mnt/sdi/snap/foo 2.00MiB 0.00B - /mnt/sdi/snap/bar 2.00MiB 0.00B - /mnt/sdi/snap/baz 6.00MiB 0.00B - /mnt/sdi/snap 6.00MiB 0.00B 3.00MiB /mnt/sdi Now there's a much better sharing, files bar and baz share 1M of the extent of file foo and the second extent of files bar and baz is shared between themselves. This will later be turned into a test case for fstests. Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
Move these out of ctree.h into scrub.h to cut down on code in ctree.h. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
All callers pas GFP_KERNEL as parameter so we can use it directly in alloc_scrub_sector. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There's only one caller that calls scrub_setup_recheck_block in the memalloc_nofs_save/_restore protection so it's effectively already GFP_NOFS and it's safe to use GFP_KERNEL. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
This is only used locally in scrub.c, move it out of ctree.h into scrub.c. Reviewed-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 07 11月, 2022 2 次提交
-
-
由 Johannes Thumshirn 提交于
If we're doing device replace on a zoned filesystem and discover in scrub_enumerate_chunks() that we don't have to copy the block group it is unlocked before it gets skipped. But as the block group hasn't yet been locked before it leads to a locking imbalance. To fix this simply remove the unlock. This was uncovered by fstests' testcase btrfs/163. Fixes: 9283b9e0 ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY") Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
This reverts commit 786672e9. [BUG] Since commit 786672e9 ("btrfs: scrub: use larger block size for data extent scrub"), btrfs scrub no longer reports errors if the corruption is not in the first sector of a STRIPE_LEN. The following script can expose the problem: mkfs.btrfs -f $dev mount $dev $mnt xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar umount $mnt # 13631488 is the logical bytenr of above 8K extent btrfs-map-logical -l 13631488 -b 4096 $dev mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1 # Corrupt the 2nd sector of that extent xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev mount $dev $mnt btrfs scrub start -B $mnt scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7 Scrub started: Mon Nov 7 07:18:27 2022 Status: finished Duration: 0:00:00 Total to scrub: 536.00MiB Rate: 0.00B/s Error summary: no errors found <<< [CAUSE] That offending commit enlarges the data extent scrub size from sector size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated. But unfortunately the data extent scrub is still heavily relying on the fact that there is only one scrub_sector per scrub_block. Thus it will only check the first sector, and ignoring the remaining sectors. Furthermore the error reporting is not able to handle multiple sectors either. [FIX] For now just revert the offending commit. The consequence is just extra memory usage during scrub. We will need a proper change to make the remaining data scrub path to handle multiple sectors before we enlarging the data scrub size. Reported-by: NLi Zhang <zhanglikernel@gmail.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 29 9月, 2022 1 次提交
-
-
由 Josef Bacik 提交于
If we have NOWAIT specified on our IOCB and we're writing into a PREALLOC or NOCOW extent then we need to be able to tell can_nocow_extent that we don't want to wait on any locks or metadata IO. Fix can_nocow_extent to allow for NOWAIT. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NStefan Roesch <shr@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 26 9月, 2022 5 次提交
-
-
由 Christoph Hellwig 提交于
The parity raid write/recover functionality is currently not very well abstracted from the bio submission and completion handling in volumes.c: - the raid56 code directly completes the original btrfs_bio fed into btrfs_submit_bio instead of dispatching back to volumes.c - the raid56 code consumes the bioc and bio_counter references taken by volumes.c, which also leads to special casing of the calls from the scrub code into the raid56 code To fix this up supply a bi_end_io handler that calls back into the volumes.c machinery, which then puts the bioc, decrements the bio_counter and completes the original bio, and updates the scrub code to also take ownership of the bioc and bio_counter in all cases. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[PROBLEM] The existing scrub code for data extents always limit the block size to sectorsize. This causes quite some extra scrub_block being allocated: (there is a data extent at logical bytenr 298844160, length 64KiB) alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1 alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1 alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1 alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1 alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1 alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1 alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1 alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1 alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1 alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1 alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1 alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1 alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1 alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1 alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1 ... scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1 scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1 scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1 scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1 scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1 scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1 scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1 scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1 scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1 scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1 scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1 scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1 scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1 scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1 scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1 scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1 This behavior will waste a lot of memory, especially after we have moved quite some members from scrub_sector to scrub_block. [FIX] To reduce the allocation of scrub_block, and to reduce memory usage, use BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data extents. This results only one scrub_block to be allocated for above data extent: alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1 This would greatly reduce the memory usage (even it's just transient) for larger data extents scrub. For above example, the memory usage would be: Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector)) 16 * (408 + 96) = 8065 New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector) 408 + 16 * 96 = 1944 A good reduction of 75.9%. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently we store the following members in scrub_sector: - logical - physical - physical_for_dev_replace - dev - mirror_num However the current scrub code has ensured that scrub_blocks never cross stripe boundary. This is caused by the entry functions (scrub_simple_mirror, scrub_simple_stripe), thus every scrub_block will not cross stripe boundary. Thus this makes it possible to move those members into scrub_block other than putting them into scrub_sector. This should save quite some memory, as a scrub_block can be as large as 64 sectors, even for metadata it's 16 sectors byte default. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases, it will allocate one page for each scrub_sector, which can cause extra unnecessary memory usage. Utilize scrub_block::pages[] instead of allocating page for each scrub_sector, this allows us to integrate larger extents while using less memory. For example, if our page size is 64K, sectorsize is 4K, and we got an 32K sized extent. We will only allocate one page for scrub_block, and all 8 scrub sectors will point to that page. To do that properly, here we introduce several small helpers: - scrub_page_get_logical() Get the logical bytenr of a page. We store the logical bytenr of the page range into page::private. But for 32bit systems, their (void *) is not large enough to contain a u64, so in that case we will need to allocate extra memory for it. For 64bit systems, we can use page::private directly. - scrub_block_get_logical() Just get the logical bytenr of the first page. - scrub_sector_get_page() Return the page which the scrub_sector points to. - scrub_sector_get_page_offset() Return the offset inside the page which the scrub_sector points to. - scrub_sector_get_kaddr() Return the address which the scrub_sector points to. Just a wrapper using scrub_sector_get_page() and scrub_sector_get_page_offset() - bio_add_scrub_sector() Please note that, even with this patch, we're still allocating one page for one sector for data extents. This is because in scrub_extent() we split the data extent using sectorsize. The memory usage reduction will need extra work to make scrub to work like data read to only use the correct sector(s). Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BACKGROUND] Currently for scrub, we allocate one page for one sector, this is fine for PAGE_SIZE == sectorsize support, but can waste extra memory for subpage support. [CODE CHANGE] Make scrub_block contain all the pages, so if we're scrubbing an extent sized 64K, and our page size is also 64K, we only need to allocate one page. [LIFESPAN CHANGE] Since now scrub_sector no longer holds a page, but is using scrub_block::pages[] instead, we have to ensure scrub_block has a longer lifespan for write bio. The lifespan for read bio is already large enough. Now scrub_block will only be released after the write bio finished. [COMING NEXT] Currently we only added scrub_block::pages[] for this purpose, but scrub_sector is still utilizing the old scrub_sector::page. The switch will happen in the next patch. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-