- 26 9月, 2022 13 次提交
-
-
由 Christoph Hellwig 提交于
The parity raid write/recover functionality is currently not very well abstracted from the bio submission and completion handling in volumes.c: - the raid56 code directly completes the original btrfs_bio fed into btrfs_submit_bio instead of dispatching back to volumes.c - the raid56 code consumes the bioc and bio_counter references taken by volumes.c, which also leads to special casing of the calls from the scrub code into the raid56 code To fix this up supply a bi_end_io handler that calls back into the volumes.c machinery, which then puts the bioc, decrements the bio_counter and completes the original bio, and updates the scrub code to also take ownership of the bioc and bio_counter in all cases. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[PROBLEM] The existing scrub code for data extents always limit the block size to sectorsize. This causes quite some extra scrub_block being allocated: (there is a data extent at logical bytenr 298844160, length 64KiB) alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1 alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1 alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1 alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1 alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1 alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1 alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1 alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1 alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1 alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1 alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1 alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1 alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1 alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1 alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1 ... scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1 scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1 scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1 scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1 scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1 scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1 scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1 scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1 scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1 scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1 scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1 scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1 scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1 scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1 scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1 scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1 This behavior will waste a lot of memory, especially after we have moved quite some members from scrub_sector to scrub_block. [FIX] To reduce the allocation of scrub_block, and to reduce memory usage, use BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data extents. This results only one scrub_block to be allocated for above data extent: alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1 scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1 This would greatly reduce the memory usage (even it's just transient) for larger data extents scrub. For above example, the memory usage would be: Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector)) 16 * (408 + 96) = 8065 New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector) 408 + 16 * 96 = 1944 A good reduction of 75.9%. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently we store the following members in scrub_sector: - logical - physical - physical_for_dev_replace - dev - mirror_num However the current scrub code has ensured that scrub_blocks never cross stripe boundary. This is caused by the entry functions (scrub_simple_mirror, scrub_simple_stripe), thus every scrub_block will not cross stripe boundary. Thus this makes it possible to move those members into scrub_block other than putting them into scrub_sector. This should save quite some memory, as a scrub_block can be as large as 64 sectors, even for metadata it's 16 sectors byte default. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases, it will allocate one page for each scrub_sector, which can cause extra unnecessary memory usage. Utilize scrub_block::pages[] instead of allocating page for each scrub_sector, this allows us to integrate larger extents while using less memory. For example, if our page size is 64K, sectorsize is 4K, and we got an 32K sized extent. We will only allocate one page for scrub_block, and all 8 scrub sectors will point to that page. To do that properly, here we introduce several small helpers: - scrub_page_get_logical() Get the logical bytenr of a page. We store the logical bytenr of the page range into page::private. But for 32bit systems, their (void *) is not large enough to contain a u64, so in that case we will need to allocate extra memory for it. For 64bit systems, we can use page::private directly. - scrub_block_get_logical() Just get the logical bytenr of the first page. - scrub_sector_get_page() Return the page which the scrub_sector points to. - scrub_sector_get_page_offset() Return the offset inside the page which the scrub_sector points to. - scrub_sector_get_kaddr() Return the address which the scrub_sector points to. Just a wrapper using scrub_sector_get_page() and scrub_sector_get_page_offset() - bio_add_scrub_sector() Please note that, even with this patch, we're still allocating one page for one sector for data extents. This is because in scrub_extent() we split the data extent using sectorsize. The memory usage reduction will need extra work to make scrub to work like data read to only use the correct sector(s). Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BACKGROUND] Currently for scrub, we allocate one page for one sector, this is fine for PAGE_SIZE == sectorsize support, but can waste extra memory for subpage support. [CODE CHANGE] Make scrub_block contain all the pages, so if we're scrubbing an extent sized 64K, and our page size is also 64K, we only need to allocate one page. [LIFESPAN CHANGE] Since now scrub_sector no longer holds a page, but is using scrub_block::pages[] instead, we have to ensure scrub_block has a longer lifespan for write bio. The lifespan for read bio is already large enough. Now scrub_block will only be released after the write bio finished. [COMING NEXT] Currently we only added scrub_block::pages[] for this purpose, but scrub_sector is still utilizing the old scrub_sector::page. The switch will happen in the next patch. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The allocation and initialization is shared by 3 call sites, and we're going to change the initialization of some members in the upcoming patches. So factor out the allocation and initialization of scrub_sector into a helper, alloc_scrub_sector(), which will do the following work: - Allocate the memory for scrub_sector - Allocate a page for scrub_sector::page - Initialize scrub_sector::refs to 1 - Attach the allocated scrub_sector to scrub_block The attachment is bidirectional, which means scrub_block::sectorv[] will be updated and scrub_sector::sblock will also be updated. - Update scrub_block::sector_count and do extra sanity check on it Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Although there are only two callers, we are going to add some members for scrub_block in the incoming patches. Factoring out the initialization code will make later expansion easier. One thing to note is, even scrub_handle_errored_block() doesn't utilize scrub_block::refs, we still use alloc_scrub_block() to initialize sblock::ref, allowing us to use scrub_block_put() to do cleanup. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
In function scrub_handle_errored_block(), we use @sblocks_for_recheck pointer to hold one scrub_block for each mirror, and uses kcalloc() to allocate an array. But this one pointer for an array is not readable due to the member offsets done by addition and not []. Change this pointer to struct scrub_block *[BTRFS_MAX_MIRRORS], this will slightly increase the stack memory usage. Since function scrub_handle_errored_block() won't get iterative calls, this extra cost would completely be acceptable. And since we're here, also set sblock->refs and use scrub_block_put() to clean them up, as later we will add extra members in scrub_block, which needs scrub_block_put() to clean them up. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are several sanity checks which are no longer possible to trigger inside btrfs_scrub_dev(). Since we have mount time check against super block nodesize/sectorsize, and our fixed macro is hardcoded to handle even the worst combination. Thus those sanity checks are no longer needed, can be easily removed. But this patch still uses some ASSERT()s as a safe net just in case we change some features in the future to trigger those impossible combinations. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We use this during device replace for zoned devices, we were simply taking the lock because it was in a bit field and we needed the lock to be safe with other modifications in the bitfield. With the bit helpers we no longer require that locking. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We use a bit field in the btrfs_block_group for different flags, however this is awkward because we have to hold the block_group->lock for any modification of any of these fields, and makes the code clunky for a few of these flags. Convert these to a properly flags setup so we can utilize the bit helpers. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] The following script shows that, although scrub can detect super block errors, it never tries to fix it: mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2 xfs_io -c "pwrite 67108864 4k" $dev2 mount $dev1 $mnt btrfs scrub start -B $dev2 btrfs scrub start -Br $dev2 umount $mnt The first scrub reports the super error correctly: scrub done for f3289218-abd3-41ac-a630-202f766c0859 Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 But the second read-only scrub still reports the same super error: Scrub started: Tue Aug 2 14:44:11 2022 Status: finished Duration: 0:00:00 Total to scrub: 1.26GiB Rate: 0.00B/s Error summary: super=1 Corrected: 0 Uncorrectable: 0 Unverified: 0 [CAUSE] The comments already shows that super block can be easily fixed by committing a transaction: /* * If we find an error in a super block, we just report it. * They will get written with the next transaction commit * anyway */ But the truth is, such assumption is not always true, and since scrub should try to repair every error it found (except for read-only scrub), we should really actively commit a transaction to fix this. [FIX] Just commit a transaction if we found any super block errors, after everything else is done. We cannot do this just after scrub_supers(), as btrfs_commit_transaction() will try to pause and wait for the running scrub, thus we can not call it with scrub_lock hold. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[PROBLEM] Unlike data/metadata corruption, if scrub detected some error in the super block, the only error message is from the updated device status: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 This is not helpful at all. [CAUSE] Unlike data/metadata error reporting, there is no visible report in kernel dmesg to report supper block errors. In fact, return value of scrub_checksum_super() is intentionally skipped, thus scrub_handle_errored_block() will never be called for super blocks. [FIX] Make super block errors to output an error message, now the full dmesg would looks like this: BTRFS info (device dm-1): scrub: started on devid 2 BTRFS warning (device dm-1): super block error on device /dev/mapper/test-scratch2, physical 67108864 BTRFS error (device dm-1): bdev /dev/mapper/test-scratch2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 BTRFS info (device dm-1): scrub: finished on devid 2 with status: 0 BTRFS info (device dm-1): scrub: started on devid 2 This fix involves: - Move the super_errors reporting to scrub_handle_errored_block() This allows the device status message to show after the super block error message. But now we no longer distinguish super block corruption and generation mismatch, now all counted as corruption. - Properly check the return value from scrub_checksum_super() - Add extra super block error reporting for scrub_print_warning(). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 25 7月, 2022 4 次提交
-
-
由 Christoph Hellwig 提交于
Always consume the bio and call the end_io handler on error instead of returning an error and letting the caller handle it. This matches what the block layer submission does and avoids any confusion on who needs to handle errors. Also use the proper bool type for the generic_io argument. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there are functions passing it as arguments, this is not necessary. The fixed value has been used for a long time and though the stripe length should be configurable by super block member stripesize, this hasn't been implemented and would require more changes so we don't need to keep this code around until then. Partially based on a patch from Qu Wenruo. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> [ update changelog ] Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
For scrub_stripe() we can easily calculate the dev extent length as we have the full info of the chunk. Thus there is no need to pass @dev_extent_len from the caller, and we introduce a helper, btrfs_calc_stripe_length(), to do the calculation from extent_map structure. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Previously we use "unsigned long *" for those two bitmaps. But since we only support fixed stripe length (64KiB, already checked in tree-checker), "unsigned long *" is really a waste of memory, while we can just use "unsigned long". This saves us 8 bytes in total for scrub_parity. To be extra safe, add an ASSERT() making sure calclulated @nsectors is always smaller than BITS_PER_LONG. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 5月, 2022 19 次提交
-
-
由 Qu Wenruo 提交于
[SUSPICIOUS CODE] When refactoring scrub code, I noticed a very strange behavior around scrub_remap_extent(): if (sctx->is_dev_replace) scrub_remap_extent(fs_info, cur_logical, scrub_len, &cur_physical, &target_dev, &cur_mirror); As replace target is a 1:1 copy of the source device, thus physical offset inside the target should be the same as physical inside source, thus this remap call makes no sense to me. [REAL FUNCTIONALITY] After more investigation, the function name scrub_remap_extent() doesn't tell anything of the truth, nor does its if () condition. The real story behind this function is that, for scrub_pages() we never expect missing device, even for replacing missing device. What scrub_remap_extent() is really doing is to find a live mirror, and make later scrub_pages() to read data from the good copy, other than from the missing device and increase error counters unnecessarily. [IMPROVEMENT] We have no need to bother scrub_remap_extent() in scrub_simple_mirror() at all, we only need to call it before we call scrub_pages(). And rename the function to scrub_find_live_copy(), add extra comments on them. By this we can remove one parameter from scrub_extent(), and reduce the unnecessary calls to scrub_remap_extent() for regular replace. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Since we have find_first_extent_item() to iterate the extent items of a certain range, there is no need to use the open-coded version. Replace the final scrub call site with find_first_extent_item(). Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently scrub_raid56_parity() has a large double loop, handling the following things at the same time: - Iterate each data stripe - Iterate each extent item in one data stripe Refactor this by: - Introduce a new helper to handle data stripe iteration The new helper is scrub_raid56_data_stripe_for_parity(), which only has one while() loop handling the extent items inside the data stripe. The code is still mostly the same as the old code. - Call cond_resched() for each extent Previously we only call cond_resched() under a complex if () check. I see no special reason to do that, and for other scrub functions, like scrub_simple_mirror() we're already doing the same cond_resched() after scrubbing one extent. - Add more comments Please note that, this patch is only to address the double loop, there are incoming patches to do extra cleanup. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Although RAID56 has complex repair mechanism, which involves reading the whole full stripe, but inside one data stripe, it's in fact no different than SINGLE/RAID1. The point here is, for data stripe we just check the csum for each extent we hit. Only for csum mismatch case, our repair paths divide. So we can still reuse scrub_simple_mirror() for RAID56 data stripes, which saves quite some code. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Since we have moved all other profiles handling into their own functions, now the main body of scrub_stripe() is just handling RAID56 profiles. There is no need to address other profiles in the main loop of scrub_stripe(), so we can remove those dead branches. Since we're here, also slightly change the timing of initialization of variables like @offset, @increment and @logical. Especially for @logical, we don't really need to initialize it for btrfs_extent_root()/btrfs_csum_root(), we can use bg->start for that purpose. Now those variables are only initialize for RAID56 branches. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new entrance will iterate through each data stripe which belongs to the target device. And since inside each data stripe, RAID0 is just SINGLE, while RAID10 is just RAID1, we can reuse scrub_simple_mirror() to do the scrub properly. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, scrub_simple_mirror(), will scrub all extents inside a range which only has simple mirror based duplication. This covers every range of SINGLE/DUP/RAID1/RAID1C*, and inside each data stripe for RAID0/RAID10. Currently we will use this function to scrub SINGLE/DUP/RAID1/RAID1C* profiles. As one can see, the new entrance for those simple-mirror based profiles can be small enough (with comments, just reach 100 lines). This function will be the basis for the incoming scrub refactor. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The new helper, find_first_extent_item(), will locate an extent item (either EXTENT_ITEM or METADATA_ITEM) which covers any byte of the search range. This helper will later be used to refactor scrub code. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The variable @physical_end is the exclusive stripe end, currently it's calculated using @physical + @dev_extent_len / map->stripe_len * map->stripe_len. And since at allocation time we ensured dev_extent_len is stripe_len aligned, the result is the same as @physical + @dev_extent_len. So this patch will just assign @physical and @physical_end early, without using @nstripes. This is especially helpful for any possible out: label user, as now we only need to initialize @offset before going to out: label. Since we're here, also make @physical_end constant. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
All three scrub workqueues don't need ordered execution or thread disabling threshold (as the thresh parameter is less than DFT_THRESHOLD). Just switch to the normal workqueues that use a lot less resources, especially in the work_struct vs btrfs_work structures. Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
This requires one extra parameter @pgoff for the function. In the current code base, scrub is still one page per sector, thus the new parameter will always be 0. It needs the extra subpage scrub optimization code to fully take advantage. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
All the scrub bios go straight to the block device or the raid56 code, none of which looks at the btrfs_bio. Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
The I/O in repair_io_failue is synchronous and doesn't need a btrfs_bio, so just use an on-stack bio. Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Christoph Hellwig 提交于
Require a separate call to the integrity checking helpers from the actual bio submission. Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Yu Zhe 提交于
Explicit type casts are not necessary when it's void* to another pointer type. Signed-off-by: NYu Zhe <yuzhe@nfschina.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Since the subpage support for scrub, one page no longer always represents one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no longer accurate. Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively. This also involves scrub_ctx::pages_per_bio and other macros involved. Now the renaming of pages involved in scrub is be finished. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Since the subpage support of scrub, scrub_sector is in fact just representing one sector. Thus the name scrub_page is no longer correct, rename it to scrub_sector. This also involves the following renames: - spage -> sector Normally we would just replace "page" with "sector" and result something like "ssector". But the repeating 's' is not really eye friendly. So here we just simple use "sector", as there is nothing from MM layer called "sector" to cause any confusion. - scrub_parity::spages -> sectors_list Normally we use plural to indicate an array, not a list. Rename it to @sectors_list to be more explicit on the list part. - Also reformat and update comments that get changed Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The following will be renamed in this patch: - scrub_block::pagev -> sectors - scrub_block::page_count -> sector_count - SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK - page_num -> sector_num to iterate scrub_block::sectors For now scrub_page is not yet renamed to keep the patch reasonable and it will be updated in a followup. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 4月, 2022 1 次提交
-
-
由 Filipe Manana 提交于
During a scrub, or device replace, we can race with block group removal and allocation and trigger the following assertion failure: [7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817 [7526.387351] ------------[ cut here ]------------ [7526.387373] kernel BUG at fs/btrfs/ctree.h:3599! [7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4 [7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs] [7526.393520] Code: f3 48 c7 c7 20 (...) [7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246 [7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000 [7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff [7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001 [7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800 [7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000 [7526.402874] FS: 00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000 [7526.404038] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0 [7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [7526.408169] Call Trace: [7526.408529] <TASK> [7526.408839] scrub_enumerate_chunks.cold+0x11/0x79 [btrfs] [7526.409690] ? do_wait_intr_irq+0xb0/0xb0 [7526.410276] btrfs_scrub_dev+0x226/0x620 [btrfs] [7526.410995] ? preempt_count_add+0x49/0xa0 [7526.411592] btrfs_ioctl+0x1ab5/0x36d0 [btrfs] [7526.412278] ? __fget_files+0xc9/0x1b0 [7526.412825] ? kvm_sched_clock_read+0x14/0x40 [7526.413459] ? lock_release+0x155/0x4a0 [7526.414022] ? __x64_sys_ioctl+0x83/0xb0 [7526.414601] __x64_sys_ioctl+0x83/0xb0 [7526.415150] do_syscall_64+0x3b/0xc0 [7526.415675] entry_SYSCALL_64_after_hwframe+0x44/0xae [7526.416408] RIP: 0033:0x7f6c99d34397 [7526.416931] Code: 3c 1c e8 1c ff (...) [7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397 [7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003 [7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000 [7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de [7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640 [7526.425950] </TASK> That assertion is relatively new, introduced with commit d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()"). The block group we get at scrub_enumerate_chunks() can actually have a start address that is smaller then the chunk offset we extracted from a device extent item we got from the commit root of the device tree. This is very rare, but it can happen due to a race with block group removal and allocation. For example, the following steps show how this can happen: 1) We are at transaction T, and we have the following blocks groups, sorted by their logical start address: [ bg A, start address A, length 1G (data) ] [ bg B, start address B, length 1G (data) ] (...) [ bg W, start address W, length 1G (data) ] --> logical address space hole of 256M, there used to be a 256M metadata block group here [ bg Y, start address Y, length 256M (metadata) ] --> Y matches W's end offset + 256M Block group Y is the block group with the highest logical address in the whole filesystem; 2) Block group Y is deleted and its extent mapping is removed by the call to remove_extent_mapping() made from btrfs_remove_block_group(). So after this point, the last element of the mapping red black tree, its rightmost node, is the mapping for block group W; 3) While still at transaction T, a new data block group is allocated, with a length of 1G. When creating the block group we do a call to find_next_chunk(), which returns the logical start address for the new block group. This calls returns X, which corresponds to the end offset of the last block group, the rightmost node in the mapping red black tree (fs_info->mapping_tree), plus one. So we get a new block group that starts at logical address X and with a length of 1G. It spans over the whole logical range of the old block group Y, that was previously removed in the same transaction. However the device extent allocated to block group X is not the same device extent that was used by block group Y, and it also does not overlap that extent, which must be always the case because we allocate extents by searching through the commit root of the device tree (otherwise it could corrupt a filesystem after a power failure or an unclean shutdown in general), so the extent allocator is behaving as expected; 4) We have a task running scrub, currently at scrub_enumerate_chunks(). There it searches for device extent items in the device tree, using its commit root. It finds a device extent item that was used by block group Y, and it extracts the value Y from that item into the local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset(); It then calls btrfs_lookup_block_group() to find block group for the logical address Y - since there's currently no block group that starts at that logical address, it returns block group X, because its range contains Y. This results in triggering the assertion: ASSERT(cache->start == chunk_offset); right before calling scrub_chunk(), as cache->start is X and chunk_offset is Y. This is more likely to happen of filesystems not larger than 50G, because for these filesystems we use a 256M size for metadata block groups and a 1G size for data block groups, while for filesystems larger than 50G, we use a 1G size for both data and metadata block groups (except for zoned filesystems). It could also happen on any filesystem size due to the fact that system block groups are always smaller (32M) than both data and metadata block groups, but these are not frequently deleted, so much less likely to trigger the race. So make scrub skip any block group with a start offset that is less than the value we expect, as that means it's a new block group that was created in the current transaction. It's pointless to continue and try to scrub its extents, because scrub searches for extents using the commit root, so it won't find any. For a device replace, skip it as well for the same reasons, and we don't need to worry about the possibility of extents of the new block group not being to the new device, because we have the write duplication setup done through btrfs_map_block(). Fixes: d04fbe19 ("btrfs: scrub: cleanup the argument list of scrub_chunk()") CC: stable@vger.kernel.org # 5.17 Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 14 3月, 2022 1 次提交
-
-
由 Jiapeng Chong 提交于
increment is being initialized to map->stripe_len but this is never read as increment is overwritten later on. Remove the redundant initialization. Cleans up the following clang-analyzer warning: fs/btrfs/scrub.c:3193:6: warning: Value stored to 'increment' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: NAbaci Robot <abaci@linux.alibaba.com> Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 07 1月, 2022 2 次提交
-
-
由 Qu Wenruo 提交于
The argument list of btrfs_stripe() has similar problems of scrub_chunk(): - Duplicated and ambiguous @base argument Can be fetched from btrfs_block_group::bg. - Ambiguous argument @length It's again device extent length - Ambiguous argument @num The instinctive guess would be mirror number, but in fact it's stripe index. Fix it by: - Remove @base parameter - Rename @length to @dev_extent_len - Rename @num to @stripe_index Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The argument list of scrub_chunk() has the following problems: - Duplicated @chunk_offset It is the same as btrfs_block_group::start. - Confusing @length The most instinctive guess is chunk length, and one may want to delete it, but the truth is, it's the device extent length. Fix this by: - Remove @chunk_offset Use btrfs_block_group::start instead. - Rename @length to @dev_extent_len Also rename the caller to remove the ambiguous naming. - Rename @cache to @bg The "_cache" suffix for btrfs_block_group has been removed for a while. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-