- 07 1月, 2022 1 次提交
-
-
由 Qu Wenruo 提交于
Use BTRFS_MAX_METADATA_BLOCKSIZE and SZ_4K (minimal sectorsize) to calculate this value. And remove one stale comment on the value, in fact with recent subpage support, BTRFS_MAX_METADATA_BLOCKSIZE * PAGE_SIZE is already beyond BTRFS_STRIPE_LEN, just we don't use the full page. Also since we're here, update the BUG_ON() related to SCRUB_MAX_PAGES_PER_BLOCK to ASSERT(). As those ASSERT() are really only for developers to catch early obvious bugs, not to let end users suffer. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 03 1月, 2022 3 次提交
-
-
由 Josef Bacik 提交于
We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 11月, 2021 1 次提交
-
-
由 Colin Ian King 提交于
The bitfields have_csum and io_error are currently signed which is not recommended as the representation is an implementation defined behaviour. Fix this by making the bit-fields unsigned ints. Fixes: 2c363954 ("btrfs: scrub: remove the anonymous structure from scrub_page") Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NColin Ian King <colin.i.king@gmail.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 10月, 2021 6 次提交
-
-
由 Josef Bacik 提交于
We have a lot of device lookup functions that all do something slightly different. Clean this up by adding a struct to hold the different lookup criteria, and then pass this around to btrfs_find_device() so it can do the proper matching based on the lookup criteria. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffca ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is always passed into alloc_rbio(), and only get released when the raid bio is released. Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info parameters for alloc_rbio() callers. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(), except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes bio->bi_iter.bi_sector. However the naming itself is not using "btrfs_io_bio" to indicate its parameter is "strcut btrfs_io_bio" and can be easily confused with "struct btrfs_bio". Considering assigned bio->bi_iter.bi_sector is such a simple work and there are already tons of call sites doing that manually, there is no need to do that in a helper. Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc() function to provide a fail-safe value for its @nr_iovecs. And then replace all btrfs_bio_alloc() callers with btrfs_io_bio_alloc(). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 6月, 2021 1 次提交
-
-
由 David Sterba 提交于
Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 6月, 2021 3 次提交
-
-
由 Qu Wenruo 提交于
[BUG] For the following file layout, scrub will not be able to repair all these two repairable error, but in fact make one corruption even unrepairable: inode offset 0 4k 8K Mirror 1 |XXXXXX| | Mirror 2 | |XXXXXX| [CAUSE] The root cause is the hard coded PAGE_SIZE, which makes scrub repair to go crazy for subpage. For above case, when reading the first sector, we use PAGE_SIZE other than sectorsize to read, which makes us to read the full range [0, 64K). In fact, after 8K there may be no data at all, we can just get some garbage. Then when doing the repair, we also writeback a full page from mirror 2, this means, we will also writeback the corrupted data in mirror 2 back to mirror 1, leaving the range [4K, 8K) unrepairable. [FIX] This patch will modify the following PAGE_SIZE use with sectorsize: - scrub_print_warning_inode() Remove the min() and replace PAGE_SIZE with sectorsize. The min() makes no sense, as csum is done for the full sector with padding. This fixes a bug that subpage report extra length like: checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test, physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file) Where the error is only 1 sector. - scrub_handle_errored_block() Comments with PAGE|page involved, all changed to sector. - scrub_setup_recheck_block() - scrub_repair_page_from_good_copy() - scrub_add_page_to_wr_bio() - scrub_wr_submit() - scrub_add_page_to_rd_bio() - scrub_block_complete() Replace PAGE_SIZE with sectorsize. This solves several problems where we read/write extra range for subpage case. RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE usage, and is not really safe enough. Thus we will reject RAID56 for subpage in later commit. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
There are common values set for the stripe constraints, some of them are already factored out. Do that for increment and mirror_num as well. Reviewed-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NAnand Jain <anand.jain@oracle.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
Add sysfs interface to limit io during scrub. We relied on the ionice interface to do that, eg. the idle class let the system usable while scrub was running. This has changed when mq-deadline got widespread and did not implement the scheduling classes. That was a CFQ thing that got deleted. We've got numerous complaints from users about degraded performance. Currently only BFQ supports that but it's not a common scheduler and we can't ask everybody to switch to it. Alternatively the cgroup io limiting can be used but that also a non-trivial setup (v2 required, the controller must be enabled on the system). This can still be used if desired. Other ideas that have been explored: piggy-back on ionice (that is set per-process and is accessible) and interpret the class and classdata as bandwidth limits, but this does not have enough flexibility as there are only 8 allowed and we'd have to map fixed limits to each value. Also adjusting the value would need to lookup the process that currently runs scrub on the given device, and the value is not sticky so would have to be adjusted each time scrub runs. Running out of options, sysfs does not look that bad: - it's accessible from scripts, or udev rules - the name is similar to what MD-RAID has (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max) - the value is sticky at least for filesystem mount time - adjusting the value has immediate effect - sysfs is available in constrained environments (eg. system rescue) - the limit also applies to device replace Sysfs: - raw value is in bytes - values written to the file accept suffixes like K, M - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max - 0 means use default priority of IO The scheduler is a simple deadline one and the accuracy is up to nearest 128K. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 21 4月, 2021 1 次提交
-
-
由 Filipe Manana 提交于
When doing a device replace on a zoned filesystem, if we find a block group with ->to_copy == 0, we jump to the label 'done', which will result in later calling btrfs_unfreeze_block_group(), even though at this point we never called btrfs_freeze_block_group(). Since at this point we have neither turned the block group to RO mode nor made any progress, we don't need to jump to the label 'done'. So fix this by jumping instead to the label 'skip' and dropping our reference on the block group before the jump. Fixes: 78ce9fc2 ("btrfs: zoned: mark block groups to copy for device-replace") CC: stable@vger.kernel.org # 5.12 Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 19 4月, 2021 1 次提交
-
-
由 Anand Jain 提交于
Drop function declarations at the beginning of the file scrub.c. These functions are defined before they are used in the same file and don't need forward declaration. No functional changes. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 3月, 2021 1 次提交
-
-
由 Christoph Hellwig 提交于
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 2月, 2021 1 次提交
-
-
由 Filipe Manana 提交于
When we active a swap file, at btrfs_swap_activate(), we acquire the exclusive operation lock to prevent the physical location of the swap file extents to be changed by operations such as balance and device replace/resize/remove. We also call there can_nocow_extent() which, among other things, checks if the block group of a swap file extent is currently RO, and if it is we can not use the extent, since a write into it would result in COWing the extent. However we have no protection against a scrub operation running after we activate the swap file, which can result in the swap file extents to be COWed while the scrub is running and operating on the respective block group, because scrub turns a block group into RO before it processes it and then back again to RW mode after processing it. That means an attempt to write into a swap file extent while scrub is processing the respective block group, will result in COWing the extent, changing its physical location on disk. Fix this by making sure that block groups that have extents that are used by active swap files can not be turned into RO mode, therefore making it not possible for a scrub to turn them into RO mode. When a scrub finds a block group that can not be turned to RO due to the existence of extents used by swap files, it proceeds to the next block group and logs a warning message that mentions the block group was skipped due to active swap files - this is the same approach we currently use for balance. Fixes: ed46ff3d ("Btrfs: support swap files") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 09 2月, 2021 4 次提交
-
-
由 Naohiro Aota 提交于
When a bad checksum is found and if the filesystem has a mirror of the damaged data, we read the correct data from the mirror and writes it to damaged blocks. This however, violates the sequential write constraints of a zoned block device. We can consider three methods to repair an IO failure in zoned filesystems: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group and is straightforward to implement. It relocates all the mirrored device extents, so it potentially is a more costly operation than method (1) or (2). But it relocates only used extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This is 3/4 patch to implement device-replace on zoned filesystems. This commit implements copying. To do this, it tracks the write pointer during the device replace process. As device-replace's copy process is smart enough to only copy used extents on the source device, we have to fill the gap to honor the sequential write requirement in the target device. The device-replace process on zoned filesystems must copy or clone all the extents in the source device exactly once. So, we need to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8 ("Btrfs: fix block group remaining RO forever after error during device replace"). Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
This is the 1/4 patch to support device-replace on zoned filesystems. We have two types of IOs during the device replace process. One is an IO to "copy" (by the scrub functions) all the device extents from the source device to the destination device. The other one is an IO to "clone" (by handle_ops_on_dev_replace()) new incoming write IOs from users to the source device into the target device. Cloning incoming IOs can break the sequential write rule in on target device. When a write is mapped in the middle of a block group, the IO is directed to the middle of a target device zone, which breaks the sequential write requirement. However, the cloning function cannot be disabled since incoming IOs targeting already copied device extents must be cloned so that the IO is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether a bio is going to a not yet copied region. Since we have a time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have a newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block groups as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Also, btrfs_finish_block_group_to_copy() will check if the copied stripe is the last stripe in the block group. With the last stripe copied, the to_copy flag is finally disabled. Afterwards we can safely clone incoming IOs on this block group. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 25 1月, 2021 1 次提交
-
-
由 Christoph Hellwig 提交于
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: NChristoph Hellwig <hch@lst.de> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 12月, 2020 6 次提交
-
-
由 Qu Wenruo 提交于
Since btrfs scrub is utilizing its own infrastructure to submit read/write, scrub is independent from all other routines. This brings one very neat feature, allow us to read 4K data into offset 0 of a 64K page. So is the writeback routine. This makes scrub on subpage sector size much easier to implement, and thanks to previous commits which just changed the implementation to always do scrub based on sector size, now scrub can handle subpage filesystem without any problem. This patch will just remove the restriction on (sectorsize != PAGE_SIZE), to make scrub finally work on subpage filesystems. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Btrfs scrub is more flexible than buffered data write path, as we can read an unaligned subpage data into page offset 0. This ability makes subpage support much easier, we just need to check each scrub_page::page_len and ensure we only calculate hash for [0, page_len) of a page. There is a small thing to notice: for subpage case, we still do sector by sector scrub. This means we will submit a read bio for each sector to scrub, resulting in the same amount of read bios, just like on the 4K page systems. This behavior can be considered as a good thing, if we want everything to be the same as 4K page systems. But this also means, we're wasting the possibility to submit larger bio using 64K page size. This is another problem to consider in the future. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
To support subpage tree block scrub, scrub_checksum_tree_block() only needs to learn 2 new tricks: - Follow sector size Now scrub_page only represents one sector, we need to follow it properly. - Run checksum on all sectors Since scrub_page only represents one sector, we need to run checksum on all sectors, not only (nodesize >> PAGE_SIZE). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
For scrub_pages() and scrub_pages_for_parity(), we currently allocate one scrub_page structure for one page. This is fine if we only read/write one sector one time. But for cases like scrubbing RAID56, we need to read/write the full stripe, which is in 64K size for now. For subpage size, we will submit the read in just one page, which is normally a good thing, but for RAID56 case, it only expects to see one sector, not the full stripe in its endio function. This could lead to wrong parity checksum for RAID56 on subpage. To make the existing code work well for subpage case, here we take a shortcut by always allocating a full page for one sector. This should provide the base to make RAID56 work for subpage case. The cost is pretty obvious now, for one RAID56 stripe now we always need 16 pages. For support subpage situation (64K page size, 4K sector size), this means we need full one megabyte to scrub just one RAID56 stripe. And for data scrub, each 4K sector will also need one 64K page. This is mostly just a workaround, the proper fix for this is a much larger project, using scrub_block to replace scrub_page, and allow scrub_block to handle multi pages, csums, and csum_bitmap to avoid allocating one page for each sector. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Btrfs on-disk format chose to use u64 for almost everything, but there are a other restrictions that won't let us use more than u32 for things like extent length (the maximum length is 128MiB for non-hole extents), or stripe length (we have device number limit). This means if we don't have extra handling to convert u64 to u32, we will always have some questionable operations like "u32 = u64 >> sectorsize_bits" in the code. This patch will try to address the problem by reducing the width for the following members/parameters: - scrub_parity::stripe_len - @len of scrub_pages() - @extent_len of scrub_remap_extent() - @len of scrub_parity_mark_sectors_error() - @len of scrub_parity_mark_sectors_data() - @len of scrub_extent() - @len of scrub_pages_for_parity() - @len of scrub_extent_for_parity() For members extracted from on-disk structure, like map->stripe_len, they will be kept as is. Since that modification would require on-disk format change. There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call sites, extra ASSERT() is added to be extra safe for debug builds. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 08 12月, 2020 9 次提交
-
-
由 Qu Wenruo 提交于
That anonymous structure serve no special purpose, just replace it with regular members. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to check if the parameter seed is true in the function btrfs_find_device(). This tells it whether to traverse the seed device list or not. After this commit, the argument is unused and can be removed. In device_list_add() it's not necessary because fs_devices always points to the device's fs_devices. So with the devid+uuid matching, it will find the right device and return, thus not needing to traverse seed devices. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Function scrub_find_csum() is to locate the csum for bytenr @logical from sctx->csum_list. However it lacks a lot of comments to explain things like how the csum_list is organized and why we need to drop csum range which is before us. Refactor the function by: - Add more comments explaining the behavior - Add comment explaining why we need to drop the csum range - Put the csum copy in the main loop This is mostly for the incoming patches to make scrub_find_csum() able to find multiple checksums. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The @force parameter for scrub_pages() is to indicate whether we want to force bio submission. Currently it's only used for the super block, and it can be easily determined by the @flags, so we can remove the parameter. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are several call sites where we declare something like "struct scrub_page *page". This is confusing as we also use regular page in this code, rename it to 'spage' where applicable. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
The context structure unnecessarily stores copy of the checksum size, that can be now easily obtained from fs_info. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
btrfs_get_16 shows up in the system performance profiles (helper to read 16bit values from on-disk structures). This is partially because of the checksum size that's frequently read along with data reads/writes, other u16 uses are from item size or directory entries. Replace all calls to btrfs_super_csum_size by the cached value from fs_info. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We do a lot of calculations where we divide or multiply by sectorsize. We also know and make sure that sectorsize is a power of two, so this means all divisions can be turned to shifts and avoid eg. expensive u64/u32 divisions. The type is u32 as it's more register friendly on x86_64 compared to u8 and the resulting assembly is smaller (movzbl vs movl). There's also superblock s_blocksize_bits but it's usually one more pointer dereference farther than fs_info. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
When scrubbing a stripe of a block group we always start readahead for the checksums btree and wait for it to complete, however when the blockgroup is not a data block group (or a mixed block group) it is a waste of time to do it, since there are no checksums for metadata extents in that btree. So skip that when the block group does not have the data flag set, saving some time doing memory allocations, queueing a job in the readahead work queue, waiting for it to complete and potentially avoiding some IO as well (when csum tree extents are not in memory already). Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 05 11月, 2020 1 次提交
-
-
由 David Sterba 提交于
Based on user feedback update the message printed when scrub fails to start due to write requirements. To make a distinction add a device id to the messages. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-