- 14 3月, 2022 6 次提交
-
-
由 Josef Bacik 提交于
The submit helper will always run bio_endio() on the bio if it fails to submit, so cleaning up the bio just leads to a variety of use-after-free and NULL pointer dereference bugs because we race with the endio function that is cleaning up the bio. Instead just return BLK_STS_OK as the repair function has to continue to process the rest of the pages, and the endio for the repair bio will do the appropriate cleanup for the page that it was given. Reviewed-by: NBoris Burkov <boris@bur.io> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
If we fail to submit a bio for whatever reason, we may not have setup a mirror_num for that bio. This means we shouldn't try to do the repair workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in clean_io_failure. Instead simply skip the repair workflow if we have no mirror set, and add an assert to btrfs_check_repairable() to make it easier to catch what is happening in the future. Reviewed-by: NBoris Burkov <boris@bur.io> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
After commit 92082d40 ("btrfs: integrate page status update for data read path into begin/end_page_read"), the 'nr' counter at btrfs_do_readpage() is no longer used, we increment it but we never read from it. So just remove it. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
At btrfs_do_readpage(), if we get an error when trying to lookup for an extent map, we end up marking the page with the error bit, clearing the uptodate bit on it, and doing everything else that should be done. However we return success (0) to the caller, when we should return the error encoded in the extent map pointer. So fix that by returning the error encoded in the pointer. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Filipe Manana 提交于
At extent_io.c, in the read page and write page code paths, we are testing if the return value from btrfs_get_extent() can be NULL. However that is not possible, as btrfs_get_extent() always returns either an error pointer or a (non-NULL) pointer to an extent map structure. Everywhere else outside extent_io.c we never check for NULL, we always treat any returned value as a non-NULL pointer if it does not encode an error. So check only for the IS_ERR() case at extent_io.c. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
In get_extent_skip_holes() we're checking the return of btrfs_get_extent_fiemap() for an error pointer or NULL, but btrfs_get_extent_fiemap() will never return NULL, only error pointers or a valid extent_map. The other caller of btrfs_get_extent_fiemap(), find_desired_extent(), correctly only checks for error-pointers. Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 02 3月, 2022 1 次提交
-
-
由 Josef Bacik 提交于
Whenever we do any extent buffer operations we call assert_eb_page_uptodate() to complain loudly if we're operating on an non-uptodate page. Our overnight tests caught this warning earlier this week WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1f6/0x470 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x26d/0x580 ? process_one_work+0x580/0x580 worker_thread+0x55/0x3b0 ? process_one_work+0x580/0x580 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This was partially fixed by c2e39305 ("btrfs: clear extent buffer uptodate when we fail to write it"), however all that fix did was keep us from finding extent buffers after a failed writeout. It didn't keep us from continuing to use a buffer that we already had found. In this case we're searching the commit root to cache the block group, so we can start committing the transaction and switch the commit root and then start writing. After the switch we can look up an extent buffer that hasn't been written yet and start processing that block group. Then we fail to write that block out and clear Uptodate on the page, and then we start spewing these errors. Normally we're protected by the tree lock to a certain degree here. If we read a block we have that block read locked, and we block the writer from locking the block before we submit it for the write. However this isn't necessarily fool proof because the read could happen before we do the submit_bio and after we locked and unlocked the extent buffer. Also in this particular case we have path->skip_locking set, so that won't save us here. We'll simply get a block that was valid when we read it, but became invalid while we were using it. What we really want is to catch the case where we've "read" a block but it's not marked Uptodate. On read we ClearPageError(), so if we're !Uptodate and !Error we know we didn't do the right thing for reading the page. Fix this by checking !Uptodate && !Error, this way we will not complain if our buffer gets invalidated while we're using it, and we'll maintain the spirit of the check which is to make sure we have a fully in-cache block while we're messing with it. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 22 1月, 2022 1 次提交
-
-
由 Christoph Hellwig 提交于
Patch series "remove Xen tmem leftovers". Since the removal of the Xen tmem driver in 2019, the cleancache hooks are entirely unused, as are large parts of frontswap. This series against linux-next (with the folio changes included) removes cleancaches, and cuts down frontswap to the bits actually used by zswap. This patch (of 13): The cleancache subsystem is unused since the removal of Xen tmem driver in commit 814bbf49 ("xen: remove tmem driver"). [akpm@linux-foundation.org: remove now-unreachable code] Link: https://lkml.kernel.org/r/20211224062246.1258487-1-hch@lst.de Link: https://lkml.kernel.org/r/20211224062246.1258487-2-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJuergen Gross <jgross@suse.com> Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org> Cc: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 1月, 2022 5 次提交
-
-
由 Yang Li 提交于
The warnings were found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/btrfs/extent_io.c:3210: warning: Function parameter or member 'bio_ctrl' not described in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'bio' description in 'btrfs_bio_add_page' fs/btrfs/extent_io.c:3210: warning: Excess function parameter 'prev_bio_flags' description in 'btrfs_bio_add_page' fs/btrfs/space-info.c:1602: warning: Excess function parameter 'root' description in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1602: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_metadata_bytes' Note: this is fixing only the warnings regarding parameter list, the first line is not strictly conforming to the kdoc format as the btrfs codebase does not stick to that and keeps the first line more free form (because it's only for internal use). Reported-by: NAbaci Robot <abaci@linux.alibaba.com> Signed-off-by: NYang Li <yang.lee@linux.alibaba.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> [ add note ] Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently there is only one user for btrfs metadata readahead, and that's scrub. But even for the single user, it's not providing the correct functionality it needs, as scrub needs reada for commit root, which current readahead can't provide. (Although it's pretty easy to add such feature). Despite this, there are some extra problems related to metadata readahead: - Duplicated feature with btrfs_path::reada - Partly duplicated feature of btrfs_fs_info::buffer_radix Btrfs already caches its metadata in buffer_radix, while readahead tries to read the tree block no matter if it's already cached. - Poor layer separation Metadata readahead works kinda at device level. This is definitely not the correct layer it should be, since metadata is at btrfs logical address space, it should not bother device at all. This brings extra chance for bugs to sneak in, while brings unnecessary complexity. - Dead code In the very beginning of scrub.c we have #undef DEBUG, rendering all the debug related code useless and unable to test. Thus here I purpose to remove the metadata readahead mechanism completely. [BENCHMARK] There is a full benchmark for the scrub performance difference using the old btrfs_reada_add() and btrfs_path::reada. For the worst case (no dirty metadata, slow HDD), there could be a 5% performance drop for scrub. For other cases (even SATA SSD), there is no distinguishable performance difference. The number is reported scrub speed, in MiB/s. The resolution is limited by the reported duration, which only has a resolution of 1 second. Old New Diff SSD 455.3 466.332 +2.42% HDD 103.927 98.012 -5.69% Comprehensive test methodology is in the cover letter of the patch. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
REQ_OP_ZONE_APPEND can only work on zoned devices, so it is redundant to check if the filesystem is zoned when REQ_OP_ZONE_APPEND is set as the bio's bio_op. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Sink zone check into btrfs_repair_one_zone() so we don't need to do it in all callers. Also as btrfs_repair_one_zone() doesn't return a sensible error, make it a boolean function and return false in case it got called on a non-zoned filesystem and true on a zoned filesystem. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Encapsulate the inode lock needed for serializing the data relocation writes on a zoned filesystem into a helper. This streamlines the code reading flow and hides special casing for zoned filesystems. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 03 1月, 2022 1 次提交
-
-
由 Qu Wenruo 提交于
We use @nr_written to record how many pages have been started by btrfs_run_delalloc_range(). Currently there are only two cases that would populate @nr_written: - Inline extent creation - Compressed write But both cases will also set @page_started to one. In fact, in writepage_delalloc() we have the following code, showing that @nr_written is really only utilized for above two cases: /* did the fill delalloc function already unlock and start * the IO? */ if (page_started) { /* * we've unlocked the page, so we can't update * the mapping's writeback index, just update * nr_to_write. */ wbc->nr_to_write -= nr_written; return 1; } But for such cases, writepage_delalloc() will return 1, and exit __extent_writepage() without going through __extent_writepage_io(). Thus this means, inside __extent_writepage_io(), we always get @nr_written as 0. So this patch is going to remove the unnecessary parameter from the following functions: - writepage_delalloc() As @nr_written passed in is always the initial value 0. Although inside that function, we still need a local @nr_written to update wbc->nr_to_write. - __extent_writepage_io() As explained above, @nr_written passed in can only be 0. This also means we can remove one update_nr_written() call. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 16 12月, 2021 1 次提交
-
-
由 Josef Bacik 提交于
Filipe reported a hang when we have errors on btrfs. This turned out to be a side-effect of my fix c2e39305 ("btrfs: clear extent buffer uptodate when we fail to write it") which made it so we clear EXTENT_BUFFER_UPTODATE on an eb when we fail to write it out. Below is a paste of Filipe's analysis he got from using drgn to debug the hang """ btree readahead code calls read_extent_buffer_pages(), sets ->io_pages to a value while writeback of all pages has not yet completed: --> writeback for the first 3 pages finishes, we clear EXTENT_BUFFER_UPTODATE from eb on the first page when we get an error. --> at this point eb->io_pages is 1 and we cleared Uptodate bit from the first 3 pages --> read_extent_buffer_pages() does not see EXTENT_BUFFER_UPTODATE() so it continues, it's able to lock the pages since we obviously don't hold the pages locked during writeback --> read_extent_buffer_pages() then computes 'num_reads' as 3, and sets eb->io_pages to 3, since only the first page does not have Uptodate bit set at this point --> writeback for the remaining page completes, we ended decrementing eb->io_pages by 1, resulting in eb->io_pages == 2, and therefore never calling end_extent_buffer_writeback(), so EXTENT_BUFFER_WRITEBACK remains in the eb's flags --> of course, when the read bio completes, it doesn't and shouldn't call end_extent_buffer_writeback() --> we should clear EXTENT_BUFFER_UPTODATE only after all pages of the eb finished writeback? or maybe make the read pages code wait for writeback of all pages of the eb to complete before checking which pages need to be read, touch ->io_pages, submit read bio, etc writeback bit never cleared means we can hang when aborting a transaction, at: btrfs_cleanup_one_transaction() btrfs_destroy_marked_extents() wait_on_extent_buffer_writeback() """ This is a problem because our writes are not synchronized with reads in any way. We clear the UPTODATE flag and then we can easily come in and try to read the EB while we're still waiting on other bio's to complete. We have two options here, we could lock all the pages, and then check to see if eb->io_pages != 0 to know if we've already got an outstanding write on the eb. Or we can simply check to see if we have WRITE_ERR set on this extent buffer. We set this bit _before_ we clear UPTODATE, so if the read gets triggered because we aren't UPTODATE because of a write error we're guaranteed to have WRITE_ERR set, and in this case we can simply return -EIO. This will fix the reported hang. Reported-by: NFilipe Manana <fdmanana@suse.com> Fixes: c2e39305 ("btrfs: clear extent buffer uptodate when we fail to write it") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 08 12月, 2021 2 次提交
-
-
由 Josef Bacik 提交于
generic/484 fails sometimes with compression on because the write ends up small enough that it goes into the btree. This means that we never call mapping_set_error() on the inode itself, because the page gets marked as fine when we inline it into the metadata. When the metadata writeback happens we see it and abort the transaction properly and mark the fs as readonly, however we don't do the mapping_set_error() on anything. In syncfs() we will simply return 0 if the sb is marked read-only, so we can't check for this in our syncfs callback. The only way the error gets returned if we called mapping_set_error() on something. Fix this by calling mapping_set_error() on the btree inode mapping. This allows us to properly return an error on syncfs and pass generic/484 with compression on. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
I got dmesg errors on generic/281 on our overnight fstests. Looking at the history this happens occasionally, with errors like this WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50 CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffae598230bc60 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000 RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340 RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000 R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0 R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000 FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1d6/0x430 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x270/0x5a0 worker_thread+0x55/0x3c0 ? process_one_work+0x5a0/0x5a0 kthread+0x174/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we're trying to read from a extent buffer page that is !PageUptodate. This happens because we will clear the page uptodate when we have an IO error, but we don't clear the extent buffer uptodate. If we do a read later and find this extent buffer we'll think its valid and not return an error, and then trip over this warning. Fix this by also clearing uptodate on the extent buffer when this happens, so that we get an error when we do a btrfs_search_slot() and find this block later. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 27 10月, 2021 16 次提交
-
-
由 Qu Wenruo 提交于
The member btrfs_bio::logical is only initialized by two call sites: - btrfs_repair_one_sector() No corresponding site to utilize it. - btrfs_submit_direct() The corresponding site to utilize it is btrfs_check_read_dio_bio(). However for btrfs_check_read_dio_bio(), we can grab the file_offset from btrfs_dio_private::file_offset directly. Thus it turns out we don't really need that btrfs_bio::logical member at all. For btrfs_bio, the logical bytenr can be fetched from its bio->bi_iter.bi_sector directly. So let's just remove the member to save 8 bytes for structure btrfs_bio. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of 5963ffca ("btrfs: always abort the transaction if we abort a trans handle") we will always set BTRFS_FS_STATE_ERROR if we abort, so we don't have to check both ABORTED and ERROR to see if things have gone wrong. Add a helper to check BTRFS_FS_STATE_ERROR and then convert all checkers of FS_STATE_ERROR to use the helper. The TRANS_ABORTED bit check was added in af722733 ("Btrfs: clean up resources during umount after trans is aborted") but is not actually specific. Reviewed-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] With experimental subpage compression enabled, a simple fsstress can lead to self deadlock on page 720896: mkfs.btrfs -f -s 4k $dev > /dev/null mount $dev -o compress $mnt $fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156 [CAUSE] If we have a file layout looks like below: 0 32K 64K 96K 128K |//| |///////////////| 4K Then we run delalloc range for the inode, it will: - Call find_lock_delalloc_range() with @delalloc_start = 0 Then we got a delalloc range [0, 4K). This range will be COWed. - Call find_lock_delalloc_range() again with @delalloc_start = 4K Since find_lock_delalloc_range() never cares whether the range is still inside page range [0, 64K), it will return range [64K, 128K). This range meets the condition for subpage compression, will go through async COW path. And async COW path will return @page_started. But that @page_started is now for range [64K, 128K), not for range [0, 64K). - writepage_dellloc() returned 1 for page [0, 64K) Thus page [0, 64K) will not be unlocked, nor its page dirty status will be cleared. Next time when we try to lock page [0, 64K) we will deadlock, as there is no one to release page [0, 64K). This problem will never happen for regular page size as one page only contains one sector. After the first find_lock_delalloc_range() call, the @delalloc_end will go beyond @page_end no matter if we found a delalloc range or not Thus this bug only happens for subpage, as now we need multiple runs to exhaust the delalloc range of a page. [FIX] Fix the problem by ensuring the delalloc range we ran at least started inside @locked_page. So that we will never get incorrect @page_started. And to prevent such problem from happening again: - Make find_lock_delalloc_range() return false if the found range is beyond @end value passed in. Since @end will be utilized now, add an ASSERT() to ensure we pass correct @end into find_lock_delalloc_range(). This also means, for selftests we needs to populate @end before calling find_lock_delalloc_range(). - New ASSERT() in find_lock_delalloc_range() Now we will make sure the @start/@end passed in at least covers part of the page. - New ASSERT() in run_delalloc_range() To make sure the range at least starts inside @locked page. - Use @delalloc_start as proper cursor, while @delalloc_end is always reset to @page_end. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Pages passed to __extent_writepage() are always locked, but they may be locked by different functions. There are two types of locked page for __extent_writepage(): - Page locked by plain lock_page() It should not have any subpage::writers count. Can be unlocked by unlock_page(). This is the most common locked page for __extent_writepage() called inside extent_write_cache_pages() or extent_write_full_page(). Rarer cases include the @locked_page from extent_write_locked_range(). - Page locked by lock_delalloc_pages() There is only one caller, all pages except @locked_page for extent_write_locked_range(). In this case, we have to call subpage helper to handle the case. So here we introduce a helper, btrfs_page_unlock_writer(), to allow __extent_writepage() to unlock different locked pages. And since for all other callers of __extent_writepage() their pages are ensured to be locked by lock_page(), also add an extra check for epd::extent_locked to unlock such pages directly. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are two sites are not subpage compatible yet for extent_write_locked_range(): - How @nr_pages are calculated For subpage we can have the following range with 64K page size: 0 32K 64K 96K 128K | |////|/////| | In that case, although 96K - 32K == 64K, thus it looks like one page is enough, but the range spans two pages, not one. Fix it by doing proper round_up() and round_down() to calculate @nr_pages. Also add some extra ASSERT()s to ensure the range passed in is already aligned. - How the page end is calculated Currently we just use cur + PAGE_SIZE - 1 to calculate the page end. Which can't handle the above range layout, and will trigger ASSERT() in btrfs_writepage_endio_finish_ordered(), as the range is no longer covered by the page range. Fix it by taking page end into consideration. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are several cleanups for extent_write_locked_range(), most of them are pure cleanups, but with some preparation for future subpage support. - Add a proper comment for which call sites are suitable Unlike regular synchronized extent write back, if async COW or zoned COW happens, we have all pages in the range still locked. Thus for those (only) two call sites, we need this function to submit page content into bios and submit them. - Remove @mode parameter All the existing two call sites pass WB_SYNC_ALL. No need for @mode parameter. - Better error handling Currently if we hit an error during the page iteration loop, we overwrite @ret, causing only the last error can be recorded. Here we add @found_error and @first_error variable to record if we hit any error, and the first error we hit. So the first error won't get lost. - Don't reuse @start as the cursor We reuse the parameter @start as the cursor to iterate the range, not a big problem, but since we're here, introduce a proper @cur as the cursor. - Remove impossible branch Since all pages are still locked after the ordered extent is inserted, there is no way that pages can get its dirty bit cleared. Remove the branch where page is not dirty and replace it with an ASSERT(). Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] If we remove the subpage limitation in add_ra_bio_pages(), then read a compressed extent which has part of its range in next page, like the following inode layout: 0 32K 64K 96K 128K |<--------------|-------------->| Btrfs will trigger ASSERT() in endio function: assertion failed: atomic_read(&subpage->readers) >= nbits ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3431! Internal error: Oops - BUG: 0 [#1] SMP Workqueue: btrfs-endio btrfs_work_helper [btrfs] Call trace: assertfail.constprop.0+0x28/0x2c [btrfs] btrfs_subpage_end_reader+0x148/0x14c [btrfs] end_page_read+0x8c/0x100 [btrfs] end_bio_extent_readpage+0x320/0x6b0 [btrfs] bio_endio+0x15c/0x1dc end_workqueue_fn+0x44/0x64 [btrfs] btrfs_work_helper+0x74/0x250 [btrfs] process_one_work+0x1d4/0x47c worker_thread+0x180/0x400 kthread+0x11c/0x120 ret_from_fork+0x10/0x30 ---[ end trace c8b7b552d3bb408c ]--- [CAUSE] When we read the page range [0, 64K), we find it's a compressed extent, and we will try to add extra pages in add_ra_bio_pages() to avoid reading the same compressed extent. But when we add such page into the read bio, it doesn't follow the behavior of btrfs_do_readpage() to properly set subpage::readers. This means, for page [64K, 128K), its subpage::readers is still 0. And when endio is executed on both pages, since page [64K, 128K) has 0 subpage::readers, it triggers above ASSERT() [FIX] Function add_ra_bio_pages() is far from subpage compatible, it always assume PAGE_SIZE == sectorsize, thus when it skip to next range it always just skip PAGE_SIZE. Make it subpage compatible by: - Skip to next page properly when needed If we find there is already a page cache, we need to skip to next page. For that case, we shouldn't just skip PAGE_SIZE bytes, but use @pg_index to calculate the next bytenr and continue. - Only add the page range covered by current extent map We need to calculate which range is covered by current extent map and only add that part into the read bio. - Update subpage::readers before submitting the bio - Use proper cursor other than confusing @last_offset - Calculate the missed threshold based on sector size It's no longer using missed pages, as for 64K page size, we have at most 3 pages to skip. (If aligned only 2 pages) - Add ASSERT() to make sure our bytenr is always aligned - Add comment for the function Add a special note for subpage case, as the function won't really work well for subpage cases. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
In function __extent_writepage() we always pass page start to @delalloc_start for writepage_delalloc(). Thus we don't really need @delalloc_start parameter as we can extract it from @page. Remove @delalloc_start parameter and make __extent_writepage() to declare @page_start and @page_end as const. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Previously we had "struct btrfs_bio", which records IO context for mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra btrfs specific info for logical bytenr bio. With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now. The struct btrfs_bio changes meaning by this commit. There was a suggested name like btrfs_logical_bio but it's a bit long and we'd prefer to use a shorter name. This could be a concern for backports to older kernels where the different meaning could possibly cause confusion or bugs. Comparing the new and old structures, there's no overlap among the struct members so a build would break in case of incorrect backport. We haven't had many backports to bio code anyway so this is more of a theoretical cause of bugs and a matter of precaution but we'll need to keep the semantic change in mind. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(), except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes bio->bi_iter.bi_sector. However the naming itself is not using "btrfs_io_bio" to indicate its parameter is "strcut btrfs_io_bio" and can be easily confused with "struct btrfs_bio". Considering assigned bio->bi_iter.bi_sector is such a simple work and there are already tons of call sites doing that manually, there is no need to do that in a helper. Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc() function to provide a fail-safe value for its @nr_iovecs. And then replace all btrfs_bio_alloc() callers with btrfs_io_bio_alloc(). Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Johannes Thumshirn 提交于
Don't allow more than one process to add pages to a relocation inode on a zoned filesystem, otherwise we cannot guarantee the sequential write rule once we're filling preallocated extents on a zoned filesystem. Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Function repair_io_failure() is no longer used out of extent_io.c since commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum fixup code"), which removes the last external caller. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Anand Jain 提交于
In preparation to fix a bug in btrfs_show_devname(). Convert fs_devices::latest_bdev type from struct block_device to struct btrfs_device and, rename the member to fs_devices::latest_dev. So that btrfs_show_devname() can use fs_devices::latest_dev::name. Tested-by: NSu Yue <l@damenly.su> Signed-off-by: NAnand Jain <anand.jain@oracle.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Naohiro Aota 提交于
If we have written to the zone capacity, the device automatically deactivates the zone. Sync up block group side (the active BG list and zone_is_active flag) with it. We need to do it both on data BGs and metadata BGs. On data side, we add a hook to btrfs_finish_ordered_io(). On metadata side, we use end_extent_buffer_writeback(). To reduce excess lookup of a block group, we mark the last extent buffer in a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for data (ordered_extent), because the address may change due to REQ_OP_ZONE_APPEND. Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
Currently we use u16 bitmap to make 4k sectorsize work for 64K page size. But this u16 bitmap is not large enough to contain larger page size like 128K, nor is space efficient for 16K page size. To handle both cases, here we pack all subpage bitmaps into a larger bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for subpage usage. Each sub-bitmap will has its start bit number recorded in btrfs_subpage_info::*_start, and its bitmap length will be recorded in btrfs_subpage_info::bitmap_nr_bits. All subpage bitmap operations will be converted from using direct u16 operations to bitmap operations, with above *_start calculated. For 64K page size with 4K sectorsize, this should not cause much difference. While for 16K page size, we will only need 1 unsigned long (u32) to store all the bitmaps, which saves quite some space. Furthermore, this allows us to support larger page size like 128K and 258K. Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 26 10月, 2021 2 次提交
-
-
由 Qu Wenruo 提交于
The existing calling convention of btrfs_alloc_subpage() is pretty awful. Change it to a more common pattern by returning struct btrfs_subpage directly and let the caller to determine if the call succeeded. Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
There are two call sites of btrfs_alloc_subpage(): - btrfs_attach_subpage() We have ensured sectorsize is smaller than PAGE_SIZE - alloc_extent_buffer() We call btrfs_alloc_subpage() unconditionally. The alloc_extent_buffer() forces us to check the sectorsize size against page size inside btrfs_alloc_subpage(). Since the function name, btrfs_alloc_subpage(), already indicates it should only get called for subpage cases, do the check in alloc_extent_buffer() and add an ASSERT() in btrfs_alloc_subpage(). Reviewed-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 23 8月, 2021 5 次提交
-
-
由 Naohiro Aota 提交于
btrfs_lookup_ordered_extent() is supposed to query the offset in a file instead of the logical address. Pass the file offset from submit_extent_page() to calc_bio_boundaries(). Also, calc_bio_boundaries() relies on the bio's operation flag, so move the call site after setting it. Fixes: 390ed29b ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier") Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Boris Burkov 提交于
Add support for fsverity in btrfs. To support the generic interface in fs/verity, we add two new item types in the fs tree for inodes with verity enabled. One stores the per-file verity descriptor and btrfs verity item and the other stores the Merkle tree data itself. Verity checking is done in end_page_read just before a page is marked uptodate. This naturally handles a variety of edge cases like holes, preallocated extents, and inline extents. Some care needs to be taken to not try to verity pages past the end of the file, which are accessed by the generic buffered file reading code under some circumstances like reading to the end of the last page and trying to read again. Direct IO on a verity file falls back to buffered reads. Verity relies on PageChecked for the Merkle tree data itself to avoid re-walking up shared paths in the tree. For this reason, we need to cache the Merkle tree data. Since the file is immutable after verity is turned on, we can cache it at an index past EOF. Use the new inode ro_flags to store verity on the inode item, so that we can enable verity on a file, then rollback to an older kernel and still mount the file system and read the file. Since we can't safely write the file anymore without ruining the invariants of the Merkle tree, we mark a ro_compat flag on the file system when a file has verity enabled. Acked-by: NEric Biggers <ebiggers@google.com> Co-developed-by: NChris Mason <clm@fb.com> Signed-off-by: NChris Mason <clm@fb.com> Signed-off-by: NBoris Burkov <boris@bur.io> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
When btrfs_run_delalloc_range() failed, we will error out. But there is a strange comment mentioning that btrfs_run_delalloc_range() could have returned value >0 to indicate the IO has already started. Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack usage") introduced the comment, but unfortunately at that time, we were already using @page_started to indicate that case, and still return 0. Furthermore, even if that comment was right (which is not), we would return -EIO if the IO had already started. By all means the comment is incorrect, just remove the comment along with the dead check. Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to make sure we either return 0 or error, no positive return value. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Chaitanya Kulkarni 提交于
The offset and can never be negative use unsigned int instead of int type for them. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Qu Wenruo 提交于
[BUG] When running btrfs/160 in a loop for subpage with experimental compression support, it has a high chance to crash (~20%): BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ordered-data.c:238! Internal error: Oops - BUG: 0 [#1] SMP pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs] lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs] Call trace: __btrfs_add_ordered_extent+0x550/0x670 [btrfs] btrfs_add_ordered_extent+0x2c/0x50 [btrfs] run_delalloc_nocow+0x81c/0x8fc [btrfs] btrfs_run_delalloc_range+0xa4/0x390 [btrfs] writepage_delalloc+0xc0/0x1ac [btrfs] __extent_writepage+0xf4/0x370 [btrfs] extent_write_cache_pages+0x288/0x4f4 [btrfs] extent_writepages+0x58/0xe0 [btrfs] btrfs_writepages+0x1c/0x30 [btrfs] do_writepages+0x60/0x110 __filemap_fdatawrite_range+0x108/0x170 filemap_fdatawrite_range+0x20/0x30 btrfs_fdatawrite_range+0x34/0x4dc [btrfs] __btrfs_write_out_cache+0x34c/0x480 [btrfs] btrfs_write_out_cache+0x144/0x220 [btrfs] btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs] btrfs_commit_transaction+0xd0/0xbb4 [btrfs] btrfs_sync_fs+0x64/0x1cc [btrfs] sync_fs_one_sb+0x3c/0x50 iterate_supers+0xcc/0x1d4 ksys_sync+0x6c/0xd0 __arm64_sys_sync+0x1c/0x30 invoke_syscall+0x50/0x120 el0_svc_common.constprop.0+0x4c/0xd4 do_el0_svc+0x30/0x9c el0_svc+0x2c/0x54 el0_sync_handler+0x1a8/0x1b0 el0_sync+0x198/0x1c0 ---[ end trace 336f67369ae6e0af ]--- [CAUSE] For subpage case, we can have multiple sectors inside a page, this makes it possible for __extent_writepage() to have part of its page submitted before returning. In btrfs/160, we are using dm-dust to emulate write error, this means for certain pages, we could have everything running fine, but at the end of __extent_writepage(), one of the submitted bios fails due to dm-dust. Then the page is marked Error, and we change @ret from 0 to -EIO. This makes the caller extent_write_cache_pages() to error out, without submitting the remaining pages. Furthermore, since we're erroring out for free space cache, it doesn't really care about the error and will update the inode and retry the writeback. Then we re-run the delalloc range, and will try to insert the same delalloc range while previous delalloc range is still hanging there, triggering the above error. [FIX] The proper fix is to handle errors from __extent_writepage() properly, by ending the remaining ordered extent. But that fix needs the following changes: - Know at exactly which sector the error happened Currently __extent_writepage_io() works for the full page, can't return at which sector we hit the error. - Grab the ordered extent covering the failed sector As a hotfix for subpage case, here we unify the error paths in __extent_writepage(). In fact, the "if (PageError(page))" branch never get executed if @ret is still 0 for non-subpage cases. As for non-subpage case, we never submit current page in __extent_writepage(), but only add current page into bio. The bio can only get submitted in next page. Thus we never get PageError() set due to IO failure, thus when we hit the branch, @ret is never 0. By simply removing that @ret assignment, we let subpage case ignore the IO failure, thus only error out for fatal errors just like regular sectorsize. So that IO error won't be treated as fatal error not trigger the hanging OE problem. Signed-off-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-