1. 27 10月, 2021 11 次提交
    • Q
      btrfs: cleanup for extent_write_locked_range() · 2bd0fc93
      Qu Wenruo 提交于
      There are several cleanups for extent_write_locked_range(), most of them
      are pure cleanups, but with some preparation for future subpage support.
      
      - Add a proper comment for which call sites are suitable
        Unlike regular synchronized extent write back, if async COW or zoned
        COW happens, we have all pages in the range still locked.
      
        Thus for those (only) two call sites, we need this function to submit
        page content into bios and submit them.
      
      - Remove @mode parameter
        All the existing two call sites pass WB_SYNC_ALL. No need for @mode
        parameter.
      
      - Better error handling
        Currently if we hit an error during the page iteration loop, we
        overwrite @ret, causing only the last error can be recorded.
      
        Here we add @found_error and @first_error variable to record if we hit
        any error, and the first error we hit.
        So the first error won't get lost.
      
      - Don't reuse @start as the cursor
        We reuse the parameter @start as the cursor to iterate the range, not
        a big problem, but since we're here, introduce a proper @cur as the
        cursor.
      
      - Remove impossible branch
        Since all pages are still locked after the ordered extent is inserted,
        there is no way that pages can get its dirty bit cleared.
        Remove the branch where page is not dirty and replace it with an
        ASSERT().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      2bd0fc93
    • Q
      btrfs: subpage: make add_ra_bio_pages() compatible · 6a404910
      Qu Wenruo 提交于
      [BUG]
      If we remove the subpage limitation in add_ra_bio_pages(), then read a
      compressed extent which has part of its range in next page, like the
      following inode layout:
      
      	0	32K	64K	96K	128K
      	|<--------------|-------------->|
      
      Btrfs will trigger ASSERT() in endio function:
      
        assertion failed: atomic_read(&subpage->readers) >= nbits
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        Internal error: Oops - BUG: 0 [#1] SMP
        Workqueue: btrfs-endio btrfs_work_helper [btrfs]
        Call trace:
         assertfail.constprop.0+0x28/0x2c [btrfs]
         btrfs_subpage_end_reader+0x148/0x14c [btrfs]
         end_page_read+0x8c/0x100 [btrfs]
         end_bio_extent_readpage+0x320/0x6b0 [btrfs]
         bio_endio+0x15c/0x1dc
         end_workqueue_fn+0x44/0x64 [btrfs]
         btrfs_work_helper+0x74/0x250 [btrfs]
         process_one_work+0x1d4/0x47c
         worker_thread+0x180/0x400
         kthread+0x11c/0x120
         ret_from_fork+0x10/0x30
        ---[ end trace c8b7b552d3bb408c ]---
      
      [CAUSE]
      When we read the page range [0, 64K), we find it's a compressed extent,
      and we will try to add extra pages in add_ra_bio_pages() to avoid
      reading the same compressed extent.
      
      But when we add such page into the read bio, it doesn't follow the
      behavior of btrfs_do_readpage() to properly set subpage::readers.
      
      This means, for page [64K, 128K), its subpage::readers is still 0.
      
      And when endio is executed on both pages, since page [64K, 128K) has 0
      subpage::readers, it triggers above ASSERT()
      
      [FIX]
      Function add_ra_bio_pages() is far from subpage compatible, it always
      assume PAGE_SIZE == sectorsize, thus when it skip to next range it
      always just skip PAGE_SIZE.
      
      Make it subpage compatible by:
      
      - Skip to next page properly when needed
        If we find there is already a page cache, we need to skip to next page.
        For that case, we shouldn't just skip PAGE_SIZE bytes, but use
        @pg_index to calculate the next bytenr and continue.
      
      - Only add the page range covered by current extent map
        We need to calculate which range is covered by current extent map and
        only add that part into the read bio.
      
      - Update subpage::readers before submitting the bio
      
      - Use proper cursor other than confusing @last_offset
      
      - Calculate the missed threshold based on sector size
        It's no longer using missed pages, as for 64K page size, we have at
        most 3 pages to skip. (If aligned only 2 pages)
      
      - Add ASSERT() to make sure our bytenr is always aligned
      
      - Add comment for the function
        Add a special note for subpage case, as the function won't really
        work well for subpage cases.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6a404910
    • Q
      btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc() · cf3075fb
      Qu Wenruo 提交于
      In function __extent_writepage() we always pass page start to
      @delalloc_start for writepage_delalloc().
      
      Thus we don't really need @delalloc_start parameter as we can extract it
      from @page.
      
      Remove @delalloc_start parameter and make __extent_writepage() to
      declare @page_start and @page_end as const.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cf3075fb
    • Q
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo 提交于
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Q
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo 提交于
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Q
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo 提交于
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c664611
    • J
      btrfs: zoned: only allow one process to add pages to a relocation inode · 35156d85
      Johannes Thumshirn 提交于
      Don't allow more than one process to add pages to a relocation inode on
      a zoned filesystem, otherwise we cannot guarantee the sequential write
      rule once we're filling preallocated extents on a zoned filesystem.
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35156d85
    • Q
      btrfs: unexport repair_io_failure() · 38d5e541
      Qu Wenruo 提交于
      Function repair_io_failure() is no longer used out of extent_io.c since
      commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum
      fixup code"), which removes the last external caller.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38d5e541
    • A
      btrfs: convert latest_bdev type to btrfs_device and rename · d24fa5c1
      Anand Jain 提交于
      In preparation to fix a bug in btrfs_show_devname().
      
      Convert fs_devices::latest_bdev type from struct block_device to struct
      btrfs_device and, rename the member to fs_devices::latest_dev.
      So that btrfs_show_devname() can use fs_devices::latest_dev::name.
      Tested-by: NSu Yue <l@damenly.su>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d24fa5c1
    • N
      btrfs: zoned: finish fully written block group · be1a1d7a
      Naohiro Aota 提交于
      If we have written to the zone capacity, the device automatically
      deactivates the zone. Sync up block group side (the active BG list and
      zone_is_active flag) with it.
      
      We need to do it both on data BGs and metadata BGs. On data side, we add a
      hook to btrfs_finish_ordered_io(). On metadata side, we use
      end_extent_buffer_writeback().
      
      To reduce excess lookup of a block group, we mark the last extent buffer in
      a block group with EXTENT_BUFFER_ZONE_FINISH flag. This cannot be done for
      data (ordered_extent), because the address may change due to
      REQ_OP_ZONE_APPEND.
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      be1a1d7a
    • Q
      btrfs: subpage: pack all subpage bitmaps into a larger bitmap · 72a69cd0
      Qu Wenruo 提交于
      Currently we use u16 bitmap to make 4k sectorsize work for 64K page
      size.
      
      But this u16 bitmap is not large enough to contain larger page size like
      128K, nor is space efficient for 16K page size.
      
      To handle both cases, here we pack all subpage bitmaps into a larger
      bitmap, now btrfs_subpage::bitmaps[] will be the ultimate bitmap for
      subpage usage.
      
      Each sub-bitmap will has its start bit number recorded in
      btrfs_subpage_info::*_start, and its bitmap length will be recorded in
      btrfs_subpage_info::bitmap_nr_bits.
      
      All subpage bitmap operations will be converted from using direct u16
      operations to bitmap operations, with above *_start calculated.
      
      For 64K page size with 4K sectorsize, this should not cause much
      difference.
      
      While for 16K page size, we will only need 1 unsigned long (u32) to
      store all the bitmaps, which saves quite some space.
      
      Furthermore, this allows us to support larger page size like 128K and
      258K.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72a69cd0
  2. 26 10月, 2021 2 次提交
  3. 23 8月, 2021 11 次提交
    • N
      btrfs: zoned: fix ordered extent boundary calculation · 939c7feb
      Naohiro Aota 提交于
      btrfs_lookup_ordered_extent() is supposed to query the offset in a file
      instead of the logical address. Pass the file offset from
      submit_extent_page() to calc_bio_boundaries().
      
      Also, calc_bio_boundaries() relies on the bio's operation flag, so move
      the call site after setting it.
      
      Fixes: 390ed29b ("btrfs: refactor submit_extent_page() to make bio and its flag tracing easier")
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      939c7feb
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • Q
      btrfs: remove the dead comment in writepage_delalloc() · 7361b4ae
      Qu Wenruo 提交于
      When btrfs_run_delalloc_range() failed, we will error out.
      
      But there is a strange comment mentioning that
      btrfs_run_delalloc_range() could have returned value >0 to indicate the
      IO has already started.
      
      Commit 40f76580 ("Btrfs: split up __extent_writepage to lower stack
      usage") introduced the comment, but unfortunately at that time, we were
      already using @page_started to indicate that case, and still return 0.
      
      Furthermore, even if that comment was right (which is not), we would
      return -EIO if the IO had already started.
      
      By all means the comment is incorrect, just remove the comment along
      with the dead check.
      
      Just to be extra safe, add an ASSERT() in btrfs_run_delalloc_range() to
      make sure we either return 0 or error, no positive return value.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7361b4ae
    • C
      btrfs: fix argument type of btrfs_bio_clone_partial() · 21dda654
      Chaitanya Kulkarni 提交于
      The offset and can never be negative use unsigned int instead of int
      type for them.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21dda654
    • Q
      btrfs: unify regular and subpage error paths in __extent_writepage() · 963e4db8
      Qu Wenruo 提交于
      [BUG]
      When running btrfs/160 in a loop for subpage with experimental
      compression support, it has a high chance to crash (~20%):
      
       BTRFS critical (device dm-7): panic in __btrfs_add_ordered_extent:238: inconsistency in ordered tree at offset 0 (errno=-17 Object already exists)
       ------------[ cut here ]------------
       kernel BUG at fs/btrfs/ordered-data.c:238!
       Internal error: Oops - BUG: 0 [#1] SMP
       pc : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       lr : __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
       Call trace:
        __btrfs_add_ordered_extent+0x550/0x670 [btrfs]
        btrfs_add_ordered_extent+0x2c/0x50 [btrfs]
        run_delalloc_nocow+0x81c/0x8fc [btrfs]
        btrfs_run_delalloc_range+0xa4/0x390 [btrfs]
        writepage_delalloc+0xc0/0x1ac [btrfs]
        __extent_writepage+0xf4/0x370 [btrfs]
        extent_write_cache_pages+0x288/0x4f4 [btrfs]
        extent_writepages+0x58/0xe0 [btrfs]
        btrfs_writepages+0x1c/0x30 [btrfs]
        do_writepages+0x60/0x110
        __filemap_fdatawrite_range+0x108/0x170
        filemap_fdatawrite_range+0x20/0x30
        btrfs_fdatawrite_range+0x34/0x4dc [btrfs]
        __btrfs_write_out_cache+0x34c/0x480 [btrfs]
        btrfs_write_out_cache+0x144/0x220 [btrfs]
        btrfs_start_dirty_block_groups+0x3ac/0x6b0 [btrfs]
        btrfs_commit_transaction+0xd0/0xbb4 [btrfs]
        btrfs_sync_fs+0x64/0x1cc [btrfs]
        sync_fs_one_sb+0x3c/0x50
        iterate_supers+0xcc/0x1d4
        ksys_sync+0x6c/0xd0
        __arm64_sys_sync+0x1c/0x30
        invoke_syscall+0x50/0x120
        el0_svc_common.constprop.0+0x4c/0xd4
        do_el0_svc+0x30/0x9c
        el0_svc+0x2c/0x54
        el0_sync_handler+0x1a8/0x1b0
        el0_sync+0x198/0x1c0
       ---[ end trace 336f67369ae6e0af ]---
      
      [CAUSE]
      For subpage case, we can have multiple sectors inside a page, this makes
      it possible for __extent_writepage() to have part of its page submitted
      before returning.
      
      In btrfs/160, we are using dm-dust to emulate write error, this means
      for certain pages, we could have everything running fine, but at the end
      of __extent_writepage(), one of the submitted bios fails due to dm-dust.
      
      Then the page is marked Error, and we change @ret from 0 to -EIO.
      
      This makes the caller extent_write_cache_pages() to error out, without
      submitting the remaining pages.
      
      Furthermore, since we're erroring out for free space cache, it doesn't
      really care about the error and will update the inode and retry the
      writeback.
      
      Then we re-run the delalloc range, and will try to insert the same
      delalloc range while previous delalloc range is still hanging there,
      triggering the above error.
      
      [FIX]
      The proper fix is to handle errors from __extent_writepage() properly,
      by ending the remaining ordered extent.
      
      But that fix needs the following changes:
      
      - Know at exactly which sector the error happened
        Currently __extent_writepage_io() works for the full page, can't
        return at which sector we hit the error.
      
      - Grab the ordered extent covering the failed sector
      
      As a hotfix for subpage case, here we unify the error paths in
      __extent_writepage().
      
      In fact, the "if (PageError(page))" branch never get executed if @ret is
      still 0 for non-subpage cases.
      
      As for non-subpage case, we never submit current page in
      __extent_writepage(), but only add current page into bio.
      The bio can only get submitted in next page.
      
      Thus we never get PageError() set due to IO failure, thus when we hit
      the branch, @ret is never 0.
      
      By simply removing that @ret assignment, we let subpage case ignore the
      IO failure, thus only error out for fatal errors just like regular
      sectorsize.
      
      So that IO error won't be treated as fatal error not trigger the hanging
      OE problem.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      963e4db8
    • Q
      btrfs: subpage: allow submit_extent_page() to do bio split · e0eefe07
      Qu Wenruo 提交于
      Current submit_extent_page() just checks if the current page range can
      be fitted into current bio, and if not, submit then re-add.
      
      But this behavior can't handle subpage case at all.
      
      For subpage case, the problem is in the page size, 64K, which is also
      the same size as stripe size.
      
      This means, if we can't fit a full 64K into a bio, due to stripe limit,
      then it won't fit into next bio without crossing stripe either.
      
      The proper way to handle it is:
      
      - Check how many bytes we can be put into current bio
      - Put as many bytes as possible into current bio first
      - Submit current bio
      - Create a new bio
      - Add the remaining bytes into the new bio
      
      Refactor submit_extent_page() so that it does the above iteration.
      
      The main loop inside submit_extent_page() will look like this:
      
      	cur = pg_offset;
      	while (cur < pg_offset + size) {
      		u32 offset = cur - pg_offset;
      		int added;
      		if (!bio_ctrl->bio) {
      			/* Allocate new bio if needed */
      		}
      
      		/* Add as many bytes into the bio */
      		added = btrfs_bio_add_page();
      
      		if (added < size - offset) {
      			/* The current bio is full, submit it */
      		}
      		cur += added;
      	}
      
      Also, since we're doing new bio allocation deep inside the main loop,
      extract that code into a new helper, alloc_new_bio().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e0eefe07
    • Q
      btrfs: subpage: fix writeback which does not have ordered extent · cc1d0d93
      Qu Wenruo 提交于
      [BUG]
      When running fsstress with subpage RW support, there are random
      BUG_ON()s triggered with the following trace:
      
       kernel BUG at fs/btrfs/file-item.c:667!
       Internal error: Oops - BUG: 0 [#1] SMP
       CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
       Hardware name: Radxa ROCK Pi 4B (DT)
       Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
       pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
       pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
       lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
       Call trace:
        btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
        btrfs_submit_bio_start+0x20/0x30 [btrfs]
        run_one_async_start+0x28/0x44 [btrfs]
        btrfs_work_helper+0x128/0x1b4 [btrfs]
        process_one_work+0x22c/0x430
        worker_thread+0x70/0x3a0
        kthread+0x13c/0x140
        ret_from_fork+0x10/0x30
      
      [CAUSE]
      Above BUG_ON() means there is some bio range which doesn't have ordered
      extent, which indeed is worth a BUG_ON().
      
      Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
      subpage dirty bitmap to record which range is dirty and should be
      written back.
      
      This means, if we submit bio for a subpage range, we do not only need to
      clear page dirty, but also need to clear subpage dirty bits.
      
      In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
      any range we submit a bio.
      
      But there is loophole, if we hit a range which is beyond i_size, we just
      call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
      then break out, without clearing the subpage dirty.
      
      This means, if we hit above branch, the subpage dirty bits are still
      there, if other range of the page get dirtied and we need to writeback
      that page again, we will submit bio for the old range, leaving a wild
      bio range which doesn't have ordered extent.
      
      [FIX]
      Fix it by always calling btrfs_page_clear_dirty() in
      __extent_writepage_io().
      
      Also to avoid such problem from happening again, add a new assert,
      btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
      dirty bits are cleared before exiting __extent_writepage_io().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cc1d0d93
    • Q
      btrfs: reset this_bio_flag to avoid inheriting old flags · 4c37a793
      Qu Wenruo 提交于
      In btrfs_do_readpage(), we never reset @this_bio_flag after we hit a
      compressed extent.
      
      This is fine, as for PAGE_SIZE == sectorsize case, we can only have one
      sector for one page, thus @this_bio_flag will only be set at most once.
      
      But for subpage case, after hitting a compressed extent, @this_bio_flag
      will always have EXTENT_BIO_COMPRESSED bit, even we're reading a regular
      extent.
      
      This will lead to various read errors, and causing new ASSERT() in
      incoming subpage patches, which adds more strict check in
      btrfs_submit_compressed_read().
      
      Fix it by declaring @this_bio_flag inside the main loop and reset its
      value for each iteration.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4c37a793
    • D
      btrfs: switch uptodate to bool in btrfs_writepage_endio_finish_ordered · 25c1252a
      David Sterba 提交于
      The uptodate parameter should be bool, change the type.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25c1252a
    • Q
      btrfs: remove unused start and end parameters from btrfs_run_delalloc_range() · a129ffb8
      Qu Wenruo 提交于
      Since commit d75855b4 ("btrfs: Remove
      extent_io_ops::writepage_start_hook") removes the writepage_start_hook()
      and adds btrfs_writepage_cow_fixup() function, there is no need to
      follow the old hook parameters.
      
      Remove the @start and @end hook, since currently the fixup check is full
      page check, it doesn't need @start and @end hook.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a129ffb8
    • J
      btrfs: zoned: remove max_zone_append_size logic · 5a80d1c6
      Johannes Thumshirn 提交于
      There used to be a patch in the original series for zoned support which
      limited the extent size to max_zone_append_size, but this patch has been
      dropped somewhere around v9.
      
      We've decided to go the opposite direction, instead of limiting extents
      in the first place we split them before submission to comply with the
      device's limits.
      
      Remove the related code, btrfs_fs_info::max_zone_append_size and
      btrfs_zoned_device_info::max_zone_append_size.
      
      This also removes the workaround for dm-crypt introduced in
      1d68128c ("btrfs: zoned: fail mount if the device does not support
      zone append") because the fix has been merged as f34ee1dc ("dm
      crypt: Fix zoned block device support").
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5a80d1c6
  4. 21 6月, 2021 16 次提交
    • Q
      btrfs: subpage: fix a rare race between metadata endio and eb freeing · 3d078efa
      Qu Wenruo 提交于
      [BUG]
      There is a very rare ASSERT() triggering during full fstests run for
      subpage rw support.
      
      No other reproducer so far.
      
      The ASSERT() gets triggered for metadata read in
      btrfs_page_set_uptodate() inside end_page_read().
      
      [CAUSE]
      There is still a small race window for metadata only, the race could
      happen like this:
      
                      T1                  |              T2
      ------------------------------------+-----------------------------
      end_bio_extent_readpage()           |
      |- btrfs_validate_metadata_buffer() |
      |  |- free_extent_buffer()          |
      |     Still have 2 refs             |
      |- end_page_read()                  |
         |- if (unlikely(PagePrivate())   |
         |  The page still has Private    |
         |                                | free_extent_buffer()
         |                                | |  Only one ref 1, will be
         |                                | |  released
         |                                | |- detach_extent_buffer_page()
         |                                |    |- btrfs_detach_subpage()
         |- btrfs_set_page_uptodate()     |
            The page no longer has Private|
            >>> ASSERT() triggered <<<    |
      
      This race window is super small, thus pretty hard to hit, even with so
      many runs of fstests.
      
      But the race window is still there, we have to go another way to solve
      it other than relying on random PagePrivate() check.
      
      Data path is not affected, as it will lock the page before reading,
      while unlocking the page after the last read has finished, thus no race
      window.
      
      [FIX]
      This patch will fix the bug by repurposing btrfs_subpage::readers.
      
      Now btrfs_subpage::readers will be a member shared by both metadata and
      data.
      
      For metadata path, we don't do the page unlock as metadata only relies
      on extent locking.
      
      At the same time, teach page_range_has_eb() to take
      btrfs_subpage::readers into consideration.
      
      So that even if the last eb of a page gets freed, page::private won't be
      detached as long as there still are pending end_page_read() calls.
      
      By this we eliminate the race window, this will slight increase the
      metadata memory usage, as the page may not be released as frequently as
      usual.  But it should not be a big deal.
      
      The code got introduced in ("btrfs: submit read time repair only for
      each corrupted sector"), but the fix is in a separate patch to keep the
      problem description and the crash is rare so it should not hurt
      bisectability.
      Signed-off-by: NQu Wegruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d078efa
    • Q
      btrfs: make __extent_writepage_io() only submit dirty range for subpage · c5ef5c6c
      Qu Wenruo 提交于
      __extent_writepage_io() function originally just iterates through all
      the extent maps of a page, and submits any regular extents.
      
      This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
      need to submit the only sector contained in the page.
      
      But for subpage case, one dirty page can contain several clean sectors
      with at least one dirty sector.
      
      If __extent_writepage_io() still submit all regular extent maps, it can
      submit data which is already written to disk.
      And since such already written data won't have corresponding ordered
      extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
      
      Change the behavior of __extent_writepage_io() by finding the first
      dirty byte in the page, and only submit the dirty range other than the
      full extent.
      
      Since we're also here, also modify the following calls to be subpage
      compatible:
      
      - SetPageError()
      - end_page_writeback()
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5ef5c6c
    • Q
      btrfs: make btrfs_set_range_writeback() subpage compatible · d2a91064
      Qu Wenruo 提交于
      Function btrfs_set_range_writeback() currently just sets the page
      writeback unconditionally.
      
      Change it to call the subpage helper so that we can handle both cases
      well.
      
      Since the subpage helpers needs btrfs_fs_info, also change the parameter
      to accept btrfs_inode.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2a91064
    • Q
      btrfs: update locked page dirty/writeback/error bits in __process_pages_contig · a33a8e9a
      Qu Wenruo 提交于
      When __process_pages_contig() gets called for
      extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
      bit is updated, but dirty/writeback/error bits are all skipped.
      
      There are several call sites that call extent_clear_unlock_delalloc()
      with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
      
      - cow_file_range()
      - run_delalloc_nocow()
      - cow_file_range_async()
        All for their error handling branches.
      
      For those call sites, since we skip the locked page for
      dirty/error/writeback bit update, the locked page will still have its
      subpage dirty bit remaining.
      
      Normally it's the call sites which locked the page to handle the locked
      page, but it won't hurt if we also do the update.
      
      Especially there are already other call sites doing the same thing by
      manually passing NULL as locked_page.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a33a8e9a
    • Q
      btrfs: make page Ordered bit to be subpage compatible · b945a463
      Qu Wenruo 提交于
      This involves the following modification:
      
      - Ordered extent creation
        This is done in process_one_page(), now PAGE_SET_ORDERED will call
        subpage helper to do the work.
      
      - endio functions
        This is done in btrfs_mark_ordered_io_finished().
      
      - btrfs_invalidatepage()
      
      - btrfs_cleanup_ordered_extents()
        Use the subpage page helper, and add an extra branch to exit if the
        locked page have covered the full range.
      
      Now the usage of page Ordered flag for ordered extent accounting is fully
      subpage compatible.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b945a463
    • Q
      btrfs: make process_one_page() to handle subpage locking · 1e1de387
      Qu Wenruo 提交于
      Introduce a new data inodes specific subpage member, writers, to record
      how many sectors are under page lock for delalloc writing.
      
      This member acts pretty much the same as readers, except it's only for
      delalloc writes.
      
      This is important for delalloc code to trace which page can really be
      freed, as we have cases like run_delalloc_nocow() where we may exit
      processing nocow range inside a page, but need to exit to do cow half
      way.
      In that case, we need a way to determine if we can really unlock a full
      page.
      
      With the new btrfs_subpage::writers, there is a new requirement:
      - Page locked by process_one_page() must be unlocked by
        process_one_page()
        There are still tons of call sites manually lock and unlock a page,
        without updating btrfs_subpage::writers.
        So if we lock a page through process_one_page() then it must be
        unlocked by process_one_page() to keep btrfs_subpage::writers
        consistent.
      
        This will be handled in next patch.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e1de387
    • Q
      btrfs: make end_bio_extent_writepage() to be subpage compatible · 9047e317
      Qu Wenruo 提交于
      Now in end_bio_extent_writepage(), the only subpage incompatible code is
      the end_page_writeback().
      
      Just call the subpage helpers.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9047e317
    • Q
      btrfs: make __process_pages_contig() to handle subpage dirty/error/writeback status · e38992be
      Qu Wenruo 提交于
      For __process_pages_contig() and process_one_page(), to handle subpage
      we only need to pass bytenr in and call subpage helpers to handle
      dirty/error/writeback status.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e38992be
    • Q
      btrfs: only require sector size alignment for end_bio_extent_writepage() · 321a02db
      Qu Wenruo 提交于
      Just like read page, for subpage support we only require sector size
      alignment.
      
      So change the error message condition to only require sector alignment.
      
      This should not affect existing code, as for regular sectorsize ==
      PAGE_SIZE case, we are still requiring page alignment.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      321a02db
    • Q
      btrfs: refactor page status update into process_one_page() · ed8f13bf
      Qu Wenruo 提交于
      In __process_pages_contig() we update page status according to page_ops.
      
      That update process is a bunch of 'if' branches, which lie inside
      two loops, this makes it pretty hard to expand for later subpage
      operations.
      
      So this patch will extract these operations into its own function,
      process_one_pages().
      
      Also since we're refactoring __process_pages_contig(), also move the new
      helper and __process_pages_contig() before the first caller of them, to
      remove the forward declaration.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ed8f13bf
    • Q
      btrfs: pass bytenr directly to __process_pages_contig() · 98af9ab1
      Qu Wenruo 提交于
      As a preparation for incoming subpage support, we need bytenr passed to
      __process_pages_contig() directly, not the current page index.
      
      So change the parameter and all callers to pass bytenr in.
      
      With the modification, here we need to replace the old @index_ret with
      @processed_end for __process_pages_contig(), but this brings a small
      problem.
      
      Normally we follow the inclusive return value, meaning @processed_end
      should be the last byte we processed.
      
      If parameter @start is 0, and we failed to lock any page, then we would
      return @processed_end as -1, causing more problems for
      __unlock_for_delalloc().
      
      So here for @processed_end, we use two different return value patterns.
      If we have locked any page, @processed_end will be the last byte of
      locked page.
      Or it will be @start otherwise.
      
      This change will impact lock_delalloc_pages(), so it needs to check
      @processed_end to only unlock the range if we have locked any.
      
      Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
      Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      98af9ab1
    • Q
      btrfs: rename PagePrivate2 to PageOrdered inside btrfs · f57ad937
      Qu Wenruo 提交于
      Inside btrfs we use Private2 page status to indicate we have an ordered
      extent with pending IO for the sector.
      
      But the page status name, Private2, tells us nothing about the bit
      itself, so this patch will rename it to Ordered.
      And with extra comment about the bit added, so reader who is still
      uncertain about the page Ordered status, will find the comment pretty
      easily.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f57ad937
    • Q
      btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered() · 38a39ac7
      Qu Wenruo 提交于
      There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
      end_compressed_bio_write().
      
      It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
      which is only supposed to accept inode pages.
      
      Thankfully the important info here is the inode, so let's pass
      btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
      make @page parameter optional.
      
      By this, end_compressed_bio_write() can happily pass page=NULL while
      still getting everything done properly.
      
      Also, to cooperate with such modification, replace @page parameter for
      trace_btrfs_writepage_end_io_hook() with btrfs_inode.
      Although this removes page_index info, the existing start/len should be
      enough for most usage.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      38a39ac7
    • Q
      btrfs: make subpage metadata write path call its own endio functions · fa04c165
      Qu Wenruo 提交于
      For subpage metadata, we're reusing two functions for subpage metadata
      write:
      
      - end_bio_extent_buffer_writepage()
      - write_one_eb()
      
      But the truth is, for subpage we just call
      end_bio_subpage_eb_writepage() without using any bit in
      end_bio_extent_buffer_writepage().
      
      For write_one_eb(), it's pretty similar, but with a small part of code
      reused.
      
      There is really no need to pollute the existing code path if we're not
      really using most of them.
      
      So this patch will do the following change to separate the subpage
      metadata write path from regular write path by:
      
      - Use end_bio_subpage_eb_writepage() directly as endio in
        write_one_subpage_eb()
      - Directly call write_one_subpage_eb() in submit_eb_subpage()
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fa04c165
    • Q
      btrfs: refactor submit_extent_page() to make bio and its flag tracing easier · 390ed29b
      Qu Wenruo 提交于
      There is a lot of code inside extent_io.c needs both "struct bio
      **bio_ret" and "unsigned long prev_bio_flags", along with some
      parameters like "unsigned long bio_flags".
      
      Such strange parameters are here for bio assembly.
      
      For example, we have such inode page layout:
      
        0       4K      8K      12K
        |<-- Extent A-->|<- EB->|
      
      Then what we do is:
      
      - Page [0, 4K)
        *bio_ret = NULL
        So we allocate a new bio to bio_ret,
        Add page [0, 4K) to *bio_ret.
      
      - Page [4K, 8K)
        *bio_ret != NULL
        We found this page is continuous to *bio_ret,
        and if we're not at stripe boundary, we
        add page [4K, 8K) to *bio_ret.
      
      - Page [8K, 12K)
        *bio_ret != NULL
        But we found this page is not continuous, so
        we submit *bio_ret, then allocate a new bio,
        and add page [8K, 12K) to the new bio.
      
      This means we need to record both the bio and its bio_flag, but we
      record them manually using those strange parameter list, other than
      encapsulating them into their own structure.
      
      So this patch will introduce a new structure, btrfs_bio_ctrl, to record
      both the bio, and its bio_flags.
      
      Also, in above case, for all pages added to the bio, we need to check if
      the new page crosses stripe boundary.  This check itself can be time
      consuming, and we don't really need to do that for each page.
      
      This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
      When a new bio is allocated, the stripe and ordered extent boundary is
      also calculated, so no matter how large the bio will be, we only
      calculate the boundaries once, to save some CPU time.
      
      The following functions/structures are affected:
      
      - struct extent_page_data
        Replace its bio pointer with structure btrfs_bio_ctrl (embedded
        structure, not pointer)
      
      - end_write_bio()
      - flush_write_bio()
        Just change how bio is fetched
      
      - btrfs_bio_add_page()
        Use pre-calculated boundaries instead of re-calculating them.
        And use @bio_ctrl to replace @bio and @prev_bio_flags.
      
      - calc_bio_boundaries()
        New function
      
      - submit_extent_page() callers
      - btrfs_do_readpage() callers
      - contiguous_readpages() callers
        To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
        bio.
      
      - btrfs_bio_fits_in_ordered_extent()
        Removed, as now the ordered extent size limit is done at bio
        allocation time, no need to check for each page range.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      390ed29b
    • D
      btrfs: clean up header members offsets in write helpers · 24880be5
      David Sterba 提交于
      Move header offsetof() to the expression that calculates the address so
      it's part of get_eb_offset_in_page where the 2nd parameter is the member
      offset.
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24880be5